OPENAI MUSENET : USING GPT-2 ON MUSIC

Author : Wayne Cheng

May 7, 2021 

OpenAI trained the world's most advanced NLP AI to create music, using the GPT-2 model to create a songwriting application called MuseNet.

  • Instagram
  • Twitter
  • Facebook
  • LinkedIn

Introduction

What happens if the world’s most advanced natural language processing AI is used to create music? That’s what the developers at OpenAI did, using the GPT-2 model to create a songwriting application called MuseNet

According to OpenAI, MuseNet is able to generate music up to 4 minutes long, and with up to 10 instruments. Using MuseNet, users can create a new song with multiple instruments, in the style of composers such as Mozart or Lady Gaga. 

In this article, I will examine OpenAI’s MuseNet in detail.

Summary of the Tool

MuseNet is available as a web application, which can be accessed from the website. The tool can be used for free.

Similar to Google Magenta or Amazon DeepComposer, MuseNet transforms a short melody into a song in the style of a composer. In addition, the generated music may contain multiple instruments.

User Experience

Starting with the simple mode, users can select a song style and a starting melody. With the default settings of song style "Chopin" and starting melody "Mozart's Rondo alla Turca," the tool works well; the generated song repeats melodic motifs and overall sounds musical. Even mismatching song styles and melodies sound good, such as "Adele's Someone Like You" in the style of "Rachmaninoff." Most of the songs generated in simple mode only contains one instrument. Note that in the simple mode, the songs are all pre-generated so the runtime is negligible.

In the advanced mode, users have access to a lot more song styles and starting melodies. Users can also choose to generate songs with instruments such as guitar, bass, drums, etc. Because the starting melodies contain only piano, the other instruments will come in later in the song, if at all. However, using the option of no starting melody, songs can be generated with most of the instrumentation present. The songs take about a minute to generate. Depending on the chosen options, the generated music can sound quite good.

Technical Details

OpenAI provides a detailed explanation of the dataset and underlying technology that powers MuseNet.

Training Dataset


The training dataset is comprised of MIDI files from Classical Archives (classical music), BitMidi (popular music), and MAESTRO (piano performances). It's not clear how many MIDI files were used.

Similar to other autoregressive machine learning architectures, the input data is encoded into tokens. Each token comprises of an instrument, volume, and pitch. There is a special "wait" token that marks a passage of time.


The data is further augmented through pitch transposition, volume alteration, timing alteration, and using "mixup" (which is essentially an interpolation of the embeddings between two data entries).

Machine Learning Architectures


As mentioned previously, MuseNet is based on the powerful GPT-2 model, which is a large-scale transformer model trained to predict the next token in a sequence.

A good visual introduction to the GPT-2 architecture can be found here.

Unlike a typical transformer, GPT-2 uses only the decoder blocks of a transformer. Within the decoder blocks, the same masked self-attention mechanism is used.

MuseNet also uses the recompute technique from the Sparse Transformer, which allows for the training of a very deep neural network. The transformer is 72 layers deep, with 24 attention heads, and full attention over a context of 4096 tokens. By comparison, the original transformer uses 6 layers, 8 attention heads, and a context of 512 tokens.

Compared with other open source songwriting AI tools like Google Magenta or Amazon DeepComposer, MuseNet has by far the most advanced AI architecture. As a result, the quality of MuseNet's generated music is unrivaled.

Conclusion

OpenAI MuseNet demonstrates the possibilities of applying the world's most advanced transformer architecture to the domain of music generation. The result is composed music that is nearly indistinguishable from a human composer.

The tool is free to use, and easy to understand. There currently are not many options to choose from, so hopefully that will improve with further development.

About the Author

Wayne Cheng is the founder and machine learning engineer at Audoir. His focus is on the use of generative deep learning to build songwriting tools. Prior to starting Audoir, he worked as a hardware design engineer for Silicon Valley startups, and an audio engineer for creative organizations. He has a MSEE from UC Davis, and a Music Technology degree from Foothill College.