HOW TO BUILD A SONGWRITING AI
Author : Wayne Cheng
May 14, 2021
How can we use today's most advanced AI technology to create music? What is the technology that powers Google Magenta, Amazon DeepComposer, OpenAI MuseNet, AIVA, and Amper Music?
Background on Deep Learning
Today’s most advanced AI technology is deep learning, which is an artificial neural network that maps between data domains. Within the artificial neural network, there is a chain of equations that mathematically maps data from X into Y.
This mapping between data domains is a process known as “automatic feature extraction,” where the AI learns the relevant features of X that can be mapped to Y. For example, in an image recognition application, the AI learns that an image with fur, ears, and a tail, maps to the word “dog.”
There are two phases when building a deep learning AI : training and prediction (also known as inference). During the training phase, the AI is given a dataset of X and Y, and the AI learns how to map X to Y. During the prediction phase, the AI is given new data of X, and the AI is tasked with predicting new data Y.
Background on Autoregressive AI Models
Music is inherently an autoregressive, time-series, or sequential problem, where a music note can be predicted by a past sequence of music notes. This is why deep learning autoregressive models have worked best in AI music applications.
Internally, in a deep learning autoregressive model, a sequence of notes is represented as a state. In the example below, the first note would update the state to grey. The next note would update the state to orange, and so on. During training, it’s the orange arrows that are updated; the AI learns how to represent a sequence of notes as an internal state.
During prediction, the AI would convert the internal state to a probability distribution. You can think of a probability distribution as a set of weighted dice. The dice is then rolled to determine the next note for the sequence. This generated note is (usually) fed back to the AI to update the internal state, so that in the next time step, the next note can be generated for the sequence.
How to Build an AI that Creates Music
An autoencoder architecture is typically used to build an AI music creation tool. In this architecture, an encoder AI is used to map data from X to Z, and a decoder AI is used to map data from Z to Y. Z is a representation of the relevant features of X that maps to Y.
The following picture shows a 2-D simplification of the Z space. During training, the AI would map music of similar features to similar locations on the Z space. For example, the AI would map songs by the Beatles to the green dots, and songs by Mariah Carey to the blue dots.
One technique of training the autoencoder is reconstruction. During training, X is mapped to Z, and then Z is mapped back to X.
During prediction, we can then remove the first half of the autoencoder, and just use the Z to X mapping. If a random point were selected in the Z space, then the decoder AI can generate new random music. If a point halfway between the green and blue dots is selected, then the decoder AI can generate music that’s about 50% Beatles and 50% Mariah Carey.
Another technique is transformation, where in both training and prediction, X is mapped to Z, and Z is mapped to Y. This is similar to the X to Y mapping discussed previously, except that there is more control over how the relevant features are represented in Z.
With this technique, music can be created from context. For example, text can be transformed to music, chord progression can be transformed to music, or even a melodic motif can be transformed to music.
Additional Processing of AI Generated Music
Once the AI is trained, it is capable of generating a large amount of derivative music. However, the generated music suffers from two problems.
First, the AI will plagiarize from the training dataset. Plagiarism in this context is where a sequence of notes in the generated music is exactly the same as a sequence of notes in the training data. Depending on the tuning of the hyperparameters, the AI will generate more or less plagiarised music. In order for the AI tool to be used in an ethical way, a plagiarism checker needs to be used to discard music with detected instances of plagiarism.
Second, the generated music has a wide range of quality. Because it is very tedious to manually evaluate a large batch of generated music, a critic can be used to rank the music based on quality.
Representing Music as Data
The lead sheet format has been in use since the 1930’s, and according to Berklee College, it is the most efficient way of representing music ideas. The essential information of a song’s composition is distilled into the main melody and chords for the harmony.
By distilling the essential information into a compact and efficient dataset, this makes it easier for the AI to extract the relevant features to map between data domains.
Keep in mind that the quality of the training data will determine the quality of the generated music. So it is essential to find the highest quality data when training the AI.
Songwriting AI is built with the following deep learning technologies :
Autoregressive or time-series AI models
Autoencoder architecture (X -> Z -> Y)
Once the music has been generated, it has to be processed with a plagiarism checker and a critic.
About the Author
Wayne Cheng is the founder and AI mobile app developer at Audoir. His focus is on the use of generative deep learning to build songwriting tools. Prior to starting Audoir, he worked as a hardware design engineer for Silicon Valley startups, and an audio engineer for creative organizations. He has a MSEE from UC Davis, and a Music Technology degree from Foothill College.