About OpenAI Jukebox

OpenAI introduces Jukebox, a neural network designed to generate music, including basic singing, in raw audio across various genres and artist styles. The model weights, code, and a tool to explore the generated samples are available for the public.

Features of OpenAI Jukebox

  1. Music Generation: Jukebox can produce music samples from scratch when provided with genre, artist, and lyrics as input. It can produce a wide range of music and singing styles and can adapt to lyrics not seen during its training phase.
  2. Motivation and Prior Work: The idea of automatic music generation has been around for over half a century. Traditional methods generated music symbolically, like a piano roll, which specified the timing, pitch, and instrument of each note. However, these symbolic generators had limitations, such as not being able to capture human voices or the subtle nuances of music. Jukebox models music directly as raw audio, which is a challenging task due to the long sequences involved.
  3. Approach: Jukebox uses an autoencoder model that compresses audio to a discrete space using a method called VQ-VAE. It draws inspiration from VQ-VAE-2 and applies their approach to music. The model uses three levels in the VQ-VAE, which compress the raw audio by different factors. The compressed audio retains essential information about the pitch, timbre, and volume.
  4. Dataset: Jukebox was trained on a dataset of 1.2 million songs, 600,000 of which are in English. This dataset was paired with corresponding lyrics and metadata from LyricWiki.
  5. Artist and Genre Conditioning: The model can be provided with additional information like the artist and genre for each song. This helps in reducing the entropy of the audio prediction and allows for the generation of music in a specific style.
  6. Lyrics Conditioning: Apart from artist and genre, the model can also be conditioned on the lyrics of a song. This was a challenge due to the lack of a well-aligned dataset. However, techniques were used to match audio portions to their corresponding lyrics.
  7. Limitations: While Jukebox is a significant advancement, it still has some limitations. For instance, the generated songs might not have familiar larger musical structures like repeating choruses. The downsampling and upsampling process can introduce noise, and the models are slow to sample from.

Additional Features

  • Future Directions: OpenAI’s audio team is working on generating audio samples conditioned on different kinds of priming information. They have seen early success conditioning on MIDI files and stem files. The team believes that collaborations between humans and models will be an exciting creative space in the future.
  • Engagement with Musicians: OpenAI shared Jukebox with 10 musicians from various genres to gather feedback. While the musicians found Jukebox interesting, they didn’t find it immediately applicable to their creative processes due to its current limitations.
  • Collaboration Opportunities: OpenAI is reaching out to the wider creative community as they believe generative work across text, images, and audio will continue to improve. They are inviting creative collaborators to help build useful tools or new works of art in these domains.