Latent Jamming – Martin Heinze

Latent Jamming is an improvisation practice with real-time capable neural audio models that embraces concepts of algorithmic and/ or generative composition techniques. It is one of my main practical research topics since 2023.

Motivation and background

Coming from a traditional electronic music background (Drum & Bass, Breaks, Electronica) where deterministically driven production routines in a technologically homogeneous setup are dominant, two main questions have been at the center of my practical research for the last years:

How can techniques of generative music and algorithmic composition be injected into electronic music genres that are deterministically driven? (see e.g. Fibonacci Jungle, Risset Rhythms)
How can generative AI be integrated into creative processes in electronic music production holistically, not only as another new tool out of many in existing production routines?

To narrow in on answering these questions, in particular the second one, I train neural nets on the musical material I’ve written and produced in the past and work with the trained models in real-time settings. I apply compositional concepts from generative music and algorithmic composition as mediators between human performer and the generative abilities of the neural nets, displacing and circumventing concepts of authorship and genius by empowering multiple independent agents in an improvisation-driven, co-creative process that leads to musical output, but not necessarily to a fixed recording artifact.

Sharing agency

With this approach, I aim to amplify one key quality of neural audio models, which is their unexpected behaviour when generating output. This quality sets the models apart from a perception of conventional musical instruments, where control over the produced sound is usually the objective. My goal when making music powered by neural nets is to share the agency by finding the right equilibrium between establishing control and embracing the lack thereof.

Creative considerations

Using deep learning algorithms to interpret and extract key characteristics of particular audio data subsets, my creative intent is an expansion of these characteristics into something genuinely new.

Finding a novel approach to music production

Opposed to approaches with similar AI-augmented practices in contemporary music production, where models are often used as a material source for samples or sound items in otherwise conventional production routines, my interest in using neural audio synthesis aims at being able to generate (electronic) music in a real-time compositional dialogue with single models. Consequently, my training data consists explicitly of self-contained assets (i.e. full tracks), not separated stems of one instrument, synthesizer, or other homogeneous sound samples.

„back in our day we didn’t have ai we used REAL synthesizers. . .to sound like drums“ dadabots

Object of this approach is my own music written in past years under a traditional electronic music production paradigm. Preselection and categorization is a first creative act in the process, where e.g. material with a particular sonic character (e.g. sparse, dense or attributed to a particular genre), such from a particular working phase or a dedicated output selection (e.g. an album), is separated into various datasets.

Building hybrid instruments

Using open source audio-to-audio neural network architectures RAVE, vschaos2, MSPrior or AFTER, I trained various models on these curated selections of my earlier works. Capable of reproducing and respawning sound characteristics they’ve learned while training, these models become hybrids of instruments and sound machines that partly act autonomously. (For example, RAVE models are known to randomly produce sound on no input/ silence when the training data didn’t explicitly contain silence as information.)

Learning to navigate in latent space

The compositional setup used to make music with the models requires an experimental approach that embraces this understanding of them both as instruments of a new type and autonomous actors. Interaction with the models happens in latent space, where conventional compositional techniques cannot be applied. Similarities in behaviour between different models hardly exist; each model requires exploration and empirical observation. Therefore, the compositional setup is mainly a boilerplate template combining different techniques that have proven successful in similar use cases, while putting it into action resembles learning an instrument from scratch.

Embracing new qualities

Results of working with this approach can produce high similarities with the musical characteristics of the original material; however, the amalgamation of sounds as performed by the models as well as their unexpected behaviour generally results in a new quality of output that challenges both performer and listener. As such, making music with neural audio models in real-time settings bears a paradigm shift in electronic music production.

Technical setup

For the compositional process, I use Pure Data (PD) where RAVE and vschaos2 (as well as MSPrior and AFTER) models can be employed for real-time application using the nn~ object. In PD, I programmed a set of custom abstractions that allow building frameworks for semi-generative or algorithmic use cases and are tailored for these model types explicitly.

With these abstractions, I can intervene directly in the latent space of the models, overriding their intended use case of timbre transfer on audio material with injecting latent embedding mimicry instead. This allows me to guide the models’ outputs, comparable to tuning – and to some extent playing – an instrument.

Compositorial and performative considerations

Tuning and setting control thresholds

The compositorial process usually includes a lot of exploratory work until a constellation of parameters is found that leads to musically coherent and/ or novel results. Once a parameter constellation (or tuning) for the models has been established, the amount of human influence on a compositional level is determined. This includes defining the range of control level variation the models can use to create their output. It also implies leveling out the amount of perceivable rhythmic structure or repetition.

Finding pieces

While performing, the model’s behaviour can be stabilized, but the actual output is usually not exactly repeatable a second time. For that reason, I call that musical practice Latent Jamming, referring to a co-creative situation where human and artificial agents interact in an improvisational setting. In terms of compositorial or performative practice, therefore the process is hardly deterministically but exploratory driven – less writing a piece but finding a piece.

Ethical considerations

Selecting data

From an ethical point of view, neural audio model training – like basically all AI model trainings – requires considerations of dataset provenance in particular regarding questions of authorship and licensing. Using only my own musical material, excluding remixes and collaborations with other artists, is not only an aesthetically driven decision but also a practical one since I’m not touching the rights of any other creator.

Considering bias

While bias is something considered problematic in LLMs, it can be highly desirable when training neural audio models; in my use case, it didn’t require any additional consideration.

Compensating environmental footprint

Training AI models is broadly known to come at a significant environmental cost. Training RAVE and vschaos2 neural audio models on cloud data centers appears to be comparably cheap (e.g. 170 GPU hours of training a RAVE model on Kaggle equals around 24,48 kg CO₂, while 12 GPU hours for vschaos2 models equal around 1,73 kg CO₂; numbers are rough estimations based on an hourly power consumption of Tesla P100 GPUs + infrastructure (300W) and a global electricity carbon intensity of 0,48 kg CO₂/kWh.).

In the EU, the most efficient way to compensate CO₂ as a private person is by buying (and retiring) fractions of EU Allowances (EUAs) for CO₂ emissions. I’ve chosen ForTomorrow to compensate for my own environmental footprint in this manner on a yearly basis.

Use cases and examples

In the past years, I’ve developed various frameworks in Pure Data that build on the idea of Latent Jamming in order to explore new ways of music co-creation. You can find these under Works.