Tag: neural networks

  • UnRAVEl: speculative composition in latent space

    RAVE models encode data from the audio domain into highly compressed latent representations. Based on statistical information retrieved from these encodings, a speculative compositional practice can be established inside the latent space of the models. It is derived from the improvisation tactics empirically proven in Latent Jamming.

    For a proof of concept (POC), I have written UnRAVEl, a set of Python scripts, that cover a three step process:

    • audio data encoding: 1-n audio files are encoded into arrays in the shape of a model’s latent space.
    • generation of synthetic data: 1-n encodings are evaluated for their data distribution. Based on the results, arrays of synthetic data are generated which are used to populate preset patterns and apply alterations to these patterns.
    • decoding of synthetic data into audio data: 1-n generated/ synthetic data arrays are being decoded and up sampled back to the audio domain using the same model as in the encoding process.

    Considerations and hypotheses

    Prior distribution vs. encoding audio

    While models contain statistics learned during training, encoding real world audio data through a model for evaluation can be more robust when it comes to deviations from the original data set. However, domain data similar to what the model has seen during training should yield most truthful statistics for informing synthetic data generation.

    Distribution

    VAEs like RAVE assume a Gaussian latent prior. Generating synthetic data using mean and standard deviation retrieved from the encodings should create truthful results.

    Data distribution in a set of encodings.

    In UnRAVEl, this is covered by normal distribution; other distribution types, e.g. uniform or correlation-based, are experimental but, depending on the model, can create more interesting output since they sample from a different value range and logic.

    Normal data distribution in a generated array.
    Uniform data distribution in a generated array.

    Tempo quantizing

    Latent embeddings created with RAVE are highly compressed representations of audio domain data. A sample rate of 44.1KHz corresponds to 21 data points times the number of latent dimensions in the model’s configuration. The audio domain resolution is high enough to be more or less irrelevant for tempo considerations, however, with the low resolution in latent space, limitations to achievable tempi are expressed by:

    60 * (model sample rate / model compression) / latent data points

    For example:

    60 * (44100 / 2048) / 11 = 117.45 BPM

    This leads to the following quantized tempi (in 4ths, 8ths for double time) achievable by looping k amount of latent data points.

    kTempo BPMDouble timeAudio
    7184.57Example
    8161.49322.98Example
    9143.55Example
    10129.19258.38Example
    11117.45Example
    12107.66215.32Example

    Patterns

    The compositional approach in latent space exemplified in UnRAVEl is based on high level structural considerations, e.g. repeating (parts of) data arrays, replacing data points and/or slightly altering them while boundaries like value distribution or tempo quantizing need to be considered.

    Compositional ideas can be established defining patterns; in UnRAVEl four patterns have been implemented as a starting point.

    fibo

    A given array of shape (data points, latent dimensions) is repeated along the fibonacci series of integers. 1 corresponds to the first row in the array, 2 corresponds to the first two rows in the array, …, 8 corresponds to rows 0-7 and so on.


    orale

    An approximation to a standard sequence in electronic music building an array using a 3:1 scheme where the original array is repeated three times and a fourth time with subtle changes applied to its values. This sequence is then repeated and altered again in the same scheme of 3:1.


    blender

    Two arrays are blended into one another by replacing single data points sequentially after n repetitions, starting with the first value in the first dimension, followed by the first value in the second dimension and so on until the last value in the last dimension has been reached.


    swapper

    Values of randomly picked data points in two arrays of the same size are swapped. The altered array is repeated n times.


    Use in Pure Data

    UnRAVEl generates .npy arrays that can be decoded to the audio domain using the dedicated script. Alternatively, latent audio files are being written; these are basically multi channel (= latent dimensions), double precision (= for values outside -1/+1 boundary) .wav file at e.g. 21Hz resolution (if data source was 44.1KHz). This format works with an abstraction I’ve written in Pure Data: ch4ns0n/ch8ns0n (note that only models with 4 and 8 latent dimensions are supported, but the component is fairly easy to extent).

    Acknowledgements

  • Black Latents | Latent Diffusion

    Black Latents | Latent Diffusion is a gradio application that allows you to spawn audio items from Black Latents, a RAVE V2 VAE trained on the Black Plastics series using RAVE-Latent Diffusion models.

    A demo version is accessible on Huggingface. The full application can be retrieved from GitHub to use in local inference.


    Latent Diffusion with RAVE

    The RAVE architecture makes timbre transfer on audio input possible, but you can also generate audio by using its decoder layer as a neural audio synthesizer, e.g. in Latent Jamming.

    Another approach to use RAVE to spawn new audio information has been provided by Moisés Horta Valenzuela (aka 𝔥𝔢𝔵𝔬𝔯𝔠𝔦𝔰𝔪𝔬𝔰) with his RAVE-Latent Diffusion model.

    Latent diffusion models in general are quite efficient since they operate on the highly compressed representations of the original data. The key idea of RAVE-Latent Diffusion is to replicate structural coherency of audio information by encoding (longer) audio sequences into their latent representations using a RAVE encoder and then train a denoising diffusion model on these embeddings. The trained model is able to unconditionally generate new and similar sequences of the same length which can be decoded back into the audio domain using the RAVE model’s decoder.

    The original package by 𝔥𝔢𝔵𝔬𝔯𝔠𝔦𝔰𝔪𝔬𝔰 supports a latent embedding length down to a window size of 2048, which translates to about 95 seconds of audio at 44.1 KHz, suitable for compositional level information.

    In my fork RAVE-Latent Diffusion (Flex’ed), I extended the code to support a minimum of 256, which equals about 12 seconds at 44.1 KHz, and implemented a few other improvements and additional training options.

    Black Latents: turning Black Plastics into a RAVE model

    The motivation to train Black Latents was to extract dominant characteristics from my Black Plastics series, a compilation of 7 EPs with a total of 28 audio tracks of genres Experimental Techno, Breakbeats and Drum & Bass, I released between 2012-2020.

    I trained the model using the RAVE V2 architecture with a higher capacity of 128 and submitted it to the RAVE model challenge 2025 hosted by IRCAM, where it was publicly voted into first place. The model is available on the Forum IRCAM website.

    Using Black Latents | Latent Diffusion to spawn audio

    For Black Latents | Latent Diffusion, I trained diffusion models in 7 different configurations and context window lengths using once again the audio material from the Black Plastics series as base data set together with the Black Latents VAE.

    The application itself is a simple gradio interface to the generate script of RAVE-Latent Diffusion (Flex’ed). In the UI, you can choose from the different diffusion models, define seeds and set additional parameters like temperature or latent normalization before generating audio items through the Black Latents model decoder.

    Depending on the diffusion model and parameter selection, the resulting output varies from stumbling rhythmic micro structures to items with resemblances of their base training data’s macro scale considerations.

    Other examples

    I published earlier experiments with RAVE-Latent Diffusion and a different set of RAVE models in the form of two albums:

    MARTSMÆN – RLDG_0da02c80cb [datamarts/2KOMMA4]: BandcampNina

    MARTSM^N – RLDG_835770db1c [datamarts/2KOMMA3]: BandcampNina

  • Neural network bending in Pure Data

    The practice of bending systems, that is: modifying or disrupting their intended functions, has been a recurring aspect of artistic practice across different cultural contexts. More recently, the bending of neural networks has become a point of interest for researchers and practitioners, driven partly by the desire to expand the models’ generative capabilities through alterations to their underlying structures for processing and reproducing information.

    “One common criticism of using deep generative models in an artistic and creative context, is that they can only reproduce samples that fit the distribution of samples in the training set. However, by introducing deterministic controlled filters into the computation graph during inference, these models can be used to produce a large array of novel results.”

    Broad et. al. “Network Bending: Expressive Manipulation of Generative Models in Multiple Domains” https://www.mdpi.com/1421002

    A few months back I came across Błażej Kotowski’s fork of nn~. It adds a new functionality to the nn~ object that exposes neural net layers along with their weights and biases for compatible model architectures (e.g. RAVE, vschaos2, MSPrior or AFTER). It also allows you to modify weights and biases and push them back into the respective layer. That means we can hack into these models and do some network bending experimentation in real time now, purposefully altering, partly disrupting the capabilities of the model both in terms of processing and creating audio information.

    Bender abstraction for Pure Data

    Inspired by Błażej’s video, i’ve created an abstraction in Pure Data that can modify the neural net’s data in various ways, such as off-setting, randomizing or inverting values. That component is called Bender and is available on Github.

    Since the changes can have a dramatic effect on the sound, I’ve added a method that lets you control the percentage of data points affected when applying adjustments. This makes the results much less extreme, allowing you to bend your neural network in a more subtle way.

    You can select the desired percentage by moving the slider to a position between 0 and 100%. The number of data points is then calculated and evenly distributed within the selected layer. Any adjustments made using the sliders next to the array will only affect these specific data points, not the entire array.

    Limitations

    The number of data points per layer can range from a few thousand to millions, depending on the model’s architecture and training setup. This can impact the real-time performance of network bending, especially based on your workstation’s configuration. I haven’t found a practical solution for this issue yet, but it might be addressed in the future.