Disentangler and Decoder Outcome

After nearly a month of training, refactoring, fine tuning and losing hair, I hit a wall with my disentangler / decoder models. No matter what I did, no matter how sophisticated I made the decoder, I could not get it to reconstruct the high-frequency parts of the voice outputs. My investigations lead me to take another look at the embeddings being output by the disentangler, and sure enough, the high-frequency elements were almost non-existent.
The disentangler was doing an amazing job at separating voice from noise. I mean I was absolutely amazed at how well it was working, and for the longest time, I was convinced that my decoder was the problem. It too was doing a great job of reconstructing the voice and noise, but everything seemed a bit muffled. 
 
To give you an example of how the disentangler / decoder performed here is some audio that was processed through the disentangler then through the decoder.
 

The output

 
To give you a sense of what the disentangler had to work with
 
And finally, this is what I was hoping it would sound like (the original voice clip before mixing in the noise)
 
So you can see, while it did a good job of really separating the voice from the noise, the resulting voice was too muffled and really lacked the important vocal elements needed to create crisp, rich voice reconstruction. 
 
Ultimately, I needed to rethink how the disentangler model should be architected and settled on a slightly more modular and flexible approach. Firstly, I moved away from treating both voice and noise with the same set of conditions, this was a major change and maybe one might say should have been a design consideration from the start. But starting with the obvious doesn't help me find the limitations I need to understand to build better more reliable AGI.
 
The enhanced disentanglers new core philosophy has become:
  1. Primary: Strong self-learning (NTXENT + BYOL - Same as before) for AGI compatibility
  2. Secondary: Supervised guidance using clean voice / noise targets for audio quality
  3. Specialised processing: Voice-specific and noise-specific feature extraction ← This is new
  4. Maintains both batch and streaming capabilities for different use cases
 
Further, I pushed the model to be a bit more detailed. Previously I was using 25ms frames with 10ms hops from the raw embeddings and 100 Hz for the PCEN. Now it uses a hop length of~5ms frames with a much more feature rich 400Hz for stronger temperal dimensions after embedding (2.5ms per embedding frame). Given the inference speeds I saw with the original model, this extra level of detail was still quite achievable without impacting the models ability to remain real-time.
 
Before we had multiple embeddings:
Voice/Noise Raw, Voice/Noise PCEN, Voice/Noise Fused, Raw Features and PCEN Features. All 128-dim (except the two features which were 64-dim). In the new model I still have multiple embeddings, but a much more feature focus set per voice / noise path. These are
 
Voice Embeddings: 160-dim total
ā”œā”€ā”€ Pitch Features: 32-dim (F0, harmonic tracking)
ā”œā”€ā”€ Harmonic Features: 32-dim (harmonic template matching)
ā”œā”€ā”€ Formant Features: 32-dim (vocal tract resonances)
ā”œā”€ā”€ VAD Features: 16-dim (voice activity detection)
└── Spectral Features: 48-dim (multi-scale voice patterns)

 

Noise Embeddings: 144-dim total
ā”œā”€ā”€ Broadband Features: 32-dim (non-harmonic spectral)
ā”œā”€ā”€ Transient Features: 24-dim (onset/impact detection)
ā”œā”€ā”€ Environmental Features: 28-dim (environmental sounds)
ā”œā”€ā”€ Texture Features: 20-dim (noise pattern analysis)
ā”œā”€ā”€ Non-harmonic Features: 24-dim (non-periodic content)
└── Statistical Features: 16-dim (temporal statistics)
 
More importantly, the loss weights were a lot more skewed to preserving high-frequency elements in the voice path to ensure reconstruction could be achieved with much more crispness than I saw in the first prototype.
 
The training approach, while largely the same as the first model was enhanced with some supervised guidance, but in maintaining the primary requirement that this be AGI ready, I kept that to 20% of the total weights.
 

Training Approach:

Total Loss Weights:
ā”œā”€ā”€ NTXENT (Contrastive): 50% - Primary disentanglement signal
ā”œā”€ā”€ BYOL (Self-supervised): 30% - Robust representation learning
└── Supervised Guidance: 20% - Audio quality bootstrap
 
1. NTXENT Contrastive Loss: TRUE positive embeddings (optimized path)
- Separate voice/noise NT-Xent instances with memory banks
- Pre-computed positive embeddings with gradient flow
2. BYOL Self-supervised: EMA target encoders (prevents collapse)
- Voice/noise target encoders with Ļ„=0.999 momentum
- Stop-gradient on target side for proper BYOL learning
3. Supervised Reconstruction: Multi-component reconstruction loss
- Time-domain MSE loss for accuracy
- Multi-resolution STFT loss (5 scales: 256-4096 FFT)
- Spectral convergence loss for magnitude consistency
- High-frequency emphasis (4-8kHz weighted 4-6x)
4. Separation Loss: Barlow Twins style decorrelation
- Cross-correlation minimization between voice/noise embeddings
- Bounded variance maintenance (prevents collapse)


So the big design changes were all centered around domain-specific feature extraction rather than simply separating voice and noise loosely. 

Training was done using 1.9 million 2 second triplets with clean voice, clean noise and mix of both covering 10 languages, with 1 million English samples and 100,000 samples from other languages with strong variations from English (Arabic, Mandarin, Swahili etc).

 

It took 78 hours of continuous training, and 67,000 steps before the model reached very good metrics and plateaued. 

 

Now I am generating the dataset for the new decoder with the new disentangler (1 million samples). While I have done some tests on the embeddings and the numbers look much better (especially the pain point of HF absence), only after the decoder has been trained and I going to know for sure if the overhaul has paid of.. It's always a tense few days training and waiting to see if things are improving, but that's part of the journey I suppose.

 

Training the decoder starts tomorrow (The dataset creation has a few hundred thousand more samples to prepare).

I'm excited to see the results in a few days time.