Closed
Description
With the first Sesame CSM model openly available, we should implement a local example similar to their online research demo. It seems that the released CSM model uses Kyutai's Mimi audio codec which we have to implement in a similar way as we did with the WavTokenizer. Next we can modify the talk-llama example to support audio generation with the CSM. This way we will be able to plug any LLM for the text response generation and use Sesame for speech input/output.