A Multi-Model, Multi-Modal Agent Based Chatbot on EKS

Objective:

The goal is to design a chatbot with an interface that can assist a user at different tasks. Notably, it can

perform search on the web
answer queries regarding an image
generate images
answer queries regarding a document
retain memory of the user's interactions with it.

Additionally, the chatbot needs to be deployable on a scalable backend using Kubernetes.

Tech Stack

The project uses state-of-the-art libraries for inference.

Models used:

Mixtral-8x7B-Instruct-v0.1 (6.0bpw/exl2)
- This is used as our main model for the agent. It is also used as the model for RAG.
SDXL-Turbo
- A fast image generation model that requires 1-2 steps.
LLaVA-1.6
- A multi-modal instruction following modelfor describing uploaded.
CodeLlama-34b-Instruct-hf (4.65bpw/exl2)
- An instruction-tuned model for code synthesis
bge-small-en-v1.5
- An embedding model used for RAG on uploaded documents.

Inference Libraries used:

ExLlamaV2
- A quantization library which claims to have a better compression and inference latency compared to AWQ and GPTQ models. Also, has a number of bits representation. It is used for the Mixtral and Code Llama models. TabbyAPI provides an OpenAI-compatible API for ExLlamaV2 models.
SGLang
- An inference library built on top of vLLM with a higher throughput. It has native integration for multi-modal models like LLaVA.
Text-Embeddings Inference
- An embeddings inference library by HuggingFace to host our embeddings model.
TorchServe
- A production grade library for serving PyTorch models. It is used for the SDXL-Turbo model.

Langchain is used for building the agent.

The frontend has been created in Streamlit. The models are deployed on EKS.

Demo

Optimizing for Scaling

Our models are ~73GB in size! Each time a node is to be provisioned, these files will be downloaded on each node and can take a significant time. This is in addition to the docker images needed for inference. We can implement a few tricks to save bandwidth as well as time - ~14 minutes per model!

See details here.

Key Findings:

Do not trust LLM benchmarks. A 34B model* with a high-score on HF leaderboard performed extremely poor as an agent.
Quantization can lobotomize an LLM; performance degrades even at following instructions or choosing the right tool.
The same prompts might not work for different models.

* not named for legal reasons 😉

Architecture

Click To See Diagram

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
README.md		README.md
optimize.md		optimize.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Multi-Model, Multi-Modal Agent Based Chatbot on EKS

Objective:

Tech Stack

Models used:

Inference Libraries used:

Demo

Optimizing for Scaling

Key Findings:

Architecture

About

Uh oh!

Releases

Packages

mmgxa/llm_lmm_chatbot

Folders and files

Latest commit

History

Repository files navigation

A Multi-Model, Multi-Modal Agent Based Chatbot on EKS

Objective:

Tech Stack

Models used:

Inference Libraries used:

Demo

Optimizing for Scaling

Key Findings:

Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages