Skip to content

mmgxa/llm_lmm_chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

A Multi-Model, Multi-Modal Agent Based Chatbot on EKS

Objective:

The goal is to design a chatbot with an interface that can assist a user at different tasks. Notably, it can

  • perform search on the web
  • answer queries regarding an image
  • generate images
  • answer queries regarding a document
  • retain memory of the user's interactions with it.

Additionally, the chatbot needs to be deployable on a scalable backend using Kubernetes.

Tech Stack

The project uses state-of-the-art libraries for inference.

Models used:

Inference Libraries used:

  • ExLlamaV2
    • A quantization library which claims to have a better compression and inference latency compared to AWQ and GPTQ models. Also, has a number of bits representation. It is used for the Mixtral and Code Llama models. TabbyAPI provides an OpenAI-compatible API for ExLlamaV2 models.
  • SGLang
    • An inference library built on top of vLLM with a higher throughput. It has native integration for multi-modal models like LLaVA.
  • Text-Embeddings Inference
    • An embeddings inference library by HuggingFace to host our embeddings model.
  • TorchServe
    • A production grade library for serving PyTorch models. It is used for the SDXL-Turbo model.

Langchain is used for building the agent.

The frontend has been created in Streamlit. The models are deployed on EKS.

Demo

Demo

Optimizing for Scaling

Our models are ~73GB in size! Each time a node is to be provisioned, these files will be downloaded on each node and can take a significant time. This is in addition to the docker images needed for inference. We can implement a few tricks to save bandwidth as well as time - ~14 minutes per model!

See details here.

Key Findings:

  • Do not trust LLM benchmarks. A 34B model* with a high-score on HF leaderboard performed extremely poor as an agent.
  • Quantization can lobotomize an LLM; performance degrades even at following instructions or choosing the right tool.
  • The same prompts might not work for different models.

* not named for legal reasons 😉

Architecture

Click To See Diagram
Diagram

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published