This project offers a path to version models, data, and runs using MLflow. This approach enables data professionals to compare model performances after changes to model hyperparameters, application architecture, data ingestion, and/or metrics used.
Comparing the performance of a data solution allows the data professional to select a model that balances latency, throughput, reliability, accuracy metrics, and other performance indicators. It is important not only to have the model with the best accuracy but also one that can be more easily implemented in the overall data solution.
However, as always, the first step in deciding whether to use an LLM solution is to analyze if it is truly necessary and viable for your current goal and overall context. If it is, then the following repository may help you start the journey of LLMops Model Versioning with MLflow.
-
Create and Activate Environment
Create and activate a Conda or Virtualenv environment. -
Install Dependencies
-
First, visit PyTorch's Get Started page to select and install the appropriate CUDA or CPU version for your system.
-
Next, install the necessary dependencies and sub-dependencies listed in the
requirements.txt
file by running:pip install -r requirements.txt
-
-
Start the MLflow Server
To set up the MLflow server, you need an artifact store where MLflow saves model artifacts like models and plots. You can use S3, Azure Blob Storage, Google Cloud Storage, or even a shared filesystem. Additionally, a tracking server is needed to log experiment data. By default, it logs to the local filesystem, but for more robust use, you may want to set up a database like MySQL or SQLite. This project will configure an authentication mechanism to prevent unauthorized users from accessing the logged experiment data and artifacts. The MLflow server will use port 5000:mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlflow-artifacts --app-name basic-auth --port 5000
-
Configure Authentication
To use theauth.py
file and change the login credentials, create a.env
file with the following format:MLFLOW_TRACKING_URI=http://localhost:5000 MLFLOW_TRACKING_USERNAME=admin MLFLOW_TRACKING_PASSWORD=password MLFLOW_TRACKING_NEW_PASSWORD=new_password
The default login credentials are
admin
andpassword
.
The versioning code is divided into files by each overall goal: authentication setup, training logging, and inferencing. Defined and used functions have docstrings to simplify debugging and code refactoring. Type hints are used for the same goal. The latter is especially useful because some objects from the torch
and transformers
libraries are similar, with the same methods but different architectures. For example, FastTokenizer
is used for smaller input data, while LazyTokenizer
is used for larger data volumes. Lazy models delay the loading of the model's full state until it is actually needed, drastically reducing the initial memory footprint.
-
Tokenization and Padding
Using the same base model for tokenization and fine-tuning is important because different LLM architectures may use different tokenization strategies and padding mechanisms. For instance, models like GPT-2 (which is autoregressive) do not use eos tokens, which can affect how padding and tokenization work. -
Dataset Preparation
The dataset used for fine-tuning is shuffled, tokenized, and saved to a parquet file format. Parquet datasets are used because they reduce storage space and speed up data loading and processing due to their optimized compression and efficient columnar access. After converting the data to parquet format, dataloaders are used to handle the data in batches for the fine-tuned model. Training and testing datasets are logged to the run before fine-tuning. -
GPU Check
Since both the data and model will be moved to GPU (if available), the script checks if CUDA is available on the system. -
Optimizer
For training, the AdamW optimizer is used for optimizing gradient descent calculation. AdamW is preferred over Adam because it applies weight decay (L2 regularization) directly to the parameters rather than through the gradients, resulting in more stable training. -
Performance Logging
During training, the current model accuracy, precision, recall, and F1 performance are logged. While evaluating, model gradient calculation and dropout layers are disabled. During this stage, batch normalization is adjusted to use the mean and variance accumulated during training. This allows the data professional to analyze the overall model’s change in these performance criteria during each epoch.After training, both the base torch and fine-tuned models are logged and registered for later use.
Each of the models registered during the latter stage is instantiated from the MLflow artifact storage. Two simple demonstration examples are defined to showcase the difference in performance results of the fine-tuning process executed. As the demonstration examples are small, the fast version of the distillbert base uncased tokenizer is used to tokenize the inputs for the registered base torch model. Both models are used for inferencing separately, and their results are compared.
After inferencing, a new run is executed to log and register a new version of the base model. In MLflow, it is possible to keep multiple versions of a model, allowing the data professional to switch to the preferred one based on context and goals. After logging the new version of the base torch model, the previous version is chosen to be deleted. To demonstrate, the registered model itself is then deleted.
"With words one man can make another blessed, or drive him to despair" — Sigmund Freud