Skip to content

CLI that wraps any ONNX or TensorRT engine into an NVIDIA NIM microservice with auto-generated OpenAPI + Prometheus metrics. Turn any model into a production-ready API in seconds.

License

Notifications You must be signed in to change notification settings

danieleschmidt/nimify-anything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Nimify Anything

Python 3.10+ NVIDIA NIM License: MIT Docker CI/CD

CLI that wraps any ONNX or TensorRT engine into an NVIDIA NIM microservice with auto-generated OpenAPI + Prometheus metrics. Turn any model into a production-ready API in seconds.

πŸš€ Overview

NVIDIA NIM (NVIDIA Inference Microservices) added open MoE models in July 2025, but developers still hand-roll deployment configs. Nimify automates the entire process:

  • One-command deployment from model file to production API
  • Auto-generated OpenAPI with type-safe clients
  • Built-in monitoring via Prometheus/Grafana
  • Smart autoscaling based on latency and GPU utilization
  • Helm charts for Kubernetes deployment

⚑ Quick Demo

# Transform any ONNX model into a NIM service
nimify create my-model.onnx --name my-service --port 8080

# Deploy to Kubernetes with autoscaling
nimify deploy my-service --replicas 3 --autoscale

# Access your API
curl http://localhost:8080/v1/predict -d '{"input": [1, 2, 3]}'

Nimify Demo

πŸ“‹ Requirements
# Core dependencies
python>=3.10
onnx>=1.16.0
onnxruntime-gpu>=1.18.0
tensorrt>=10.0.0
tritonclient>=2.45.0
nvidia-pyindex

# API & Infrastructure
fastapi>=0.110.0
uvicorn>=0.30.0
pydantic>=2.0.0
prometheus-client>=0.20.0

# Deployment tools
docker>=24.0.0
kubernetes>=29.0.0
helm>=3.14.0

πŸ› οΈ Installation

From PyPI

pip install nimify-anything

From Source

git clone https://github.com/yourusername/nimify-anything.git
cd nimify-anything
pip install -e .

Verify Installation

# Check version and dependencies
nimify --version
nimify doctor

🚦 Usage Examples

Basic Model Wrapping

# ONNX model
nimify create model.onnx --name my-classifier

# TensorRT engine
nimify create model.trt --name my-detector --input-shapes "images:3,224,224"

# Hugging Face model
nimify create "facebook/bart-large-mnli" --source huggingface

Advanced Configuration

# Create with custom settings
nimify create model.onnx \
    --name sentiment-analyzer \
    --port 8080 \
    --max-batch-size 32 \
    --dynamic-batching \
    --gpu-memory 4GB \
    --metrics-port 9090

Python API

from nimify import Nimifier, ModelConfig

# Configure model
config = ModelConfig(
    name="my-model",
    max_batch_size=64,
    dynamic_batching=True,
    preferred_batch_sizes=[8, 16, 32, 64],
    max_queue_delay_microseconds=100
)

# Create NIM service
nim = Nimifier(config)
service = nim.wrap_model(
    "model.onnx",
    input_schema={"input": "float32[?,3,224,224]"},
    output_schema={"predictions": "float32[?,1000]"}
)

# Generate artifacts
service.generate_openapi("openapi.json")
service.generate_helm_chart("./helm/my-model")
service.build_container("myregistry/my-model:latest")

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Model File    │────▢│   Nimifier   │────▢│  NIM Service    β”‚
β”‚ (ONNX/TRT/HF)   β”‚     β”‚   Engine     β”‚     β”‚  (Container)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                      β”‚                      β”‚
         β–Ό                      β–Ό                      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Model Analysis  β”‚     β”‚   Triton     β”‚     β”‚   Kubernetes    β”‚
β”‚                 β”‚     β”‚   Config     β”‚     β”‚   Deployment    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 Features

Auto-Generated OpenAPI

# Generated openapi.yaml
openapi: 3.0.0
info:
  title: my-model NIM API
  version: 1.0.0
paths:
  /v1/predict:
    post:
      summary: Run inference
      requestBody:
        content:
          application/json:
            schema:
              type: object
              properties:
                input:
                  type: array
                  items:
                    type: number
      responses:
        200:
          description: Successful prediction
          content:
            application/json:
              schema:
                type: object
                properties:
                  predictions:
                    type: array

Prometheus Metrics

# Automatically exposed metrics:
# - nim_request_duration_seconds
# - nim_request_count_total
# - nim_batch_size_histogram
# - nim_gpu_utilization_percent
# - nim_model_loading_time_seconds
# - nim_queue_size

Smart Autoscaling

# Generated HPA configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: gpu
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: nim_request_duration_seconds_p99
      target:
        type: AverageValue
        averageValue: "100m"  # 100ms

🐳 Container Building

Automatic Dockerfile Generation

# Auto-generated Dockerfile
FROM nvcr.io/nvidia/tritonserver:24.06-py3

# Install NIM runtime
RUN pip install nvidia-nim-runtime

# Copy model and config
COPY model_repository/ /models/
COPY nim_config.pbtxt /models/my-model/config.pbtxt

# Expose ports
EXPOSE 8000 8001 8002

# Launch Triton with NIM
CMD ["tritonserver", "--model-repository=/models", "--nim-mode"]

Build and Push

# Build optimized container
nimify build my-model --optimize --tag myregistry/my-model:v1

# Push to registry
nimify push myregistry/my-model:v1

# Or use GitHub Actions
nimify generate-ci --platform github

☸️ Kubernetes Deployment

Helm Chart Generation

# Generate production-ready Helm chart
nimify helm create my-model --values prod-values.yaml

# Deploy to Kubernetes
helm install my-model ./my-model-chart \
  --namespace nim \
  --set image.tag=v1 \
  --set autoscaling.enabled=true

Generated Resources

# values.yaml
replicaCount: 3

image:
  repository: myregistry/my-model
  tag: latest
  pullPolicy: IfNotPresent

service:
  type: LoadBalancer
  port: 80
  targetPort: 8000

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetGPUUtilizationPercentage: 80
  targetLatencyMilliseconds: 100

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 16Gi
  requests:
    nvidia.com/gpu: 1
    memory: 8Gi

monitoring:
  prometheus:
    enabled: true
    port: 9090
  grafana:
    enabled: true
    dashboards:
      - nim-overview
      - gpu-metrics

πŸ“Š Monitoring Dashboard

Grafana Integration

# Deploy Grafana dashboard
nimify grafana deploy --model my-model

# Access dashboard
kubectl port-forward svc/grafana 3000:3000

Pre-built Dashboards

  • Request latency P50/P95/P99
  • Throughput (requests/sec)
  • GPU utilization and memory
  • Batch size distribution
  • Queue depth and wait times
  • Error rates and types

πŸ”§ Advanced Features

Multi-Model Serving

# Create ensemble service
nimify ensemble create \
  --name multi-stage-pipeline \
  --models preprocessor:preprocess.onnx \
           detector:yolov8.trt \
           classifier:resnet50.onnx \
  --pipeline sequential

A/B Testing

from nimify import ABTestConfig

# Configure A/B test
ab_config = ABTestConfig(
    variants={
        "control": "model_v1.onnx",
        "treatment": "model_v2.onnx"
    },
    traffic_split={"control": 0.8, "treatment": 0.2},
    metrics=["latency", "accuracy"]
)

nimify.create_ab_test("my-experiment", ab_config)

Custom Preprocessors

from nimify import Preprocessor

@Preprocessor.register("image_normalize")
def normalize_image(input_data):
    """Custom preprocessing logic"""
    return (input_data - 127.5) / 127.5

# Use in configuration
nimify create model.onnx \
  --preprocessor image_normalize \
  --postprocessor argmax

πŸ§ͺ Testing & Validation

Load Testing

# Run built-in load test
nimify loadtest my-model \
  --concurrent-users 100 \
  --duration 5m \
  --rps 1000

# Generate report
nimify loadtest report --output performance.html

Model Validation

from nimify import ModelValidator

validator = ModelValidator()

# Validate model serving
results = validator.validate(
    model_path="model.onnx",
    test_data="test_samples.json",
    checks=["output_shape", "latency", "throughput"]
)

assert results.passed, f"Validation failed: {results.errors}"

πŸ”„ CI/CD Integration

GitHub Actions

# .github/workflows/nimify.yml
name: Build and Deploy NIM

on:
  push:
    branches: [main]

jobs:
  nimify:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    
    - name: Install Nimify
      run: pip install nimify-anything
    
    - name: Build NIM service
      run: |
        nimify create model.onnx --name my-model
        nimify build my-model --tag ${{ github.sha }}
    
    - name: Deploy to Kubernetes
      run: |
        nimify deploy my-model \
          --image my-model:${{ github.sha }} \
          --wait

GitLab CI

# .gitlab-ci.yml
stages:
  - build
  - deploy

build-nim:
  stage: build
  script:
    - nimify create $MODEL_PATH --name $SERVICE_NAME
    - nimify build $SERVICE_NAME --tag $CI_COMMIT_SHA
    - nimify push $REGISTRY/$SERVICE_NAME:$CI_COMMIT_SHA

deploy-nim:
  stage: deploy
  script:
    - nimify deploy $SERVICE_NAME --image $REGISTRY/$SERVICE_NAME:$CI_COMMIT_SHA

🎯 Real-World Examples

Computer Vision Pipeline

# Object detection service
nimify create yolov8.onnx \
  --name object-detector \
  --input-type image \
  --output-type bounding-boxes \
  --preprocessing resize,normalize \
  --postprocessing nms

# Deploy with GPU optimization
nimify deploy object-detector \
  --gpu-memory 8GB \
  --tensorrt-optimization aggressive

NLP Service

# Text classification
nimify create bert-sentiment.onnx \
  --name sentiment-analyzer \
  --input-type text \
  --tokenizer bert-base-uncased \
  --max-sequence-length 512

Time Series Prediction

# Financial forecasting
nimify create lstm-forecast.onnx \
  --name stock-predictor \
  --input-shape "sequence:30,features:5" \
  --output-shape "predictions:5" \
  --streaming-mode true

🀝 Contributing

We welcome contributions! Priority areas:

  • Additional model format support
  • Custom metric collectors
  • Cloud provider integrations
  • Performance optimizations
  • Documentation improvements

See CONTRIBUTING.md for guidelines.

πŸ“„ Citation

@software{nimify_anything,
  title={Nimify Anything: Automated NVIDIA NIM Service Generation},
  author={Your Name},
  year={2025},
  url={https://github.com/yourusername/nimify-anything}
}

πŸ† Acknowledgments

  • NVIDIA for the NIM framework
  • The Triton Inference Server team
  • Contributors to the Kubernetes ecosystem

πŸ“ License

MIT License - See LICENSE for details.

πŸ”— Resources

πŸ“§ Contact

Python 3.10+ NVIDIA NIM License: MIT Docker CI/CD

CLI that wraps any ONNX or TensorRT engine into an NVIDIA NIM microservice with auto-generated OpenAPI + Prometheus metrics. Turn any model into a production-ready API in seconds.

πŸš€ Overview

NVIDIA NIM (NVIDIA Inference Microservices) added open MoE models in July 2025, but developers still hand-roll deployment configs. Nimify automates the entire process:

  • One-command deployment from model file to production API
  • Auto-generated OpenAPI with type-safe clients
  • Built-in monitoring via Prometheus/Grafana
  • Smart autoscaling based on latency and GPU utilization
  • Helm charts for Kubernetes deployment

⚑ Quick Demo

# Transform any ONNX model into a NIM service
nimify create my-model.onnx --name my-service --port 8080

# Deploy to Kubernetes with autoscaling
nimify deploy my-service --replicas 3 --autoscale

# Access your API
curl http://localhost:8080/v1/predict -d '{"input": [1, 2, 3]}'

Nimify Demo

About

CLI that wraps any ONNX or TensorRT engine into an NVIDIA NIM microservice with auto-generated OpenAPI + Prometheus metrics. Turn any model into a production-ready API in seconds.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages