AI Model Inferencing on Oracle Cloud Infrastructure

2025-04-17 post Matt Ferguson

Running Mistral AI Models on Oracle Cloud Infrastructure

A step-by-step guide to deploying Mistral language models on OCI

Introduction

Oracle Cloud Infrastructure (OCI) offers powerful compute instances with GPU acceleration, making it an excellent platform for running large language models. In this guide, I’ll walk through the process of setting up an OCI instance to run Mistral AI models using vLLM for efficient inference.

Aknowledgements

Git Repo by Bogdan Bazarca using Oracle’s Resource Manager

Oracle Solution NACI-AI-CN-DEV team

Prerequisites

Before starting, you’ll need:

An Oracle Cloud account with access to GPU instances (A10 recommended)
Your Hugging Face access token to download Mistral models
Basic familiarity with Linux and cloud environments

Step 1: Setting Up Your OCI Instance

First, provision an appropriate compute instance in OCI:

Navigate to the OCI Console and create a new compute instance
Select an A10 GPU shape (VM.GPU.A10.1 or VM.GPU.A10.2)
Choose Oracle Linux 8 or compatible image
Set up network access, ensuring port 8888 is accessible for Jupyter Notebook

Connecting to Your Instance

After provisioning, you’ll need to connect to your instance via SSH. OCI provides you with SSH key pairs during the setup process. Here are common connection methods:

Standard SSH Connection (Recommended for Production)

For secure connections with proper host verification:

ssh -i /path/to/private_key opc@your_instance_ip

Development/Automation SSH Connection

For development environments or automated scripts where interactive prompts would break automation:

ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /path/to/private_key opc@your_instance_ip

Security Note: This command bypasses SSH host verification and should only be used in development environments or temporary instances. It skips the security check that helps prevent man-in-the-middle attacks and doesn’t save the host key to your known_hosts file. For production environments, use the standard connection method.

Step 2: System Setup

The below setup handles the following system configurations:

Package Installation

The script installs essential packages for development:

dnf install -y dnf-utils zip unzip gcc
dnf install -y python3-devel
dnf install -y rust cargo

Docker Installation

Docker is required for container management:

dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service

NVIDIA Toolkit Setup

For GPU acceleration:

dnf install -y nvidia-container-toolkit
systemctl start docker.service

Step 3: Python Environment Configuration

The script sets up Python 3.10 and Conda for dependency management:

# Install Python 3.10.6
wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tar.xz
tar -xf Python-3.10.6.tar.xz
cd Python-3.10.6/
./configure --enable-optimizations
make -j $(nproc)
sudo make altinstall

# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash miniconda.sh -b -u -p ~/miniconda3

Step 4: Creating a Conda Environment for Mistral

A dedicated environment ensures all dependencies are properly managed:

conda create -n mistral python=3.10.9 -y
conda activate mistral
conda install pip -y
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu118
pip install jupyter vllm huggingface-hub tqdm gradio

Step 5: Downloading the Mistral Model

For this step you’ll need to visit Hugging Face, create an account (or sign-in). Navigate to settings -> Access Tokens and create the ncessary User Access Token. Next, you need to navigate to the latest Mistral model (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

APIKEYVAL: Your Hugging Face access token
MODEL: The specific Mistral model to use (e.g., “Mistral-7B-Instruct-v0.2”)

The model is downloaded using the Hugging Face Hub API:

import os
from huggingface_hub import snapshot_download
from tqdm import tqdm

model_name = "mistralai/Mistral-7B-Instruct-v0.2"  # Replace with your model
local_dir = "/home/opc/models/Mistral-7B-Instruct-v0.2"
access_token = "YOUR_HF_ACCESS_TOKEN"

os.makedirs(local_dir, exist_ok=True)
snapshot_download(
    repo_id=model_name, 
    local_dir=local_dir, 
    token=access_token
)

Step 6: Launching the Model Server

vLLM is used to serve the model with an OpenAI-compatible API:

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 \
    --model /home/opc/models/Mistral-7B-Instruct-v0.2 \
    --tokenizer hf-internal-testing/llama-tokenizer \
    --max-model-len 16384 \
    --enforce-eager \
    --gpu-memory-utilization 0.8 \
    --max-num-seqs 2

Step 7: Setting Up Jupyter Notebook

A Jupyter Notebook server is started for interactive development:

jupyter notebook --ip=0.0.0.0 --port=8888

Configuring Firewall Access

For security reasons, most cloud instances have firewalls enabled by default. To allow access to Jupyter Notebook’s port, you’ll need to configure the firewall:

# On Oracle Linux with firewalld
sudo firewall-cmd --permanent --add-port=8888/tcp
sudo firewall-cmd --reload

In addition to the local firewall, make sure to configure your OCI Security List to allow inbound traffic on port 8888 from your desired source (either your IP address for better security, or 0.0.0.0/0 for open access).

Step 8: Testing Your Deployment

The script creates two Jupyter Notebooks for testing:

A simple API request notebook
A Gradio interface for interactive model testing

Example API Request:

import requests
import json
import os

model = os.getenv("MODEL")
url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
    "accept": "application/json",
    "Content-Type": "application/json",
}

data = {
    "model": f"/home/opc/models/{model}",
    "messages": [{"role": "user", "content": "Write a short conclusion."}],
    "max_tokens": 64
}

response = requests.post(url, headers=headers, json=data)
result = response.json()
print("Response:", json.dumps(result, indent=4))

Performance Considerations

The A10 GPU provides excellent performance for models up to 13B parameters
Use the --max-model-len parameter to adjust context length based on your needs
Adjust --gpu-memory-utilization to balance performance and stability

Troubleshooting

Model and Server Issues

Common issues and solutions:

Model download fails: Verify your Hugging Face access token has proper permissions
Server fails to start: Check GPU memory usage and adjust utilization parameter
Slow inference: Consider optimizing batch size or using a different precision format

SSH Connectivity Issues

When connecting to your OCI instance:

Permission denied errors:

Ensure your private key has the correct permissions: chmod 600 /path/to/private_key
Verify you’re using the correct username (default is opc for Oracle Linux)

Connection timeouts:

Verify the instance is running in OCI console
Check security list rules to ensure SSH port 22 is open
Try connecting from the OCI Cloud Shell to rule out local network issues

Host key verification failures:

If the instance was recreated with the same IP, remove the old key from ~/.ssh/known_hosts
Run: ssh-keygen -R your_instance_ip

Automation script connection issues:

For scripts that need non-interactive connections, use the StrictHostKeyChecking=no option
Ensure your script handles SSH timeouts and retries appropriately

Conclusion

With this setup, you now have a fully functional Mistral AI model running on OCI with an OpenAI-compatible API endpoint. This environment is perfect for development, testing, or even small-scale production deployments of Mistral models.

Next Steps

Explore fine-tuning Mistral models for your specific use case
Implement proper security for production deployments
Set up monitoring and logging for your model server
Optimize performance for your specific workload

Happy modeling!