AI Model Inferencing on Oracle Cloud Infrastructure
Running Mistral AI Models on Oracle Cloud Infrastructure
A step-by-step guide to deploying Mistral language models on OCI
Introduction
Oracle Cloud Infrastructure (OCI) offers powerful compute instances with GPU acceleration, making it an excellent platform for running large language models. In this guide, I’ll walk through the process of setting up an OCI instance to run Mistral AI models using vLLM for efficient inference.
Aknowledgements
Git Repo by Bogdan Bazarca using Oracle’s Resource Manager
Oracle Solution NACI-AI-CN-DEV team
Prerequisites
Before starting, you’ll need:
- An Oracle Cloud account with access to GPU instances (A10 recommended)
- Your Hugging Face access token to download Mistral models
- Basic familiarity with Linux and cloud environments
Step 1: Setting Up Your OCI Instance
First, provision an appropriate compute instance in OCI:
- Navigate to the OCI Console and create a new compute instance
- Select an A10 GPU shape (VM.GPU.A10.1 or VM.GPU.A10.2)
- Choose Oracle Linux 8 or compatible image
- Set up network access, ensuring port 8888 is accessible for Jupyter Notebook
Connecting to Your Instance
After provisioning, you’ll need to connect to your instance via SSH. OCI provides you with SSH key pairs during the setup process. Here are common connection methods:
Standard SSH Connection (Recommended for Production)
For secure connections with proper host verification:
ssh -i /path/to/private_key opc@your_instance_ip
Development/Automation SSH Connection
For development environments or automated scripts where interactive prompts would break automation:
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /path/to/private_key opc@your_instance_ip
Security Note: This command bypasses SSH host verification and should only be used in development environments or temporary instances. It skips the security check that helps prevent man-in-the-middle attacks and doesn’t save the host key to your known_hosts file. For production environments, use the standard connection method.
Step 2: System Setup
The below setup handles the following system configurations:
Package Installation
The script installs essential packages for development:
dnf install -y dnf-utils zip unzip gcc
dnf install -y python3-devel
dnf install -y rust cargo
Docker Installation
Docker is required for container management:
dnf config-manager --add-repo=https://download.docker.com/linux/centos/docker-ce.repo
dnf remove -y runc
dnf install -y docker-ce --nobest
systemctl enable docker.service
NVIDIA Toolkit Setup
For GPU acceleration:
dnf install -y nvidia-container-toolkit
systemctl start docker.service
Step 3: Python Environment Configuration
The script sets up Python 3.10 and Conda for dependency management:
# Install Python 3.10.6
wget https://www.python.org/ftp/python/3.10.6/Python-3.10.6.tar.xz
tar -xf Python-3.10.6.tar.xz
cd Python-3.10.6/
./configure --enable-optimizations
make -j $(nproc)
sudo make altinstall
# Install Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash miniconda.sh -b -u -p ~/miniconda3
Step 4: Creating a Conda Environment for Mistral
A dedicated environment ensures all dependencies are properly managed:
conda create -n mistral python=3.10.9 -y
conda activate mistral
conda install pip -y
pip install torch torchvision torchaudio -f https://download.pytorch.org/whl/cu118
pip install jupyter vllm huggingface-hub tqdm gradio
Step 5: Downloading the Mistral Model
For this step you’ll need to visit Hugging Face, create an account (or sign-in). Navigate to settings -> Access Tokens and create the ncessary User Access Token. Next, you need to navigate to the latest Mistral model (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
APIKEYVAL
: Your Hugging Face access tokenMODEL
: The specific Mistral model to use (e.g., “Mistral-7B-Instruct-v0.2”)
The model is downloaded using the Hugging Face Hub API:
import os
from huggingface_hub import snapshot_download
from tqdm import tqdm
model_name = "mistralai/Mistral-7B-Instruct-v0.2" # Replace with your model
local_dir = "/home/opc/models/Mistral-7B-Instruct-v0.2"
access_token = "YOUR_HF_ACCESS_TOKEN"
os.makedirs(local_dir, exist_ok=True)
snapshot_download(
repo_id=model_name,
local_dir=local_dir,
token=access_token
)
Step 6: Launching the Model Server
vLLM is used to serve the model with an OpenAI-compatible API:
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--model /home/opc/models/Mistral-7B-Instruct-v0.2 \
--tokenizer hf-internal-testing/llama-tokenizer \
--max-model-len 16384 \
--enforce-eager \
--gpu-memory-utilization 0.8 \
--max-num-seqs 2
Step 7: Setting Up Jupyter Notebook
A Jupyter Notebook server is started for interactive development:
jupyter notebook --ip=0.0.0.0 --port=8888
Configuring Firewall Access
For security reasons, most cloud instances have firewalls enabled by default. To allow access to Jupyter Notebook’s port, you’ll need to configure the firewall:
# On Oracle Linux with firewalld
sudo firewall-cmd --permanent --add-port=8888/tcp
sudo firewall-cmd --reload
In addition to the local firewall, make sure to configure your OCI Security List to allow inbound traffic on port 8888 from your desired source (either your IP address for better security, or 0.0.0.0/0 for open access).
Step 8: Testing Your Deployment
The script creates two Jupyter Notebooks for testing:
- A simple API request notebook
- A Gradio interface for interactive model testing
Example API Request:
import requests
import json
import os
model = os.getenv("MODEL")
url = "http://0.0.0.0:8000/v1/chat/completions"
headers = {
"accept": "application/json",
"Content-Type": "application/json",
}
data = {
"model": f"/home/opc/models/{model}",
"messages": [{"role": "user", "content": "Write a short conclusion."}],
"max_tokens": 64
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print("Response:", json.dumps(result, indent=4))
Performance Considerations
- The A10 GPU provides excellent performance for models up to 13B parameters
- Use the
--max-model-len
parameter to adjust context length based on your needs - Adjust
--gpu-memory-utilization
to balance performance and stability
Troubleshooting
Model and Server Issues
Common issues and solutions:
- Model download fails: Verify your Hugging Face access token has proper permissions
- Server fails to start: Check GPU memory usage and adjust utilization parameter
- Slow inference: Consider optimizing batch size or using a different precision format
SSH Connectivity Issues
When connecting to your OCI instance:
- Permission denied errors:
- Ensure your private key has the correct permissions:
chmod 600 /path/to/private_key
- Verify you’re using the correct username (default is
opc
for Oracle Linux)
- Connection timeouts:
- Verify the instance is running in OCI console
- Check security list rules to ensure SSH port 22 is open
- Try connecting from the OCI Cloud Shell to rule out local network issues
- Host key verification failures:
- If the instance was recreated with the same IP, remove the old key from
~/.ssh/known_hosts
- Run:
ssh-keygen -R your_instance_ip
- Automation script connection issues:
- For scripts that need non-interactive connections, use the StrictHostKeyChecking=no option
- Ensure your script handles SSH timeouts and retries appropriately
Conclusion
With this setup, you now have a fully functional Mistral AI model running on OCI with an OpenAI-compatible API endpoint. This environment is perfect for development, testing, or even small-scale production deployments of Mistral models.
Next Steps
- Explore fine-tuning Mistral models for your specific use case
- Implement proper security for production deployments
- Set up monitoring and logging for your model server
- Optimize performance for your specific workload
Happy modeling!