What are SLMs?
Small Language Models (SLMs) are language models with 1-13 billion parameters, compared to 175B+ for models like GPT-4. Despite their smaller size, SLMs can match or exceed large models on specific tasks when properly fine-tuned.
Key Insight
A 7B parameter model fine-tuned for entity extraction will often outperform GPT-4 at that specific task, while costing 100x less to run.
SLMs vs LLMs
| Aspect | Small Language Models | Large Language Models |
|---|---|---|
| Parameters | 1B - 13B | 70B - 1T+ |
| Latency | 10-100ms | 500-3000ms |
| Cost per request | $0.0001 - $0.001 | $0.01 - $0.10 |
| Memory | 4-16 GB | 80-320 GB |
| Can run locally | Yes (CPU or single GPU) | No (requires multi-GPU) |
| Task scope | Narrow (specialized) | Broad (general-purpose) |
| Fine-tuning | Practical and affordable | Expensive and complex |
Model Tiers in Simplex
Simplex provides automatic model tiering, selecting the right model for each task:
| Tier | Example Model | Latency | Cost | Use Case |
|---|---|---|---|---|
fast |
Llama 7B | 10-50ms | $0.0001 | Classification, simple extraction |
default |
Llama 70B | 50-200ms | $0.001 | General tasks, summarization |
quality |
Claude/GPT-4 | 200-2000ms | $0.01 | Complex reasoning, creative writing |
// Fast tier for simple classification
let sentiment = await ai::classify<Sentiment>(
text,
model: "fast"
)
// Default tier for general tasks
let summary = await ai::complete(prompt)
// Quality tier for complex reasoning
let analysis = await ai::complete(
complex_prompt,
model: "quality"
)
Specialists in CHAI
In Cognitive Hive AI, each specialist wraps an SLM fine-tuned for a specific task:
specialist Summarizer {
model: "summarization-7b",
domain: "text summarization",
memory: 8.GB,
temperature: 0.3,
max_tokens: 500,
receive Summarize(text: String, style: SummaryStyle) -> String {
let prompt = match style {
SummaryStyle::Brief => "Summarize in 1-2 sentences: {text}",
SummaryStyle::Detailed => "Provide a detailed summary: {text}",
SummaryStyle::Bullets => "Summarize as bullet points: {text}"
}
infer(prompt)
}
receive SummarizeBatch(docs: List<String>) -> List<String> {
docs.map(doc => infer("Summarize: {doc}"))
}
}
Common Specialist Types
Entity Extractor
Extracts named entities (people, places, organizations, dates) from text. Fine-tuned on NER datasets.
Sentiment Analyzer
Classifies text sentiment (positive, negative, neutral) with confidence scores.
Summarizer
Condenses long documents into concise summaries. Can produce different styles.
Classifier
Categorizes content into predefined classes. Great for routing and tagging.
Translator
Translates between languages. Specialized models often beat general LLMs.
Code Generator
Generates code in specific languages. Fine-tuned on language-specific corpora.
Deployment Options
CPU Deployment
For lower throughput needs, SLMs can run on CPU instances:
- Pros: Cheapest option, widely available spot instances
- Cons: Higher latency (100-500ms), lower throughput
- Best for: Development, low-traffic applications
GPU Deployment
For production workloads, GPU instances provide best performance:
- Pros: Low latency (10-50ms), high throughput
- Cons: More expensive, less availability
- Best for: Production, high-traffic applications
Quantization
Reduce model size and memory requirements through quantization:
| Precision | Memory (7B model) | Quality Loss |
|---|---|---|
| FP16 (default) | 14 GB | None |
| INT8 | 7 GB | Minimal |
| INT4 | 3.5 GB | Slight |
Fine-Tuning Specialists
Fine-tuning transforms a general SLM into a task-specific specialist:
[model]
base = "llama-7b"
output = "my-summarizer-7b"
[training]
method = "lora" # Parameter-efficient fine-tuning
learning_rate = 2e-4
epochs = 3
batch_size = 8
[data]
train = "data/train.jsonl"
validation = "data/val.jsonl"
format = "instruction" # Input/output pairs
Training Data Format
{"input": "Summarize: [long article]", "output": "[summary]"}
{"input": "Summarize: [another article]", "output": "[another summary]"}
...
When to Use SLMs vs LLMs
Use SLMs When:
- Task is well-defined and narrow
- High volume (1000+ requests/day)
- Low latency required (<100ms)
- Cost is a concern
- Data privacy requires local processing
- Predictable, consistent outputs needed
Use LLMs When:
- Task requires broad knowledge or reasoning
- Low volume or prototyping
- Task is novel or undefined
- Quality is more important than cost
- Complex multi-step reasoning needed