Small Language Models

What are SLMs?

Small Language Models (SLMs) are language models with 1-13 billion parameters, compared to 175B+ for models like GPT-4. Despite their smaller size, SLMs can match or exceed large models on specific tasks when properly fine-tuned.

Key Insight

A 7B parameter model fine-tuned for entity extraction will often outperform GPT-4 at that specific task, while costing 100x less to run.

SLMs vs LLMs

Aspect	Small Language Models	Large Language Models
Parameters	1B - 13B	70B - 1T+
Latency	10-100ms	500-3000ms
Cost per request	$0.0001 - $0.001	$0.01 - $0.10
Memory	4-16 GB	80-320 GB
Can run locally	Yes (CPU or single GPU)	No (requires multi-GPU)
Task scope	Narrow (specialized)	Broad (general-purpose)
Fine-tuning	Practical and affordable	Expensive and complex

Model Tiers in Simplex

Simplex provides automatic model tiering, selecting the right model for each task:

Tier	Example Model	Latency	Cost	Use Case
`fast`	Llama 7B	10-50ms	$0.0001	Classification, simple extraction
`default`	Llama 70B	50-200ms	$0.001	General tasks, summarization
`quality`	Claude/GPT-4	200-2000ms	$0.01	Complex reasoning, creative writing

// Fast tier for simple classification
let sentiment = await ai::classify<Sentiment>(
    text,
    model: "fast"
)

// Default tier for general tasks
let summary = await ai::complete(prompt)

// Quality tier for complex reasoning
let analysis = await ai::complete(
    complex_prompt,
    model: "quality"
)

Specialists in CHAI

In Cognitive Hive AI, each specialist wraps an SLM fine-tuned for a specific task:

specialist Summarizer {
    model: "summarization-7b",
    domain: "text summarization",
    memory: 8.GB,
    temperature: 0.3,
    max_tokens: 500,

    receive Summarize(text: String, style: SummaryStyle) -> String {
        let prompt = match style {
            SummaryStyle::Brief => "Summarize in 1-2 sentences: {text}",
            SummaryStyle::Detailed => "Provide a detailed summary: {text}",
            SummaryStyle::Bullets => "Summarize as bullet points: {text}"
        }
        infer(prompt)
    }

    receive SummarizeBatch(docs: List<String>) -> List<String> {
        docs.map(doc => infer("Summarize: {doc}"))
    }
}

Common Specialist Types

Entity Extractor

Extracts named entities (people, places, organizations, dates) from text. Fine-tuned on NER datasets.

Sentiment Analyzer

Classifies text sentiment (positive, negative, neutral) with confidence scores.

Summarizer

Condenses long documents into concise summaries. Can produce different styles.

Classifier

Categorizes content into predefined classes. Great for routing and tagging.

Translator

Translates between languages. Specialized models often beat general LLMs.

Code Generator

Generates code in specific languages. Fine-tuned on language-specific corpora.

Deployment Options

CPU Deployment

For lower throughput needs, SLMs can run on CPU instances:

Pros: Cheapest option, widely available spot instances
Cons: Higher latency (100-500ms), lower throughput
Best for: Development, low-traffic applications

GPU Deployment

For production workloads, GPU instances provide best performance:

Pros: Low latency (10-50ms), high throughput
Cons: More expensive, less availability
Best for: Production, high-traffic applications

Quantization

Reduce model size and memory requirements through quantization:

Precision	Memory (7B model)	Quality Loss
FP16 (default)	14 GB	None
INT8	7 GB	Minimal
INT4	3.5 GB	Slight

Fine-Tuning Specialists

Fine-tuning transforms a general SLM into a task-specific specialist:

[model]
base = "llama-7b"
output = "my-summarizer-7b"

[training]
method = "lora"           # Parameter-efficient fine-tuning
learning_rate = 2e-4
epochs = 3
batch_size = 8

[data]
train = "data/train.jsonl"
validation = "data/val.jsonl"
format = "instruction"    # Input/output pairs

Training Data Format

{"input": "Summarize: [long article]", "output": "[summary]"}
{"input": "Summarize: [another article]", "output": "[another summary]"}
...

When to Use SLMs vs LLMs

Use SLMs When:

Task is well-defined and narrow
High volume (1000+ requests/day)
Low latency required (<100ms)
Cost is a concern
Data privacy requires local processing
Predictable, consistent outputs needed

Use LLMs When:

Task requires broad knowledge or reasoning
Low volume or prototyping
Task is novel or undefined
Quality is more important than cost
Complex multi-step reasoning needed

What are SLMs?

SLMs vs LLMs

Model Tiers in Simplex

Specialists in CHAI

Common Specialist Types

Entity Extractor

Sentiment Analyzer

Summarizer

Classifier

Translator

Code Generator

Deployment Options

CPU Deployment

GPU Deployment

Quantization

Fine-Tuning Specialists

Training Data Format

When to Use SLMs vs LLMs

Use SLMs When:

Use LLMs When:

Next Steps

Cognitive Hive AI

Examples