What are SLMs?

Small Language Models (SLMs) are language models with 1-13 billion parameters, compared to 175B+ for models like GPT-4. Despite their smaller size, SLMs can match or exceed large models on specific tasks when properly fine-tuned.

Key Insight

A 7B parameter model fine-tuned for entity extraction will often outperform GPT-4 at that specific task, while costing 100x less to run.

SLMs vs LLMs

Aspect Small Language Models Large Language Models
Parameters 1B - 13B 70B - 1T+
Latency 10-100ms 500-3000ms
Cost per request $0.0001 - $0.001 $0.01 - $0.10
Memory 4-16 GB 80-320 GB
Can run locally Yes (CPU or single GPU) No (requires multi-GPU)
Task scope Narrow (specialized) Broad (general-purpose)
Fine-tuning Practical and affordable Expensive and complex

Model Tiers in Simplex

Simplex provides automatic model tiering, selecting the right model for each task:

Tier Example Model Latency Cost Use Case
fast Llama 7B 10-50ms $0.0001 Classification, simple extraction
default Llama 70B 50-200ms $0.001 General tasks, summarization
quality Claude/GPT-4 200-2000ms $0.01 Complex reasoning, creative writing
model-tiers.sx
// Fast tier for simple classification
let sentiment = await ai::classify<Sentiment>(
    text,
    model: "fast"
)

// Default tier for general tasks
let summary = await ai::complete(prompt)

// Quality tier for complex reasoning
let analysis = await ai::complete(
    complex_prompt,
    model: "quality"
)

Specialists in CHAI

In Cognitive Hive AI, each specialist wraps an SLM fine-tuned for a specific task:

summarizer.sx
specialist Summarizer {
    model: "summarization-7b",
    domain: "text summarization",
    memory: 8.GB,
    temperature: 0.3,
    max_tokens: 500,

    receive Summarize(text: String, style: SummaryStyle) -> String {
        let prompt = match style {
            SummaryStyle::Brief => "Summarize in 1-2 sentences: {text}",
            SummaryStyle::Detailed => "Provide a detailed summary: {text}",
            SummaryStyle::Bullets => "Summarize as bullet points: {text}"
        }
        infer(prompt)
    }

    receive SummarizeBatch(docs: List<String>) -> List<String> {
        docs.map(doc => infer("Summarize: {doc}"))
    }
}

Common Specialist Types

Entity Extractor

Extracts named entities (people, places, organizations, dates) from text. Fine-tuned on NER datasets.

Sentiment Analyzer

Classifies text sentiment (positive, negative, neutral) with confidence scores.

Summarizer

Condenses long documents into concise summaries. Can produce different styles.

Classifier

Categorizes content into predefined classes. Great for routing and tagging.

Translator

Translates between languages. Specialized models often beat general LLMs.

Code Generator

Generates code in specific languages. Fine-tuned on language-specific corpora.

Deployment Options

CPU Deployment

For lower throughput needs, SLMs can run on CPU instances:

  • Pros: Cheapest option, widely available spot instances
  • Cons: Higher latency (100-500ms), lower throughput
  • Best for: Development, low-traffic applications

GPU Deployment

For production workloads, GPU instances provide best performance:

  • Pros: Low latency (10-50ms), high throughput
  • Cons: More expensive, less availability
  • Best for: Production, high-traffic applications

Quantization

Reduce model size and memory requirements through quantization:

Precision Memory (7B model) Quality Loss
FP16 (default) 14 GB None
INT8 7 GB Minimal
INT4 3.5 GB Slight

Fine-Tuning Specialists

Fine-tuning transforms a general SLM into a task-specific specialist:

finetune.toml
[model]
base = "llama-7b"
output = "my-summarizer-7b"

[training]
method = "lora"           # Parameter-efficient fine-tuning
learning_rate = 2e-4
epochs = 3
batch_size = 8

[data]
train = "data/train.jsonl"
validation = "data/val.jsonl"
format = "instruction"    # Input/output pairs

Training Data Format

train.jsonl
{"input": "Summarize: [long article]", "output": "[summary]"}
{"input": "Summarize: [another article]", "output": "[another summary]"}
...

When to Use SLMs vs LLMs

Use SLMs When:

  • Task is well-defined and narrow
  • High volume (1000+ requests/day)
  • Low latency required (<100ms)
  • Cost is a concern
  • Data privacy requires local processing
  • Predictable, consistent outputs needed

Use LLMs When:

  • Task requires broad knowledge or reasoning
  • Low volume or prototyping
  • Task is novel or undefined
  • Quality is more important than cost
  • Complex multi-step reasoning needed

Next Steps