Running open-source LLMs on your own machine

There’s something satisfying about running a language model entirely on your own machine — no API key, no rate limit, no data leaving your desk. It’s also gotten remarkably easy.

The setup starts with uv, which has become my default way to spin up a clean Python environment:

1
uv venv --python 3.12 --seed
2
source .venv/bin/activate
3
uv pip install torch transformers accelerate bitsandbytes

From there, loading an open-weight instruction-tuned model in 4-bit is a few lines:

1
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
2
import torch
3

4
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
5
model_id = "google/gemma-7b-it"
6

7
tokenizer = AutoTokenizer.from_pretrained(model_id)
8
model = AutoModelForCausalLM.from_pretrained(
9
    model_id,
10
    device_map="auto",
11
    quantization_config=quantization_config,
12
)
13

14
model_inputs = tokenizer(
15
    ["The secret to baking a good cake is "], return_tensors="pt"
16
).to("cuda")
17
generated_ids = model.generate(**model_inputs, max_length=30)
18
print(tokenizer.batch_decode(generated_ids)[0])

On a consumer GPU, google/gemma-7b-it in 4-bit ran comfortably and produced coherent, useful output — for the cake prompt above, it confidently informed me that the secret is “patience and practice,” which is either wisdom or a very well-trained prior.

The bigger point isn’t the cake. It’s that the barrier to running a genuinely capable model locally has quietly collapsed. A weekend’s worth of uv incantations gets you a private, offline, inspectable model — a useful thing to have, if only as a sandbox for understanding what these systems can and can’t do without a network connection watching.