Choosing a local LLM often starts with asking “what is the largest model that fits in my VRAM?” However, on consumer GPUs like the RTX 4060 Ti 16GB, inference speed and compatibility matter just as much as raw size. The whichllm tool helps measure these factors through a simple CLI.

📑Table of Contents
  1. whichllm Tool Basics and Installation
  2. Benchmark Results on RTX 4060 Ti 16GB
  3. Selection Criteria Beyond VRAM Capacity
  4. Models Measured and Comparison Table
  5. whichllm CLI Examples and Practical Tips
  6. Frequently Asked Questions about Local LLM Selection
  7. Summary

whichllm Tool Basics and Installation

whichllm is a CLI utility designed to benchmark local LLMs on consumer GPUs such as the RTX 4060 Ti 16GB. It reports VRAM usage, inference speed, and model compatibility in one place. The project is hosted on GitHub at https://github.com/aktsmm/whichllm.

Installation requires a Python environment with CUDA-enabled NVIDIA drivers. Clone the repository or install via pip, then run whichllm –help to see available commands. The tool shines when you need to profile multiple models without manually launching each one through Ollama or llama.cpp.


Benchmark Results on RTX 4060 Ti 16GB

On the RTX 4060 Ti 16GB, models in the 7B–13B range ran reliably. VRAM consumption varies significantly with quantization level (for example, Q4_K_M versus Q5_K_M).

For instance, Llama 3 8B at Q5_K_M used approximately 6.2 GB and achieved around 45 tokens per second. Larger 13B models often exceeded 10 GB, leaving less headroom for longer contexts or batch processing.

These figures come from direct whichllm runs. Real-world speed depends on prompt length and hardware specifics.


Selection Criteria Beyond VRAM Capacity

Focusing only on maximum VRAM fit is insufficient. Slow inference reduces practicality, and lower accuracy can make a model unsuitable for certain tasks. Compatibility with GGUF formats and specific quantization methods also deserves attention.

whichllm surfaces these trade-offs in a single report, making it easier to decide between speed-focused 7B-class models and accuracy-focused 13B-class options.

Model VRAM (GB) Speed (tokens/s) Accuracy Compatibility
Llama 3 8B Q4 5.1 52 High Good
Mistral 7B Q5 5.8 48 Medium Good
Llama 3 13B Q4 9.4 32 High Caution
Gemma 2 9B Q5 6.5 41 High Good

Source: whichllm measurements on RTX 4060 Ti 16GB (June 2026, GitHub repository)

An 8B-class model often strikes the best balance for daily use.


Models Measured and Comparison Table

The table above summarizes representative models tested with whichllm. Targets included common GGUF-quantized models from Ollama and Hugging Face.

Llama 3 series models showed strong compatibility across tasks. Mistral-based models tended to deliver higher speed but occasionally showed task-specific accuracy differences.


whichllm CLI Examples and Practical Tips

Basic usage is straightforward. Here are common commands:

whichllm benchmark --model llama3:8b --gpu rtx4060ti --quant q5_k_m
whichllm list --sort vram
whichllm profile --model mistral:7b --output json

Practical tips include running with –dry-run first to estimate VRAM before a full benchmark. For multiple models, scripting the calls saves time. JSON output makes it easy to archive or compare results later.


Frequently Asked Questions about Local LLM Selection

Q: Which GPUs does whichllm support?

It primarily targets NVIDIA RTX 40-series cards but works with any CUDA-capable GPU. Check the GitHub repository for the full compatibility list.

Q: Can a 70B model run on 16 GB VRAM?

Even with aggressive quantization, 13B-class models are the practical limit on this hardware.

Q: Do the benchmark numbers match official reports?

Differences arise from hardware and environment variations. whichllm emphasizes real-machine measurements.

Q: What are the prerequisites for installation?

Python 3.10 or newer, CUDA 12.x, and the latest NVIDIA drivers are required.

Q: How can I share results with my team?

Use the JSON or Markdown report output. Some users post results as GitHub issues for community feedback.

Q: How does whichllm compare to ollama bench?

whichllm integrates VRAM, speed, and compatibility metrics in one workflow, which many users find more convenient.


Related articles:

Summary

Using whichllm on an RTX 4060 Ti 16GB setup lets you select local LLMs based on measured data rather than VRAM size alone. Considering speed and compatibility alongside capacity leads to more satisfying real-world performance. Start by exploring the GitHub repository and running your first benchmark.

Related new article:

krona23

Author

krona23

Over 20 years in the IT industry, serving as Division Head and CTO at multiple companies running large-scale web services in Japan. Experienced across Windows, iOS, Android, and web development. Currently focused on AI-native transformation. At DevGENT, sharing practical guides on AI code editors, automation tools, and LLMs in three languages.

DevGENT about →

Leave a Reply

Trending

Discover more from DevGENT

Subscribe now to keep reading and get access to the full archive.

Continue reading