How To Choose an LLM

Jul 26, 2025 3:27:42 PM

image2-2

When you implement a custom Large Language Model (LLM), it’s important to understand the specific use cases you can apply it to, as each has distinct performance metrics that reveal how well a model can solve a particular task.

1. Understand Your Use-Case

Classification

You can use an LLM for text classification to assign predefined labels to text for tasks like sentiment analysis and topic modeling. The performance of LLMs in this area can be significantly enhanced through fine-tuning. For example, a Llama-3-8B model’s accuracy on spam SMS detection jumped from 39% to 98% after being fine-tuned[1]. Similarly, a fine-tuned Qwen-7B model saw its accuracy improve by up to 38.6% on certain datasets[1]. For complex tasks like multiclass classification, models such as Llama3 and GPT-4 can outperform traditional machine learning methods[2].

Summarization

You can leverage LLMs to create concise summaries of long documents. While traditional metrics like ROUGE exist, they often show a low correlation with human judgments on the quality of LLM-generated summaries[3]. Newer evaluation methods use another LLM to assess summaries based on key dimensions like completeness, correctness, and readability[3]. Research indicates that LLM-based evaluations align more closely with human assessments compared to older automated metrics[4].

Translation

You can utilize LLMs for powerful machine translation, and recent models have shown superior performance over traditional systems. In a blind comparison study, LLMs like Claude 3.5-Sonnet and GPT-4o were rated as producing “good” quality translations more often than dedicated services like Google Translate and DeepL[5]. Claude 3.5-Sonnet was the top performer, achieving good translations in about 78% of cases across several languages[5]. These findings were validated at the WMT24 conference, where Claude 3.5-Sonnet was the best system in 9 of 11 language pairs[5]. Furthermore, fine-tuning can dramatically boost performance; a 13B LLaMA-2 model fine-tuned for translation (ALMA-13B) saw its performance increase by an average of 12 BLEU points, outperforming GPT-3.5[6].

Embedding Extraction for Retrieval-Augmented Generation (RAG)

To build advanced AI systems that can access your private data, you can use Retrieval-Augmented Generation (RAG). This technique significantly improves factual accuracy. A Stanford study found that without RAG, LLMs answered questions with only 34.7% accuracy, but with access to correct reference documents via RAG, their accuracy soared to 94%[7]. Similarly, when answering questions from enterprise SQL databases, an LLM’s accuracy improved from 16.7% to 54.2% when it could access a knowledge graph representation of the data[8].

Content and Email Generation

You can use LLMs to create original written material, from professional emails to marketing copy. The quality of this generated content is measured using a variety of metrics that assess its relevance, coherence, and factual accuracy[9]. Key evaluation criteria include:

Perplexity (PPL): Measures how well a model predicts a sequence of words, with lower scores being better[9].
BERTScore: Uses contextual embeddings to compare the semantic similarity between the generated text and a reference, making it resilient to simple paraphrasing[9].
Factual Consistency: Often expressed as a “hallucination rate,” this metric measures how often the model generates information that is factually incorrect[9].

Code Generation

If you are a developer, you can use LLMs as coding assistants. Performance is often measured with the pass@k metric, which calculates the probability that at least one of k generated code samples is correct. Performance varies by model size and capability; a smaller model might achieve a pass@1 of 27% and a pass@5 of 66%, while a state-of-the-art model like GPT-4o can achieve a pass@1 of 90.2% on the HumanEval benchmark[10]. In more specific tasks, such as implementing software design patterns, LLMs have demonstrated high success rates, ranging from 84% for the Observer pattern to 94% for the Strategy pattern[11].

Data Extraction

You can automate data entry by using an LLM to pull structured information from unstructured text. In this domain, LLMs can achieve very high accuracy. For instance, models like Claude-3.5-sonnet have reached an accuracy of 96.2% on certain data extraction tasks[12]. Performance can be further improved with a human-in-the-loop approach; one study found that an LLM-only method achieved 95.7% accuracy, but an LLM-assisted method where a human validates the output increased accuracy to 97.3%[12].

Conversational AI and Chatbots

You can power modern chatbots and virtual assistants with LLMs. The effectiveness of these conversational agents is evaluated using a multi-dimensional approach that goes beyond simple accuracy[13]. Key metrics include:

Task Completion: Whether the chatbot successfully fulfills the user’s request[13].
Conversation Relevancy: Assesses if the bot’s responses remain relevant throughout a multi-turn conversation[14].
Knowledge Retention: Measures the chatbot’s ability to remember information provided earlier in the conversation[14].
Retrieval Stability: Ensures that small variations in a user’s question do not lead to drastically different answers[13].

2. Understand Your Data

When developing a custom LLM, defining the type of data it will process is crucial, as performance varies significantly across domains.

General Data

LLMs trained on general-purpose data are designed to handle a wide array of topics. The quality of this training data is mission-critical, as it directly impacts the accuracy, reliability, and generalization capabilities of the model [15]. Performance is often measured using broad benchmarks like MMLU (Massive Multitask Language Understanding), which assesses knowledge across 57 subjects. An empirical formula called the “Performance Law” can even predict a model’s MMLU score based on its parameters [16]. However, it’s important to recognize that general performance benchmarks can be unsatisfying because the range of potential use cases is vast and difficult to capture in a single metric [17]. A fundamental way to evaluate performance is by calculating the cross-entropy loss or negative log likelihood on a dataset, which measures how well the model predicts the next token in a sequence [18].

Math Data

Mathematics presents a unique challenge for LLMs due to its demand for precision and structured reasoning. When evaluated on questions from the Math Stack Exchange, GPT-4 demonstrated the best performance among several models with a normalized discounted cumulative gain (nDCG) of 0.48 and a Precision@10 of 0.37, though it still failed to provide consistently accurate answers for more complex problems [19][20]. The quality of pre-training data is a key factor. By systematically rewriting and refining mathematical training datasets, a Llama-3.1-8B model’s accuracy improved by +12.4 on the GSM8K benchmark and +7.6 on the MATH benchmark [21].

Coding Data

LLMs are increasingly used for code generation, and their performance can be measured by their ability to produce efficient and correct code. In a study comparing 18 LLMs on Leetcode problems, researchers found that models could generate code that was, on average, more efficient than solutions written by humans [22]. Another controlled experiment evaluated leading models on data science coding challenges, revealing that ChatGPT and Claude had the highest success rates, particularly in analytical and algorithmic tasks [23]. The quality of training data is also paramount for coding. Continual pre-training with a refined codebase (SwallowCode) allowed a Llama-3.1-8B model to boost its pass@1 score by +17.0 on the HumanEval benchmark, demonstrating a significant improvement in code generation capability [21].

Legal Data

The legal field requires a high degree of precision and domain-specific knowledge. To measure LLM capabilities in this area, the LawBench benchmark was created. It assesses models on three levels: legal knowledge memorization, understanding, and application across 20 different tasks [24][25]. In an extensive evaluation of 51 different LLMs, GPT-4 was found to be the best-performing model by a significant margin [24][25]. Another key metric is factuality. Research shows that fine-tuning a model on legal-specific documents can improve its factual precision from 63% to 81% when answering questions about case law and legislation [26].

Medical Data

In healthcare, LLMs are evaluated on their ability to accurately process and analyze complex medical information. A systematic study focused on extracting patient data from unstructured medical reports found that GPT-4o achieved the highest overall accuracy at 91.4% across various categories like patient demographics and diagnostic details [27]. When benchmarked across multiple healthcare tasks, specialized medical LLMs typically provide more faithful answers than general models. However, general LLMs sometimes demonstrate better robustness and can generate more comprehensive (though potentially less accurate) content [28]. In a benchmark for global health, LLMs performed better than top human experts on questions related to tropical and infectious diseases, though their performance could be further improved with intentional fine-tuning using in-context learning [29].

3. Pick a Model

After defining your use case and data type, the next step is to select a model from a platform like the Hugging Face Model Hub. This process involves several key technical decisions.

Choose a Family of Models

First, select a model family that aligns with your project’s technical requirements. Different open-source families are developed by various organizations and possess unique architectural strengths and performance characteristics.

LLaMA (Meta): This family of auto-regressive, decoder-only models is based on the transformer architecture. Successive versions have introduced significant improvements; for instance, the context length increased from 2,000 tokens in LLaMA 1 to 8,000 in LLaMA 3, with Llama 3.1 supporting up to 128,000 tokens. To enhance inference speed and reduce memory overhead, LLaMA 2 and 3 use Grouped-Query Attention (GQA). The family includes a range of sizes, from smaller 8B models to the large 405B Llama 3.1, which is one of the most capable open models available.
Qwen (Alibaba): The Qwen family includes a wide range of models, from 0.5B to 110B parameters, trained on up to 7 trillion tokens of data. A key innovation is a “hybrid thinking architecture” that allows the model to switch between immediate answers and more complex step-by-step reasoning. Qwen models leverage architectural advances like Rotary Positional Embeddings (RoPE) and SwiGLU activation functions to support context lengths of up to 131,000 tokens. For efficiency, some larger models use a Mixture-of-Experts (MoE) architecture; for example, one 235B parameter model only activates 22B parameters per token, balancing high performance with lower computational cost.
Falcon (TII): Developed by the Technology Innovation Institute, the Falcon family includes two generations. The first generation was trained on up to 3.5 trillion tokens and ranges in size from 1.3B to 180B parameters. The second generation, Falcon 2, features more efficient models like an 11B parameter version trained on 5.5 trillion tokens, which performs on par with Google’s Gemma 7B and surpasses Meta’s Llama 3 8B. Falcon 2 is also TII’s first family to include a multimodal Vision-to-Language Model (VLM) and is set to incorporate a Mixture-of-Experts (MoE) architecture to enhance performance.
Mistral (Mistral AI): Mistral models are known for their high performance in relatively small parameter sizes. The Mistral-7B-v0.1 model, for instance, outperforms the larger Llama 2 13B model on certain tasks. It achieves this through architectural innovations like Sliding Window Attention (SWA), which gives it a theoretical attention span of 128,000 tokens with a fixed memory cache, and Grouped-Query Attention (GQA) for faster inference. The family also includes much larger models, such as Mistral Large 2, which has 123 billion parameters and a 128,000-token context window for handling complex reasoning tasks.
Gemma (Google): Gemma is a family of open-source models derived from the same technology as the proprietary Gemini models. Available in 2B, 9B, and 27B parameter sizes, Gemma models are designed for strong performance and responsible AI development. The 27B model has been shown to outperform some larger proprietary models, and the smaller versions are optimized for efficient inference on a range of hardware, from cloud servers to edge devices.
OpenAssistant: This is a community-driven project designed to be an open-source alternative to models like ChatGPT. Its architecture is based on the Transformer model and is refined using a combination of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), specifically using Proximal Policy Optimization (PPO). The model is trained on instruction datasets and is designed to run efficiently on a single high-end consumer GPU.

Choose Between Chat and Instruction Models

You must also decide between a “chat” and an “instruct” model, a distinction based on the model’s fine-tuning and intended purpose.

Chat models are fine-tuned on conversational data to excel at multi-turn dialogue and context retention. This makes them ideal for applications requiring natural, human-like interaction, such as chatbots and virtual assistants.
Instruct models are fine-tuned on sets of instruction-response pairs. They are optimized for executing specific, single-turn commands and are highly effective for tasks like summarization, translation, and code generation.

While base models are trained to predict the next word in a sequence, these fine-tuned models are specialized for structured interactions. However, the distinction can sometimes be fluid, as it’s often possible to give instructions to chat models or have a conversation with an instruct model.

Choose a Model Size

The size of a model, measured by its number of parameters (e.g., 8B for 8 billion), directly impacts its performance and computational cost. Larger models generally provide more accurate and nuanced outputs but require significantly more memory and processing power.

A practical rule of thumb is that you need approximately 2 bytes of VRAM per parameter to run a model in its standard 16-bit precision format (FP16 or BF16). For example, a 7-billion-parameter model requires about 14 GB of VRAM for inference. You can use tools like Hugging Face’s accelerate estimate-memory to calculate the precise memory requirements for a specific model without needing to download it first.

Choose a Quantization Method

If your hardware resources are limited, you can use a quantized version of a model. Quantization is a compression technique that reduces a model’s memory footprint and accelerates inference by lowering the numerical precision of its weights—for example, from 16-bit floating-point numbers (FP16) down to 4-bit integers (INT4). This process involves a trade-off, as very low precision can sometimes slightly degrade model accuracy. Popular methods include GPTQ, AWQ, and Bitsandbytes.

GPTQ (Generative Pre-trained Transformer Quantization) This is a post-training quantization (PTQ) technique that compresses a model’s weights after it has been trained[30]. It operates layer by layer, quantizing each row of a weight matrix independently to minimize the error introduced by the compression[31][30]. GPTQ can reduce weights down to 3 or 4 bits with negligible impact on accuracy[32][33]. During inference, these 4-bit integer weights are de-quantized back to 16-bit floating-point numbers on the fly, which can reduce memory usage by a factor of four and improve inference speed[34][30].
AWQ (Activation-aware Weight Quantization) This method improves upon traditional weight-only quantization by also considering the model’s activations[35]. The key insight is that not all weights are equally important; a small fraction of weights that process large-magnitude “salient” activations are critical for the model’s performance[36]. AWQ identifies these important weights by analyzing the model’s activations with a calibration dataset and protects them from significant precision loss during quantization[37][36]. This allows AWQ to compress models to 4-bits while preserving high accuracy[38].
Bitsandbytes This is a popular library that makes quantization more accessible by providing a lightweight wrapper around CUDA functions[39][40]. It offers several quantization features, including LLM.int8() for 8-bit quantization and QLoRA for 4-bit quantization[41]. A core principle of Bitsandbytes is that while the model’s weights are stored in a compressed 4-bit format (using the NF4 data type for optimal representation), they are dequantized to 16-bit during computation to maintain performance[42]. QLoRA further enhances this by allowing for efficient fine-tuning of 4-bit models by inserting a small number of trainable Low-Rank Adaptation (LoRA) weights[39].

4. Real World Example

Here is a real-world example of how to select an LLM for a customer support bot, following the step-by-step process of defining the use case, data type, and model specifications.

Step 1: Defining the Use Case

First, we define the specific application and its requirements.

Application: A customer support chatbot for an e-commerce company.
Core Task: The primary function is to provide automated, real-time assistance to customers. This involves understanding user questions, retrieving relevant information from a private knowledge base (e.g., order status, return policies, product details), and generating helpful, accurate responses. This points to a Retrieval-Augmented Generation (RAG) architecture.
Key Performance Metrics: Success will be measured by the task completion rate (how often the bot successfully resolves a user’s issue), factual accuracy (ensuring answers are grounded in the company’s data with a low hallucination rate), and overall user satisfaction scores.

Step 2: Defining the Data Type

Next, we identify the kind of data the model will need to process.

Primary Data: The model will primarily handle conversational data—the unstructured, natural language questions and follow-up queries from users.
Knowledge Base: The RAG system will be built on the company’s internal data, including unstructured text from FAQs, product manuals, and policy documents, as well as structured data from order databases.
Evaluation Focus: The chosen LLM must excel at natural language understanding to accurately parse user intent from conversational text. It also needs to generate responses that are not only coherent but also factually consistent with the information retrieved from the knowledge base.

Step 3: Selecting the Model

With the use case and data defined, we can now choose a specific model from the Hugging Face Hub by making several key technical decisions.

1. Choose a Family of Models: LLaMA 3 (Meta)

For this use case, the LLaMA 3 family is an excellent choice. It is a powerful, widely-supported open-source family known for its strong performance on a variety of benchmarks. Its use of Grouped-Query Attention (GQA) provides faster inference speed, which is critical for a responsive, real-time chatbot experience.

2. Choose Between Chat and Instruction Models: Chat Model

A chat model is the clear choice. These models are specifically fine-tuned on conversational data to handle multi-turn dialogues and maintain context, which is essential for a natural-feeling customer support interaction. An instruction-tuned model might handle a single command well but would struggle to manage the back-and-forth flow of a real conversation.

3. Choose a Model Size: 8B Parameters

An 8-billion parameter model, such as Llama-3-8B-Instruct, offers the best balance of performance, speed, and cost for this application.

Performance: It is powerful enough to handle the majority of common customer support queries with high accuracy, especially when supported by a RAG system.
Cost and Speed: Smaller models have lower latency and require less expensive hardware. An 8B model running in its standard 16-bit precision format requires approximately 16 GB of VRAM, making it suitable for deployment on common cloud GPUs.

4. Choose a Quantization Method: AWQ (4-bit)

To deploy the chatbot in a cost-effective and scalable manner, we will use a quantized version of the model.

Method: Activation-aware Weight Quantization (AWQ) is a strong choice. It is a post-training technique that compresses the model’s weights to 4-bits while protecting the most important weights to preserve high accuracy.
Benefit: Using a 4-bit AWQ-quantized Llama-3-8B model reduces its memory requirement from ~16 GB to approximately 4-5 GB. This allows the chatbot to run on cheaper hardware, handle more concurrent users, and frees up VRAM for the RAG system’s retrieval components, leading to a more efficient and affordable production deployment.

After selecting your model, the final steps are to verify its performance and deploy it. This process ensures the model meets your technical and quality standards before being integrated into a live application.

5. Verify Performance on a Benchmark

You should evaluate your chosen model’s capabilities using benchmarking, which involves testing it on standardized tasks and datasets to objectively measure its performance[43][44]. You can use established open benchmarks or create your own.

Open Benchmarks: These are standardized frameworks that test models on capabilities like coding, reasoning, and common sense[43]. The evaluation process gives the model a task with varying levels of guidance, such as zero-shot (no examples provided) or few-shot (a few examples provided to guide the response)[45]. The model’s output is then scored against a ground truth using key metrics[46]:
- Correctness: Determines if the output is factually correct.
- Answer Relevancy: Assesses if the answer directly addresses the prompt.
- Contextual Relevancy: In RAG systems, this measures if the retrieved information was relevant to the query.
- Hallucination: Checks for the presence of fabricated information.
  You can use open-source evaluation frameworks like DeepEval, which supports over 14 metrics, or RAGAs, which is specifically designed to assess Retrieval-Augmented Generation (RAG) pipelines[47].
Custom Benchmarks: If you are working in a niche domain, you may need to create a custom benchmark to evaluate the model on your specific data and use case[48]. This allows you to test for performance on highly specialized tasks, such as the ability to reason over long, complex documents or maintain coherence in multi-turn dialogues[48]. Tools like Hugging Face’s lighteval can help you build and run these tailored evaluations.

6. Deploy the Model

Once you are satisfied with the model’s performance, you can deploy it. You can either use a managed cloud service or host it on your own hardware.

Choose an Inference Provider: For scalability and ease of management, you can use a managed inference provider. These platforms handle the complexities of GPU infrastructure and scaling[49]. When choosing a provider, key technical performance metrics to consider are latency (how quickly you get a response) and throughput (how many tokens per second the model can generate)[50].
Deploy Locally: If your priorities are privacy, cost control, or offline access, you can deploy the model on your own hardware. This gives you full control over your data and eliminates recurring API fees. A key consideration is hardware capacity; for example, a 70-billion parameter model like Llama-2-70B requires approximately 140 GB of VRAM to run in 16-bit precision, meaning it would need at least six 24GB GPUs[50]. Tools available to simplify local deployment include:
- GUI-based tools like LM Studio and Jan AI for user-friendly interfaces.
- Command-line tools such as Ollama and LLaMa.cpp for more developer control.

Conclusion

Here is a summary of the key stages:

Define the Use Case and Data The foundational step is to clearly identify the business problem you want to solve and the specific use case, such as enhancing customer support or automating content creation. This will determine the type of data the model will be trained on, whether it’s general text, legal documents, or medical records.
Select the Right Model Once your objectives are clear, you can choose an appropriate model architecture. This involves selecting a model family (e.g., Llama, Qwen, Mistral), deciding between a chat or instruction-tuned version, and picking a model size that balances performance with computational resources. To make models more efficient, you can use quantization techniques to reduce their memory footprint.
Verify Performance Before deployment, it is crucial to evaluate the model’s performance on standardized benchmarks. This step involves testing the model’s accuracy, reliability, and efficiency to ensure it meets the required quality standards for your application.
Deploy the Model The final step is to deploy the validated model. You can opt for a managed inference provider for scalability and ease of use, or choose a local deployment for greater control over data privacy and costs. The deployment environment must be set up with the necessary software, including GPUs, Python, and relevant libraries to run the model effectively.

By following a systematic process, you can successfully develop and launch a custom Large Language Model tailored to your specific needs. This structured approach guides you from the initial idea to a fully operational AI tool.