In this article, we’ll cover the main models Harvey uses and how they compare.
Last updated: Nov 21, 2025
Overview
To help you maximize your impact, Harvey leverages different advanced AI Large Language Models (LLMs), each designed with unique strengths. When you ask Harvey for assistance, our multi-model system will break down the request into sub-tasks, select a model to use, then synthesize the outputs for you. The current models we use are a mix of the following:
OpenAI GPT-5
OpenAI o3 model suite
OpenAI GPT-4.1 (4.1, 4.1-mini, 4.1-nano) and 4o model suite
Google Gemini 2.5 Pro model suite
Anthropic Sonnet/Opus 4 model suite
By default, Harvey’s 'Auto' mode will select for you, but if you prefer to choose yourself, workspace admins can enable Model Selector for individual users or across your workspace. Just keep in mind that there is no single model that’s best for every task so we recommend Auto.
Notes on the latest models:
Gemini 3 Pro: We are working in parallel with the teams at Google and DeepMind to bring their Early Access model towards General Availability, at which point the model will be made available within Harvey.
Claude Sonnet 4.5and GPT-5.1 are available via selection in our Model Selector. The latter is only available for the US and EU. We’re currently testing these models to evaluate further integration.
GPT-5 is included in Auto mode for Assistant chat and draft style tasks for US and EU customers. We will be expanding its integration in Auto mode as we continue testing the model for reliability at scale. If you prefer to use GPT-5 across product areas in your workspace now, it is available in the Model Selector for the US and EU.
Our Model Evaluation Process
Harvey’s model evaluation methodology is comprehensive so that we understand not only raw model performance, but also the safety and reliability of each model before including it in our system. The pillars of our model evaluation are BigLaw Bench, product performance, and unstructured evaluation.
Our benchmark that measures the ability of LLMs to complete real-world legal tasks. It evaluates both general-purpose LLMs and Harvey’s specialized agentic systems using detailed, lawyer-designed rubrics that score answer quality (accuracy, completeness, legal reasoning) and source reliability (verifiable citations to legal documents).
This approach ensures the models are judged not just on linguistic output, but on their ability to perform trusted, billable legal work with traceable, high-fidelity results. To learn more, read our blog article on Introducing BigLaw Bench.
In a test environment, we integrate the model into key product surfaces–Assistant, Vault, and Workflows–and measure system performance through both human preference and traditional machine learning metrics (accuracy, precision, recall, hallucination rate, latency, and more) over product datasets.
Researchers test the model in open-ended, less predictable ways to uncover negative features such as toxicity or bias, as well as identify major, sudden improvements in reasoning ability.
After we evaluate a model, we revisit the evaluation throughout its lifecycle to ensure we’re offering our users the best functionality.
Model Comparison
To help you navigate the models we offer, we’ve put together a high-level comparison table. Use the horizontal scrollbar at the bottom of the table to view all columns, including availability by region.
Note: If model selector is enabled in your workspace but you’re not seeing a particular model, it may not be available for your region. Ask your Customer Success Manager to confirm what’s available to you.
Model
Developer
Model Release Date
Strengths
Weaknesses
BigLaw Bench Score
Knowledge Cut-off
Regional Availability
GPT-5.1
Open AI
November 12, 2025
Legal reasoning
More detailed and better structured outputs
Instruction following
Pending further evaluation
91.8%
September 2024
US
EU
GPT-5 (reasoning)
Open AI
August 7, 2025
Analysis detail and quality
Hard problem solving, particularly long-form writing and agentic behavior
Can overthink, providing overly complicated answers to straightforward problems
Formatting, particularly structured use of headers and lists
89.22%
September 2024
US
EU
AU coming late 2025
o3
Open AI
April 16, 2025
Foundational knowledge and reasoning
Planning and execution of agentic problems and hard tasks
Sometimes hallucinates, especially if pressed for details it is not well positioned to provide
Can overthink straightforward problems
84.13%
June 2024
US
EU
AU
GPT-4.1
Open AI
April 14, 2025
Drafts comprehensive and organized outputs
Pulling out key information and workflow-specific tasks
Inconsistent quotation frequency and quality
Occasionally over-prioritizes conciseness
78.39%
June 2024
US
EU
AU
Gemini 2.5 Pro
Google
March 25, 2025
Drafts longer, detailed outputs
Multi-step analysis and outputs
Can overthink straightforward problems
Occasional misplaced or stiff tone
85.02%
February 2025
US
EU
Claude Sonnet 4
Anthropic
May 22, 2025
Strong output structure and organization
Pulling out key information and workflow-specific tasks
Occasionally too wordy
Can lack depth and thoroughness on lengthier questions
81.37%
January 2025
US
EU
Opus 4
Anthropic
May 22, 2025
Strong formatting and clarity
Grounding responses in underlying documents
Occasionally misses key factual or legal elements or over-simplifies
Sometimes too rigid in formatting
Tasks requiring specific formats
82.70%
January 2025
US
EU
Claude Sonnet 4.5
Anthropic
September 29, 2025
Pending further evaluation
Pending further evaluation
89.55%
January 2025
US
EU
AU
Looking forward, the landscape of AI technology is continuously evolving, and so is Harvey. We are constantly evaluating the latest models and their performance to ensure we are always providing the most advanced and effective solutions. Stay up-to-date on model availability and feature enhancements by following our Release Notes.
FAQs
Terminology
Generative AI refers to a class of artificial intelligence systems designed to generate new content, ideas, or solutions based on patterns and data it has learned from. It uses machine learning models to create text, images, music, and even code or designs, mimicking human creativity. Unlike traditional AI, which focuses on recognizing patterns or making decisions based on existing data, generative AI can produce novel outputs, often by learning from vast datasets.
When you input information into Harvey, or when Harvey generates a response, it's processed in tokens. A token is the smallest unit that a model can understand. Tokens have specific statistical properties that make them better to use than just using words.
Think of a "context window" as the amount of information an AI model can 'see' and process at one time. When you're working with Harvey on a document, the context window determines how much of that document the model can consider to generate its response.
A larger context window means the model can review more text at once, leading to more comprehensive and accurate analysis, especially for lengthy and complex documents like credit agreements.
Yes, LLMs are trained on data up to a certain point in time, known as a "knowledge cutoff date." This means the model's knowledge of events or information beyond that date may be limited.
If you select a model via the Model Selector, the knowledge cutoff would depend on the model used in response to a query. Most of the models we support have similar knowledge cutoffs.
If you’re using "Auto" mode or don’t have a model selector available, the knowledge cutoff is June 2024.
When you use Harvey, we combine AI's broad knowledge with specific, real-time information from your documents or trusted data sources, to optimize performance.
If you have model selector enabled, you’ll notice we note whether each model is best for complex or general-purpose tasks.
General-purpose tasks: Summarizing and analyzing documents, writing, ideation, and more everyday work.
Complex tasks: Multi-step reasoning, long-form drafting, complex workflows, and more.
Models tagged for “complex tasks” are reasoning models that take more time to think before answering — this usually lets them perform better on tasks that are at the edge of model capabilities. They may be slower to process, and sometimes overthink the easy things.
If you’re not sure where your task falls, choose Auto — Harvey will line up your task with the best model for the job.
Security
Harvey is built by and for the world's best legal and financial companies, and data security is paramount. We engineer our platform with robust privacy and security measures. Sensitive data is handled with the utmost confidentiality and is not used to train the underlying public AI models, nor is it shared externally without your explicit permission. We prioritize practical solutions that honor established practices while using technology to enhance professional work.
While AI models are incredibly powerful, they’re designed to be sophisticated reasoning tools. Sometimes, models might provide a very comprehensive response that seems to "overthink" a simple query. This occurs because these models are built for deep reasoning on hard problems.
If you encounter an unexpected response, consider rephrasing your question or, if available for the task, trying a different model that is optimized for more succinct answers.
Updates to the models powering Harvey occur regularly, aimed at continuous improvement and enhancing your experience without requiring you to manage complex technical configurations.
Although most users will benefit from Harvey's default selection, power users might like the option to choose themselves due to personal model preferences. You may find that you prefer the results or tone of certain models depending on the use case.
Not at this time, our multi-model use makes it challenging to break this down clearly per query, however this request is on our development list for future consideration.
Harvey specializes in complex professional tasks, particularly in legal and tax workflows.
No, Harvey is built so you can work with simple, clear instructions and minimizes the need for detailed prompt engineering. We also provide example libraries and features to automatically refine prompts for better outputs.
Think of Harvey as a digital associate—a fast and effective thought partner and generator of first drafts. While the output requires verification, similar to the work of a junior colleague, it saves significant time and enhances overall quality. Harvey also allows you to save favorite prompts for future use, expediting workflows and saving you time.
For more on getting the most out of your prompts, visit our Prompt Writing article.
Usage is measured relatively and is rarely, if ever, exceeded. If Harvey observes that a Customer’s monthly usage is materially disproportionate compared to other similarly situated customers, Harvey will notify the Customer. If, after 48 hours of such notice, the Customer’s usage remains outside typical usage patterns, Harvey may reasonably limit the Customer’s access to the Service to help maintain overall service quality.
Hallucinations are a type of error that occurs when LLMs (Large Language Models) generate inaccurate or fabricated information. This can happen because LLMs are trained on large and diverse corpora of text, but may lack sufficient domain knowledge or logical reasoning.
Harvey minimizes hallucinations by using domain-specific models and knowledge bases:
Domain-Specific Models: These models are trained on large datasets of specialized legal documents, capturing the nuances and complexities of the law. This narrows the gap between general-purpose models, which lack domain expertise, and human experts with specialized knowledge.
Knowledge Bases: Many of Harvey’s tools use domain-specific resources, such as statutes, case law, and legal ontologies, to ground responses in authoritative sources. For legal tasks, Harvey references selected case law and statutes, and when documents are uploaded, the output will include citations from the user-provided materials.
Technical
We invest in several areas that go beyond training a single model:
Composable systems: Harvey combines many small model “calls” together. This means a single query might use dozens—or even hundreds—of steps behind the scenes to give you a complete answer with supporting references.
Multi-model strategy: We partner with more than one AI provider. Each model has different strengths, so leveraging them all makes our results more accurate and reduces the need for heavy custom training.
Orchestration: As our systems get more complex, we route your request to the right workflow. This helps ensure you get the best possible result without extra effort.
Workflows: We’re creating step-by-step processes that guide you through legal work from start to finish.
Harvey uses domain expertise to convert professional processes into high-quality AI agents that produce expert-quality work product:
Each request is routed to a cascading series of LLMs tuned for legal synthesis, RAG systems that incorporate public or user-provided data, and powerful reasoning models like o1 that help orchestrate the work end to end.
This allows Harvey to both reduce hallucinations and improve response output quality.
Our system is designed to break up a larger task into smaller tasks that AI models are more likely to execute correctly. We created a “Partner”-like model to create a plan and delegate each subtask to the right model. The type of work being completed determines the model that is used.
The Details
We build our models to be similar to how a partners coordinate complex, multi-step tasks—delegating the work in a specific order to specific associates.
How It Works
Our Partner model creates a plan to solve the overall tasks and identify which subsystems to delegate portions of each task to.
Based on these assignments, our subsystems will receive the necessary context from the Partner model.
The subsystem the performs the task to the best of its abilities.
The output is sent back to the Partner model.
The Partner model synthesizes the output and produces the final result.
Example (Case Law)
Let’s say you ran this query: “Can the pass through defense be used to defeat class cert in nd cal.” The Partner model might generate the following list of tasks:
A list of sub questions that need to be answered to answer the original question.
A list of case law searches to answer each sub question and examples of what the subsystems should look for.
A simple plan to answer the subquestions (“summarize these cases and highlight these things for each subquestion”).
Synthesize the results into a final answer.
Example (Drafting)
Let’s say you ran this query: “Draft a client alert on the attached EU AI Act of 2024” with an uploaded copy of the EU AI Act of 2024. The Partner model might generate the following list of tasks:
A list of sub questions that need to be answered to get all the relevant information from the provided document.
A simple plan to answer the subquestions (“summarize these details and extract key information from these things for each subquestion”).
An outline of the client alert, to be filled in with information collected above.
Synthesize the results into a final answer, following the outline.
Currently GPT-5 thinking is the default and only version for selection in Harvey's Model Selector. We’re in the process of quality testing GPT-5, which will be incorporated and optimized in Auto mode once testing is complete. For the best performance, we recommend using Auto mode, which is designed to deliver the highest-level reasoning and adapt as we continue improvements.