On April 17, 2025, OpenAI officially released the new inference models o3 (full-blooded version) and o4-mini in a late-night live broadcast, replacing the previous old models such as o1 and o3-mini. This update achieves significant improvements in the areas of knowledge inference, multimodal processing, and code capabilities, while optimizing the pricing strategy to bring a more efficient AI experience to developers and users.
ShirtAI allows free unlimited use of GPT-4, GPT-4o strongest, GPT-4.1-mini and other models with one click from the official website:www.lsshirtai.com
I. Overview of the model: a comprehensive upgrade from parameters to positioning
OpenAI's o3 and o4-mini are based on a new architecture and focus on different scenarios:
- o3: As a "full-blooded version" of the flagship model, it focuses on advanced reasoning and tool synergy, supports full-featured tool access (e.g., Python, networking browsing, and function calls), and realizes for the first time the "integration of visual reasoning into the chain of thought", which is applicable to complex problem solving.
- o4-mini: a lightweight, high-performance model focusing on fast high-level reasoning and code/vision tasks, with an outstanding price/performance ratio while remaining efficient.
Second, the performance comparison: multi-dimensional ability to crush the old model
1. Knowledge-based reasoning: a tool-enabled accuracy spike
In math competitions, science problems, and cross-curricular tests, o3 and o4-mini show a crushing performance, especially when tools are allowed to be called:
Data sets / tasks | o1 | o3-mini | o3 (tool-less) | o3 (with Python) | o4-mini (without tools) | o4-mini (with Python) |
---|---|---|---|---|---|---|
AIME 2024 Mathematics Competition (AC%) | 74.3 | 87.3 | 91.6 | 95.2 | 93.4 | 98.7 |
Codeforces Code Contest (ELO) | 1891 | 2073 | – | 2719 | – | 2073 |
GPQA Diamond Science Questions (AC%) | 78 | 77 | 83.3 | – | 81.4 | – |
Humanity's Last Exam (AC%) | 13.4 | 20.3 | 20.3 | 24.9 | 14.28 | 17.7 |
Key Findings:
- o3 After calling Python, AIME accuracy improved from 91.6% to 95.2%, and Humanity's Last Exam improved accuracy by 24.9% thanks to the toolchain.
- Although o4-mini is a lightweight model, it has reached 93.41 TP3T (AIME) without tools, which is close to the o3 tool version, and the price/performance ratio is outstanding. o4-mini-high solved the latest Project Euler problem in 2 minutes and 55 seconds, but it is not a simple problem, only 15 people can solve it in 30 minutes, and it is a new problem that came out only a few days ago, which is unlikely to appear in the o4 training set, showing that o4-mini-high relies on "thinking" to solve it. This is a new problem that came out only a few days ago and could not have appeared in o4's training set, which suggests that o4-mini-high relied on 'thinking' to solve it.
2. Multimodal Visual Reasoning: From "Image Recognition" to "Image Thinking"
For the first time, o3 and o4-mini support the integration of visual reasoning into the chain of thought, far surpassing older models in complex image understanding tasks:
data set | mission statement | o1 | o3 | o4-mini |
---|---|---|---|---|
MMMU (Visual Mathematics for Universities) | Formula + Graphical Integrated Problem Solving (AC%) | 77.6 | 82.9 | 81.6 |
MathVista (visual math) | Geometric / Functional Image Reasoning (AC%) | 71.8 | 87.5 | 84.3 |
CharXiv-Reasoning | Scientific Diagram Comprehension (AC%) | 55.1 | 75.4 | 72 |
Significance of the breakthrough: o3 can "look at the picture and think" like human beings, realizing the paradigm upgrading from "pixel processing" to "scene reasoning". A user casually took a picture on the way to work and asked o3 to analyze the location. A user took a photo on his way to work and asked o3 to analyze the location. It first enlarged the picture in the interception, analyzed the key information in the picture, and then searched related web pages to narrow down the search scope step by step, and finally gave the specific location information.
3. Code and engineering capabilities: o3 is the first choice for developers
In software engineering tasks, o3 leads with tool access and code comprehension, while o4-mini is balanced in lightweight scenarios:
code task | norm | o1-high | o3-mini | o3-high | o4-mini-high |
---|---|---|---|---|---|
SWE-Bench Validation (AC%) | Algorithms / System Design | 48.9 | 69.1 | 69.1 | 68.1 |
Aider Code Editor (whole) | Overall multilingual rewrite (%) | 66.7 | 81.3 | 81.3 | 64.4 |
SWE-Lancer Order Taking Revenue | Freelance assignments ($) | 118,000 | 177,000 | 236,000 | – |
Practical value: o3 has averaged $236,000 per month in real coding tasks, far outpacing the old model and becoming a core tool for enterprise-level code development; o4-mini is suitable for rapid prototyping and lightweight code debugging.

4. Tool use and implementation: o3 A new paradigm for building intelligences
o3 demonstrates greater task coherence in tool collaboration scenarios such as multi-round command following, browser manipulation, and function calls:
Tool Tasks | norm | o1-high | o3-mini | o3 (tool version) | o4-mini (tool version) |
---|---|---|---|---|---|
Scale MultiChallenge | Multi-round command following (AC%) | 28.3 | 44.93 | 56.51 | 42.99 |
BrowseComp Browser Operations | Information Capture (AC%) | 32.4 | 50.0 | 70.8 | 52.0 |
Tau-bench Function Calls | Structured output (AC%) | 49.7 | 51.5 | 57.6 (Retail) | 65.6 (Retail) |
Key Benefits: o3 has commercial-grade capabilities in automating complex processes by autonomously operating virtual browsers and calling APIs to generate structured outputs such as flight booking JSON.
III. Parameters and Pricing: Full Optimization of Price/Performance Ratio
mould | reasoning ability | tempo | Price (Input/Output / Thousand Token) | Supported inputs | context window |
---|---|---|---|---|---|
o1 | infrastructural | slowest | $15-$60 | Text / Image | 200,000 |
o3-mini | high level | moderate | $1.1-$4.4 | copies | 200,000 |
o4-mini | high level | moderate | $1.1-$4.4 | Text / Image | 200,000 |
o3 | supreme | slowest | $10-$40 | Text / Image | 200,000 |
o1-pro | specialized field | slowest | $150-$600 | Text / Image | 200,000 |
Core adjustments: o3 is priced 1/3 lower than o1 for a much better price/performance ratio; o4-mini is priced the same as o3-mini, but with support for image input and better inference.
If you want to use GPT Plus, Claude Pro, Grok Super official paid exclusive account, you can contact our professional team (wx: abch891) if you don't know how to recharge yourself.