OpenAI 发布 GPT-4.1 系列：编码、指令执行与长上下文能力大幅提升

Content Details

In a world where technology and knowledge are intertwined, every reading is like a marvelous adventure that makes you feel the power of wisdom and inspires endless creativity.

OpenAI Releases GPT-4.1 Series: Dramatic Improvements in Coding, Instruction Execution, and Long Context Capabilities

I. Introduction

On April 15, 2025, OpenAI officially launched the new GPT-4.1 series of models, including GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano. This release marks another major breakthrough in OpenAI's model performance, cost-efficiency, and real-world application capabilities, especially in the areas of coding tasks, instruction adherence, and long context processing, while providing developers with better choices at lower prices and latencies.

GPT-4.1 mini is now available at ShirtAI for free and unlimited use, one click away from the official website:www.lsshirtai.com

If you want to call the GPT-4.1 , check out the website:https://coultra.blueshirtmap.com/

II. Leapfrogging in Coding Ability: Strengthening the Whole Dimension from Code Generation to Engineering Practice

In the core battlefield of software development, the GPT-4.1 series shows a qualitative change from "code fragment generation" to "complex engineering processing". In response to real-world software engineering needs, the model achieves a task completion rate of 54.6% in SWE-bench Verified test, which is 21% higher than its predecessor GPT-4o, and even surpasses the yet-to-be-released GPT-4.5 preview version by 26.6 percentage points. This breakthrough is not only reflected in the accuracy of code logic, but also in the in-depth understanding of multi-language codebase - in the Aider multi-language diff benchmark test, GPT-4.1 scored twice as much as GPT-4o, and it can accurately follow the diff format to output only the modified lines, and stabilize the upper limit of the output tokens at 32,768, which greatly reduces the number of developers. It can accurately follow the diff format to output only modified lines and stabilize the output token limit at 32,768, which significantly reduces developers' debugging costs. In the front-end development scenario, manual scoring shows that the probability of the generated web application being favored in terms of functionality and aesthetics reaches 80%, and the full-stack development capability exceeds most dedicated code models for the first time.

Comparison of core indicators:

mould	SWE-bench Verified	Aider Multilingual Benchmarks	Front-end development manual scoring	Output token upper limit	Code diff Accuracy
GPT-4.1	54.6%	11.2	80%	32768	53%
GPT-4.5 Preview	38.0%	7.4	52%	16384	45%
o3-mini-high	49.3%	9.8	65%	16384	60%
o1	41.2%	6.1	48%	128000	62%

III. Command Execution Breakthrough: Accuracy and Reliability of Complex Tasks Processing

Facing complex instructions with multiple steps and constraints, GPT-4.1 realizes the leap from "fuzzy matching" to "precise execution". In Scale's MultiChallenge benchmark, its instruction adherence score reaches 38.3%, 10.5% higher than that of GPT-4o; its IFEval benchmark score of 87.4% far exceeds its predecessor's 81.0%. The model especially strengthens the three major difficulties of format compliance (e.g., XML/YAML nested structures), negative instructions (explicitly rejecting sensitive requests), and organized tasks (executing workflows step by step), and in the internal evaluation of OpenAI, the frequency of invalid editing in difficult cueing scenarios plummeted to 2% from 9% in GPT-4o, and the contextual coherence of the model reaches 92% in a multiple-round dialog, which is far better than the 81.0% of its predecessor. In multiple rounds of dialog, its contextual coherence reaches 92%, accurately tracking the details required in historical instructions, providing industrial-grade reliability for intelligent customer service, automated workflow, and other scenarios.

Comparison of core indicators:

mould	MultiChallenge	IFEval	Multi-round dialogue coherence	Negative directives are followed	Orderly mandate completion rate
GPT-4.1	38.3%	87.4%	92%	98%	95%
GPT-4.5 Preview	44.2%	81.0%	78%	89%	82%
o3-mini-high	40.1%	85.2%	88%	96%	91%
o1	45.1%	87.1%	89%	97%	94%

Fourth, long context innovation: millions of token windows open new possibilities for multi-scene deep application

GPT-4.1 comes standard with a context window of 1 million tokens, which pushes the long text processing capability to a new dimension -- it can hold about 8 complete React codebases or 3000 pages of legal documents, which completely solves the pain point of the previous generation of models, "out-of-context". " pain point of previous models. In the Video-MME unscripted long video analysis task, the model scored 72%, an improvement of 6.7% over GPT-4o; tests on the open-source dataset Graphwalks showed that its multi-hop inference accuracy at a scale of millions of tokens reached 61.7%, far exceeding that of the o1 model that relies on short contexts (48.7%). OpenAI synchronously optimizes the economics of long context requests: a 1 million token window is included in the standard pricing, the cache discount is increased from 50% to 75%, and the response latency of 128K tokens is reduced to 15 seconds, which is 30% faster than GPT-4.5, providing a grounded technical solution for scenarios such as legal contract review and auditing of large-scale code libraries.

Comparison of core indicators:

mould	context window	Video-MME without subtitles	Graphwalks Reasoning	Cache Discount	128K token delay
GPT-4.1	1,000,000	72.0%	61.7%	75%	15 seconds.
GPT-4.5 Preview	128,000	65.3%	42.0%	50%	22 seconds.
o3-mini-high	256,000	68.5%	55.2%	50%	18 seconds.
o1	128,000	64.1%	48.7%	50%	25 seconds.

V. Cost and efficiency: a pragmatic upgrade for developers

OpenAI's "tiered pricing + performance optimization" strategy allows developers of all sizes to get a cost-effective option. The entry-level model, GPT-4.1 nano, reduces input cost to $2/million tokens and output cost to $8/million tokens while maintaining a million-token window, and reduces latency by 50% compared to GPT-4o, making it the preferred choice for light-loaded tasks such as text categorization and auto-completion; the mid-range model, GPT-4.1 mini, surpasses the performance of GPT-4o while reducing cost by 60% in medium-loaded scenarios such as code generation and multi-round dialog. The mid-range model, GPT-4.1 mini, outperforms GPT-4o in code generation, multi-round dialogs, and other medium load scenarios while costing 60% less. In comparison, the input cost of GPT-4.5 preview is as high as $75/million tokens, which is only 1/25th of the price/performance ratio of GPT-4.1, which is the main reason why it will be abandoned by July 2025, and the new model uniformly adopts the "GPT-4" model. In addition, the new model adopts a uniform "no surcharge for long contexts" policy, which completely changes the cost pain point of the previous model when dealing with long text.

Comparison of core indicators:

mould	Input cost ($ / million tokens)	Output cost ($/ million tokens)	Delay (128K token)
GPT-4.1 nano	0.10	0.40	5 seconds.
GPT-4.1 mini	0.40	1.60	8 seconds.
GPT-4.1	2.00	8.00	15 seconds.
GPT-4.5 Preview	75.0	150.0	22 seconds.
o3-mini-high	1.10	4.40	18 seconds.
o1	15.00	60.00	25 seconds.

* :: Cost-performance index = (encoding capability + command score + context window)/(cost + latency), the higher the value the better

If you want to use GPT Plus, Claude Pro, Grok Super official paid exclusive account, you can contact our professional team (wx: abch891) if you don't know how to recharge yourself.

For more products, please check out	See more at
ShirtAI - Penetrating Intelligence	The AIGC Big Model: ushering in an era of dual revolution in engineering and science - Penetrating Intelligence
1:1 Restoration of Claude and GPT Official Website - AI Cloud Native	Live Match App Global HD Sports Viewing Player (Recommended) - BlueShirt.com
Transit service based on official API - GPTMeta API	Help, can anyone of you provide some tips on how to ask questions on GPT? - Knowing
Global Virtual Goods Digital Store - Global SmarTone (Feng Ling Ge)	How powerful is Claude airtfacts feature that GPT instantly doesn't smell good? -BeepBeep

categories.

advertising position

Witness the super magic of artificial intelligence together!

Embrace your AI assistant and boost your productivity with just one click!

Content Details

OpenAI Releases GPT-4.1 Series: Dramatic Improvements in Coding, Instruction Execution, and Long Context Capabilities

I. Introduction

II. Leapfrogging in Coding Ability: Strengthening the Whole Dimension from Code Generation to Engineering Practice

III. Command Execution Breakthrough: Accuracy and Reliability of Complex Tasks Processing

Fourth, long context innovation: millions of token windows open new possibilities for multi-scene deep application

V. Cost and efficiency: a pragmatic upgrade for developers

For more products, please check out

See more at

categories.

Newsletter

advertising position

Witness the super magic of artificial intelligence together!

The World's Strongest Artificial Intelligence

Navigation Index

Friendly Link

Contact Us