Recently, AliCloud officially launched its latest multimodal AI model, Qwen-VLo, which has caused a strong reaction in the AI community upon its release. Many users said after their first experience that the model's performance in image generation even surpassed that of GPT-4o, showing amazing creative capabilities.
As the latest achievement of AliCloud in the field of multimodal AI, Qwen-VLo not only inherits the advantages of its predecessor in image comprehension and generation, but also realizes significant improvement in multiple dimensions, such as user interaction experience, editing accuracy and language support. Currently, the model has been opened for free for global users to experience, and users can use it directly through the Qwen Chat platform.
Technical features and innovative highlights
Core Technology Advantage
Qwen-VLo has achieved a number of breakthroughs in its technical architecture, and its core advantages can be summarized as follows:
Characterization dimensions | concrete expression | Technical Advantages |
---|---|---|
detailing | Enhanced Detail Capture | High semantic consistency throughout the generation process |
editing function | Single-command image editing | Support style conversion, element addition and deletion, text addition and other operations |
Language Support | multilingual compatibility | Enhance global user experience by covering multiple languages including English and Chinese |
Resolution Adaptation | Flexible frame support | Inputs and outputs support arbitrary resolutions and aspect ratios. |
Intelligent Understanding Capability Upgrade
In addition to its image generation capabilities, Qwen-VLo also demonstrates excellent capabilities in image recognition and interpretation. The model is able to accurately recognize specific objects in an image, for example, after generating an image containing pets, it is able to accurately recognize specific breeds such as tiger cats and beagles, showing its depth of visual understanding.
More notably, Qwen-VLo is also equipped with an image annotation function that enables it to detect and segment existing images. For example, when the model is asked to segment the edge of a banana, it is able to accurately mark the complete outline of the banana with a red mask, and this precise semantic segmentation capability provides a solid foundation for subsequent image editing.

In-depth testing of image editing features
Object Replacement Test
In real-world tests, Qwen-VLo's image editing capabilities performed well. The first test was a simple object replacement test:
Test Case One: Drink Substitution
- Initial task: generate an image of a polar bear drinking a Coke (cartoon style)
- Edit command: replace cola with milk
- Test Result: Successfully completed the replacement, the background and the main body of the polar bear remain basically unchanged, and only the drink changes


Test Case Two: Animal Replacement
- Initial task: Generate bird photos (photo-realistic style)
- Edit command: replace birds with pigeons
- Test results: species replacement was completed accurately and the environmental context was fully consistent


It is worth noting that in the test of the "garlic bird" terrier, although the model did not understand the meaning of the Internet buzzword, it still tried to execute the basic instructions for bird substitution and showed good instruction execution ability.

Multi-step composite editing
More complex tests involve a multi-step image creation and editing process:
- Sketch generation phase: Creating Basic Line Sketches
- color filling stage: Adding color and detail to sketches
- Text Addition Stage: Add Chinese text to an image
- Copy editing stage: Modify existing text
Throughout the process, Qwen-VLo is able to maintain the stability of the main figure and background, and although there are slight variations in the detailing, the overall editing effect is satisfactory. In particular, the model demonstrated strong text comprehension and rendering capabilities in Chinese and English text editing.




Explanation of Progressive Generation Techniques
Generating institutional innovations
Qwen-VLo adopts a unique progressive image generation mechanism, which is not only a visual effect, but also has real technical value. Unlike the "pseudo-progressive" effects of some models, Qwen-VLo's progressive generation is a true technical realization.
Characteristics of the generation process
Observing the image generation process of Qwen-VLo, the following features can be found:
- top-down construction: the image is generated incrementally downwards from the top
- Dynamic Optimization Adjustment: Continuous adjustment and optimization of forecast content during the generation process
- Semantic Consistency Guarantee: Ensure harmonization of end results
This generation mechanism is especially suitable for long text generation tasks that require fine control, such as advertisement design or comic book subplot production. The model will be constantly self-corrected during the generation process, similar to the process of "drawing while thinking" in human creation, and the realization of this "visualization chain of thought" brings new possibilities for AI creation.

UX Case Study
Since Qwen-VLo's open experience, a large number of creative use cases have emerged from the user community:
Creative Drawing Assistant
- Users upload hand-drawn sketches and the model is automatically colored and optimized for details
- Support anime character design, style conversion and other creative needs

Marketing material production
- Quickly generate promotional posters with specific text
- Creation of branded logo displays, such as the "Qwen Chat" promotional signage.

Entertainment content creation
- Internet terrier map creation, support for adding popular text and emoticons
- Movie and television character style conversion, such as Ghibli animation style remodeling


An important feature of Qwen-VLo is that it lowers the threshold of using AI image creation. Users do not need complex prompt engineering skills, but only need to describe their needs in natural language to get satisfactory results. This "conversational authoring" mode makes it easy for ordinary users to experience the fun of AI authoring.
Currently users can access the https://chat.qwen.ai/ Experience the full power of Qwen-VLo for free and feel the innovative appeal of this multimodal AI technology.