Decoding AI | Birth of “Chinese Edition Sora”: Domestic AI Multimodal Track Overtaking in Progress

This Spring Festival, Professor Zhu Jun, Vice Dean of the Institute of Artificial Intelligence at Tsinghua University and co-founder and chief scientist of Shengshu Technology, was disturbed by the video model Sora launched by OpenAI. He said that the reason for saying “alarmed” is partly due to the outstanding performance demonstrated by Sora, and partly due to concerns about the unopened technology and uncertain future breakthroughs of OpenAI. At that time, many people asked: When will there be a mechanism for generating long videos like Sora?

Recently, at the Zhongguancun Forum, Zhu Jun, on behalf of Tsinghua University and Shengshu Technology, released China’s first long duration, high consistency, and high dynamic video model Vidu. Zhu Jun stated that Vidu’s joint research and development can be regarded as the latest achievement of full stack independent innovation, achieving technological breakthroughs in various dimensions. This includes the ability to simulate the real physical world, have imagination, and understand multi shot languages, no longer just simple camera push-pull, but can generate videos up to 16 seconds with just one click.

Previously, the industry had discussed that there were only two models in the field of video generation: the OpenAI Sora model and other models that were not Sora. Now, this topic has been broken by Vidu. In the view of many industry insiders, the field of video modeling has not yet formed a pioneer monopoly, and latecomers, after being familiar with algorithm principles and accumulating rich engineering experience, are fully likely to catch up with Sora.

(”,)

Vidu’s Birth Process

Before Sora, companies such as Runway, Pika, Google, and Meta had already launched related products in the field of cultural and educational videos. The launch of Vidu this time also faces a comparison of related products.

In Zhu Jun’s demonstration, apart from Sora, which is currently unable to be tested online, Vidu was compared with popular online systems such as Pika and RunwayGen-2. The latter two systems generated a maximum of 4 seconds of short videos. In comparison, Vidu can generate 16 seconds of videos. Zhu Jun believes that Vidu’s performance in semantic understanding is more prominent.

Zhu Jun stated that the team has conducted extensive research in areas such as diffusion models and Bayesian deep learning. After Sora came out, the team happened to find that their technical roadmap was highly consistent with Sora, so they firmly pushed for further research. In September 2022, the team launched the first Fusion and Transformer integrated architecture U-ViT, while the Sora team released the DiT architecture three months later.

On this route, Zhu Jun stated that the team has been conducting large-scale training. In March 2023, the team opened up the world’s first fusion based large model, UniDiffuser, which was the first to verify the rules of large-scale training and expansion. Subsequently, Sora’s appearance stimulated the team’s speed, and they urgently initiated the research and reported to the leaders of Haidian District, receiving a lot of support at that time. Two months later, Vidu was able to showcase.

Zhu Jun said on the scene that some people may have asked why a breakthrough could be achieved within two months after the release of Sora? Is it technically simpler than Sora? Did you just make a cheap knockoff product?

“By sorting out the timeline, it can be seen that the key time nodes of Vidu and Sora are staggered.” Zhu Jun said that during the process of doing Vidu, he also encountered many difficulties, such as in terms of computing power. In 2023, due to limited computing power, the team focused on investing in graphic design, while in 3D design, the focus was relatively on the development of larger models with smaller computational costs, with a focus on verifying the behavior of the models after scaling up.

Zhu Jun stated that Sora’s technology roadmap is different from the big language model, mainly focusing on the Diffusion Model, with Transformer being just a part of it. There are many misconceptions that it is a branch of Transformer, but in reality, it is not. Therefore, the team needs to fully understand the different algorithmic principles. In addition, there are many experiences and insights on how to master the patterns of model architecture, including the implementation of large-scale engineering.

“When training the first version of UniDiffuser at that time, the computing power used was nearly 40 times that of training the same model in mid year last year. The team reduced the computing power requirement by 40 times in half a year. In other words, the team could train a model 40 times larger with the same computing power. In addition, the consumption of computing power by long videos and the transmission of network bandwidth in distributed systems all brought new challenges, which required a little bit of tackling. At the same time, the support of computing power and the governance of high-quality data were also needed.” Zhu Jun said that the team’s accumulated experience in images and short videos in the past, combined with various factors, contributed to the final effect.

In January of this year, the team achieved the generation of 4-second videos, which can achieve the effects of Pika and Runway. Breaking through to 8 seconds at the end of March. Although there was only a few seconds of improvement, in Zhu Jun’s view, it was a huge progress and confirmed that the technical route was correct. In April, the team further intensified its efforts. Nowadays, Vidu is showcasing 16 second results to the public, but Zhu Jun believes that in the near future, Vidu will iterate at a faster speed.

In addition, the reason why it is called Vidu is partly due to the abbreviation of Video, which represents video, that is, the big model of video. On the other hand, its pun is We do, which shows the determination to be made to the outside world. “The current progress is still preliminary, and we hope to cooperate with high-quality domestic units to jointly promote technological progress,” said Zhu Jun.

Vidu’s valuation has reached $100 million

The R&D team behind Vidu, Shengshu Technology, was officially established in March 2023, jointly incubated by RealAI, Ant, and Baidu Venture Capital of Ruilai Intelligence. Former Vice President of Ruilai Intelligence, Tang Jiayu, was appointed as CEO. In June 2023, the company completed an angel round of financing of nearly 100 million yuan, led by Ant Group and co invested by BV Baidu Ventures and Zhuoyuan Capital. The post investment valuation reached 100 million US dollars.

Zhou Zhifeng, a partner of Qiming Venture Capital, stated that the current big model has gradually shifted from pure language mode to multimodal exploration. Since its establishment, Shengshu Technology has chosen the multimodal track and is the earliest and most deeply accumulated team in this field in China. A large amount of work has been cited by OpenAI and Stable Diffusion teams.

The core members of the Shengshu Technology entrepreneurship team come from the Institute of Artificial Intelligence at Tsinghua University, with Zhu Jun, Vice President of the Institute, serving as the chief scientist; CEO Tang Jiayu studied for a bachelor’s and master’s degree in the Department of Computer Science at Tsinghua University, and is a member of the THUNLP group (the Natural Language Processing and Social Humanities Computing Laboratory of the Department of Computer Science at Tsinghua University); CTO Bao Fan is a doctoral student in the Department of Computer Science at Tsinghua University and a member of Professor Zhu Jun’s research group. He has long been interested in the field of diffusion models and has led the completion of both U-ViT and UniDiffuser tasks.

After completing the financing in 2023, Tang Jiayu stated in a media interview that globally, research on multimodal large models is still in its early stages and the technological maturity is not yet high. This is different from the hot language models, as foreign countries have already taken the lead by an era. Therefore, compared to struggling with language models, Tang Jiayu believes that multimodality is an important opportunity for domestic teams to seize the big model track.

Regarding the pursuit of OpenAI, Tang Jiayu stated that currently, it is relatively easier for China to catch up with Sora compared to last year’s ChasGPT. Sora is equivalent to the GPT-2 stage and has not formed a clear first mover or monopoly advantage. And the underlying architecture generation team is very familiar with it. So once the team accumulates enough engineering experience, it is definitely possible to catch up with Sora.

As for the separation and operation of Shengshu Technology, Tang Jiayu stated that there are two main considerations: firstly, from a business perspective, Ruilai Intelligence’s business direction focuses on secure and controllable artificial intelligence solutions, such as improving the security and reliability of AI technology and applications, serving B-end customers, while Shengshu focuses on multimodal large models and application development, mainly involving C-end products, with different business positioning; The second reason is that there is a huge demand for resource investment in the early stage of large-scale model entrepreneurship, and independent and separated operations are more suitable.

In January 2024, Shengshu Technology launched a short video generation function on its visual creative design platform PixWeaver, supporting 4-second high aesthetic short video content. After the launch of Sora in February, Shengshu Technology established a formal research and development team to accelerate the development progress of the original video direction. In March, it achieved 8-second video generation internally, and in April, it exceeded 16 second generation, achieving breakthroughs in both generation quality and duration.

On the technical roadmap, Vidu adopts a fusion architecture of Diffusion and Transformer that is completely consistent with Sora. Unlike the multi-step processing method of inserting frames to achieve the generation of long videos, Vidu adopts the same route as Sora, which directly generates high-quality videos through a single step. From the bottom layer, based on a single model, it is completely end-to-end generated, which can be achieved in one step, does not involve intermediate frame insertion and other multi-step processing, and the text to video conversion is direct and continuous.

(‘Up: Vidu Down: Sora’,)

Racing AI Long Track

In February of this year, the video model Sora released by OpenAI caused a shock in the market. At the Zhongguancun Forum, Huang Tiejun, Chairman of Beijing Zhiyuan Artificial Intelligence Research Institute, said that in the past two months, everyone has been flooded with Sora, but there are problems with this phenomenon. Dozens of videos have caused everyone to rush forward like chasing stars, which is not a good phenomenon. The successful emergence of any technology is the result of long-term accumulation. Even if artificial intelligence develops so quickly, it is difficult to achieve excellent results without previous accumulation.

Putting aside the hustle and bustle, Sora has become a new catch-up target in the field of video modeling after ChatGPT. Although Sora has shown far superior capabilities compared to its peers, it has not chosen to be open to the public like Pika and Runway. Instead, it has adopted a conservative strategy similar to Google and Meta, first announcing it, slowly conducting internal testing, and waiting for a suitable time before opening it to the public.

Chen Chen, partner of Analysys Research, stated that Sora’s failure to open up to the public is mainly due to several reasons: firstly, considering whether cultural video technology will be abused and the security issues it may cause, OpenAI may need to conduct a series of security tests and optimization adjustments; Secondly, due to business strategy considerations, GPT underwent 4-6 months of internal testing before gradually opening up, which may be due to the need for OpenAI to conduct preliminary evaluations of the actual operating costs of the model. At present, the operating cost of ChatGPT is already very high, and if Sora is added, the cost may increase by an order of magnitude. Therefore, OpenAI needs to develop a corresponding commercialization route before the product is launched.

At present, many domestic enterprises are successively laying out video models. According to Chen Chen’s observation, they are mainly divided into three categories: the first category is traditional big factories, such as ByteDance, which has been laid out in the video field for a long time. Previously, the high-definition Wensheng video model MagicVideo-V2 was released. In addition, Alibaba Cloud, Tencent, Baidu, iFLYTEK, etc., in addition to continuing to work on multimodal big models in general technology, they are also developing some big models for industry applications in the vertical field. The second type is manufacturers specializing in visual analysis, such as Hikvision, who have started investing in the development of video models. The third category includes some manufacturers that focus on content development and creative marketing, such as Kunlun Wanwei and Wanxing Technology, which have also developed their own video big models.

Chen Chen told reporters, “From the perspective of the generation effect, Vidu’s understanding of semantics, video duration, quality, consistency and other aspects have reached a leading position in the field of domestic cultural videos. In addition, Vidu is similar to Sora in its technical route, and adopts a single model end-to-end generation method, which is also the reason why video fluency and visual performance look good.”

However, it should be noted that Chen Chen stated that compared to Sora, Vidu still has gaps in terms of duration, richness of visual elements, and detail expression. However, Vidu is a phased product, and the breakthrough in model capabilities is only a matter of time. At least Sora has not yet opened up, possibly due to the need for integration of practical task processing capabilities, as well as issues with resources, business models, and other aspects. From this perspective, compared to big language models, the development of visual models in China started relatively early, with deep accumulation of technology and experience. What is needed is to leverage the advantages of domestic industrial chain collaboration, and be able to apply multimodal capabilities to the rich application scenarios of the B-end and C-end.

Regarding the issue of domestic AI companies hoping to achieve corner overtaking through multimodality, Chen Chen told reporters that the technological breakthrough of video big models will inevitably accelerate the process of AGI. However, the key to AGI still lies in whether it can spontaneously handle infinite tasks and whether it has a cognitive architecture consistent with humans. In addition, there have been different voices regarding Sora recently, and some experts do not believe that Sora is a true path to AGI. However, the current relatively independent technological routes may not necessarily merge in the future, creating truly intelligent and flexible AGI models.

As for who comes first and who comes last, Chen Chen said that at the current speed of model iteration, talking about who surpasses others is actually temporary. The development of AI is not a trade-off, it will definitely be the result of common progress.

Related News