Heart of the Machine Original

Author: Zenan
From autonomous driving to recommendation systems, machine learning development can now be done with a unified platform.

Different machine learning tasks are implemented on a unified platform, the speed is doubled, and the GPU schedules 0 fragments. This is the latest open technology of the volcano engine.

On July 20, the Volcano Engine FORCE Power Conference was held in Beijing. At the event, the brand unveiled a series of newest capabilities for the Volcano Engine, just a year after its release.

In terms of AI, Volcano Engine has launched a multi-cloud deployment solution for machine learning and intelligent recommendation platforms. According to Xiang Liang, the head of the machine learning system of Volcano Engine, AI training tasks for different businesses such as Douyin, Watermelon Video, and Feishu within ByteDance are all submitted based on a unified training platform and trained by a unified training system.

The solution released this time also adheres to the concept of "unification and openness". The original intention is to hope that algorithm engineers can effectively implement their own ideas.

picture

Xiang Liang, head of the machine learning system of Volcano Engine.

Unification and openness of machine learning capabilities
The Volcano Engine was born out of ByteDance’s technology platform. Its algorithm engineering and business platform can be divided into recommendation systems and machine learning platforms. Both are based on ByteDance’s unified machine learning system, and the latter is based on a powerful computing foundation. facility.
This unified system serves ByteDance's video, content and e-commerce businesses. Xiang Liang believes that although they are different businesses, they can be abstracted into machine learning problems in essence, and unified training can be carried out.
"In Douyin, the duration of users watching videos and the proportion of likes, shares, and attention seem to be different on the surface. After being transformed into machine learning tasks, it can be summarized as the same problem, that is, when event A occurs, Predict the probability of event B. After reading the article in Understanding Chedi to evaluate the probability of user likes and comments, it can be compared to e-commerce applications,” Xiang Liang said.
For a data-driven company like ByteDance, one of the most intuitive benefits of applying a unified platform for different business systems is the reduction of "variables". Because the engineering system at the bottom of all businesses is unified, it is easier to judge which factors have brought positive improvement to the business, so that effective knowledge can be quickly reused in different businesses, and innovative ideas can be directly transformed For productivity, reduce engineering investment, enhance the individual combat capabilities of engineers and R&D, and improve innovation efficiency.
"This is also the reason why we have opened this AI infrastructure system to external companies through the Volcano Engine this time." Xiang Liang said, "ToB service itself is to help customers focus on their own business. Clients focus on what they are good at.”
According to Xiang Liang, the "unified" architecture is not the first of ByteDance's creation, but from the perspective of better business support, ByteDance continues to polish this system, hoping to achieve the ultimate in performance and experience. people and resources.
Taking the "0 fragmentation" capability of the Volcano Engine machine learning platform as an example, due to the high cost of GPUs, improving the utilization efficiency of GPUs has always been an urgent need for customers. Based on the huge GPU resources of ByteDance, when the computing power is sufficiently large, the system will dynamically optimize and allocate the different needs of multiple users. In most cases, Volcano Engine can guarantee that all users reach 100% application rate without worrying about resource fragmentation. In this internal and external multiplexing system, by multiplexing a larger resource pool, 0 fragments of external customers can be guaranteed.
"Volcano Engine has been working hard to help customers reduce costs," Xiang Liang said. "We believe that only by starting from the interests of customers can we make the cake bigger and bigger."
Let developers have a better experience

At the Motive Power Conference, the newly released machine learning and intelligent recommendation platform multi-cloud deployment solution of Volcano Engine emphasized the developer experience.
Many developers encounter the problem that when building a machine learning business, the GPUs used for training are often underutilized. The traditional practice is to configure many physical development machines with GPUs for R&D engineers, and these computing cards will be idle when machine learning training tasks are not performed. The independent online development machine module of the Volcano Engine machine learning platform can improve efficiency while aligning the physical development machine experience.
"After the developer shuts down, the previously performed operations, downloaded data, and configured environment are all preserved after restarting," Xiang Liang said. "After the shutdown, the computing power will be released immediately at the same time."
The development machine module integrates the container well, which is convenient for people to switch between different environments. In addition, the Volcano Engine machine learning platform also provides corresponding tools for monitoring and experimental tracking. When reproducing the plan, the volcano engine can provide the solution of the development environment through mirroring; after the engineer completes the development, the development code can be saved in the cloud through Job-based training, and the training can be initiated on the machine learning platform with one click. Compare the results of different experiments.

picture

Not only that, on the basis of helping customers achieve GPU "0 fragmentation", the Volcano Engine machine learning platform also starts with computing, network, storage and other aspects to bring developers the ultimate and smooth performance experience.

picture

In terms of calculation, the volcano engine provides various operator optimization capabilities, which can double the speed of existing operators.
In terms of communication, Volcano Engine has open sourced two communication libraries, bytePS is used to realize parameter communication and parameter synchronization; veGiantModel mainly realizes multi-machine parallel training acceleration of super large models.
In the storage link, Volcano Engine provides two sets of solutions: TOS object storage and vePFS distributed file system, which face the complex file and environment processing challenges encountered in practical work, while meeting the high performance and ease of use requirements of storage .

In addition, the intelligent recommendation system is an important technical driving force for the rapid development of ByteDance’s business. The intelligent recommendation platform launched by Volcano Engine makes full use of existing practices in terms of real-time and scale, and can achieve second-level real-time updates and ultra-large-scale recommended advertisements Model training.
It is understood that in order to implement an end-to-end recommendation system, its work involves tasks such as data processing, feature engineering, rule orchestration, and verification of recommendation effects. On the Volcano Engine, these processes do not need to span multiple systems. Only one platform is needed, the user behavior can be input and the recommendation result output can be accessed, and a complete recommendation service can be built without caring about the details. For customers in different industries, Volcano Engine provides the ability to customize templates, and companies can customize tools according to their own business.
In the intelligent recommendation platform, the volcano engine also provides more than ten kinds of model structures, and you only need to set the optimization goal to start training. The ability to customize the model is to realize the development of the model through a low-code way. The platform has built-in various code examples, and provides a variety of tools such as code comparison, effect comparison, training log, etc., which is convenient for engineers to get started faster.
Whether it is a preset model or a custom model, the bottom layer of the Volcano Engine is based on a set of self-developed training and inference solutions. Features such as streaming training and real-time model parameter adjustment can ensure the performance and effect of model training.
In terms of deployment methods, the machine learning platform and intelligent recommendation platform support four different deployment methods, including public cloud deployment, VPC deployment, private cloud and dedicated AZ deployment.
New drivers of growth in the cloud
"The development of ByteDance is accompanied by the explosion of technologies such as deep learning. At the same time, our system has been rooted in the cloud from the very beginning." Xiang Liang introduced.
ByteDance has achieved full cloud nativeization of its own business. At the end of last year, Volcano Engine officially released cloud computing products. Combining its own powerful capabilities, Volcano Engine provides enterprises with a complete set of cloud-native construction solutions.
At present, Volcano Engine has won the favor of thousands of benchmarking companies and institutions, serving customers in many industries such as finance, energy, automobiles, and consumer electronics. Companies are creating more and more new capabilities based on volcano engines.
Based on the machine learning platform of the volcano engine, the unmanned technology company Qingzhou Zhihang has created a research and development tool chain Qingzhou matrix, which is fully applied to its own development system. With simulation as the core, Qingzhou Matrix opens up the whole process from data processing, labeling, training, large-scale simulation and technical output, realizes the safe storage and efficient transfer of vehicle data, and supports the development of various models. Automatic labeling, quality inspection, training and evaluation allow the autonomous driving AI brain to learn autonomously from massive data.
Among them, the volcano engine directly connects 10,000 GPUs through the RDMA network, combined with the self-developed BytePS distributed training framework and high-performance operator library, so that the multi-machine acceleration efficiency of mainstream models exceeds 90%, and the GPU utilization rate for automatic driving model training 30% increase. The seamless connection between the model life cycle management tool and the self-developed storage of the volcano engine, as well as the personalized service experience, greatly accelerate the training efficiency of the autonomous driving model on the light boat matrix.
In the direction of recommendation systems, Volcano Engine utilizes the features of the latest hardware architecture and NVIDIA's customized optimization of the recommendation system Pipeline, which can help enterprises quickly build, deploy and expand the most advanced deep learning recommendation systems, significantly reduce costs and greatly reduce task delays.

picture

Currently, the field of cloud-based IT infrastructure is undergoing another change: 5 years ago, 58% of enterprises opted for a multi-cloud architecture. In 2021, 80% of enterprises have chosen a multi-cloud architecture, of which 79% of customers will choose more than two public clouds. In the multi-cloud era, the vast majority of application loads will be deployed on cloud-native infrastructure, and cloud-native is becoming the digital "new infrastructure" of enterprises.
In an ever-changing world, the Volcano Engine is an engine that will help companies keep moving forward.

© THE END 

For reprinting, please contact this official account for authorization

Contribute or seek coverage: [email protected]