Meta, formerly known as Facebook, is launching one of the world’s most powerful supercomputers called the AI Research SuperCluster (RSC). The company says it will be the fastest, most powerful computer when it’s “fully built out in mid-2022” at an undisclosed location. The social media giant changed its name to Meta in October.
Meta is announcing the AI Research SuperCluster (RSC), our latest AI supercomputer ???? for AI research. RSC will allow our researchers to do new, groundbreaking experiments in #AI. Learn more about RSC and the important role it will play: https://t.co/l9CcQuFLyM pic.twitter.com/gD8Ve74ZqQ
— Meta AI (@MetaAI) January 24, 2022
One of the main reasons Meta states as “critical” for the development of the supercomputer is to “identify harmful content” on internet platforms. The MetaAI blog writes:
“To fully realize the benefits of self-supervised learning and transformer-based models, various domains, whether vision, speech, language, or for critical use cases like identifying harmful content, will require training increasingly large, complex, and adaptable models.”
Meta’s supercomputer will be capable of processing “quintillions of operations per second.” The blog explains that processing power on this scale is needed to help Meta’s AI researchers:
“Build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more. Our researchers will be able to train the largest models needed to develop advanced AI for computer vision, NLP, speech recognition, and more.”
The supercomputer will be the backbone for Meta’s Metaverse platform with its AI-driven features and products. Speech recognition, the deep learning architecture called Transformers, Self-Supervised Learning, which is a way to build “background learning” that is so natural to human development, are just a few of the areas of development that require such a powerful computing system. Advanced processing speed is essential to advancing those AI processes and products.
2013 marked the year when Meta (Facebook) first began to “advance the state of the art of AI” with its creation of the Facebook AI Research (FAIR) group. The FAIR team presented some of its research at an April 2020 International Conference on Learning Representations (ICLR).
While Meta developed the first iteration of its supercomputer infrastructure in 2017 with its “22,000 NVIDIA V100 Tensor Core GPU,” it wasn’t until 2020 that new graphics processing units (GPU) and the necessary network fabric technology was available. Per cisco.com, network fabric is:
“[The] mesh of connections between network devices such as access points, switches, and routers that transports data to its destination. ‘Fabric’ can mean the physical wirings that make up these connections, but usually it refers to a virtualized, automated lattice of overlay connections on top of the physical topology.”
The network fabric connects the GPUs.
The training time required for the AI networks has decreased significantly from nine weeks to three because of advancements in computing technology. The Metablog explains RSC today comprises
“A total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs—with each A100 GPU being more powerful than the V100 used in our previous system. Early benchmarks on RSC, compared with Meta’s legacy production and research infrastructure, have shown that it runs computer vision workflows up to 20 times faster, runs the NVIDIA Collective Communication Library (NCCL) more than nine times faster, and trains large-scale NLP models three times faster.”
Nvidia announced its partnership with Meta on Monday, confirming Meta’s challenges during the pandemic to build the system due to “industry-wide wafer supply constraints.” A short promotional video showing the A100 can be viewed here.
The diagram below shows Phase 1 of RSC’s storage tier with “175 petabytes of Pure Storage FlashArray, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.” The A100 GPU being used now is much more powerful.
One petabyte of data can hold an enormous amount of data. The graphic from cobaltiron.com below demonstrates its capacity.
Performance and performance scale are both important. Wei Li, Ph.D. from ScaledML, explains why performance and performance scale are so important for AI systems in this thirty-minute video.
When the RSC project is complete, the InfiniBand network fabric will connect 16,000 GPUs as endpoints, “making it one of the largest such networks deployed to date.” Meta has also employed a caching and storage system that can serve “16 TB/s of training data”, and they plan to “scale it up to 1 exabyte.” The graphics below respectively show examples of the storage capacity of one terabyte and one exabyte. Five exabytes represent “all words ever spoken by human beings,” according to highscalability.com.
Phase 2 of the RSC buildout in 2022 will increase capacity from 6,080 to 16,000 GPUs, which will “increase AI training performance by more than 2.5x,” says Meta. Reliability and security are top priorities for the network. With the InfiniBand network fabric, data packets are rarely lost.