Is OpenAI using Kubernetes?

Recently, I accidentally saw an article from two years ago titled "Scaling Kubernetesto 7500 nodes". The original text is as follows: https://openai

Recently, I accidentally saw an article from two years ago titled "Scaling Kubernetesto 7500 nodes". The original text is as follows: https://openai.com/research/scaling-kubernetes-to-7500-nodes#unsolvedproblems . Although a bit stubborn, fortunately, the things inside are not too outdated. At least from previous articles by the OpenAI team, it does record the growth of the Kubernetes cluster and the sharing of experience, which is very worth learning from. Recently, we have also encountered some scheduling issues with KubeGemsPAI (machine learning platform), so we will read it again and take a simple note to share our feedback with everyone.

Resource scheduling

Explanation:Because the GPU of each node in Kubernetes uses NVLink and GPUDirect direct network cards, only one Pod is scheduled on a node to monopolize all resources to achieve maximum utilization of computing power. To achieve this effect, using NodeSelector and DaemoSet can easily meet the requirements and minimize the scheduling pressure on K8S (as mentioned later, OpenAI does not use DaemonSet method and cannot achieve more advanced scheduling strategies, which is only an example here). In the scenario of exclusive nodes, there is indeed no need to schedule support for Bin Pack (filling nodes with pods as much as possible) and Fragmentation (fragmentation) algorithms, as the minimum granularity of resources in the entire cluster is Node rather than Pod, and there is naturally no need to consider the CPUNUMA topology structure. There is also no problem of node resource contention.

New knowledge:Full bisection bandwidth means that any half of the nodes in a cluster can communicate with the other half of the nodes for the maximum bandwidth without being affected by Bandwidth throttling. For example, suppose a system has 16 nodes, each with a 10Gb/s network connection. If the system is designed well, then any 8 nodes should be able to communicate at 10Gb/s with the other 8 nodes simultaneously. The main advantage of full duplex bandwidth splitting is that it can greatly improve the parallel processing capability of the system, as it allows all nodes to maximize the utilization of their network bandwidth.

explainWe designed a stain based on the team nameOpenai. com/team=teamname: NoScheduleAnd mark it on the server, so that different teams must add stain tolerance when using resources in order to coordinate them. At the same time, we have also developed our own controller for integratingIgnore stains and prioritize scheduling low-priority podsWebhookVolcanoResource schedulingQueuereclaimQueue

explainGangscheduling is very important when processing MPI jobs because of the synchronous communication characteristics of MPI jobs. Due to MPI being a programming model for parallel computing, it allows communication between processes through message passing to complete a common computing task. In MPI, a common operation is set communication, where all processes need to participate simultaneously. If any process lags or becomes unavailable, all processes will be blocked and wait for the process to complete. This results in MPI jobs relying heavily on the synchronous execution of all participating processes. The implementation of GangScheduling in OpenAI is achieved by embedding k8ssscheulerplugis. This plugin is called Coscheduling and has been merged into the scheduler plugin mainline. https://github.com/kubernetes/enhancements/pull/1463. As mentioned earlier, GangScheduling can also be implemented by extending Kubernetes' scheduling function through Volcano.

Parallel job processing

Explanation:All work nodes involved in running MPI job tasks must undergo regular checkpoints, which is a type offault tolerance It can restore the status of a job in the event of a job error or system crash, to avoid a complete restart after a calculation error.

New concept:Semi statefulpod (semi state container), as the runtime carrier of parallel tasks is pod, its state data mainly refers to the checkpoint generated during task execution. Obviously, this part of the data needs to be persisted into PVC. The reason why it is called semi state is mainly because even if the container hangs, the worst-case scenario is for the task to pause as a whole and return to the previous checkpoint to restart, which does not cause irreversible disasters like stateful applications

network

explain networkstorageKuberneteskube-proxyingressNodePodPodHostnetworknetwork KubernetsMPInetworkWorker

explain K8S7500networkoverlayflannelIPAzureVMSS

New knowledge:VMSSAzurenetwork

There is limited information available, and it seems that the meaning I want to express here is that OpenAI has used the VMSS service for managing virtual machine addresses on Azure to KuberntesPod through CNI.

explain PodNATIptablesPodnetwork

storage

explainOpenAIBlobstoragestoragecheckoutstoragestoragePOSIX

APIs servers

Explanation: Our main optimization point here is to separate KuebrnetesEvents from other Etcd clusters to reduce the latency caused by IO recording a large number of events

explainIn the scenario of running a large number of nodes, theList WatchThe flooding effect brought is quite obvious, with a trickle forming a river. When all requests converge to the APIServer, the transmission bandwidth brought is as high as 1GB/s! Fortunately, we used versions after Kubernete 1.1 and reduced the pressure by 1000 times on the server through EndpointSlices

New knowledgeEndpointSlices is a new feature introduced in Kubernetes version 1.16. It disperses Endpoint information among multiple smaller objects, each containing only a portion of Endpoint information. In this way, adding, deleting, or modifying endpoints only requires updating a smaller EndpointSlice object, without updating the entire Endpoints object. This greatly improves the performance and scalability of Kubernetes when dealing with large-scale clusters.

monitor

Explanation:We have also encountered a large number of invalid indicators, which is really "annoying" and most of us never pay attention to them. We Prometheus also frequently OOM, but later discovered that it was caused by a large number of histogram indicator queries piling up. So we set the execution timeout when querying on the backend, so that the memory of promtherus never bursts again.

In addition, after Prometheus restarted, the playback event of the WAL file was too slow for us to tolerate. Later, with the help of RobustPerception, we learned to increase the GOMAXPROCS parameter to set the goroutine number to accelerate playback speed

Ah? Originally, the engineers of OpenAI didn't know -!

summary


Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])