Taking Control: Self-Hosting GitHub Runners in Action
Take control of your infrastructure and revolutionize your CI process with self-hosted runners in GitHub Actions. Boost productivity like never before.
Continuous Integration (CI) is a software development practice that involves merging code changes from multiple developers into a shared repository regularly. It ensures that each change is tested and integrated early, promoting early bug detection and collaboration. GitHub Actions is a CI/CD (Continuous Integration/Continuous Deployment) platform provided by GitHub that automates the CI process, allowing developers to define workflows as code and automate tasks such as building, testing, and deploying applications directly from their GitHub repositories, those workflows run on what is known as runners.
GitHub Actions offers a convenient solution by providing GitHub-hosted runners, which are well-suited for small-to-medium sized projects. These runners come preconfigured with necessary tools and packages, simplifying the stepping stone for running your CI process.
As projects become more complex, certain tasks can become burdensome for your engineering team. For example, running a large test suite might consume a significant amount of time. In such cases, it is crucial to explore solutions to enhance your CI process. GitHub Actions offers larger GitHub Actions runners, which provide more computational power to tackle these challenges effectively (or, you can use Redefine.dev to optimize your CI).
Moreover, GitHub Actions provides you with the exhilarating opportunity to self-host your runners, granting you the autonomy to manage your own infrastructure and supercharge your CI process.
This article dives into the exciting world of self-hosted runners in GitHub Actions, empowering you with the freedom to take full control of your infrastructure and revolutionize your CI process.
Freedom and Control: The Choice to Self-Host GitHub Runners
Initially, our preference was to minimize the utilization of self-hosted runners and rely on GitHub-hosted runners. However, as our workflows grew in complexity, we began utilizing GitHub-hosted large runners to address the increased demands, including training AI models and executing resource-intensive computing tasks.
While this approach sufficed for a period, we soon encountered challenges in terms of cost as our GitHub expenses escalated substantially due to the growing number of resource-intensive workflows. This realization prompted us to explore alternative solutions, leading us to venture into the realm of hosting our own runners.
Taking Control: Self-Hosting GitHub Runners in Action
In our quest for a suitable self-hosted runner solution, we came across several projects that offer convenient setups on popular cloud providers. Notable examples include ec2-github-runner and terraform-aws-github-runner, which provide self-hosted runners using resources like EC2 instances or Lambda functions. Additionally, we explored action-runner-controller, a versatile solution that enables the creation of a Runner resource on your Kubernetes cluster, allowing for greater flexibility.
After careful evaluation, we opted to proceed with the action-runner-controller solution. Despite being in the early stages of development, the project boasts comprehensive documentation and aligns well with our existing experience with Kubernetes. This choice provided us with a straightforward approach to create and manage self-hosted runners within our Kubernetes cluster.
Actions-runner-controller offers a user-friendly quickstart guide that allowed us to swiftly get started. We established a dedicated EKS cluster equipped with auto-scaling nodes, completing the setup in just two days. This effort resulted in the successful implementation of fully functional self-hosted GitHub runners.
However, our journey was not without its share of challenges and pitfalls. In the following sections, we will discuss some of the major obstacles we encountered along the way and detail the strategies we employed to overcome them.
Cache it, don't trash it
Implementing caching in CI can significantly improve runtime, but it requires careful configuration to avoid issues. For instance, certain tests involving Docker containers can be time-consuming due to pulling operations. Fortunately, the action-runner-controller offers a helpful example for Docker layer caching. All-in all introducing caching successfully to your CI required careful consideration, fulfilling specific requirements, and making informed decisions. Here are the key steps we took
- We had to set up a CSI (Container Storage Interface) in the Kubernetes cluster, such as the EBS CSI driver
- We chose to opt for the DinD (Docker-in-Docker) runner, which combines the Docker daemon and the runner in a single pod for better performance and reduced IO overhead.
- The mechanism is not perfect, and cache volumes tend to accumulate, which can impact resource usage. We tackled this by implementing a daily cleanup job to remove unused volumes every two weeks.
By following these steps, we effectively harnessed caching in our CI workflows, improving efficiency while mitigating potential issues.
Bytes, Bytes Everywhere! Taming NAT Gateway Costs
We encountered skyrocketing costs associated with our NAT gateway due to high data transfer volumes. Although our runners operated within a private subnet without direct internet access, they still required connections to external services. To tackle this issue, we implemented the following measures to reduce bandwidth usage and mitigate costs:
- Analyzing VPC flow logs: We carefully examined the traffic endpoints within our VPC flow logs to identify the main sources of data transfer.
- Leveraging ECR for Docker pulls: To optimize data transfer, we shifted our CI processes to pull Docker images from the Elastic Container Registry (ECR) rather than external sources. We also ensured the availability of a pull through cache when necessary.
- Utilizing AWS CodeArtifact for package installations: By leveraging AWS CodeArtifact, specifically for package installations (such as pip in our case), we reduced the need for external data transfers.
- Adding VPC endpoints: We established VPC endpoints to create logical connections directly from our private subnet to the required external services, eliminating the need for data transfer through the NAT gateway.
Implementing these measures resulted in a remarkable 75% reduction in data transfer through the NAT gateway, significantly improving the sustainability of our NAT gateway costs.
From Zero to Hero: CI Autoscaling Unleashed!
Implementing autoscaling was a fundamental aspect of our CI platform from the beginning. We recognized the importance of dynamically adjusting the number of runners based on workload demands. The action-runner-controller provides a comprehensive guide on autoscaling runners, and it's beneficial to review their documentation to understand the pros and cons of each strategy.
To cater to workflows that require significant resources, we implemented pod-level scaling from and to zero. This was achieved by combining two scaling metrics, as detailed in the action-runner-controller's documentation.
In addition to pod scaling, we also configured node scaling. We utilized a cluster autoscaler and made additional configurations to the auto scaling group, enabling the automatic scaling of EC2 instances (nodes) from zero based on workload requirements.
By implementing both pod and node autoscaling, we ensured our CI infrastructure dynamically adjusted its capacity to handle varying workloads efficiently.
Noisy Neighbors: Managing Trust Challenges
When running multiple runner pods on the same node, resource sharing can lead to trust issues among the pods. A spike in one pod's resource usage may cause other pods on the same node to fail. Identifying and resolving these challenges can be difficult as they may appear in the form of canceled pods or unreliable tests.
The "noisy neighbor" problem, as we refer to it, remains a significant challenge for us. While we haven't completely mitigated it yet, we have learned some valuable tips from our experience:
- Set resource limits for your pods: It is essential to define resource limits for each pod. This best practice applies not only to CI but is particularly critical in this context. Without resource limits, the impact of the noisy neighbor problem becomes more severe.
- Avoid overloading nodes: We aim to keep a balanced workload on each node by limiting the number of runner pods to 2-3 per node. Additionally, we reserve around 5%-10% of the resources on each node to accommodate spikes in our workflows.
- Implement monitoring: Monitoring is crucial for identifying and addressing issues promptly. We utilize Prometheus and Grafana to monitor our pods closely, enabling us to detect any anomalies and take necessary actions.
While we continue to fine-tune our approach to mitigate the noisy neighbor problem, following these tips has helped us manage and minimize its impact on our CI environment.
Empowering CI: Harnessing the Benefits of Self-Hosted GitHub Runners
Although hosting your own GitHub runners may initially seem like a challenging endeavor that demands attention and some DevOps skills, the advantages it offers make the journey worthwhile. We have experienced and continue to enjoy these benefits in our daily work.
Customized Images, Powerful Runners
By utilizing self-hosted runners, you gain the flexibility to create customized runner images tailored to your specific needs. These images can include project-specific dependencies, libraries, and components for testing and building purposes, such as Python, Node.js, and other related tools. Leveraging our custom images, we significantly reduced the setup time of our tests by approximately 75%, reducing it from 4 minutes to just 1 minute. The ability to skip these installations significantly improved our workflow efficiency and allowed us to focus more on running tests and building our projects.
Taking Control
Self-hosting your own runner provides you with greater control over your CI environment. You gain the flexibility to customize resource allocation based on your specific requirements. Additionally, you can include platforms that may not be officially supported by GitHub hosted runners, like arm64 architecture (as for the time this article was written). If you need additional resources for your pods, you have the freedom to allocate them as needed. Moreover, self-hosted runners allow for easy access to the pods (such as `kubectl exec`), enabling deeper investigation of any CI issues that arise. This level of control empowers you to tailor your CI setup and troubleshoot more effectively.
Beyond Testing: Unleashing the Power of Self-Hosted Runners
With the level of freedom provided by self-hosted runners, we went beyond using them solely for testing purposes. By utilizing larger machines, we were able to execute a wide range of compute-intensive workflows that demanded significantly more resources. This flexibility allowed us to handle tasks such as data processing, AI model training, and other resource-intensive operations within our CI environment. By leveraging the power of self-hosted runners, we effectively scaled our workflows and achieved efficient execution of compute-heavy tasks.
Money Matters: Self-Hosted Runners Win the Cost Game
Upon comparing the costs between utilizing large GitHub-hosted runners and our self-hosted runners, we observed a significant reduction in expenses. When considering the overall cost, running our own hosted runners proved to be approximately 55% less expensive than solely relying on GitHub-hosted runners. This cost savings demonstrates the financial benefit of leveraging self-hosted runners and aligning resource allocation to our specific needs. By optimizing our CI infrastructure, we achieved substantial cost efficiency while maintaining the desired level of performance.
Self-Hosted Runners: Was It Worth the Journey?
Undoubtedly, when we embarked on hosting our own runners, we encountered unforeseen challenges and invested time in ensuring the reliability of our infrastructure. There were instances where we had to extensively investigate and troubleshoot peculiar issues, and it also introduced some complexity to our workflows. However, considering the significant cost reduction, increased flexibility, and the valuable knowledge we gained about the components influencing our CI and associated costs, we believe this journey was well worth it. We hope that this article serves as an inspiration for you to implement self-hosted runners more smoothly, avoiding the pitfalls we encountered along the way.