AWS Loft Peter Caron, HERE AWS Loft| November 16, 2016
Migrating and Running Continuous Integration Systems at Scale in AWS
Peter Caron AWS Loft, München | November 16, 2016
Agenda 1. Transition to Cloud Lift and Shift Crawl, Walk, Run, Sprint 2. CI / CD Overview 3. Challenges
HERE’s Business
HERE is one the world’s leading map data companies and is now able to deliver the next generation of mobility and location-based services.
HERE Products
HERE software serves map, traffic and location data to a variety of target platforms • • • •
HERE Open Location Platform Embedded Automobile Navigation Enterprise Extensions Mobile Apps
HERE’s Challenge
HERE needed a CI system that could meet complex and heterogenous deployments and releases that could scale.
Before the Cloud Jenkins In the Data Centre
Region #1
Build Systems ~5+ Builds per month 40T+ Tests cycle
Jenkins Under the Desk
Homemade Build Tools
Region #2
Region #x
CD Platform Pipelines 17 Services 2 Unique pipelines 1000+VMs on VMWare 1 ish Deployments/month 100s Acceptance tests/month ~10 Build runs / month
?
AWS Services in Production Jenkins Master EC2 instances
Amazon VPC
Amazon EFS
Jenkins Master EC2 instances
security group
security group
Amazon S3 Region #1
Region #2
Amazon Cloud Watch
Common CI Systems – CCI (Jenkins / Electric Flow) 110K+ Builds per day 25M+ Tests per day CI for Micro-services - JaaS (Jenkins as a Service) 130 Products and services
Spot Instances
CD Platform Pipelines (go as a Service) 36 Services AWS 668 Unique pipelines Device Farm 600+ VMs on AWS 40+ Deployments/month 100s Acceptance tests/day 1400+ Build runs / month
What kind(s) of integration and testing to use? Jenkins Unit testing
go integration testing
go deployment
Real Device testing
Mesos / Marathon
deployment orchestration
Real-time Data Services Static Data Services Micro Services
Customer Integration
Embedded and Downloaded Applications
Transition to Cloud Moving CI workstreams to AWS
Moving our CI / CD infrastructure to AWS … • • • • •
Git Gerrit Jenkins Go Splunk
It was a simple lift and shift from our local infrastructure
… and everything worked well from Day 1 Uh, not exactly!
.
Plan to Grow
•
Get your Workflow right •
• • • • •
i.e. Get your CI act together first
Know your Capacity and Limits Focus on Testing Set Expectations Internally Know your Fallback options Monitor changes (costs)
Create a Culture Change 1. Start small, iterate • A single developer group before your flagship product 2. Understand your changes • There is infrastructure outside the control of your developer. Don’t let is become Expensive Hosting 2.0 3. Infrastructure as Code is not just a buzz word • Apply it if you have one or more people using CI 4. Measure Results and Adapt WoW • Only react to verifiable metrics
What did we learn?
• • • • • • •
Don’t trust the plugins Capacity is always underestimated Costs will be high Plan fallback Trust the developers – just enough Moving the Cloud will help nothing People will use it … and what could we have done better?
Do Continuous Integration Moving CI ways of working to a Cloud
Client Pipelines Runs every 3 hrs Duration : 3hrs
Runs every day ? Duration : ?
Full Verification
Full Verification
E2E (Manual Tests)
Release candidate
Full Verification
Runs on each successful SV Duration < 20min
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Every 5 min Duration < 20min
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Mainline
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Service Pipelines Artifact
Full Verification
Full Verification
Full Verification
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Baseline Sanity Tests
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Pre-submit Verification
Challenges and Lessons Learned Maintaining a CI ways in a Cloud
Our biggest challenge Handling the Loads 2,499,109
81,283
Common CI Runs 2015-2016
Transparency Measurement and Monitoring
Dashboards provide visibility
What is the system doing?
Transparency Measure it, Use it
What slowed down this build?
Know what your system is doing!
Other Challenges EC2 Instance types and plugins • Build Rotator • Fluentd • BFA Plugin
• Hierarchy Killer Plugin • Timestamp Plugin
Special Challenges (Peaks and Valleys) • CO2 (Choose your AWS region first) • Performance (Watch your Queues) • Use Containers (Duh!)
The Advantages of CI in the Cloud • Security in our infrastructure • Stability: automated tests run reliably in a consistent infrastructure • Rapid scaling: slaves come online fast • Cost control: slaves go off-line fast • Common AWS tools are known to Engineers • High availability: Master servers are always available • Multi-regional presence reduces latency • Parallel builds and testing will reduced time and costs
Load and Scalability Number of Build Runs per day
Speed and Predictability Mean Duration of pre-commit validation runs
Final Thought
Avoid creating a big pile of poo!
Questions?
Thank you Contact Peter Caron Service Automation and Continuous Integration HERE Invalidenstrasse 116 10115 Berlin
[email protected] Plugins Plugin name
Version affected
Issue
Action
Download
BuildRotator Plugin
---
LogRotator that comes with Jenkins tries to be much Update plugin. Replace smarter then needed. So, it "LogRotator" to loads entire job history at "BuildRotator" as build least twice to understand what discard mechanism could be removed and what - everywhere. not.
Fluentd
---
Send data to Fluentd
1.13.0 and earlier
When we have a huge amount Build failure analyzer => of aborted builds, BFA needs Advanced => "Ignore aborted to process all of them, that builds" option should be Available in Jenkins creates queue and slowdown enabled in Jenkins Jenkins/feedback itself. configuration.
BFA Plugin
Install and enjoy.
BuildRotator.hpi
fluentd.hpi
Plugins Plugin name
HierarchyKillerPlugin
Version affected
0.98 and earlier
Issue
Action
Download
When plugin goes to kill some item from queue, it kills first job in queue instead of killing job that was connected Update plugin and have fun. build-hierarchy-killer.hpi to upstream. FIX: Correct API call was used.
Timestamper Plugin
1.8.4 and earlier
Even then Jenkins needs only last 150 KB, plugin reads entire log (because of the # of users we have up to 3 GB) to calculate timestamp for last X lines. Update plugin and enjoy. Main problem that plugin stores timestamps in encoded format - VarInt. FIX: Read only last 150 KB of logs for finished builds.
Available in Jenkins
Contributions Plugins • S3 • BFA • DSL • EC2 • Unit • Gerrit • ccache
Core • Jenkins • XML library