Spinnaker is a multi-cloud, continuous delivery platform developed by Google/Netflix engineers in order to ease out the microservices deployments on continuous basis. To achieve this goal, the new deployment should be continuously verified against the stable release. AI-based Autopilot brings "continuous verification" capability to Spinnaker by continuously generating a health score of new deployment against the stable release. Based on health score generated, new deployment is either rolled out or rolled back.
How Continuous Verification Works in Spinnaker?
New Deployment Initialization
A new version of a microservice is deployed to a controlled environment. This version is tested in parallel with the existing stable release.
Health Score Calculation
AI-driven Autopilot continuously monitors the performance, reliability, and error rates of the new deployment. It generates a health score based on multiple factors, including:
Application performance metrics (e.g., response time, CPU/memory usage). Error logs and failure rates (e.g., HTTP 5xx errors). User experience indicators (e.g., latency, request success rate). System anomalies detected by machine learning models.
Decision Making: Roll Forward or Rollback
If the health score is above a predefined threshold, the new deployment is rolled out to production. If the health score indicates issues, the system automatically rolls back the deployment to prevent failures from affecting end-users.
Automated Feedback and Improvement
Insights from previous deployments are stored and analyzed for future improvements. AI models continuously learn from past deployment behavior, making future releases more stable and reliable.
An application consists of multiple services. These multiple services are packed together to build a deployment container which forms a docker image. while deploying an application, the docker image is then shipped through Kubernetes platform.
A new release of application containerized in docker image can be tested against the old release by comparing "Logs" generated from different services under test suits. These can be database or operation logs stored and retrieved though "Elasticsearch" tool.
To compare application logs in between releases across services, they need to be clustered contextually.
There are different techniques where natural language sentences can be clustered by tokenizing the words. Considering, a log event as one sentence, word embeddings can be generated for logs. Feeding these embeddings to clustering models can output the log clusters.
These log clusters can then easily compared in between the applications by classifying them into "WARN", "ERROR" and "CRITICAL" labels.
Health check is generated based on ratio of log cluster events from new release to stable release with respected to labels.
I tested different word embedding techniques like using TFIDF, Word2Vec, Sentence2Vec as well BERT which can detect meanings of logs unlike natural texts. To get meaningful clusters of logs embeddings, different clustering algorithms like Hierarchical, DBSCAN clustering are tested.
Application performance metrics of new release like latency (response per min), throughput (requests per min), error rates, CPU / memory utilization are checked against corresponding metrics in stable version of application to give health score.
Central tendencies, non-parametric statistical test like Wilcoxon signed rank tests can be applied to find statistical difference between APM metrics.
To quantify the differences between metrics "effect size" can be taken into account.
I have tested different non parametric statistical methods like Wilcoxon signed rank test, Kruskal-Wallis tests to compare the distribution of APM metrics with quantification of differences in terms of "effect size" .
Application consists of number of microservices. Tracking logs generated out of different microservices in one place becomes essential to track log events related to the KPI (like response per min) degradation .
But, this tracking comes up with many hurdles like variable load conditions, absent of logs from database services, logs and metrics are documented at different point of time, collecting and monitoring points takes place at fixed point of interval and system operation behavior happens at variable time intervals.
To solve this, APMs are tracked for "anomaly detection". Once anomaly is detected, the logs events around that time bucket are collected. Assumption here is once KPI degrades, impact problems occur more frequently.
metric-log causal relationship is built using count of log events (log cluster size) in moving time buckets around anomaly giving weightage of log events causing KPI degradation.
Algorithms like "Luminol" can be used to detect the anomaly in APM and correlation of anomaly across other APM metrics.
In this way, root cause of KPI degradation can be easily tracked for application across different services by using logs and APM metrics.
My responsibility was to implement the different clustering mechanisms for contextual clustering of log events and reduce the false positives. The contribution resulted in effective clustering of log events generating accurate health score of application.
To get accurate quantification of differences among the APM metrics, I contributed to testing of different non-parametric statistical tests and calculation of effect size.
I have built "Root cause analysis" mechanism for a client who specifically wanted solution on KPI degradations and tracking over it and later it integrated in autopilot product.
White paper - https://www.researchgate.net/figure/Workflow-of-extracting-correlation-between-logs-and-metrics_fig2_304291940
The outcome of development was accurate health score generation for new release paving the way for reliable continuous verification stage in CI/CD pipelines. The overall integration of Autopilot led to 1/3rd reduction in deployment time with more reliability on canary deployment.