Part -I – Introduction to Project Canary
Engineering success and data-driven, actionable insights
Success in software development programs requires engineering managers to be more proactive than reactive. They must not only stay on top of things that demand attention “at the moment”, but also anticipate and get ahead of development risks and red flags, preferably before they occur and cause engineering failures and stress within teams.
Engineering teams often have established frameworks, experience, knowledge, and tools to proactively assess and respond to risks. They also maintain a list of the probable risks in a risk registry. In larger organizations, there are dedicated groups that track and assess risks and surface them to leadership teams when things start turning red. However, this is easier said than done since most of them rely on manual tracking, post-event data, and intervention. Like any manual process, continuous risk evaluation only works if the risks are flagged and addressed in time and the engineering team is working off a common understanding of risks with no room for misinterpretations.
It is widely known that in a software engineering eco-system, the cost of failure is always far greater than the cost of development, and mature engineering teams would rather anticipate and forecast risks in advance, without manual dependencies.
At Persistent Systems, we have been working with the world’s leading software companies helping them build their products. Over the last 30 years, we have seen engineering programs face similar challenges that could have easily been avoided if there was a data-driven altering mechanism.
To put things into perspective, on a typical day, a senior engineering leader oversees initiatives for over 10-15 ISV clients, spanning across 100+ engineering programs with a total team size of 1000-1300 engineers. As a part of the leadership team, this role is primarily geared towards business growth and client success. However, in reality, most engineering leaders have often seen firefighting against surprise escalations, and talent management and attrition related issues leaving them very little bandwidth to focus on their primary goals. These surprises, also impact business predictability and overall client relationship.
To address this, we developed an intelligent alerting framework and tool that we call ‘Project Canary’ especially geared towards product engineering scenarios. The objective of this data-driven framework is to predict such risks at least two weeks in advance so that engineering teams can intervene early to mitigate them. Specifically, it derives actionable insights around:
- WHICH engineering POD may potentially run into issues?
- WHAT is the likely risk?
- What needs to be done NOW to avoid it?
Why ‘Project Canary’
The name originated from the team’s discussion of things and actions associated with an “early warning”. A quick Google search led us to the explanation for Sentinel Species on Wikipedia.
In the olden days, air quality within mine shafts was often unpredictable with the possibility of dangerous gases such as carbon monoxide posing a great risk to miners. To mitigate this, miners would carry caged canaries into mines with them. The presence of dangerous gases within the mine would cause the canary to die thus giving an early warning to miners.
Thankfully, the advent of sensors and technology towards the end of the 19th century replaced this age-old method with a more humane alerting mechanism. We drew inspiration from this to build a framework system that would give early warnings to software development teams.
Over the years, a lot of studies has gone into analyzing and measuring risks during software development and a focused effort has been made to develop tools to help predict risks to engineering outcomes. From lack of software quality to misinterpreted and unclear requirements, from scope creep to schedule variance, from inadequate testing to lack of collaboration, the list is endless as to why software engineering can fail. However, addressing these is straightforward provided they are surfaced at the right time and engineering leads take proactive measures to mitigate them.
For example, if there is scope creep or a schedule variance, it’s inevitable for teams to start working longer hours to bring things back to schedule. In other cases, there is a distinct pattern of increased working hours during a planned software release or deployment. This allows us to extrapolate a larger pattern that predicts common conditions for the failure, common actions taken to prevent them, and behavior patterns observed in teams to resolve certain issues.
Most software engineering teams maintain a record of such data or are implicitly generating signals that can be useful to derive these patterns. Often, this dark data just lies unused in the repositories. What if we can build an intelligent, self-learning machine to analyze this data from the software development lifecycle, sentiment in email, and other digital communications? Can it yield an improved risk alerting system?
With these questions in mind, we challenged ourselves in seeking out to mine a wealth of information from the dark data, derive meaningful actionable insights from it, and drive increased success, and strong outcomes from our product engineering collaborations.
What does the Canary smell?
To augment traditional approaches of measuring risk where standard frameworks measure mostly lagging indicators, Project Canary relies on using the leading indicators. Leading indicators can be defined as the parameters or events that reflect the development sprint’s health. Provided promptly, these indicators allow leadership to take proactive actions to influence engineering outcomes.
Within the Project Canary framework, we have focused on flagging anomalies in lead indicators that are regularly captured within most organizations and trigger alerts where there is a deviation in the patterns of these data sets.
- Swipe-In/Swipe-Out Data – Using statistical algorithms (e.g. moving averages and standard deviations over some) on attendance records of team members, we were able to identify trends in hours spent at work and discover deviations to raise an alert. This helped identify if the team was putting in extra hours (efforts) as compared to the team’s usual pattern of working. This indicated some critical deadline was looming flagged as a schedule variance risk.
- Email Sentiments – We used the machine learning-based API’s to score the negative sentiments in emails. An alert is raised when there is an increase in emails with negative sentiments or there is a single email with a very high negative sentiment score. Being extremely conscious of privacy and information security, this module did not have any human intervention, and no data was stored in any storage system. In fact, to begin with, we restricted the ML algorithm to scan a very narrow slice of email metadata of the emails exchanged between the engineering team and product owner. This information can be used to understand changes in the patterns, frequencies of the conversations to check any sudden spikes of negative sentiments which could be an early indicator of something going wrong. It is important to note that, not all negative sentiments indicate risk, hence the algorithm is designed to adjust to this.
- Team member dissatisfaction and attrition – A sudden increase in the workload, possibilities of failures can lead team members to think about moving out of the teams. The Canary framework tries to co-relate these trends to see if they are related to the employee attrition and vice-versa. Being able to co-relate this to the causes not just helps mitigate flight risk, but in some cases can help HR to reverse attritions.
- Internal job openings, indents – The Canary framework identify trends in open positions within engineering teams and triggers alerts based on indent or requisition aging. The rise in hiring for senior leads from a strong technology background in the middle of milestones may also be classified as a vacuum in technical expertise or help determine a change in scope or a risk.
- Software delivery release cycles – Not all alerts with the Canary framework are classified as critical risks. Some alerts are simple notifications that help the leadership team to track ongoing developmental milestones and take actions if required.
- Meetings Data – As critical developmental milestones approach, there is increased frequency in team meetings – Daily stand-ups, reviews, etc. The Canary framework co-relates a sudden spike in involvement from senior team members attending meetings, sentiment analysis of negative connotation words in the subject line of the unplanned meetings, etc to surface signs of some turbulence.
- Risk Data – In addition to the datasets mentioned above, most engineering teams track standard risk KPIs and maintain a risk register. The Canary framework sends an information alert for major changes seen in these risk metrics. Later used to augmented lead indicators to tune the ML (machine learning) models for overall risk predictions and derive more relevant insights.
These are just a few examples of how the data can be converted into actionable insights within the Canary framework to build an intelligent alerting framework. The framework continues to evolve by bringing in additional datasets and features that we are working on to make the Canary more intelligent and actionable.
In the next blog, we will showcase how the Canary framework reacts to signals and communicates so that data-driven actionable insights can be facilitated. In staying with the bird analogy, we hope you can “flyback” again to our Persistent blog soon and join us for the next installment.