As data engineers, we have our fair share of troubleshooting bugs or issues that arise when managing data. In this article, we’ll focus specifically on working with data pipelines. We’ve prepared this high-level overview of the steps to follow when troubleshooting and fixing common issues in big-data pipelines. While it may not be a complete guide on how to troubleshoot specific technologies, it is intended to guide a data engineer during the process.
Learn the recommended steps to follow when handling the most common causes of data pipeline failures.
Step 1: Incident Information
In this first step, collect all possible information from the reported incident, including information such as: who reported the issue, who is affected by the issue, the priority of the issue, when it happened, user screenshots or videos, etc. If the issue was reported by an automated alerting system (e.g., Cloudwatch, pager-duty), include all information that is part of the alert and identify the system from where the alert was generated. With this information, we can easily identify (1) whether the observed behavior is an actual issue or not, (2) the pipelines that could be involved and (3) where to start the investigation.
Ensure you have all the documentation to attend the alert, having a run-book is very convenient and can save you a lot of time to troubleshoot, identify known issues and avoid common mistakes during the process.
Having an issue tracking system will help you to easily communicate with all affected users, have all related information in one place, and keep a communication channel open to post any questions, updates, and progress related to the investigation and resolution of the issue.
Step 2: Software and Infrastructure
Identify the software version and the source code repository that was used during the incident, when the last deployment was made, the used infrastructure (e.g., cluster, network resources, databases, storage) and the list of other jobs that were running in the same cluster at the time of the failure. Having adequate documentation for the deployment process and infrastructure will help to speed up this step.
Step 3: Logs and Metrics
Identify the location of log files and metrics generated by the cluster at the moment of the failure, this should include both software and infrastructure (e.g., Apache Spark logs and metrics, network load, cluster CPU utilization, available memory, etc). Error messages in logs or metrics out of range can provide hints to identify if the issue is related to infrastructure, data or software. Having adequate instrumentation in the code and an adequate centralized management tool for both logs and metrics is key for this step, it will help to simplify this process and will save you weeks or even months of hard work to identify and correlate different events that happened in the cluster during the same time window. For more information on logs and metrics look for the best practices for Logging, Application Performance Metrics and Operational Statistics as part of the DataOps movement.
Step 4: Reproduce and Isolate the Issue
Identify the lines of code, the data and the context that reproduces the issue. Depending on the type of issue and the information you get from the logs and metrics, this can be a trivial task or a work that may require weeks or even months of hard work. Look for a troubleshooting guide specific to a framework or technology if you can identify the issue is coming from that single piece of your tech-stack.
If your logs don’t provide enough information, reproduce the issue using a testing environment with the corresponding software version and an infrastructure similar to production. This step may require preparing a dataset and executing some previous or parallel jobs that were running at the time of the failure, or the use of a different infrastructure. This will usually allow you to identify if the issue is related to infrastructure, data or software and provide hints on the pipeline stages that are failing. This step may involve running several executions of pipelines that can take hours to run, having an adequate testing environment like a QA environment with the right infrastructure, data and automated tools for testing can help to speed things up.
Identify the code component where the issue is located and the smallest dataset required to reproduce the issue. This may require to run several tests where an isolated part of the code is executed, with different datasets or environment variables. It is recommended to set the logging level to “Debug” so we can get more detailed information during the testing process.
If all previous fails but you have managed to isolate the issue to a small piece of code you may require manual or automated instrumentation of the code, (e.g., adding more debug logs in some code lines or by the use of an external tool like an IDE or a profiler to provide more information of execution at the method or line level) this may include context information like global variables, database connections, etc.
Step 5: Automate the scenario of the issue
We recommend implementing an automated test that can reproduce the issue and adding it to your Bug Verification Testing Suit. This ensures the same issue will not come again in a future release.
Step 6: Root Cause Hypothesis
Create a hypothesis of the factor that was the main cause of the issue. Identify if the issue is coming from a different place of the pipeline, like missing validation in a previous step in the pipeline, or a radical change in the data distribution, that caused skewed data in the cluster. Use the five whys technique the be sure that you have found the actual root cause.
Step 7: Verify Root Cause Hypothesis
Make the required changes in the code to verify your hypothesis. Optimize your automated test and your dataset that reproduces the issue on this particular piece of code, so you can test faster and iterate more quickly over different scenarios. If you cannot verify the original hypothesis, repeat and create a new hypothesis.
Step 8: Create the hot-fix
Depending on the priority of the issue, you may decide between two options: implement the final fix of high-quality code that may include complex changes like a redesign of the pipeline, upgrade of some framework or library, or changes to the infrastructure. Or two different fixes one quick fix that will be difficult to maintain or may break again in the future but will buy you time to implement the final fix. Of course, this last option requires the whole team to agree to have this quick fix running in production, with the corresponding implications and risks, and adding the corresponding task to implement the final fix on the backlog of your task tracking system.
Step 9: Submit the hot-fix
Once you have the required changes in the code and configuration it is recommended to create a hotfix branch on your source code repository as documented on the gitflow workflow. Create the corresponding pull request so other data engineers can review the changes in the code with the corresponding automated tests and let the code pass through your Continuous Integration tool to verify code quality and let the automated tools verify some other issues that may arise from the change in the code including integration issues in a QA environment.
Step 10: Deploy the fix
Revisit the Continuous Deployment process to ensure it supports all the changes that are required by the fix, discuss a strategy to deploy the change to production with the team if you are going to do a change that may impact end-users or will overload the current infrastructure or external services (e.g., databases, web-services, etc)
- Back-fill. Ensure to do a back-fill process of missing or corrupted data that may have come up as part of the issue. Use available logs, metrics or production data to identify if the issue went unnoticed before the incident was reported and determine the first instance of the issue so you know where to start.
- Incident Report. Once the issue is fixed fill-out the corresponding Incident Report including the time-line of the events since the first instance of the issue, the root-cause, the temporal or final solution and parts of the development process that can be improved in order to avoid these kind of issues in the future (e.g., improving logging strategy, adding more operational statistics, improve automated testing environment, etc.).
Thanks for reading, we hope this guide helped to demystify the process of troubleshooting pipelines. Our data team is multidisciplinary and always happy to share its secrets. If you need help with managing your data or have questions about working with our data engineers, reach out to us at firstname.lastname@example.org or contact us here.