Data Ghost Stories

Our data engineers know that a lot can go wrong when working with data. Through hiccups, mixups, and lessons learned, our data team has compiled a couple of chilling stories about the dangers behind mismanaging data.

A pipeline nightmare

It was a pleasant Friday, I recall. I was enjoying my after-lunch coffee and getting ready to head back home when the tech lead asked me to work on a simple pipeline over the weekend. This, to enrich every record of a teensy-weensy dataset with information from an API. Easy peasy, I thought.

So I rolled up my sleeves and started implementing the pipeline. After some testing and several cups of black coffee, it seemed ready for production on Sunday night. I scheduled the pipeline to run first thing in the morning and went to bed feeling satisfied with the implementation.

The next morning, I woke up to multiple back-and-forth emails about my pipeline. The last one, to my horror, highlighted the following message in bold red letters: 2000+ USD EXPENSE DETECTED THIS MORNING. It turned out the production data was not small, but rather several hundred billion records long. The job was naively trying to backfill everything, causing one of the worst heart-stopping moments of my life.

By Aldo Orozco, Data Engineer at Wizeline

A world with too much data

Imagine a world where AI algorithms start making decisions using biased data, causing episodes of injustice, racism or prejudice…  well that may already be true today. AI ​​systems of some police departments, for example, show a racist trend, personnel selection systems that leave out minority groups, evaluation systems that can reduce the amount of government aid for a person or even fire them using arbitrary metrics without considering more factors.

We should be aware that the models that feed the algorithms are only simplified representations of a system, and therefore, are likely to leave out exceptions, outliers, and more complex background factors that have not been considered correctly.

While technology and access to a large amount of data have brought positive innovations and advances, we must consider that algorithms may fail, and that data may not be neutral and could contain biases. Fortunately, none of this is an inherent behavior of the algorithms and we can take actions that minimize the negative effects, some things that we can improve are:

  • Make sure there is a meaningful, human appeals process.
  • Plan for how to catch and address mistakes in advance.
  • Take responsibility, even when our work is just one part of the system.
  • Be on the lookout for bias.
  • Choose not to just optimize metrics.

A world without data

It was late at night, my G.A.N. model was halfway through training and I was only monitoring the output logs. I was thrilled I made it work! I remember looking at the output logs, remembering the old “It’s compiling” meme and relating to my fellow software engineers at Wizeline… then suddenly, it all went black.

… to be continued.

Nellie Luna

Posted by Nellie Luna on October 29, 2019