The Dirty Truth About Data Science Jobs

NHS Jobs Dirty Data “Sexiest job of the 21st century. Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around 80% of their time on preparing and managing data for analysis. 76% of data scientists view data preparation as the least enjoyable part of their work. 57% of data scientists regard cleaning and organizing data as the least enjoyable part of their work and 19% say this about collecting data sets.” ~ Forbes ~ The most fundamental prerequisite for a Data Science project is access to a real-world production dataset! This is needed to capture the full breadth of real-world user data in terms of statistical norms and outlier dirty data. The first task when encountering any dataset is an EDA Exploratory Data Analysis. Reality is far more creative than any QA testers could imagine. Grayface showed me production examples of unvalidated user TOWN names, including "at home" and "London / Birmingham / Manchester". Both examples are dirty data when expecting to encode a scalar geocode. The first rule of Frontend Development is never trust user data! In retrospect, I should have had the foresight to request production dataset access as a day-one client non-negotiable. This should be considered a critical-path due-diligence question for future projects. In practice, STAGING appeared halfway through the project, representing a team project milestone into UAT. This was the first time we could see a website rendered with realistic data and observe how search queries work in practice. NHS DevOps has a shortage of environments. DEV was performing the role of CI, thus STAGING was performing the role of DEV. Did STAGING dataset access mark the symbolic beginning of a new Data Science project or a push-to-production UAT code freeze? There is a larger industry-wide problem of expecting Data Science projects to be conducted without access to production datasets. I was spoiled by Kaggle, who always provided a nice, clean dataset on day one. The PI School of AI taught me how to build my own datasets, metrics, and leaderboards from scratch. Public Domain datasets are a luxury. Sometimes dataset access requires understanding a bureaucracy sufficiently well to ask for every special exception and appeal in the book! Kaggle uses SDV/CTGAN for anonymizing private datasets for public competition. Many gatekeepers only have the power to say NO! Who are real decision-makers with the power to say YES? ~ Saving NHS Jobs (p143) ~ User Acceptance Testing