How I Approach a New Dataset: From Raw File to a Decision
It usually starts the same way. A spreadsheet lands in my inbox, the column names are a mix of cryptic abbreviations and outright lies, and the message attached says something like 'can you pull the numbers on this by Friday?'. No schema, no definitions, and a question vague enough to mean five different things. Over time I have stopped treating each of these as a one-off scramble and started running the same workflow every time. Here is what that looks like, from raw file to a conclusion I would happily put my name on.
Start with the question, not the data
The strongest temptation, and one I still have to resist, is to open the file and start poking. It feels productive. It almost never is. Before I load a single row I try to pin down what decision this analysis is meant to inform, because the same dataset answers 'are we losing customers?' and 'which customers are we losing?' very differently, and only one of them is usually what the person actually needs.
So I ask, even if I have to ask on their behalf: what would change depending on the answer? If nothing would change, the analysis is trivia, and it is worth saying so kindly before spending two days on it. Framing the question tightly up front saves more time than any tool or shortcut I know.
Profile before you analyze
Once I know what I am looking for, the first contact with the data is not analysis, it is reconnaissance. I want to understand the shape and the health of the dataset before I trust a single number that comes out of it.
- The shape: how many rows and columns, and is that even close to what I expected?
- Types: is a date stored as a date or as text, is a number actually numeric or secretly a string with stray commas?
- Missing values: which columns have nulls, how many, and is the missingness random or suspiciously patterned?
- Ranges and outliers: minimums, maximums, and anything physically impossible like a negative age or a future signup date.
- Duplicates and cardinality: are the 'unique' identifiers actually unique, and do categorical columns have the handful of values I expect or two hundred messy variants?
None of this is glamorous and all of it is load-bearing. Most wrong conclusions I have seen, including a few of my own, trace back to a number that was quietly broken at this stage and never questioned.
Cleaning is where the truth hides
Data cleaning has a reputation as the boring part, the chores you do before the real work. I have come to believe the opposite: cleaning is the real work, because the decisions you make here silently shape every result that follows. How you handle a missing value, whether you drop a row or impute it, how you collapse 'NY', 'New York', and 'new york ' into one category, all of it is analysis wearing overalls.
My rule is to make every cleaning decision explicit and reversible. I never edit the source file. I write the transformations as code so the path from raw to clean is documented, reviewable, and re-runnable when the data refreshes next month. If I cannot explain why a row disappeared, that is a bug, not a tidy-up.
Analysis in small, testable steps
With clean data, the analysis itself is often less dramatic than people expect. I work in small steps and check each one before building on it, the same way I would write code. Compute a number, then sanity-check it against something I already believe to be true. If the totals do not reconcile with a figure someone trusts, I stop and find out why before going a step further.
I also try to answer the question the simplest way that could possibly work before reaching for anything fancy. A clear group-by and a couple of well-chosen comparisons settle most questions. Sophisticated methods are worth it when the simple version genuinely is not enough, not as a default flex.
Validate like you are trying to prove yourself wrong
This is the habit I would most want to pass on. Once I have a result I like, I deliberately try to break it. A finding that survives a real attempt to disprove it is worth far more than one I merely hoped was true.
- Do the parts sum to the whole, and do my segment numbers reconcile with the overall total?
- Would the conclusion survive a slightly different cut of the data, or does it hinge on one fragile choice?
- Is the effect large enough and the sample big enough to be real, or am I reading meaning into noise?
- Could a simpler explanation, a data artifact, a seasonal pattern, a reporting lag, account for what I am seeing?
If a result only holds under one specific framing, that is not an insight, it is a coincidence I got attached to.
The deliverable is a decision, not a chart
Finally, I try to remember what the work is actually for. A folder of charts is not a deliverable; a clearer decision is. So I lead with the answer to the original question in plain language, show the one or two visuals that genuinely support it, and stay honest about what the data cannot tell us. The caveats are not hedging, they are part of being trustworthy.
I have learned that how a finding is framed matters almost as much as whether it is correct. The same true result can be acted on or ignored depending on whether the person hearing it understands what it means for the choice in front of them.
The goal was never the analysis. It was a decision made with a little more clarity and a little less guesswork than before.
None of this requires exotic tools. It is mostly discipline: frame the question, distrust the data until it earns your trust, make every step explicit, and try hard to prove yourself wrong before you let anyone else rely on the answer. Do that consistently and the messy Friday CSV stops being a scramble and starts being a process.