Compare, Contrast, and Evolve: A Data Quality Literature Review
A survey of how data quality has been defined, measured, and debated across decades of research — and why the field still lacks the practical, quantitative foundation modern AI demands.
Writing
Essays on AI strategy, machine learning in practice, and what it actually takes to build organizations that benefit from both.
A survey of how data quality has been defined, measured, and debated across decades of research — and why the field still lacks the practical, quantitative foundation modern AI demands.
Data doesn't need to be perfect to be useful. It needs to be fit for a specific purpose. A framework for evaluating data quality against the use case that actually matters.
How do you measure whether data is good enough to guide the distribution of critical resources? An applied case study in defining and quantifying data quality for a specific, consequential use case.
Building a model to predict COVID-19 deaths is hard enough. Building one that accounts for how the underlying data changes over time is harder — and significantly more honest.
How America's COVID-19 data was actually made — the fragmented public health infrastructure, political decisions, and reporting failures that shaped what we knew and when we knew it.
Tracking California's COVID-19 death data daily — watching numbers disappear, reappear, and rewrite history — and what it reveals about the state of modern data.
We have built sophisticated ways to use data — machine learning, generative AI, real-time models. We have not built the methods to know whether that data is worth using.
Standard line graphs hide how data changes over time. Shifted line plots, filtered heat maps, lag plots, bifrost plots, and impact plots — a visualization toolkit for data that revises itself.
Data doesn't need to be perfect. But its imperfections must be understood. A closing argument for building a practical, urgent field of data quality research.
A practical walkthrough of building a multi-agent AI system that researches a topic, drafts course content, and evaluates its own output — with real lessons from building it.
A talk on the real-world application of generative AI in industry — what's working, what's hype, and where the field is actually headed. Delivered at ODSC West's Gen Ai X Summit, 2024.
The boardroom conversation about AI has outpaced most leaders' ability to evaluate what they're hearing. Here's how to ask better questions — and what the answers should sound like.
The full dissertation submitted to Cornell University's Field of Statistics, 2023. A rigorous treatment of data quality applied to America's COVID-19 data — the problems, the metrics, and a framework for thinking about fitness for use.
The future of AI in education isn't more data — it's better data. A talk on why data selection, not data volume, is the critical design decision for effective AI in learning contexts.
A panel conversation on leadership, career trajectories, and what it takes to advance equity in data science and analytics. Data Leaders USA, 2023.
Tips and tricks for every level of practical data science — from the talk at ODSC East 2023 that resonated far beyond the conference room.
The pain of sharing code across notebooks is real. jupyckage solves it — one command, proper package structure, no boilerplate.
A practical, candid guide to breaking into data science — what hiring managers actually look for, how to build a portfolio that stands out, and how to navigate the transition from wherever you're starting.
Most organizations treat AI ROI as a measurement problem. It's actually a strategic one. The questions you ask before you build determine everything that comes after.
A keynote from the GET Cities Kickoff Summit on what it takes to build technology that moves the world rather than just moving fast.
What every professional — technical or not — actually needs to understand about data in order to make better decisions, ask better questions, and hold data-driven claims to a higher standard.
Data scientists run cross-validation constantly — but many do it without understanding why. Here's the full reasoning, from first principles to production.
Curiosity is a data scientist's greatest asset and most dangerous liability. How to follow interesting threads without losing weeks — and how to know the difference between a distraction and a discovery.
What actually makes a data science portfolio stand out — project selection, storytelling, demonstrating judgment and impact rather than technical execution alone.
Data science teams spend months building models that technically work — and fail to move the business. The culprit is almost always the same: an unstated assumption between output and outcome.
AI, machine learning, models, features — the words get thrown around in every boardroom. Here's what they actually mean, in plain language, with no condescension.
A keynote for nonprofit and civic sector leaders on how the data revolution was reshaping civil society — the opportunities, the risks, and what organizations need to do to use data without losing sight of the people behind it.
A talk on the human dimension of data work — how data can illuminate community, identity, and connection, and what it means to use data in service of people rather than in spite of them.