Data Scientist

1. What does a data scientist actually do? 🤔

In one sentence

Imagine cramming a detective + statistician + interpreter into a single person. Except the “case” isn’t a murder mystery but a business mystery like “Why have our app’s sign-ups been dropping since last month?”, and the clue isn’t a bloody knife but a messy spreadsheet (way bigger than) millions of rows long. 📊

A data scientist (Data Scientist) uses data to do things like this:

Defining the problem: This is where the real work begins. Translating a vague request like “boost our revenue” into a solvable question like “which customer segments are churning, at what point, and why?”
Collecting and cleaning data: Scraping together data scattered all over the place and cleaning up the blanks, typos, and outliers (this is genuinely 50–80% of the job… not glamorous)
Exploration and analysis (EDA): Turning the data over and over to find patterns and the weird stuff
Modeling: Building things like prediction models, recommendation engines, or churn predictors (this is where machine learning shows up)
Causal inference: Telling the difference between “these move together” (correlation) and “this caused that” (causation), this is where the real experts live
Communication: Explaining the results in a single chart and a single sentence so even an executive gets it (no matter how good your model is, if you can’t convince anyone, it won’t get used)

Let me show you a snapshot of “a day in the life of a data scientist” (not an exact schedule, just the vibe):

Morning: Slack is piled up with “Why is this number like this?” questions. You pull the data, fire off queries (SQL), and check your hypotheses one by one in a notebook (Jupyter).
Midday: Meeting with the product team. You’re hashing out “Are we even solving the right problem?” Half of it is data, half of it is dealing with people.
Afternoon: You run a model and its accuracy is suspiciously high. (That’s not a reason to celebrate, it’s usually a sign of a bug like data leakage.) Debugging begins.
Evening: You distill what you found into a single slide. You write down clearly “So here’s what we should do.” This one sentence is often far more important than the code.

The coolest part? You’re constantly switching modes: an engineer writing code, a statistician staring at numbers, a consultant persuading the room, and a critical thinker who can say “Actually, this question itself is wrong?”, you do all of it in a single day.

Why this job is awesome ✨

Let me be honest with you. The data scientist was once called “the sexiest job of the 21st century” (HBR, 2012, co-authored by DJ Patil). Even now that the hype has died down, the reasons it’s awesome are clear.

You get your hands on real decisions. What Netflix recommends, who a bank approves for a loan, which patient a hospital sees first, behind these decisions there’s a data scientist. A single line of the model you built can change the experience of millions of people.

Digging all the way into “why?” becomes your job. For a curious person, this is heaven. Poking into “what’s this pattern?” is the work, and when you find the answer they pay you and praise you for it.

There are genuinely rewarding moments too:

When you discover an insight nobody knew and the company changes direction (“Wait? That feature we thought was a failure is actually the thing keeping our core customers around”)
When a single clean chart makes the whole room go “ohhh…”
When the revenue graph bends upward thanks to the recommendation system you built

On top of that, the future keeps getting more interesting. Things like LLMs, generative AI, causal inference tools, and MLOps are opening new doors that the previous generation of data scientists never had. (More on that in section 2.)

A cold reality check ⚠️

If you’re considering becoming a data scientist even a little, you deserve to know the truth, not the Instagram highlight reel.

80% of the work isn’t glamorous. The “building cool AI” you see in movies is one slice of the job. The reality is data cleaning, filling in blanks, standardizing formats, hunting down a value someone entered wrong. There’s even a running joke that “80% of data science is data cleaning, and the other 20% is complaining about data cleaning.” 😅

The “ambiguous job” trap. What a data scientist actually does varies wildly from company to company. At one place you build machine learning models, at another you just build dashboards (which is really closer to a data analyst), and at another you build data pipelines (that’s a data engineer). Before you join, you absolutely must ask “What exactly does a data scientist do at this company?”

The gap between expectations and reality is huge. Executives often misunderstand data science as “magic” and expect unrealistic things. Stuff like “Use AI to predict next quarter’s revenue exactly.” That’s why there’s a statistic that the average tenure of a data scientist is only 1.7 years, a mismatch of expectations is a big cause.

Correcting a misconception: data science isn’t about “a genius building AI alone.” Most of it is messy reality + persuading people + relentless debugging. You have to be able to enjoy that to last.

2. Will this job still be promising in the future? 📈

A reality check on the job market

Good news: demand is still strong. According to the US Bureau of Labor Statistics (BLS), data scientist is one of the fastest-growing occupations, with about 23,400 new jobs opening per year from 2024–2034 and an estimated employment growth rate of about 34% (several times the average across all jobs). McKinsey projected that US demand for data scientists would exceed supply by more than 50%.

Bad news: that doesn’t mean getting in is easy. A polarization is underway where “junior positions shrink while demand for senior/specialized talent grows.” People who can do only basic analysis have become common, and as AI automates that part, the value of “just an average data scientist” is dropping. You need to differentiate.

Will AI replace this job?

This is Reputo’s core perspective. AI isn’t replacing data scientists, it’s reshaping them. Let’s look precisely at what’s happening.

What AI/LLMs are absorbing (work whose value is dropping):

Writing code, ChatGPT/Claude can crank out pandas code, SQL, and visualizations in an instant
Basic EDA (exploratory analysis), “summarize this data” is now something LLMs do
Basic model tuning, AutoML and agents automate model comparison and hyperparameter search

In fact, the industry says the data scientist’s role is shifting from a “doer” to an “orchestrator.” Breaking a complex task into small chunks that AI agents can execute, designing feedback loops, and building guardrails that catch the AI when it’s wrong, these are the new core competencies.

But here’s what AI can’t do, so its value is going up:

Defining the problem: “Which question should I turn this business situation into?”, no matter how good the model is, solving the wrong question scores a zero. AI can’t do this.
Causal inference: This is the real core. An LLM is a correlation engine, so it can explain causal inference techniques but it can’t actually do causal inference. Causal inference requires understanding the data-generating process, intervening on variables, and reasoning about counterfactuals that never once appeared in the training data. “Should we raise the price?” “Should we give this customer a discount?”, these “what should we do” questions belong to the realm of causation, not prediction, and they’re where AI is weakest.
ML system design: Reliably putting a model into a real service (MLOps), monitoring it, and preparing for when it breaks is still a human’s job.
LLM evaluation: Paradoxically, the work of verifying whether AI-generated output is correct is becoming a new job. Judging “Is this LLM output trustworthy?” is the data scientist’s new weapon.

One-line summary: Analysis, modeling, and coding get automated, and the value shifts to problem definition, causal inference, ML system design, and LLM evaluation. The people who climb up to this higher level use AI as a superpower, not a threat. They make AI do the grunt work while they focus on “which question to solve” and “is this really causation.”

💰 Actual salaries

The question students always ask: “So… how much does a data scientist make?” Let me answer with real numbers.

🇺🇸 United States (USD, total compensation = base + stock + bonus, levels.fyi/Glassdoor 2026):

Overall median: about $155,000 ~ $176,000 (roughly 210 million ~ 240 million KRW)
Entry-level: about $152,000 ~ $190,000 (Google L3 new grads at ~$190K), up about $40K from 2025
Big Tech median: Google $335K, Meta $288K, Amazon $250K, Microsoft $248K (roughly 340 million ~ 460 million KRW)
Senior: at Google, recent offers typically land between $310,000 ~ $410,000

🇰🇷 South Korea (KRW, as of 2026):

Entry-level: starts around 33 million KRW
5 years in: around 55.9 million KRW
10 years in: around 83.7 million KRW
Overall average: around 57.45 million KRW
The salary gap between large corporations and SMEs is about 12.85 million KRW, company size and industry (domain) heavily determine pay

Reality check: don’t let the US numbers spin your head. The US has different costs of living, taxes, and work-visa barriers, and Korea’s Big Tech (Naver, Kakao, Coupang, Toss, etc.) or foreign companies pay far above the Korean average. And the key point, someone with “experience directly defining a business problem and solving it with a model” earns noticeably more than someone who just repeats simple analysis. That “value going up” area I mentioned above is literally your salary.

Is it right for me? (self-assessment)

Think of it like a game character build. Data science rewards certain stats.

It’s a perfect fit for people who:

Are very curious, people for whom “why is this?” is a verbal tic
Can tolerate ambiguity, people who enjoy messy problems with no predetermined answer
Are logical but also communicate, people who can both read the numbers and explain those numbers to a person
Are meticulous, because a single small error in the data can flip your entire conclusion
Are skeptical, people who can suspect “This result is too good? Isn’t there a bug somewhere?”

Essential aptitudes (not optional):

Math/statistics foundations, probability, statistics, a bit of linear algebra (use them as a black box and you’ll eventually hit a wall)
Coding, at minimum Python, and SQL is practically required
Business sense, the ability to understand why the technique is needed

Honestly, it might be tough for people who:

Can only feel at ease when the answer is clear-cut (data science is a world of “probably” and “with about this much probability”)
Hate persuading people with a passion (if you’re great at analysis but can’t communicate, you won’t get recognized)
Get bored quickly with repetitive work like data cleaning

Work-life balance: the field itself is better than being a doctor or in investment banking, but stress can run high because of deadlines and unrealistic expectations. (I’ll lay it out honestly in section 3.)

3. The cold truths you must know: the downsides ⚠️

Stress and the expectations mismatch

The hardest thing about this job isn’t actually the data, it’s people’s expectations.

Executives misunderstand data science as “magic” and demand unrealistic things (“Use AI to predict next quarter’s revenue exactly”)
Projects are usually rushed for time, requirements keep changing, and stakeholder feedback pours in endlessly
When an analysis you poured weeks into gets a reaction of “Um, so what?”, that really gets to your mental health

Burnout in data roles is real. One survey (Data Kitchen) found that 97% of data engineers have experienced burnout, and data scientists feel similar pressure from unrealistic expectations, overwork, and lack of recognition.

The “invisible work” that’s hard to get credit for

Data science is often undervalued. You can stay up several nights to deliver a clean analysis, but to people’s eyes it just looks like “a single chart.” The struggle of the data cleaning, debugging, and verification behind it is invisible. You’ll spend your whole career fighting the misconception of “What exactly does data science even do?”

High turnover (the 1.7-year mystery)

There’s a statistic that the average tenure of a data scientist is 1.7 years. Why so short?

Organizations misunderstand the data scientist’s role (dumping the work of analysts and engineers on them)
Unrealistic expectations + uncooperative data infrastructure
The frustration of “the model I built never actually ships to the service” (many projects end after just a PoC)

This is often a structural industry problem, not your personal failure. That’s why choosing “a company with a mature data culture” is as important as the salary.

Economic and career realities

The pay is good, but the title “data scientist” doesn’t guarantee the job. At some places you’re a high-level analyst, at others you’re worked like an engineer.
The tech changes fast. The skills that were hot five years ago are now automated by LLMs. Lifelong learning isn’t an option, it’s a survival condition.
As AI encroaches on basic tasks, the seats for “data scientists who only do the basics” are shrinking. If you don’t climb up, you’re at risk.

Stories from people who quit

Common regrets/reasons from people who left data science:

“I barely ever saw my analysis reflected in an actual decision, I got exhausted making nothing but PoCs”
“I didn’t know data cleaning was the whole job. I thought I’d be building cool AI”
“The politics of persuading executives was harder than the analysis”

Bottom line: if you’re overflowing with curiosity, enjoy ambiguity, can persuade people, and are prepared to keep climbing upward (problem definition, causation, systems) in the AI era, data science is still an amazing path. But if you come in only seeing the picture of “an AI genius building a cool model alone,” you can get burned by the real-world data cleaning and office politics.

4. The legends of this field 🏆

Among data science’s legends, it’s not just “straight-A geniuses.” Someone who hated math, someone who taught themselves without an elite degree, an immigrant who couldn’t speak a word of English, someone who got rejected from a PhD program, these people built an entire field from scratch.

DJ Patil: the person who coined the word “data scientist”

Did you know that DJ Patil once hated math, and got rejected by Google and Yahoo?

Raised by Indian immigrant parents, he hated math in his school days, but later earned a PhD in applied mathematics at the University of Maryland and fell into the world of data. A fun fact, his first job was at eBay, and that was because his mother knew someone there. Not a glamorous start.

The real turning point was LinkedIn. Working there as the head of data products in 2008, he and his colleague Jeff Hammerbacher (then at Facebook) realized there was no word for this new thing they were doing, and coined the job title “Data Scientist.” In 2012 he co-authored the Harvard Business Review piece calling it “the sexiest job of the 21st century,” and in 2015 he was personally tapped by President Obama to become the first and (so far) only Chief Data Scientist of the White House in US history. The advice he always gives students is simple: “Build a portfolio with real projects, show impact, not code.”

Hilary Mason: the pragmatic data scientist who proved herself with a blog

Did you know that Hilary Mason got rejected from a PhD program, but became famous precisely by sharing that failure on her blog?

She majored in computer science at Grinnell College and started her career in academic machine learning. But she soon realized, she was more drawn to building things people would actually use than to writing papers. So she turned away from academia and toward the startup world.

Her springboard was the position of chief scientist at the URL-shortening service bitly. There she led a team for four years studying “how people’s attention moves across the internet in real time.” In 2014 she co-founded Fast Forward Labs with a colleague, doing the work of translating cutting-edge machine learning research into something companies could actually use, and the company was acquired by Cloudera in 2017. Her philosophy compresses into a single sentence: “Ship a messy prototype every day, perfectionism kills innovation.” It’s a case of becoming a data science icon through the habit of “build and share,” even without an academic PhD.

Fei-Fei Li: from an immigrant who couldn’t speak English to “the Godmother of AI”

Did you know that Fei-Fei Li came to the US at age 15 without a word of English, working in her family’s dry cleaners on weekends while she studied?

Born in Chengdu, China, she immigrated to New Jersey with her parents at 15. They started out in a one-room apartment, her father fixing cameras and her mother working as a cashier. She attended school while working at the dry cleaners her family ran on weekends. Yet she scored a perfect mark in math and entered Princeton’s physics department on a full scholarship.

Her greatest achievement is ImageNet (2006–). At the time, AI researchers were obsessed only with “smarter algorithms,” but Fei-Fei thought the opposite, “For a computer to see the world, it first needs an enormous amount of labeled data.” So she built a giant dataset of millions of images, each classified one by one by humans. At first she was even mocked with “how is that research,” but this dataset became the spark of the 2012 deep learning revolution and the foundation of today’s facial recognition and self-driving cars. She proved a core lesson of data science, data matters as much as the model. Now she leads Stanford’s Human-Centered AI Institute (HAI) and is called “the Godmother of AI.”

Cassie Kozyrkov: the person who turned “decisions” into a science

Did you know that Cassie Kozyrkov, an immigrant from South Africa, bombed at communication in her early consulting work, then fixed it and created Google’s first-ever role for it?

Coming from South Africa to the US, she broke through cultural barriers to study math and physics. The biggest problem she saw while working as a data scientist wasn’t the technology, it was that “people make bad decisions with data.” Even when you build a great model, no one was properly figuring out what decision to make with it or how.

So she created an entirely new field, “Decision Intelligence.” It’s a discipline that ties together statistics, machine learning, psychology, and management to address “how do we make better decisions with data.” In 2017 Google appointed her the company’s first-ever “Chief Decision Scientist,” and she trained 20,000 people inside Google and influenced more than 500 projects. Her message is the future of data science itself: “Rather than chasing the perfect answer, ask a better question.”, In an era where AI spits out answers automatically, she showed in advance that the value lies with “the person asking the question.”

Andrew Ng: the person who opened AI education to everyone

Did you know that a single online course Andrew Ng made was taken by 100,000 people, giving birth to the world’s largest online education platform?

Born in Hong Kong and raised moving between several countries as a child, he became a Stanford professor and served as the founding leader of the Google Brain team and chief scientist at Baidu. He’s written more than 100 papers in machine learning and robotics. But what truly made him a legend wasn’t research, it was education.

In 2011 he put Stanford’s machine learning course online for free, and the event of over 100,000 people enrolling occurred. This became one of the first MOOCs (Massive Open Online Courses), and the next year he co-founded Coursera with Daphne Koller. In 2017 he founded DeepLearning.AI to make AI education even more accessible. Today, almost everyone self-teaching data science and machine learning passes through his courses in some form. As his line “AI is the new electricity” goes, he’s the person who turned AI from something for a few experts into something for everyone.

5. How do I prepare? 🎯

If you’re still a student (high schooler/university student)

You don’t need to be a “genius.” You need consistency and real projects.

Subjects to study (build a solid foundation):

Math/statistics, probability, statistics, linear algebra (the real backbone of data science. Weak here and you’ll eventually hit a wall)
Programming, Python first, then SQL (the two languages for handling data)
Computer science basics, data structures and algorithms are enough
If there’s a statistics class, take it no matter what. Even if AI writes the code, judging “whether this statistic makes sense” is on you.

Skills to develop (the things that actually make a difference):

“The power to ask questions”, the habit of looking at data and throwing out “why?” and “so what should we do?”
Storytelling, practice explaining analysis results to a person (presentations, a blog, anything)
Skeptical thinking, suspecting “Is this result real? Is it correlation or causation?”
The skill of using AI as a tool, having ChatGPT/Claude write code and being able to verify the result

Projects you can start this week (for real):

Grab one beginner dataset on Kaggle (kaggle.com) (like the Titanic survival prediction) and analyze it from start to finish
Take public data on a topic you’re interested in (sports records, YouTube stats, your neighborhood’s air quality, etc.) and make a small analysis + chart
Write up that analysis on a blog or GitHub, like Hilary Mason, the habit of “build and share” becomes your portfolio
Retype a Kaggle notebook and annotate “why this code is used” line by line (learning as if you’re teaching is the cheat code)

The goal isn’t “padding your résumé.” It’s making a small piece that proves “I can take messy data and carry it all the way to a meaningful conclusion.” As DJ Patil said, show impact, not code.

If you’re switching from another field

Data science is one of the fields with the most active transitions. Almost any background that has “worked with numbers”, statistics, economics, physics, psychology, marketing, becomes an asset. (In fact, people with domain knowledge are strong, medical data is best solved by someone who knows medicine, financial data by someone who knows finance.)

Things that transfer well:

Domain expertise, the ability to define the “real problems” of an industry you already know (the part AI can never do!)
Analytical thinking and statistics, if you have research experience, you’re already halfway there
Communication, the ability to persuade with results works no matter which field you came from

Realistic expectation: you’ll have to newly learn the basics of Python, SQL, and machine learning. But because it’s not “starting from zero” but the combination of “your existing strengths + data skills,” you can actually be more competitive than a pure newcomer. Focusing for 6–12 months with Andrew Ng’s online courses can get you a portfolio.

Essential skills

Let me organize a practical skill stack by priority:

Top priority: statistics/probability
- Why: the one skill that has only become more important in the AI era. It’s the basis for judging “can I trust this model’s result”
- Resources: Introduction to Statistical Learning (free) from section 6, StatQuest on YouTube
Top priority: Python + SQL
- Why: the basic tools for touching data. Even when LLMs help with code, you need to be able to read and fix it
- Resources: Python for Data Analysis from section 6, Kaggle’s free courses
Top priority: problem definition & causal thinking
- Why: the area AI can’t automate. This is where your salary is decided
- Resources: Cassie Kozyrkov’s Decision Intelligence writings, an introductory book on causal inference
Important: machine learning
- Why: still a core tool. But this is where “understanding the principles and using them” vs “copy-pasting” diverge
- Resources: Andrew Ng’s machine learning course, Hands-On Machine Learning
Important: communication & visualization
- Why: an analysis that can’t persuade doesn’t get used. The power of a single chart, a single sentence
- Resources: running a blog, practicing data visualization

6. Learning resources 📚

Must-read books

There’s a reason these books are famous. They show you how data scientists actually think. And one free bonus book:

An Introduction to Statistical Learning (free PDF): https://www.statlearning.com/, the statistical learning textbook most recommended to data science beginners. It’s explained so the equations aren’t scary. There are both R and Python versions.

Recommended online courses

Don’t just read theory, type the code yourself while taking the course. That’s real studying. Additional strong recommendations:

Andrew Ng’s Machine Learning Specialization (Coursera/DeepLearning.AI): https://www.deeplearning.ai/courses/machine-learning-specialization/, made by the legend himself, the gold standard for an intro to machine learning
fast.ai, Practical Deep Learning for Coders (free): https://course.fast.ai/, a “code first, theory later” approach rather than “math first.” The best for people who want to build something fast

Free materials (learning without spending money)

We live in an age where even students with light wallets can use world-class materials for free:

Practice platforms
- Kaggle: https://www.kaggle.com/, real datasets + competitions + free mini-courses (Python, Pandas, ML, SQL). The playground for getting into data science
- Google Colab: https://colab.research.google.com/, free notebooks that run Python and machine learning right in your browser with no installation
YouTube (building intuition for theory)
- StatQuest with Josh Starmer: https://www.youtube.com/@statquest, the channel that makes you truly understand statistics and machine learning. One “BAM!” and the concept sticks
- 3Blue1Brown (intuition for linear algebra/calculus): https://www.youtube.com/@3blue1brown
Reading
- Towards Data Science: https://towardsdatascience.com/, practical writing from working data scientists (causal inference, careers in the AI era, etc.)
- Cassie Kozyrkov’s writing (Medium): https://kozyr.com/, pieces that break down Decision Intelligence in plain terms

Communities

Data science isn’t something you do alone. Ask when you’re stuck, and learn by peeking at other people’s analyses:

Kaggle Discussions/Notebooks: https://www.kaggle.com/discussions, how other people solve the same data is out in the open. The best textbook there is
r/datascience (Reddit): https://www.reddit.com/r/datascience/, realistic career advice, honest talk from the industry
Joining a Kaggle competition as a team, hands-on experience + networking + portfolio all at once

One last thing. You don’t start this field perfectly prepared. As Hilary Mason says, “Build a messy prototype every day.” Grab one Kaggle dataset and type the first line today. That’s the real start of being a data scientist. You’ve got this! 💪

TL;DR

Data Scientist

This career at a glance

1. What does a data scientist actually do? 🤔

In one sentence

Why this job is awesome ✨

A cold reality check ⚠️

2. Will this job still be promising in the future? 📈

A reality check on the job market

Will AI replace this job?

💰 Actual salaries

Is it right for me? (self-assessment)

3. The cold truths you must know: the downsides ⚠️

Stress and the expectations mismatch

The “invisible work” that’s hard to get credit for

High turnover (the 1.7-year mystery)

Economic and career realities

Stories from people who quit

4. The legends of this field 🏆

DJ Patil: the person who coined the word “data scientist”

Hilary Mason: the pragmatic data scientist who proved herself with a blog

Fei-Fei Li: from an immigrant who couldn’t speak English to “the Godmother of AI”

Cassie Kozyrkov: the person who turned “decisions” into a science

Andrew Ng: the person who opened AI education to everyone

5. How do I prepare? 🎯

If you’re still a student (high schooler/university student)

If you’re switching from another field

Essential skills

6. Learning resources 📚

Must-read books

Recommended online courses

Free materials (learning without spending money)

Communities

Full Career Report

A great fit if you…

Be ready for…

Step-by-step prep roadmap

Recommended majors & fields

Credentials, exams & portfolio

The honest reality

Recommended books & courses

Want to go deeper on this career?

People who walked this path

Tags

References

Ready to Start?

Related careers

Data Center Engineer

AI Security Engineer

Software Engineer