# Run this cell to set up packages for lecture.
from lec01_imports import *
Welcome to DSC 10! 👋¶
- DSC 10 is a guided tour of data science.
- It was developed by UC Berkeley in 2015 and adapted by UCSD in 2017.
- You'll learn just enough programming and statistics to do data science.
- We'll cover statistics without too much math – instead, we'll use simulation.
- This class lays the foundation for all other courses in the DSC major.
Agenda¶
- Course staff.
- What is data science?
- How will this course run?
- Fun demo.
Course staff¶
Instructor: Nishant Kheterpal (call me Nishant)¶
- BS in EECS at Berkeley, current PhD in Robotics at Michigan 〽️ 🏈.
- I've taken and helped teach the course DSC 10 was originally based on!
- Teaching at Michigan: grad-level autonomous vehicles, intro programming
- Industry experience before grad school at Apple, GM, Ike, Uber
- Outside interests: 🚴♂️, 🏎️, 🥖, ✈️.
Course staff¶
In addition, we have many other course staff members who are here to support you in discussion, office hours, and online.
- Undergraduate tutors: Jack Determan, Ashley Ho, Jason Huynh, Zoe Ludena, Nick Swetlin, Guoxuan Xu
- Stuffed panda mascot: Baby Panda. 🐼
Learn more about them at dsc10.com/staff, and come say hi in office hours!
What is "data science"? 🤔¶
What is "data science"?¶
Data science is about drawing useful conclusions from data using computation. Throughout the quarter, we'll touch on several aspects of data science:
- First 2 weeks: use Python to explore data.
- Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.
- Next 2 weeks: use data to infer about a population, given just a sample.
- Rely heavily on simulation, rather than formulas.
- Last week: use data from the past to predict what may happen in the future.
- A taste of machine learning 🤖.
It can be fun, too!¶
The site The Pudding is home to several interactive data-rich articles.
Course logistics¶
Getting set up¶
- Ed: Q&A forum. All announcements will be made here. You should have gotten email invitation; if not, join here.
- Gradescope: Where you will submit all assignments, and where all of your grades will live. You will be automatically added to Gradescope within 24 hours of enrolling in the course.
- DataHub: Where you will access and run all code in this class. Access at datahub.ucsd.edu. Learn how to use it in the first discussion section!
- We will not be using Canvas for anything!
First tasks¶
- Fill out the required Welcome Survey as soon as possible.
- Take the pretest, which will help you gauge your preparedness, brush up on prerequisite knowledge, and learn test-taking skills. Solutions will be posted on Wednesday.
Lecture¶
- Lectures will be in-person and recorded for viewing afterwards.
- Recordings can be found at podcast.ucsd.edu a few hours later.
- Slides/code from lecture will be linked on the course website, both in a "runnable" code format and as an HTML file (✏️), which you can save as a PDF and annotate on your tablet.
- We will try to make lectures engaging. Bring your laptop or tablet, if you have one.
Concept Check ✅ – Answer at cc.dsc10.com¶
Is it acceptable to recline your seat on an airplane?
A. Yes, you paid for the seat!
B. Only if the person in front of you reclined their seat first.
C. Only if you ask the person behind you and they're fine with it.
D. No, it's rude.
(We are always going to use the same link for Concept Checks, so you should bookmark it.)
Discussion¶
- The first discussion section is today after lecture. A tutor, Nick, will help you get set up with Jupyter notebooks, the programming environment we'll be using all quarter.
- In future discussions, you will practice with the conceptual ideas in the course and prepare for quizzes and exams by working through past quiz and exam problems (see practice.dsc10.com).
- Problem sets are posted online, so bring a computer or tablet to access them. But like quizzes and exams, you will answer the problems on paper.
- Problem sets aren't submitted anywhere.
- No podcasting; you need to be an active participant in these sessions to benefit.
Labs¶
- Labs refer to lab assignments, which are a required part of the course and help you develop fluency in Python and working with data.
- While working on labs, you'll be able to run autograder tests which tell you if your answers are correct.
- For labs, if you pass all autograder tests, you will get 100%!
- You must submit labs individually, but you can discuss ideas with others (no sharing code).
- All assignments, including labs will be due at 11:59PM on the due date and submitted to Gradescope.
- The first lab will have submission instructions.
Homeworks and projects¶
- Homework assignments build off of skills you develop in labs.
- A key difference between homeworks and labs is that passing autograder tests does not guarantee a perfect score!
- In homeworks, we have "hidden tests" that are only run after you submit the assignment.
- The tests that are available to you within the assignment itself only verify that your answer is reasonable/on the right track.
- Again, you must work on homeworks yourself, but you can discuss ideas with other students (no sharing code).
- In the Midterm Project and Final Project, you will do a deep dive into a dataset! Projects are longer than homeworks, so we give you more time to work on them. They're also very rewarding!
- You can work on projects with partners, following these project partner guidelines. Both of you should actively contribute to all parts of the project.
Quiz¶
- There will be one quiz July 10, meant to help prepare you for the midterm exam. It will follow lecture, at 1:30PM.
- Your quiz score can replace your lowest homework/lab score.
- There are no makeup quizzes, so if you miss the quiz your lowest HW/lab grade stands.
Exams¶
We will have two exams this session.
- Midterm Exam: Thursday, July 18, during lecture 11AM-12PM.
- Final Exam: Saturday, August 3, 11:30AM-2:29PM, location TBD.
Both exams will be conducted in person and on paper. Let us know of any conflicts on the Welcome Survey.
Readings and resources¶
- We will draw readings from two sources. Readings for each lecture will be posted on the course homepage.
- Computational and Inferential Thinking (CIT), the textbook created for Berkeley's version of this course.
babypandas
notes, written specifically for the first part of DSC 10.
- The Resources tab of the course website contains links to helpful resources that you'll want to use throughout the course (e.g. DSC 10 Reference Sheet, programming tutorials, supplemental videos).
- The Debugging tab of the course website has answers to many common technical issues.
Weekly schedule¶
- Because of the fast-paced nature of this class over the summer, the schedule each week will differ.
- We'll send out a weekly message noting the deadlines for the week, and what we recommend working on at what times.
- Always refer to the course website for the current schedule.
Getting help¶
This is a tough, fast-paced course, but we're here to help you – here's how:
- Office Hours (OH).
- Not held in an office – rather, held in a large open study space (HDSI 155).
- Come with questions, or just to work!
- See the schedule and instructions on the 📆 Calendar.
- Ed.
- Post here with any logistical or conceptual questions; please don't email.
- No code or solutions in public posts. Such posts should be private to course staff.
- Otherwise, post publicly (anonymously, if you'd like).
- 🚨 Important: Use these to your advantage!
Advice from previous students¶
At the end of each quarter, we ask DSC 10 students to give advice to future students in the course. Here are some responses from last quarter's students:
Start the assignments (especially the midterm/final projects) early! It became so manageable with more time to split up sections and think things through without a crazy overbearing time pressure.
Be prepared to spend a lot of time in this class, regardless of whether you have any prior knowledge in programming or statistics. Everything is doable, but you will need to put in a significant amount of effort to succeed and sometimes you'll have to think outside of the box to come up with solutions.
Go to office hours!! It is the best resource available. The tutors are more than willing to help you out. The tutors made my time at DSC 10 not only manageable but also enjoyable. Also, prepare for the quizzes at least one day in advance so that you can retain the material better.
Practice is the most important thing you can do to succeed in this course. Also, grab a friend - two (or more) heads are better than one! And don't be afraid to ask for help when needed.
Academic Integrity policies¶
Collaboration¶
- Discuss all questions with each other (except, of course, on quizzes and exams).
- Projects are submitted in pairs or individually. Both partners should contribute to all parts of the project, not split it up.
- Labs and homeworks are submitted individually.
- No other person should complete your work for you or write any of the code you submit in this course, with the exception of the work you do with a project partner.
- Don't give someone else your code or look at someone else's code.
Generative Artificial Intelligence (GenAI)¶
- The syllabus includes a discussion of these tools and how you may use them in this class. Please read this carefully, ask questions about it, and proceed with care!
We're here for you!¶
Regardless of your background, you can succeed in this course. No prior programming or statistics experience will be assumed!
Watch on YouTube: We’re All Data Scientists | Rebecca Nugent | TEDxCMU.
Demo¶
Little Women (1868)¶
- Little Women, by Louisa May Alcott, is a novel that follows the life of four sisters – Meg, Jo, Beth, and Amy.
- A movie based on the novel was released in 2019, starring Emma Watson (Meg) and Timothée Chalamet (Laurie).
- Using tools from this class, we'll learn (a bit) about the plot of the book, without reading it.
- Do not worry about any of this code – we'll cover the necessary pieces in the weeks to come. Sit back and relax!
# Read in 'lw.txt' to a variable called little_women_text.
little_women_text = open('data/lw.txt').read()
# See the first three thousand characters.
little_women_text[:3000]
'The Project Gutenberg EBook of Little Women, by Louisa May Alcott\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever. You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.net\n\n\nTitle: Little Women\n\nAuthor: Louisa May Alcott\n\nPosting Date: September 13, 2008 [EBook #514]\nRelease Date: May, 1996\n[This file last updated on August 19, 2010]\n\nLanguage: English\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK LITTLE WOMEN ***\n\n\n\n\nLITTLE WOMEN\n\n\nby\n\nLouisa May Alcott\n\n\n\n\nCONTENTS\n\n\nPART 1\n\n ONE PLAYING PILGRIMS\n TWO A MERRY CHRISTMAS\n THREE THE LAURENCE BOY\n FOUR BURDENS\n FIVE BEING NEIGHBORLY\n SIX BETH FINDS THE PALACE BEAUTIFUL\n SEVEN AMY\'S VALLEY OF HUMILIATION\n EIGHT JO MEETS APOLLYON\n NINE MEG GOES TO VANITY FAIR\n TEN THE P.C. AND P.O.\n ELEVEN EXPERIMENTS\n TWELVE CAMP LAURENCE\n THIRTEEN CASTLES IN THE AIR\n FOURTEEN SECRETS\n FIFTEEN A TELEGRAM\n SIXTEEN LETTERS\n SEVENTEEN LITTLE FAITHFUL\n EIGHTEEN DARK DAYS\n NINETEEN AMY\'S WILL\n TWENTY CONFIDENTIAL\n TWENTY-ONE LAURIE MAKES MISCHIEF, AND JO MAKES PEACE\n TWENTY-TWO PLEASANT MEADOWS\n TWENTY-THREE AUNT MARCH SETTLES THE QUESTION\n\n\nPART 2\n\n TWENTY-FOUR GOSSIP\n TWENTY-FIVE THE FIRST WEDDING\n TWENTY-SIX ARTISTIC ATTEMPTS\n TWENTY-SEVEN LITERARY LESSONS\n TWENTY-EIGHT DOMESTIC EXPERIENCES\n TWENTY-NINE CALLS\n THIRTY CONSEQUENCES\n THIRTY-ONE OUR FOREIGN CORRESPONDENT\n THIRTY-TWO TENDER TROUBLES\n THIRTY-THREE JO\'S JOURNAL\n THIRTY-FOUR FRIEND\n THIRTY-FIVE HEARTACHE\n THIRTY-SIX BETH\'S SECRET\n THIRTY-SEVEN NEW IMPRESSIONS\n THIRTY-EIGHT ON THE SHELF\n THIRTY-NINE LAZY LAURENCE\n FORTY THE VALLEY OF THE SHADOW\n FORTY-ONE LEARNING TO FORGET\n FORTY-TWO ALL ALONE\n FORTY-THREE SURPRISES\n FORTY-FOUR MY LORD AND LADY\n FORTY-FIVE DAISY AND DEMI\n FORTY-SIX UNDER THE UMBRELLA\n FORTY-SEVEN HARVEST TIME\n\n\n\nCHAPTER ONE\n\nPLAYING PILGRIMS\n\n"Christmas won\'t be Christmas without any presents," grumbled Jo, lying\non the rug.\n\n"It\'s so dreadful to be poor!" sighed Meg, looking down at her old\ndress.\n\n"I don\'t think it\'s fair for some girls to have plenty of pretty\nthings, and other girls nothing at all," added little Amy, with an\ninjured sniff.\n\n"We\'ve got Father and Mother, and each other," said Beth contentedly\nfrom her corner.\n\nThe four young faces on which the firelight shone brightened at the\ncheerful words, but darkened again as Jo said sadly, "We haven\'t got\nFather, and shall not have him for a long time." She didn\'t say\n"perhaps never," but each silently added it, thinking of Father far\naway, where the fighting was.\n\nNobody spoke for a minute; then Meg said in an altered tone, "You know\nthe reason Mother proposed not having any presents this Christmas was\nbecause it is going to b'
# Print the first three thousand characters.
print(little_women_text[:3000])
The Project Gutenberg EBook of Little Women, by Louisa May Alcott This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.net Title: Little Women Author: Louisa May Alcott Posting Date: September 13, 2008 [EBook #514] Release Date: May, 1996 [This file last updated on August 19, 2010] Language: English *** START OF THIS PROJECT GUTENBERG EBOOK LITTLE WOMEN *** LITTLE WOMEN by Louisa May Alcott CONTENTS PART 1 ONE PLAYING PILGRIMS TWO A MERRY CHRISTMAS THREE THE LAURENCE BOY FOUR BURDENS FIVE BEING NEIGHBORLY SIX BETH FINDS THE PALACE BEAUTIFUL SEVEN AMY'S VALLEY OF HUMILIATION EIGHT JO MEETS APOLLYON NINE MEG GOES TO VANITY FAIR TEN THE P.C. AND P.O. ELEVEN EXPERIMENTS TWELVE CAMP LAURENCE THIRTEEN CASTLES IN THE AIR FOURTEEN SECRETS FIFTEEN A TELEGRAM SIXTEEN LETTERS SEVENTEEN LITTLE FAITHFUL EIGHTEEN DARK DAYS NINETEEN AMY'S WILL TWENTY CONFIDENTIAL TWENTY-ONE LAURIE MAKES MISCHIEF, AND JO MAKES PEACE TWENTY-TWO PLEASANT MEADOWS TWENTY-THREE AUNT MARCH SETTLES THE QUESTION PART 2 TWENTY-FOUR GOSSIP TWENTY-FIVE THE FIRST WEDDING TWENTY-SIX ARTISTIC ATTEMPTS TWENTY-SEVEN LITERARY LESSONS TWENTY-EIGHT DOMESTIC EXPERIENCES TWENTY-NINE CALLS THIRTY CONSEQUENCES THIRTY-ONE OUR FOREIGN CORRESPONDENT THIRTY-TWO TENDER TROUBLES THIRTY-THREE JO'S JOURNAL THIRTY-FOUR FRIEND THIRTY-FIVE HEARTACHE THIRTY-SIX BETH'S SECRET THIRTY-SEVEN NEW IMPRESSIONS THIRTY-EIGHT ON THE SHELF THIRTY-NINE LAZY LAURENCE FORTY THE VALLEY OF THE SHADOW FORTY-ONE LEARNING TO FORGET FORTY-TWO ALL ALONE FORTY-THREE SURPRISES FORTY-FOUR MY LORD AND LADY FORTY-FIVE DAISY AND DEMI FORTY-SIX UNDER THE UMBRELLA FORTY-SEVEN HARVEST TIME CHAPTER ONE PLAYING PILGRIMS "Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug. "It's so dreadful to be poor!" sighed Meg, looking down at her old dress. "I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff. "We've got Father and Mother, and each other," said Beth contentedly from her corner. The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time." She didn't say "perhaps never," but each silently added it, thinking of Father far away, where the fighting was. Nobody spoke for a minute; then Meg said in an altered tone, "You know the reason Mother proposed not having any presents this Christmas was because it is going to b
# Create a variable "chapters" by splitting the text on 'CHAPTER '.
chapters = little_women_text.split('CHAPTER ')
# Create a DataFrame with one column - the text of each chapters.
bpd.DataFrame().assign(chapters=chapters)
chapters | |
---|---|
0 | The Project Gutenberg EBook of Little Women, b... |
1 | ONE\n\nPLAYING PILGRIMS\n\n"Christmas won't be... |
2 | TWO\n\nA MERRY CHRISTMAS\n\nJo was the first t... |
3 | THREE\n\nTHE LAURENCE BOY\n\n"Jo! Jo! Where ... |
4 | FOUR\n\nBURDENS\n\n"Oh, dear, how hard it does... |
... | ... |
43 | FORTY-THREE\n\nSURPRISES\n\nJo was alone in th... |
44 | FORTY-FOUR\n\nMY LORD AND LADY\n\n"Please, Mad... |
45 | FORTY-FIVE\n\nDAISY AND DEMI\n\nI cannot feel ... |
46 | FORTY-SIX\n\nUNDER THE UMBRELLA\n\nWhile Lauri... |
47 | FORTY-SEVEN\n\nHARVEST TIME\n\nFor a year Jo a... |
48 rows × 1 columns
# Number of occurrences of each name in each chapter.
counts = bpd.DataFrame().assign(
Amy=np.char.count(chapters, 'Amy'),
Beth=np.char.count(chapters, 'Beth'),
Jo=np.char.count(chapters, 'Jo'),
Meg=np.char.count(chapters, 'Meg'),
Laurie=np.char.count(chapters, 'Laurie'),
)
counts
Amy | Beth | Jo | Meg | Laurie | |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
1 | 23 | 26 | 44 | 26 | 0 |
2 | 13 | 12 | 21 | 20 | 0 |
3 | 2 | 2 | 62 | 36 | 16 |
4 | 14 | 18 | 34 | 17 | 0 |
... | ... | ... | ... | ... | ... |
43 | 31 | 8 | 61 | 3 | 29 |
44 | 13 | 0 | 9 | 0 | 10 |
45 | 1 | 2 | 6 | 2 | 0 |
46 | 2 | 1 | 56 | 4 | 2 |
47 | 10 | 3 | 37 | 6 | 13 |
48 rows × 5 columns
# Cumulative number of times each name appears.
cumulative_counts = bpd.DataFrame().assign(
Amy=np.cumsum(counts.get('Amy')),
Beth=np.cumsum(counts.get('Beth')),
Jo=np.cumsum(counts.get('Jo')),
Meg=np.cumsum(counts.get('Meg')),
Laurie=np.cumsum(counts.get('Laurie')),
Chapter=np.arange(1, 49, 1)
)
cumulative_counts
Amy | Beth | Jo | Meg | Laurie | Chapter | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 23 | 26 | 44 | 26 | 0 | 2 |
2 | 36 | 38 | 65 | 46 | 0 | 3 |
3 | 38 | 40 | 127 | 82 | 16 | 4 |
4 | 52 | 58 | 161 | 99 | 16 | 5 |
... | ... | ... | ... | ... | ... | ... |
43 | 619 | 459 | 1435 | 673 | 571 | 44 |
44 | 632 | 459 | 1444 | 673 | 581 | 45 |
45 | 633 | 461 | 1450 | 675 | 581 | 46 |
46 | 635 | 462 | 1506 | 679 | 583 | 47 |
47 | 645 | 465 | 1543 | 685 | 596 | 48 |
48 rows × 6 columns
# Putting it all together, we get a helpful visualization.
cumulative_counts_df = cumulative_counts.drop(columns=['Chapter']).to_df().melt().rename(columns={'variable': 'name', 'value': 'Count'})
cumulative_counts_df = cumulative_counts_df.assign(Chapter=list(range(1, 49)) * 5)
px.line(cumulative_counts_df, x='Chapter', y='Count', color='name', width=900, height=600, title='Cumulative Number of Times Each Name Appears', template='ggplot2')