# Run this cell to set up packages for lecture.
import babypandas as bpd
import matplotlib.pyplot as plt
import numpy as np
plt.style.use("ggplot")
Welcome to DSC 10! 👋¶
- DSC 10 is a guided tour of data science.
- It was developed by UC Berkeley in 2015 and adapted by UCSD in 2017.
- You'll learn just enough programming and statistics to do data science.
- We'll cover statistics without too much math – instead, we'll use simulation.
- This class lays the foundation for all other courses in the DSC major.
Agenda¶
- Course staff.
- What is data science?
- How will this course run?
- Fun demo.
- What is code? What are Jupyter Notebooks?
- Expressions.
Course staff¶
Instructor: Dr. Janine Tiefenbruck (call me Janine)¶
- BS in Math and Computer Science at Loyola Maryland, PhD in Math (combinatorics) at UCSD 🔱.
- Teaching at UCSD: Math ➡️ CSE ➡️ DSC.
- 11 years of teaching!
- Mostly teach DSC 10, sometimes DSC 40A or DSC 80.
- Outside interests: crafting, board games, hiking, baking.
|
|
|
Course staff¶
In addition, we have many other course staff members who are here to support you in discussion, office hours, and online.
- Graduate TA: Zeyu Bian.
- Undergraduate tutors: Bianca Grunbaum, Kate Feng, Raymond Williams, Sofia Tkachenko, Austin Flippo, Pranav Rajaram.
- Stuffed panda mascot: Baby Panda. 🐼
Learn more about them at dsc10.com/staff, and come say hi in office hours!
What is "data science"? 🤔¶
Everyone seems to have their own definition of data science.What is "data science"?¶
Data science is about drawing useful conclusions from data using computation. Throughout the quarter, we'll touch on several aspects of data science:
- First 4 weeks: use Python to explore data.
- Lots of visualization 📈📊 and "data manipulation", using industry-standard tools.
- Next 4 weeks: use data to infer about a population, given just a sample.
- Rely heavily on simulation, rather than formulas.
- Last 2 weeks: use data from the past to predict what may happen in the future.
- A taste of machine learning 🤖.
It can be fun, too!¶
The site The Pudding is home to several interactive data-rich articles.
(source)Course logistics¶
Important Dates¶
We will have four quizzes and two exams this quarter. All will be conducted in person and on paper.
- Quiz 1: Wednesday, April 15th
- Quiz 2: Wednesday, April 22nd
- Midterm Exam: Friday, May 1st during lecture
- Quiz 3: Wednesday, May 20th
- Quiz 4: Wednesday, May 27th
- Final Exam: Saturday, June 6th from 3:00PM to 6:00PM
Let us know of any conflicts and share your availability for course meetings on the Welcome Survey.
Getting started¶
Your first task is to complete the following by Thursday, April 2nd at 11:59PM.
- Join Campuswire (join code: 6479).
- Check if you can access Gradescope. If not, send a private message to the instructional staff on Campuswire with your name, PID, and email address, then we can add you so you can submit assignments.
- Read the syllabus and course website and complete the Syllabus Check.
- Fill out the Welcome Survey.
Academic Integrity policies¶
Collaboration¶
- Discuss all questions with each other (except, of course, on quizzes and exams).
- Projects are submitted in pairs or individually. Both partners should contribute to all parts of the project, not split it up.
- Labs and homeworks are submitted individually.
- No other person should complete your work for you or write any of the code you submit in this course, with the exception of the work you do with a project partner.
- Don't give someone else your code or look at someone else's code.
Generative Artificial Intelligence (GenAI)¶
- This class has a custom AI Tutor trained on course materials and designed to help you understand course content. Learn about how it works, see our recommendations for how to use it, and please try it out!
- The syllabus includes a discussion about the use of GenAI tools and how you may use them in this course. Please read this carefully and ask if you have any questions.
Getting help¶
This is a tough, fast-paced course, but we're here to help you – here's how:
- Discussion Section.
- Today's discussion (3PM or 4PM) will focus on getting you set up in our programming environment.
- Future discussions will focus on preparation for quizzes and exams (the hardest part of the course).
- Office Hours (OH).
- Not held in an office, but in a large open study space.
- Come with questions, or just to work!
- See the schedule and instructions on the 📆 Calendar.
- Campuswire.
- Post here with any logistical or conceptual questions; please don't email or use Canvas.
- No code or solutions in public posts. Such posts should be private to course staff.
- Otherwise, post publicly (anonymously, if you'd like).
- Resources, Resources, Resources.
- The course website includes links to course notes, a reference sheet, tutor-created videos and slideshows, interactive diagrams, practice exams with solutions, and more.
- 🚨 Important: Use these to your advantage!
Advice from previous students¶
At the end of each quarter, we ask DSC 10 students to give advice to future students in the course. Here are some responses from past students:
Start the assignments (especially the midterm/final projects) early! It became so manageable with more time to split up sections and think things through without a crazy overbearing time pressure.
Be prepared to spend a lot of time in this class, regardless of whether you have any prior knowledge in programming or statistics. Everything is doable, but you will need to put in a significant amount of effort to succeed and sometimes you'll have to think outside of the box to come up with solutions.
Go to office hours!! It is the best resource available. The tutors are more than willing to help you out. The tutors made my time at DSC 10 not only manageable but also enjoyable. Also, prepare for the quizzes at least one day in advance so that you can retain the material better.
Practice is the most important thing you can do to succeed in this course. Also, grab a friend - two (or more) heads are better than one! And don't be afraid to ask for help when needed.
We're here for you!¶
Regardless of your background, you can succeed in this course with lots of hard work. No prior programming or statistics experience will be assumed! We'll start at the beginning, but we will move fast!
Inspirational TED talk: 🎥 We’re All Data Scientists by Rebecca Nugent.
Wellness resources¶
Demo¶
Little Women (1868)¶
- Little Women, by Louisa May Alcott, is a novel that follows the life of four sisters – Meg, Jo, Beth, and Amy.
- A movie based on the novel was released in 2019, starring Emma Watson (Meg) and Timothée Chalamet (Laurie).
- Using tools from this class, we'll learn (a bit) about the plot of the book, without reading it.
- Do not worry about any of this code – we'll cover the necessary pieces in the weeks to come. Sit back and relax!
# Read in 'lw.txt' to a variable called little_women_text.
little_women_text = open("data/lw.txt").read()
# See the first three thousand characters.
little_women_text[:3000]
# Print the first three thousand characters.
print(little_women_text[:3000])
# Create a variable "chapters" by splitting the text on 'CHAPTER '.
chapters = little_women_text.split("CHAPTER ")
# Create a DataFrame with one column - the text of each chapters.
bpd.DataFrame().assign(chapters=chapters)
# Number of occurrences of each name in each chapter.
counts = bpd.DataFrame().assign(
Amy=np.char.count(chapters, "Amy"),
Beth=np.char.count(chapters, "Beth"),
Jo=np.char.count(chapters, "Jo"),
Meg=np.char.count(chapters, "Meg"),
Laurie=np.char.count(chapters, "Laurie"),
)
counts
# Cumulative number of times each name appears.
cumulative_counts = bpd.DataFrame().assign(
Amy=np.cumsum(counts.get("Amy")),
Beth=np.cumsum(counts.get("Beth")),
Jo=np.cumsum(counts.get("Jo")),
Meg=np.cumsum(counts.get("Meg")),
Laurie=np.cumsum(counts.get("Laurie")),
Chapter=np.arange(1, 49, 1),
)
cumulative_counts
# Putting it all together, we get a helpful visualization.
import plotly.express as px
cumulative_counts_df = (
cumulative_counts.drop(columns=["Chapter"])
.to_df()
.melt()
.rename(columns={"variable": "name", "value": "Count"})
)
cumulative_counts_df = cumulative_counts_df.assign(
Chapter=list(range(1, 49)) * 5
)
px.line(
cumulative_counts_df,
x="Chapter",
y="Count",
color="name",
width=700,
height=500,
title="Cumulative Number of Times Each Name Appears",
template="ggplot2",
)
- In Chapter 32, Jo moves to New York alone. Her relationship with which sister suffers the most from this faraway move?
- Laurie is a man who marries one of the sisters at the end. Which one?
What is code? What are Jupyter Notebooks? 💻¶
What is code?¶
- Instructions for computers are written in programming languages, and are referred to as code.
- “Computer programs” are nothing more than recipes: we write programs that tell the computer exactly what to do, and it does exactly that – nothing more, and nothing less.
Why Python?¶
- It's popular!
(source and methodology)
- It has a variety of use cases. Some examples:
- Web development.
- Data science and machine learning.
- Scripting and automation.
- It's (relatively) easy to dive right in! 🏊
Jupyter Notebooks 📓¶
- Often, but not in this class, code is written in a text editor and then run in a command-line interface (or both steps are done in an IDE).
- Jupyter Notebooks allow us to write and run code within a single document. They also allow us to embed text and code. We will be using Jupyter Notebooks throughout the quarter.
- DataHub is a server that allows you to run Jupyter Notebooks from your web browser without having to install any software locally.
Expressions¶
Python as a calculator¶
- An expression is a combination of values, operators, and functions that evaluates to some value.
- For now, let's think of Python like a calculator – it takes expressions and evaluates them.
- We will enter our expressions in code cells. To run a code cell, either:
- Hit
shift+enter(orshift+return) on your keyboard (strongly preferred), or - Press the "▶ Run" button in the toolbar.
- Hit
23
-15 + 2.718
4**3
(2 + 3 + 4) / 3
# Only one value is displayed. Why?
9 + 10
13 / 4
21
Arithmetic operations¶
| Operation | Operator | Example | Value |
|---|---|---|---|
| Addition | + |
2 + 3 |
5 |
| Subtraction | - |
2 - 3 |
-1 |
| Multiplication | * |
2 * 3 |
6 |
| Division | / |
7 / 3 |
2.33333 |
| Remainder | % |
7 % 3 |
1 |
| Exponentiation | ** |
2 ** 0.5 |
1.41421 |
Python uses the typical order of operations – PEMDAS (BEDMAS? 🛏️)¶
5 * 2**3
(5 * 2) ** 3
Activity¶
In the cell below, write an expression that's equivalent to
$$(19 + 6 \cdot 3) - 15 \cdot \left(\sqrt{100} \cdot \frac{1}{30}\right) \cdot \frac{3}{5} + \frac{4^2}{2^3} + \left( 6 - \frac{2}{3} \right) \cdot 12 $$
Try to use parentheses only when necessary.
Summary, next time, reminders¶
Summary¶
- Expressions evaluate to values. Python will display the value of the last expression in a cell by default.
- Python knows about all of the standard mathematical operators and follows PEMDAS.
Next time¶
- We'll learn how to use variables to store values so that we can use them later in our code.
- We'll compute values using functions like
max,min, andround. - We'll discover that there are multiple different ways of storing values in Python. These are called data types.
Reminders¶
- Attend discussion section this afternoon at 3PM or 4PM in PODEM 1A20.
- Complete the items in the Getting Started section of the syllabus by Thursday, April 2nd at 11:59PM.
- Then, work on the Pretest and Lab 0, both due Monday, April 6th at 11:59PM. Access assignments by clicking links from the homepage of dsc10.com.
