import numpy as np
import babypandas as bpd
import pandas as pd
AgendaΒΆ
- More review of old exam problems.
- Working on personal projects.
- Demo: Gapminder π.
- Some parting thoughts.
More reviewΒΆ
Ask some questions! πββοΈπββοΈΒΆ
We'll take some time to work on past exam questions. Feel free to ask about specific topics you want more practice with, or specific problems you've tried that you want me to explain.
Personal projectsΒΆ
Using Jupyter Notebooks after DSC 10ΒΆ
- You may be interested in working on data science projects of your own.
- In this video, we show you how to make blank notebooks and upload datasets of your own to DataHub.
- After this quarter, depending on the classes you're enrolled in, you may not have access to DataHub. Eventually, you'll want to install Jupyter Notebooks on your computer.
- Anaconda is a great way to do that, as it also installs many commonly used packages.
- You may want to download your work from DataHub so you can refer to it after the course ends (though you can look at it on Gradescope too).
- Remember, all
babypandas
code is regularpandas
code, too!
Finding dataΒΆ
These sites allow you to search for datasets (in CSV format) from a variety of different domains. Some may require you to sign up for an account; these are generally reputable sources.
Note that all of these links are also available at rampure.org/find-datasets.
- Data is Plural
- FiveThirtyEight.
- CORGIS.
- Kaggle Datasets.
- Googleβs dataset search.
- DataHub.io.
- Data.world.
- R datasets.
- Wikipedia. (Use this site to extract and download tables as CSVs.)
- Awesome Public Datasets GitHub repo.
- Links to even more sources.
Domain-specific sources of dataΒΆ
- Sports: Basketball Reference, Baseball Reference, etc.
- US Government Sources: census.gov, data.gov, data.ca.gov, data.sfgov.org, FBIβs Crime Data Explorer, Centers for Disease Control and Prevention.
- Global Development: data.worldbank.org, databank.worldbank.org, WHO.
- Transportation: New York Taxi trips, Bureau of Transportation Statistics, SFO Air Traffic Statistics.
- Music: Spotify Charts.
- COVID: Johns Hopkins.
- Any Google Forms survey youβve administered! (Go to the results spreadsheet, then go to βFile > Download > Comma-separated valuesβ.)
Tip: if a site only allows you to download a file as an Excel file, not a CSV file, you can download it, open it in a spreadsheet viewer (Excel, Numbers, Google Sheets), and export it to a CSV.
Demo: Gapminder πΒΆ
plotly
ΒΆ
- All of the visualizations (scatter plots, histograms, etc.) in this course were created using a library called
matplotlib
.- This library was called under-the-hood everytime we wrote
df.plot
.
- This library was called under-the-hood everytime we wrote
plotly
is a different visualization library that allows us to create interactive visualizations.- You may learn about it in a future course, but we'll briefly show you some cool visualizations you can make with it.
import plotly.express as px
Gapminder datasetΒΆ
Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. - Gapminder Wikipedia
gapminder = px.data.gapminder()
gapminder
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | AFG | 4 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 | AFG | 4 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 | AFG | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.157306 | ZWE | 716 |
1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.420786 | ZWE | 716 |
1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.449960 | ZWE | 716 |
1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.038623 | ZWE | 716 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 | ZWE | 716 |
1704 rows Γ 8 columns
The dataset contains information for each country for several different years.
gapminder.get('year').unique()
array([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007], dtype=int64)
Let's start by just looking at 2007 data (the most recent year in the dataset).
gapminder_2007 = gapminder[gapminder.get('year') == 2007]
gapminder_2007
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
35 | Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.367465 | DZA | 12 |
47 | Angola | Africa | 2007 | 42.731 | 12420476 | 4797.231267 | AGO | 24 |
59 | Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.379640 | ARG | 32 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1655 | Vietnam | Asia | 2007 | 74.249 | 85262356 | 2441.576404 | VNM | 704 |
1667 | West Bank and Gaza | Asia | 2007 | 73.422 | 4018332 | 3025.349798 | PSE | 275 |
1679 | Yemen, Rep. | Asia | 2007 | 62.698 | 22211743 | 2280.769906 | YEM | 887 |
1691 | Zambia | Africa | 2007 | 42.384 | 11746035 | 1271.211593 | ZMB | 894 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 | ZWE | 716 |
142 rows Γ 8 columns
Scatter plotΒΆ
We can plot life expectancy vs. GDP per capita. If you hover over a point, you will see the name of the country.
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', hover_name='country')
In future courses, you'll learn about transformations. Here, we'll apply a log transformation to the x-axis to make the plot look a little more linear.
px.scatter(gapminder_2007, x='gdpPercap', y='lifeExp', log_x=True, hover_name='country')
Animated scatter plotΒΆ
We can take things one step further.
px.scatter(gapminder,
x = 'gdpPercap',
y = 'lifeExp',
hover_name = 'country',
color = 'continent',
size = 'pop',
size_max = 60,
log_x = True,
range_y = [30, 90],
animation_frame = 'year',
title = 'Life Expectancy, GDP Per Capita, and Population over Time'
)
Watch this video if you want to see an even-more-animated version of this plot.
Animated histogramΒΆ
px.histogram(gapminder,
x = 'lifeExp',
animation_frame = 'year',
range_x = [20, 90],
range_y = [0, 50],
title = 'Distribution of Life Expectancy over Time')
ChoroplethΒΆ
px.choropleth(gapminder,
locations = 'iso_alpha',
color = 'lifeExp',
hover_name = 'country',
hover_data = {'iso_alpha': False},
title = 'Life Expectancy Per Country',
color_continuous_scale = px.colors.sequential.tempo
)
Parting thoughtsΒΆ
From Lecture 1: What is "data science"?ΒΆ
Data science is about drawing useful conclusions from data using computation. Throughout the quarter, we touched on several aspects of data science:
- In the first 4 weeks, we used Python to explore data.
- Lots of visualization ππ and "data manipulation", using industry-standard tools.
- In the next 4 weeks, we used data to infer about a population, given just a sample.
- Rely heavily on simulation, rather than formulas.
- In the last 2 weeks, we used data from the past to predict what may happen in the future.
- A taste of machine learning π€.
- In future DSC courses β including DSC 20 and 40A β you'll revisit all three of these aspects of data science.
Thank you!ΒΆ
- This course would not have been possible without...
- Graduate TA: Ashley Ho.
- Undergraduate tutors: Jack Determan, Kate Feng, Michelle Hong, Jason Huynh, Minchan Kim, Athulith Paraselli, Pranav Rajaram, Bill Wang, Raymond Williams.
- Learn more about tutoring β it's fun, and you can be a tutor as early as your 3rd quarter at UCSD!
- Keep in touch! dsc10.com/staff
- After grades are released, we'll make a post on Ed where you can ask course staff for advice on courses, data science, and UCSD more generally