# Run this cell to set up packages for lecture.
from lec06_imports import *
There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).
Particularly useful shortcuts:
| Action | Keyboard shortcut |
|---|---|
| Run cell + jump to next cell | SHIFT + ENTER |
| Save the notebook | CTRL/CMD + S |
| Create new cell above/below | A/B |
| Delete cell | DD |
We'll continue working with the same data from last time.
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states = states.set_index('State')
states
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
.groupby aggregates rows¶.groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values from different rows with the same value in the specified column.
states.groupby('Region').sum()
| Population | Land Area | Density | |
|---|---|---|---|
| Region | |||
| Midwest | 68985454 | 750524 | 1298.78 |
| Northeast | 57609148 | 161912 | 4957.49 |
| South | 125576562 | 868356 | 3189.37 |
| West | 78588572 | 1751054 | 881.62 |
.groupby in general¶To use .groupby:
.groupby(column_name) will gather rows which have the same value in the specified column (column_name)..count(), .sum(), .mean(), .median(), .max(), and .min().states.groupby('Region').mean()
| Population | Land Area | Density | |
|---|---|---|---|
| Region | |||
| Midwest | 5.75e+06 | 62543.67 | 108.23 |
| Northeast | 6.40e+06 | 17990.22 | 550.83 |
| South | 7.85e+06 | 54272.25 | 199.34 |
| West | 6.05e+06 | 134696.46 | 67.82 |
states.groupby('Party').median()
| Population | Land Area | Density | |
|---|---|---|---|
| Party | |||
| Democratic | 5.71e+06 | 30843.0 | 226.57 |
| Republican | 4.51e+06 | 56539.0 | 74.62 |
.count()¶Notice the column names don't make sense after grouping with the .count() aggregation method.
states.groupby('Region').count()
| Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|
| Region | |||||
| Midwest | 12 | 12 | 12 | 12 | 12 |
| Northeast | 9 | 9 | 9 | 9 | 9 |
| South | 16 | 16 | 16 | 16 | 16 |
| West | 13 | 13 | 13 | 13 | 13 |
.assign, .drop, and .get¶.assign to create a new column containing the same values as an existing column..drop(columns=list_of_column_labels) to drop any columns you no longer need..get(list_of_column_labels) to keep only certain columns. The columns will appear in the order you specify, so this is also useful for reordering columns!.get¶list of column names gives a DataFrame. (Even if the list has just one element!)states.get('Capital City')
State
Alabama Montgomery
Alaska Juneau
Arizona Phoenix
...
West Virginia Charleston
Wisconsin Madison
Wyoming Cheyenne
Name: Capital City, Length: 50, dtype: object
states.get(['Capital City', 'Party'])
| Capital City | Party | |
|---|---|---|
| State | ||
| Alabama | Montgomery | Republican |
| Alaska | Juneau | Republican |
| Arizona | Phoenix | Republican |
| ... | ... | ... |
| West Virginia | Charleston | Republican |
| Wisconsin | Madison | Republican |
| Wyoming | Cheyenne | Republican |
50 rows × 2 columns
states.get(['Capital City'])
| Capital City | |
|---|---|
| State | |
| Alabama | Montgomery |
| Alaska | Juneau |
| Arizona | Phoenix |
| ... | ... |
| West Virginia | Charleston |
| Wisconsin | Madison |
| Wyoming | Cheyenne |
50 rows × 1 columns
Change the DataFrame states_by_region so that it only has one column, called 'Count', containing the number of states in each region.
states_by_region = states.groupby('Region').count()
states_by_region
| Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|
| Region | |||||
| Midwest | 12 | 12 | 12 | 12 | 12 |
| Northeast | 9 | 9 | 9 | 9 | 9 |
| South | 16 | 16 | 16 | 16 | 16 |
| West | 13 | 13 | 13 | 13 | 13 |
In Lecture 1, we were able to answer questions about the plot of Little Women without having to read the novel and without having to understand Python code. Some of those questions included:
We answered these questions from a data visualization alone!
bpd.read_csv('data/lw_counts.csv').plot(x='Chapter', linewidth=3.0);

There are two main types of variables:
Note that here, "variable" does not mean a variable in Python, but rather it means a column in a DataFrame.
Which of these is not a numerical variable?
A. Fuel economy in miles per gallon.
B. Number of quarters at UCSD.
C. College at UCSD (Sixth, Seventh, etc).
D. Bank account number.
E. More than one of these are not numerical variables.
The type of visualization we create depends on the kinds of variables we're visualizing.
We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.
An exoplanet is a planet outside our solar system. NASA has discovered over 5,000 exoplanets so far in its search for signs of life beyond Earth. 👽
| Column | Contents |
|---|---|
'Distance'| Distance from Earth, in light years.
'Magnitude'| Apparent magnitude, which measures brightness in such a way that brighter objects have lower values.
'Type'| Categorization of planet based on its composition and size.
'Year'| When the planet was discovered.
'Detection'| The method of detection used to discover the planet.
'Mass'| The ratio of the planet's mass to Earth's mass.
'Radius'| The ratio of the planet's radius to Earth's radius.
exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo
| Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
|---|---|---|---|---|---|---|---|
| Name | |||||||
| 11 Comae Berenices b | 304.0 | 4.72 | Gas Giant | 2007 | Radial Velocity | 6165.90 | 11.88 |
| 11 Ursae Minoris b | 409.0 | 5.01 | Gas Giant | 2009 | Radial Velocity | 4684.81 | 11.99 |
| 14 Andromedae b | 246.0 | 5.23 | Gas Giant | 2008 | Radial Velocity | 1525.58 | 12.65 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |
| YZ Ceti c | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.14 | 1.05 |
| YZ Ceti d | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.09 | 1.03 |
5043 rows × 7 columns
'Distance' and 'Magnitude'?exo.plot(kind='scatter', x='Distance', y='Magnitude');
Further planets have greater 'Magnitude' (meaning they are less bright), which makes sense.
The data appears curved because 'Magnitude' is measured on a logarithmic scale. A decrease of one unit in 'Magnitude' corresponds to a 2.5 times increase in brightness.

df, usedf.plot(
kind='scatter',
x=x_column_for_horizontal,
y=y_column_for_vertical
)
df..plot, it will hide the weird text output that displays.The majority of exoplanets are less than 10,000 light years away; if we'd like to zoom in on just these exoplanets, we can query before plotting.
exo[exo.get('Distance') < 10000].plot(kind='scatter', x='Distance', y='Magnitude');
'Magnitude' of newly discovered exoplanets changed over time?# There were multiple exoplanets discovered each year.
# What operation can we apply to this DataFrame so that there is one row per year?
exo
| Distance | Magnitude | Type | Year | Detection | Mass | Radius | |
|---|---|---|---|---|---|---|---|
| Name | |||||||
| 11 Comae Berenices b | 304.0 | 4.72 | Gas Giant | 2007 | Radial Velocity | 6165.90 | 11.88 |
| 11 Ursae Minoris b | 409.0 | 5.01 | Gas Giant | 2009 | Radial Velocity | 4684.81 | 11.99 |
| 14 Andromedae b | 246.0 | 5.23 | Gas Giant | 2008 | Radial Velocity | 1525.58 | 12.65 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| YZ Ceti b | 12.0 | 12.07 | Terrestrial | 2017 | Radial Velocity | 0.70 | 0.91 |
| YZ Ceti c | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.14 | 1.05 |
| YZ Ceti d | 12.0 | 12.07 | Super Earth | 2017 | Radial Velocity | 1.09 | 1.03 |
5043 rows × 7 columns
'Magnitude' of all exoplanets discovered in each 'Year'.exo.groupby('Year').mean()
| Distance | Magnitude | Mass | Radius | |
|---|---|---|---|---|
| Year | ||||
| 1995 | 50.00 | 5.45 | 146.20 | 13.97 |
| 1996 | 51.33 | 5.12 | 1020.67 | 13.09 |
| 1997 | 57.00 | 5.41 | 332.10 | 13.53 |
| ... | ... | ... | ... | ... |
| 2021 | 1944.22 | 13.01 | 255.42 | 4.44 |
| 2022 | 508.61 | 10.62 | 943.16 | 6.77 |
| 2023 | 451.89 | 12.09 | 162.78 | 7.12 |
29 rows × 4 columns
exo.groupby('Year').mean().plot(
kind='line',
y='Magnitude',
linewidth=3.0);
It looks like the brightest planets were discovered first, which makes sense.
NASA's Kepler space telescope began its nine-year mission in 2009, leading to a boom in the discovery of exoplanets.
df, usedf.plot(
kind='line',
x=x_column_for_horizontal,
y=y_column_for_vertical
)
x= argument.If you're curious how line plots work under the hood, watch this video we made a few quarters ago.
YouTubeVideo('glzZ04D1kDg')
'Type's of exoplanets, on average?types = exo.groupby('Type').mean()
types
| Distance | Magnitude | Year | Mass | Radius | |
|---|---|---|---|---|---|
| Type | |||||
| Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |
| Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |
| Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |
| Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |
types.plot(kind='barh', y='Radius');
types.plot(kind='barh', y='Mass');
'Gas Giant's are aptly named!
df, usedf.plot(
kind='barh',
x=categorical_column_name,
y=numerical_column_name
)
'barh' stands for "horizontal".y='Mass' even though mass is measured by x-axis length.What are the most popular 'Detection' methods for discovering exoplanets?
# Count how many exoplanets are discovered by each detection method.
popular_detection = exo.groupby('Detection').count()
popular_detection
| Distance | Magnitude | Type | Year | Mass | Radius | |
|---|---|---|---|---|---|---|
| Detection | ||||||
| Astrometry | 1 | 1 | 1 | 1 | 1 | 1 |
| Direct Imaging | 50 | 50 | 50 | 50 | 50 | 50 |
| Disk Kinematics | 1 | 1 | 1 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| Radial Velocity | 1019 | 1019 | 1019 | 1019 | 1019 | 1019 |
| Transit | 3914 | 3914 | 3914 | 3914 | 3914 | 3914 |
| Transit Timing Variations | 23 | 23 | 23 | 23 | 23 | 23 |
11 rows × 6 columns
# Give columns more meaningful names and eliminate redundancy.
popular_detection = (popular_detection.assign(Count=popular_detection.get('Distance'))
.get(['Count'])
.sort_values(by='Count', ascending=False)
)
popular_detection
| Count | |
|---|---|
| Detection | |
| Transit | 3914 |
| Radial Velocity | 1019 |
| Direct Imaging | 50 |
| ... | ... |
| Astrometry | 1 |
| Disk Kinematics | 1 |
| Pulsar Timing | 1 |
11 rows × 1 columns
# Notice that the bars appear in the opposite order relative to the DataFrame.
popular_detection.plot(kind='barh', y='Count');
# Change "barh" to "bar" to get a vertical bar chart.
# These are harder to read, but the bars do appear in the same order as the DataFrame.
popular_detection.plot(kind='bar', y='Count');
Can we look at both the average 'Magnitude' and the average 'Radius' for each 'Type' at the same time?
bars = types.get(['Magnitude', 'Radius']).plot(kind='barh');
for i, patch in enumerate(bars.patches):
if i<4:
patch.set_hatch('//')
plt.legend();
How did we do that?
When calling .plot, if we omit the y=column_name argument, all other columns are plotted.
types
| Distance | Magnitude | Year | Mass | Radius | |
|---|---|---|---|---|---|
| Type | |||||
| Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |
| Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |
| Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |
| Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |
types.plot(kind='barh');
Remember, to select multiple columns, use .get([column_1, ..., column_k]). This returns a DataFrame.
types
| Distance | Magnitude | Year | Mass | Radius | |
|---|---|---|---|---|---|
| Type | |||||
| Gas Giant | 1096.40 | 10.30 | 2013.73 | 1472.39 | 12.74 |
| Neptune-like | 2189.02 | 13.52 | 2016.59 | 15.28 | 3.11 |
| Super Earth | 1916.26 | 13.85 | 2016.43 | 5.81 | 1.58 |
| Terrestrial | 1373.60 | 13.45 | 2016.37 | 1.62 | 0.85 |
types.get(['Magnitude', 'Radius'])
| Magnitude | Radius | |
|---|---|---|
| Type | ||
| Gas Giant | 10.30 | 12.74 |
| Neptune-like | 13.52 | 3.11 |
| Super Earth | 13.85 | 1.58 |
| Terrestrial | 13.45 | 0.85 |
types.get(['Magnitude', 'Radius']).plot(kind='barh');