# Run this cell to set up packages for lecture.
from lec05_imports import *
Agenda¶
- Querying.
- Querying with multiple conditions.
- Grouping.
- After class: challenge problems.
Don't forget about these resources!¶
You belong here! 🤝¶
- We're moving very quickly in this class.
- This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
- We promise, no matter what your prior experience is, the answer is yes, you belong!
- Please come to office hours (see the schedule here) and post on Campuswire for help – we're here to make sure you succeed in this course.
The data: US states 🗽¶
We'll continue working with the same data from last time.
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states = states.set_index('State')
states
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
Example 5: Which states are in the West?¶
Key concept: Querying.
The problem¶
We want to create a DataFrame consisting of only the states whose 'Region' is 'West'. How do we do that?
The solution¶
# This DataFrame only contains rows where the 'Region' is 'West'!
only_west = states[states.get('Region') == 'West']
only_west
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
| ... | ... | ... | ... | ... | ... | ... |
| Utah | West | Salt Lake City | 3271616 | 82170 | Republican | 39.82 |
| Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
13 rows × 6 columns
🤯 What just happened?
Aside: Booleans¶
- When we compare two values, the result is either
TrueorFalse.- Notice, these words are not in quotes.
boolis a data type in Python, just likeint,float, andstr.- It stands for "Boolean", named after George Boole, an early mathematician.
- There are only two possible Boolean values:
TrueorFalse.- Yes or no.
- On or off.
- 1 or 0.
5 == 6
False
type(5 == 6)
bool
9 + 10 < 21
True
Comparison operators¶
There are several types of comparisons we can make.
| symbol | meaning |
|---|---|
== |
equal to |
!= |
not equal to |
< |
less than |
<= |
less than or equal to |
> |
greater than |
>= |
greater than or equal to |
When comparing an entire Series to a single value, the result is a Series of bools (via broadcasting).
states
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
states.get('Region') == 'West'
State
Alabama False
Alaska True
Arizona True
...
West Virginia False
Wisconsin False
Wyoming True
Name: Region, Length: 50, dtype: bool
What is a query? 🤔¶
- A query is code that extracts rows from a DataFrame for which certain condition(s) are true.
- We use queries to filter DataFrames to contain only the rows that satisfy given conditions.
How do we query a DataFrame?¶
To select only certain rows of states:
- Make a sequence (list/array/Series) of
Trues (keep) andFalses (toss), usually by making a comparison. - Then pass it into
states[sequence_goes_here].
states[states.get('Region') == 'West']
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| California | West | Sacramento | 39538223 | 155779 | Democratic | 253.81 |
| ... | ... | ... | ... | ... | ... | ... |
| Utah | West | Salt Lake City | 3271616 | 82170 | Republican | 39.82 |
| Washington | West | Olympia | 7705281 | 66456 | Democratic | 115.95 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
13 rows × 6 columns
What if the condition isn't satisfied?¶
states[states.get('Region') == 'Pacific Northwest']
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State |
Example 6: What proportion of US states are Republican?¶
Key concept: Shape of a DataFrame.
Strategy¶
- Query to extract a DataFrame of just the states where the
'Party'is'Republican'. - Count the number of such states.
- Divide by the total number of states.
only_rep = states[states.get('Party') == 'Republican']
only_rep
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
31 rows × 6 columns
Shape of a DataFrame¶
.shapereturns the number of rows and columns in a given DataFrame..shapeis not a method, so we don't use parentheses..shapeis an attribute, as it describes the DataFrame.
- Access each with
[]:.shape[0]for rows..shape[1]for columns.
only_rep.shape
(31, 6)
# Number of rows.
only_rep.shape[0]
31
# Number of columns.
only_rep.shape[1]
6
# What proportion of US states are Republican?
only_rep.shape[0] / states.shape[0]
0.62
Example 7: Which Midwestern state has the most land area?¶
Key concepts: Working with the index. Combining multiple steps.
Strategy¶
- Query to extract a DataFrame of just the states in the
'Midwest'. - Sort by
'Land Area'in descending order. - Extract the first element from the index.
midwest = states[states.get('Region') == 'Midwest']
midwest
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Illinois | Midwest | Springfield | 12812508 | 55519 | Democratic | 230.78 |
| Indiana | Midwest | Indianapolis | 6785528 | 35826 | Republican | 189.40 |
| Iowa | Midwest | Des Moines | 3190369 | 55857 | Republican | 57.12 |
| ... | ... | ... | ... | ... | ... | ... |
| Ohio | Midwest | Columbus | 11799448 | 40861 | Republican | 288.77 |
| South Dakota | Midwest | Pierre | 886667 | 75811 | Republican | 11.70 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
12 rows × 6 columns
midwest_sorted = midwest.sort_values(by='Land Area', ascending=False)
midwest_sorted
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Kansas | Midwest | Topeka | 2937880 | 81759 | Republican | 35.93 |
| Minnesota | Midwest | Saint Paul | 5706494 | 79627 | Democratic | 71.67 |
| Nebraska | Midwest | Lincoln | 1961504 | 76824 | Republican | 25.53 |
| ... | ... | ... | ... | ... | ... | ... |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
| Ohio | Midwest | Columbus | 11799448 | 40861 | Republican | 288.77 |
| Indiana | Midwest | Indianapolis | 6785528 | 35826 | Republican | 189.40 |
12 rows × 6 columns
- The answer is Kansas, but how do we get it in code?
midwest_sorted.get('State').iloc[0]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[18], line 1 ----> 1 midwest_sorted.get('State').iloc[0] File ~\OneDrive\Documents\DSC 10\dsc10\dsc10-2026-sp-private\.venv\Lib\site-packages\babypandas\utils.py:20, in suppress_warnings.<locals>.wrapper(*args, **kwargs) 18 with warnings.catch_warnings(): 19 warnings.simplefilter("ignore") ---> 20 return func(*args, **kwargs) File ~\OneDrive\Documents\DSC 10\dsc10\dsc10-2026-sp-private\.venv\Lib\site-packages\babypandas\bpd.py:328, in DataFrame.get(self, key) 326 if any(mask): 327 k = [key] if isinstance(key, str) else key --> 328 raise KeyError("{} not found in columns".format(np.array(k)[mask])) 330 f = _lift_to_pd(self._pd.get) 331 return f(key=key) KeyError: "['State'] not found in columns"
Working with the index¶
- We can't use
.getbecause.getis only for columns, and there is no column called'State'.- Instead,
'State'is the index of the DataFrame.
- Instead,
- To extract the index of a DataFrame, use
.index.- Like
.shape, this is an attribute of the DataFrame, not a method. Don't use parentheses.
- Like
- Access particular elements in the index with
[].
midwest_sorted.index
Index(['Kansas', 'Minnesota', 'Nebraska', 'South Dakota', 'North Dakota',
'Missouri', 'Michigan', 'Iowa', 'Illinois', 'Wisconsin', 'Ohio',
'Indiana'],
dtype='object', name='State')
midwest_sorted.index[0]
'Kansas'
Combining multiple steps¶
It is not necessary to define the intermediate variables
midwestandmidwest_sorted. We can do everything in one line of code.When solving a multi-step problem, develop your solution incrementally. Write one piece of code at a time and run it.
# Final solution, which you should build up one step at a time.
states[states.get('Region') == 'Midwest'].sort_values(by='Land Area', ascending=False).index[0]
'Kansas'
- If a line of code gets too long, enclose it in parentheses to split it over multiple lines.
# You can space your code out like this if needed.
(
states[states.get('Region') == 'Midwest']
.sort_values(by='Land Area', ascending=False)
.index[0]
)
'Kansas'
Concept Check ✅ – Answer at cc.dsc10.com¶
Which expression below evaluates to the total population of the 'West'?
A. states[states.get('Region') == 'West'].get('Population').sum()
B. states.get('Population').sum()[states.get('Region') == 'West']
C. states['West'].get('Population').sum()
D. More than one of the above.
...
Ellipsis
Example 8: What are the top three most-populated Republican states in the South?¶
Key concepts: Queries with multiple conditions. Selecting rows by position.
Multiple conditions¶
- To write a query with multiple conditions, use
&for "and" and|for "or".&: All conditions must be true.|: At least one condition must be true.
- You must use
(parentheses)around each condition! - 🚨 Don't use the Python keywords
andandorhere! They do not behave as you'd want.- See BPD 10.3 for an explanation.
states[(states.get('Party') == 'Republican') & (states.get('Region') == 'South')]
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
| Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
| ... | ... | ... | ... | ... | ... | ... |
| Tennessee | South | Nashville | 6910840 | 41235 | Republican | 167.60 |
| Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
13 rows × 6 columns
# You can also add line breaks within brackets.
states[(states.get('Party') == 'Republican') &
(states.get('Region') == 'South')]
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
| Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
| ... | ... | ... | ... | ... | ... | ... |
| Tennessee | South | Nashville | 6910840 | 41235 | Republican | 167.60 |
| Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
13 rows × 6 columns
The & and | operators work element-wise!¶
(states.get('Party') == 'Republican')
State
Alabama True
Alaska True
Arizona True
...
West Virginia True
Wisconsin True
Wyoming True
Name: Party, Length: 50, dtype: bool
(states.get('Region') == 'South')
State
Alabama True
Alaska False
Arizona False
...
West Virginia True
Wisconsin False
Wyoming False
Name: Region, Length: 50, dtype: bool
(states.get('Party') == 'Republican') & (states.get('Region') == 'South')
State
Alabama True
Alaska False
Arizona False
...
West Virginia True
Wisconsin False
Wyoming False
Length: 50, dtype: bool
Original Question: What are the top three most-populated Republican states in the South?¶
(
states[(states.get('Party') == 'Republican') &
(states.get('Region') == 'South')]
.sort_values(by='Population', ascending=False)
)
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
| Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
| Georgia | South | Atlanta | 10711908 | 57513 | Republican | 186.25 |
| ... | ... | ... | ... | ... | ... | ... |
| Arkansas | South | Little Rock | 3011524 | 52035 | Republican | 57.87 |
| Mississippi | South | Jackson | 2961279 | 46923 | Republican | 63.11 |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
13 rows × 6 columns
How do we extract the first three rows of this DataFrame?
Using .take to select rows by position¶
- Querying allows us to select rows that satisfy a certain condition.
- We can also select rows in specific positions with
.take(sequence_of_integer_positions). This keeps only the rows whose positions are in the specified sequence (list/array).- This is analogous to using
.iloc[]on a Series. - It's rare to need to select rows by integer position. Querying is far more useful.
- This is analogous to using
(
states[(states.get('Party') == 'Republican') &
(states.get('Region')=='South')]
.sort_values(by='Population', ascending=False)
.take([0, 1, 2])
)
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Texas | South | Austin | 29145505 | 261232 | Republican | 111.57 |
| Florida | South | Tallahassee | 21538187 | 53625 | Republican | 401.64 |
| Georgia | South | Atlanta | 10711908 | 57513 | Republican | 186.25 |
.take(np.arange(3))could equivalently be used in place of.take([0, 1, 2]).
Extra Practice¶
Write code to answer each question below.
- What is the capital city of the state in the
'West'with the largest land area? - How many states in the
'Northeast'have more land area than an average US state? - What is the total population of the
'Midwest','South', and'Northeast?
✅ Click here to see the answers after you've attempted the problems on your own.
- What is the capital city of the state in the West with the largest land area?
states[states.get('Region') == 'West'].sort_values(by='Land Area', ascending=False).get('Capital City').iloc[0]
- How many states in the Northeast have more land area than an average US state?
states[(states.get('Region') == 'Northeast') &
(states.get('Land Area') > states.get('Land Area').mean())].shape[0]
- What is the total population of the Midwest, South, and Northeast?
states[(states.get('Region') == 'Midwest') |
(states.get('Region') == 'South') |
(states.get('Region') == 'Northeast')].get('Population').sum()
or equivalently:
states.get('Population').sum() - states[states.get('Region') == 'West'].get('Population').sum()
...
Ellipsis
Adjusting columns¶
Adjusting columns with .assign, .drop, and .get¶
- To rename a column, use
.assignto create a new column containing the same values as an existing column.- New columns are added on the right.
states.assign(Pop_in_millions=states.get('Population') / 1_000_000)
| Region | Capital City | Population | Land Area | Party | Density | Pop_in_millions | |
|---|---|---|---|---|---|---|---|
| State | |||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 | 5.02 |
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 | 0.73 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 | 7.15 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 | 1.79 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 | 5.89 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 | 0.58 |
50 rows × 7 columns
- Then use
.drop(columns=list_of_column_labels)to drop any columns you no longer need.- Alternatively, use
.get(list_of_column_labels)to keep only certain columns. The columns will appear in the order you specify, so this is also useful for reordering columns!
- Alternatively, use
(
states
.assign(Pop_in_millions=states.get('Population') / 1_000_000)
.drop(columns=['Population'])
)
| Region | Capital City | Land Area | Party | Density | Pop_in_millions | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 50645 | Republican | 99.21 | 5.02 |
| Alaska | West | Juneau | 570641 | Republican | 1.29 | 0.73 |
| Arizona | West | Phoenix | 113594 | Republican | 62.96 | 7.15 |
| ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 24038 | Republican | 74.62 | 1.79 |
| Wisconsin | Midwest | Madison | 54158 | Republican | 108.82 | 5.89 |
| Wyoming | West | Cheyenne | 97093 | Republican | 5.94 | 0.58 |
50 rows × 6 columns
(
states
.assign(Pop_in_millions=states.get('Population') / 1_000_000)
# .get() is also useful for reordering columns
.get(['Pop_in_millions', 'Capital City'])
)
| Pop_in_millions | Capital City | |
|---|---|---|
| State | ||
| Alabama | 5.02 | Montgomery |
| Alaska | 0.73 | Juneau |
| Arizona | 7.15 | Phoenix |
| ... | ... | ... |
| West Virginia | 1.79 | Charleston |
| Wisconsin | 5.89 | Madison |
| Wyoming | 0.58 | Cheyenne |
50 rows × 2 columns
Two ways to .get¶
- Getting a single column name gives a Series.
- Getting a
listof column names gives a DataFrame (even if the list has just one element).
states.get('Capital City')
State
Alabama Montgomery
Alaska Juneau
Arizona Phoenix
...
West Virginia Charleston
Wisconsin Madison
Wyoming Cheyenne
Name: Capital City, Length: 50, dtype: object
states.get(['Capital City', 'Party'])
| Capital City | Party | |
|---|---|---|
| State | ||
| Alabama | Montgomery | Republican |
| Alaska | Juneau | Republican |
| Arizona | Phoenix | Republican |
| ... | ... | ... |
| West Virginia | Charleston | Republican |
| Wisconsin | Madison | Republican |
| Wyoming | Cheyenne | Republican |
50 rows × 2 columns
states.get(['Capital City'])
| Capital City | |
|---|---|
| State | |
| Alabama | Montgomery |
| Alaska | Juneau |
| Arizona | Phoenix |
| ... | ... |
| West Virginia | Charleston |
| Wisconsin | Madison |
| Wyoming | Cheyenne |
50 rows × 1 columns
Example 9: Which region is most populated?¶
Key concept: Grouping by one column.
Organizing states by region¶
We can find the total population of any one region using the tools we already have.
states[states.get('Region') == 'West'].get('Population').sum()
78588572
states[states.get('Region') == 'Midwest'].get('Population').sum()
68985454
But can we find the total population of every region all at the same time, without writing very similar code multiple times? Yes, there is a better way!
A new method: .groupby¶
This expression calculates the total population of each region, all at once!
states.get(['Region', 'Population']).groupby('Region').sum()
| Population | |
|---|---|
| Region | |
| Midwest | 68985454 |
| Northeast | 57609148 |
| South | 125576562 |
| West | 78588572 |
What just happened? 🤯
An illustrative example: Pets 🐱 🐶🐹¶
Consider the DataFrame pets, shown below.
| Species | Color | Weight | Age | |
|---|---|---|---|---|
| 0 | dog | black | 40 | 5.0 |
| 1 | cat | golden | 15 | 8.0 |
| 2 | cat | black | 20 | 9.0 |
| 3 | dog | white | 80 | 2.0 |
| 4 | dog | golden | 25 | 0.5 |
| 5 | hamster | golden | 1 | 3.0 |
Let's see what happens under the hood when we run pets.groupby('Species').max().
show_grouping_animation()
Let's try it out!¶
pets = bpd.DataFrame().assign(
Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
Weight=[40, 15, 20, 80, 25, 1],
Age=[5, 8, 9, 2, 0.5, 3]
)
pets
| Species | Color | Weight | Age | |
|---|---|---|---|---|
| 0 | dog | black | 40 | 5.0 |
| 1 | cat | golden | 15 | 8.0 |
| 2 | cat | black | 20 | 9.0 |
| 3 | dog | white | 80 | 2.0 |
| 4 | dog | golden | 25 | 0.5 |
| 5 | hamster | golden | 1 | 3.0 |
pets.groupby('Species').max()
| Color | Weight | Age | |
|---|---|---|---|
| Species | |||
| cat | golden | 20 | 9.0 |
| dog | white | 80 | 5.0 |
| hamster | golden | 1 | 3.0 |
pets.get(['Species', 'Weight', 'Age'])
| Species | Weight | Age | |
|---|---|---|---|
| 0 | dog | 40 | 5.0 |
| 1 | cat | 15 | 8.0 |
| 2 | cat | 20 | 9.0 |
| 3 | dog | 80 | 2.0 |
| 4 | dog | 25 | 0.5 |
| 5 | hamster | 1 | 3.0 |
It takes several steps to go from the original pets DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.
Back to states: which region is most populated?¶
states
| Region | Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|---|
| State | ||||||
| Alabama | South | Montgomery | 5024279 | 50645 | Republican | 99.21 |
| Alaska | West | Juneau | 733391 | 570641 | Republican | 1.29 |
| Arizona | West | Phoenix | 7151502 | 113594 | Republican | 62.96 |
| ... | ... | ... | ... | ... | ... | ... |
| West Virginia | South | Charleston | 1793716 | 24038 | Republican | 74.62 |
| Wisconsin | Midwest | Madison | 5893718 | 54158 | Republican | 108.82 |
| Wyoming | West | Cheyenne | 576851 | 97093 | Republican | 5.94 |
50 rows × 6 columns
- We only need the
'Region'and'Population'columns to answer the question, so we select those first using.get. - We want to group together states that are in the same
'Region'and sum their'Population's.
states.get(['Region', 'Population']).groupby('Region').sum()
| Population | |
|---|---|
| Region | |
| Midwest | 68985454 |
| Northeast | 57609148 |
| South | 125576562 |
| West | 78588572 |
# Note the use of .index – remember, the index isn't a column!
(
states
.get(['Region', 'Population'])
.groupby('Region')
.sum()
.sort_values(by='Population', ascending=False)
.index[0]
)
'South'
Using .groupby in general¶
In short, .groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, combining the values from the different rows using an aggregation method (e.g. .sum()).
To use .groupby:
- Isolate the columns you'll need.
- Use
.get([...])to select the columns you'll need. - Rule of thumb:
- Choose a categorical column to group by (e.g.
Region). - Choose one or more columns to aggregate (e.g.
Population,Land Area, etc.). Most commonly, these are numerical columns. - The aggregation method should make sense for the data in the columns you want to aggregate (e.g.
.mean()works forintandfloatvalues, but not for strings).
- Choose a categorical column to group by (e.g.
- Use
- Make groups with
.groupby(categorical_column)..groupby(categorical_column)will gather rows which have the same value in the specified column.- In the resulting DataFrame, there will be one row for every unique value in that column.
- Follow up with an aggregation method.
- The aggregation method is applied within each group.
- The aggregation method is applied individually to each column.
- If it doesn't make sense to use the aggregation method on a column, you will get an error or nonsensical result.
- Aggregation methods you should know:
.count(),.sum(),.mean(),.median(),.max(), and.min().- 🚨 Note: In this class, avoid using
.size()which is similar but not the same as.count().
- 🚨 Note: In this class, avoid using
The role of .get¶
Earlier we used .groupby to find the total population of each region as follows.
states.get(['Region', 'Population']).groupby('Region').sum()
| Population | |
|---|---|
| Region | |
| Midwest | 68985454 |
| Northeast | 57609148 |
| South | 125576562 |
| West | 78588572 |
Observe what happens if we don't start with .get.
states.groupby('Region').sum()
| Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|
| Region | |||||
| Midwest | SpringfieldIndianapolisDes MoinesTopekaLansing... | 68985454 | 750524 | DemocraticRepublicanRepublicanRepublicanRepubl... | 1298.78 |
| Northeast | HartfordAugustaBostonConcordTrentonAlbanyHarri... | 57609148 | 161912 | DemocraticDemocraticDemocraticDemocraticDemocr... | 4957.49 |
| South | MontgomeryLittle RockDoverTallahasseeAtlantaFr... | 125576562 | 868356 | RepublicanRepublicanDemocraticRepublicanRepubl... | 3189.37 |
| West | JuneauPhoenixSacramentoDenverHonoluluBoiseHele... | 78588572 | 1751054 | RepublicanRepublicanDemocraticDemocraticDemocr... | 881.62 |
Observations on grouping¶
- After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.
💡 Pro-Tip: look for keywords "per," "for each," and "indexed by" when solving problems.
- The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, you will get an error or a nonsensical result.
- Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.
- The column names don't make sense after grouping with the
.count()aggregation method.
states.groupby('Region').count()
| Capital City | Population | Land Area | Party | Density | |
|---|---|---|---|---|---|
| Region | |||||
| Midwest | 12 | 12 | 12 | 12 | 12 |
| Northeast | 9 | 9 | 9 | 9 | 9 |
| South | 16 | 16 | 16 | 16 | 16 |
| West | 13 | 13 | 13 | 13 | 13 |
Dropping, renaming, and reordering columns¶
Consider dropping unneeded columns and renaming columns as follows:
- Use
.assignto create a new column containing the same values as the old column(s). - Use
.drop(columns=list_of_column_labels)to drop the old column(s).- Alternatively, use
.get(list_of_column_labels)to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
- Alternatively, use
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
States=states_by_region.get('Capital City')
).get(['States'])
states_by_region
| States | |
|---|---|
| Region | |
| Midwest | 12 |
| Northeast | 9 |
| South | 16 |
| West | 13 |
Challenge Problems: IMDb dataset 🎞️¶
We won't cover this section in class. Instead, it's here for you to practice with some harder examples that are typical of what we will ask in exams or quizzes.
The video below walks through the solutions (it's also linked here). You can also see the solutions by clicking the "✅ Click here to see the answer." button below each question.
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')
Before watching the video or looking at the solutions, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Campuswire.
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb
| Votes | Rating | Year | Decade | |
|---|---|---|---|---|
| Title | ||||
| Relatos salvajes | 46987 | 8.0 | 2014 | 2010 |
| La battaglia di Algeri | 32385 | 8.0 | 1966 | 1960 |
| Jaws | 364225 | 8.0 | 1975 | 1970 |
| ... | ... | ... | ... | ... |
| The Godfather: Part II | 692753 | 9.0 | 1974 | 1970 |
| The Godfather | 1027398 | 9.2 | 1972 | 1970 |
| The Shawshank Redemption | 1498733 | 9.2 | 1994 | 1990 |
250 rows × 4 columns
Question: How many movies appear from each decade?¶
imdb.groupby('Decade').count()
| Votes | Rating | Year | |
|---|---|---|---|
| Decade | |||
| 1920 | 4 | 4 | 4 |
| 1930 | 7 | 7 | 7 |
| 1940 | 14 | 14 | 14 |
| ... | ... | ... | ... |
| 1990 | 42 | 42 | 42 |
| 2000 | 50 | 50 | 50 |
| 2010 | 29 | 29 | 29 |
10 rows × 3 columns
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');
Question: What was the highest rated movie of the 1990s?¶
Let's try to do this two different ways.
Without grouping¶
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]
'The Shawshank Redemption'
Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.
With grouping¶
imdb.reset_index().groupby('Decade').max()
| Title | Votes | Rating | Year | |
|---|---|---|---|---|
| Decade | ||||
| 1920 | The Kid | 98794 | 8.3 | 1927 |
| 1930 | The Wizard of Oz | 259235 | 8.5 | 1939 |
| 1940 | The Treasure of the Sierra Madre | 350551 | 8.6 | 1949 |
| ... | ... | ... | ... | ... |
| 1990 | Unforgiven | 1498733 | 9.2 | 1999 |
| 2000 | Yip Man | 1473049 | 8.9 | 2009 |
| 2010 | X-Men: Days of Future Past | 1271949 | 8.7 | 2015 |
10 rows × 4 columns
- It turns out that this method does not yield the correct answer.
- When we use an aggregation method (e.g.
.max()), aggregation is done to each column individually. - While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
- Taking the
maxis not helpful here.
Question: How many years have more than 3 movies rated above 8.5?¶
✅ Click here to see the answer.
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count()
good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]
As mentioned below, you can also use:
(good_movies_per_year.get('Votes') > 3).sum()
Aside: Using .sum() on a boolean array¶
- Summing a boolean array gives a count of the number of
Trueelements because Python treatsTrueas 1 andFalseas 0. - Can you use that fact here?
Question: Out of the years with more than 3 movies, which had the highest average rating?¶
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3
# The year with the highest average rating, among the years with more than 3 movies
imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1]
Question: Which year had the longest movie titles, on average?¶
Hint: Use .str.len() on the column or index that contains the names of the movies.
✅ Click here to see the answer.
(
imdb.assign(title_length=imdb.index.str.len())
.groupby('Year').mean()
.sort_values(by='title_length')
.index[-1]
)
The year is 1964 – take a look at the movies from 1964 by querying!
Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?¶
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3
# The sum of the ratings of movies from years that had at least 3 movies on the list
total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum()
# The total number of movies from years that had at least 3 movies on the list
count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum()
# The correct answer
average_rating = total_rating / count
# Close, but incorrect:
# Doesn't account for the fact that different years have different numbers of movies on the list
close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()
Summary, next time¶
Summary¶
- To write queries that involve multiple conditions:
- Put parentheses around all conditions.
- Separate conditions using
&if you require all to be true, or|if you require at least one to be true.
df.groupby(column_name).agg_method()aggregates all rows with the same value forcolumn_nameinto a single row in the resulting DataFrame, usingagg_method()to combine values.- Aggregation methods you should know:
.count(),.sum(),.mean(),.median(),.max(), and.min(). - 💡 Pro-Tip: look for keywords "per," "for each," and "indexed by" when solving problems.
- Aggregation methods you should know:
Next time¶
A picture is worth a 1000 words – it's time to visualize!