In [1]:
# Run this cell to set up packages for lecture.
from lec05_imports import *

Lecture 5 – Querying and Grouping¶

DSC 10, Winter 2026¶

Agenda¶

  • Querying.
  • Querying with multiple conditions.
  • Grouping.
  • After class: challenge problems.

Don't forget about these resources!¶

  • DSC 10 reference sheet.
  • babypandas notes.
  • The Resources tab of the course website.

You belong here! 🤝¶

  • We're moving very quickly in this class.
  • This may be the first time you're ever writing code, and you may question whether or not you belong in this class, or if data science is for you.
  • We promise, no matter what your prior experience is, the answer is yes, you belong!
    • Watch: 🎥 Developing a Growth Mindset with Carol Dweck.
  • Please come to office hours (see the schedule here) and post on Campuswire for help – we're here to make sure you succeed in this course.

The data: US states 🗽¶

We'll continue working with the same data from last time.

In [2]:
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states = states.set_index('State')
states
Out[2]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

Example 5: Which states are in the West?¶

Key concept: Querying.

The problem¶

We want to create a DataFrame consisting of only the states whose 'Region' is 'West'. How do we do that?

The solution¶

In [3]:
# This DataFrame only contains rows where the 'Region' is 'West'!
only_west = states[states.get('Region') == 'West']
only_west
Out[3]:
Region Capital City Population Land Area Party Density
State
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
California West Sacramento 39538223 155779 Democratic 253.81
... ... ... ... ... ... ...
Utah West Salt Lake City 3271616 82170 Republican 39.82
Washington West Olympia 7705281 66456 Democratic 115.95
Wyoming West Cheyenne 576851 97093 Republican 5.94

13 rows × 6 columns

🤯 What just happened?

Aside: Booleans¶

  • When we compare two values, the result is either True or False.
    • Notice, these words are not in quotes.
  • bool is a data type in Python, just like int, float, and str.
    • It stands for "Boolean", named after George Boole, an early mathematician.
  • There are only two possible Boolean values: True or False.
    • Yes or no.
    • On or off.
    • 1 or 0.
In [4]:
5 == 6
Out[4]:
False
In [5]:
type(5 == 6)
Out[5]:
bool
In [6]:
9 + 10 < 21
Out[6]:
True

Comparison operators¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

When comparing an entire Series to a single value, the result is a Series of bools (via broadcasting).

In [7]:
states
Out[7]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [8]:
states.get('Region') == 'West'
Out[8]:
State
Alabama          False
Alaska            True
Arizona           True
                 ...  
West Virginia    False
Wisconsin        False
Wyoming           True
Name: Region, Length: 50, dtype: bool

What is a query? 🤔¶

  • A query is code that extracts rows from a DataFrame for which certain condition(s) are true.
  • We use queries to filter DataFrames to contain only the rows that satisfy given conditions.

How do we query a DataFrame?¶

To select only certain rows of states:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
  2. Then pass it into states[sequence_goes_here].
In [9]:
states[states.get('Region') == 'West']
Out[9]:
Region Capital City Population Land Area Party Density
State
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
California West Sacramento 39538223 155779 Democratic 253.81
... ... ... ... ... ... ...
Utah West Salt Lake City 3271616 82170 Republican 39.82
Washington West Olympia 7705281 66456 Democratic 115.95
Wyoming West Cheyenne 576851 97093 Republican 5.94

13 rows × 6 columns

What if the condition isn't satisfied?¶

In [10]:
states[states.get('Region') == 'Pacific Northwest']
Out[10]:
Region Capital City Population Land Area Party Density
State

Example 6: What proportion of US states are Republican?¶

Key concept: Shape of a DataFrame.

Strategy¶
  1. Query to extract a DataFrame of just the states where the 'Party' is 'Republican'.
  2. Count the number of such states.
  3. Divide by the total number of states.
In [11]:
only_rep = states[states.get('Party') == 'Republican']
only_rep
Out[11]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

31 rows × 6 columns

Shape of a DataFrame¶

  • .shape returns the number of rows and columns in a given DataFrame.
    • .shape is not a method, so we don't use parentheses.
    • .shape is an attribute, as it describes the DataFrame.
  • Access each with []:
    • .shape[0] for rows.
    • .shape[1] for columns.
In [12]:
only_rep.shape
Out[12]:
(31, 6)
In [13]:
# Number of rows.
only_rep.shape[0]
Out[13]:
31
In [14]:
# Number of columns.
only_rep.shape[1]
Out[14]:
6
In [15]:
# What proportion of US states are Republican?
only_rep.shape[0] / states.shape[0]
Out[15]:
0.62

Example 7: Which Midwestern state has the most land area?¶

Key concepts: Working with the index. Combining multiple steps.

Strategy¶
  1. Query to extract a DataFrame of just the states in the 'Midwest'.
  2. Sort by 'Land Area' in descending order.
  3. Extract the first element from the index.
In [16]:
midwest = states[states.get('Region') == 'Midwest']
midwest
Out[16]:
Region Capital City Population Land Area Party Density
State
Illinois Midwest Springfield 12812508 55519 Democratic 230.78
Indiana Midwest Indianapolis 6785528 35826 Republican 189.40
Iowa Midwest Des Moines 3190369 55857 Republican 57.12
... ... ... ... ... ... ...
Ohio Midwest Columbus 11799448 40861 Republican 288.77
South Dakota Midwest Pierre 886667 75811 Republican 11.70
Wisconsin Midwest Madison 5893718 54158 Republican 108.82

12 rows × 6 columns

In [17]:
midwest_sorted = midwest.sort_values(by='Land Area', ascending=False)
midwest_sorted
Out[17]:
Region Capital City Population Land Area Party Density
State
Kansas Midwest Topeka 2937880 81759 Republican 35.93
Minnesota Midwest Saint Paul 5706494 79627 Democratic 71.67
Nebraska Midwest Lincoln 1961504 76824 Republican 25.53
... ... ... ... ... ... ...
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Ohio Midwest Columbus 11799448 40861 Republican 288.77
Indiana Midwest Indianapolis 6785528 35826 Republican 189.40

12 rows × 6 columns

  • The answer is Kansas, but how do we get it in code?
In [18]:
midwest_sorted.get('State').iloc[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[18], line 1
----> 1 midwest_sorted.get('State').iloc[0]

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/babypandas/utils.py:20, in suppress_warnings.<locals>.wrapper(*args, **kwargs)
     18 with warnings.catch_warnings():
     19     warnings.simplefilter("ignore")
---> 20     return func(*args, **kwargs)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/babypandas/bpd.py:328, in DataFrame.get(self, key)
    326 if any(mask):
    327     k = [key] if isinstance(key, str) else key
--> 328     raise KeyError("{} not found in columns".format(np.array(k)[mask]))
    330 f = _lift_to_pd(self._pd.get)
    331 return f(key=key)

KeyError: "['State'] not found in columns"

Working with the index¶

  • We can't use .get because .get is only for columns, and there is no column called 'State'.
    • Instead, 'State' is the index of the DataFrame.
  • To extract the index of a DataFrame, use .index.
    • Like .shape, this is an attribute of the DataFrame, not a method. Don't use parentheses.
  • Access particular elements in the index with [].
In [19]:
midwest_sorted.index
Out[19]:
Index(['Kansas', 'Minnesota', 'Nebraska', 'South Dakota', 'North Dakota',
       'Missouri', 'Michigan', 'Iowa', 'Illinois', 'Wisconsin', 'Ohio',
       'Indiana'],
      dtype='object', name='State')
In [20]:
midwest_sorted.index[0]
Out[20]:
'Kansas'

Combining multiple steps¶

  • It is not necessary to define the intermediate variables midwest and midwest_sorted. We can do everything in one line of code.

  • When solving a multi-step problem, develop your solution incrementally. Write one piece of code at a time and run it.

In [21]:
# Final solution, which you should build up one step at a time.
states[states.get('Region') == 'Midwest'].sort_values(by='Land Area', ascending=False).index[0]
Out[21]:
'Kansas'
  • If a line of code gets too long, enclose it in parentheses to split it over multiple lines.
In [22]:
# You can space your code out like this if needed.
(
    states[states.get('Region') == 'Midwest']
    .sort_values(by='Land Area', ascending=False)
    .index[0]
)
Out[22]:
'Kansas'

Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total population of the 'West'?

A. states[states.get('Region') == 'West'].get('Population').sum()

B. states.get('Population').sum()[states.get('Region') == 'West']

C. states['West'].get('Population').sum()

D. More than one of the above.

In [23]:
...
Out[23]:
Ellipsis

Example 8: What are the top three most-populated Republican states in the South?¶

Key concepts: Queries with multiple conditions. Selecting rows by position.

Multiple conditions¶

  • To write a query with multiple conditions, use & for "and" and | for "or".
    • &: All conditions must be true.
    • |: At least one condition must be true.
  • You must use (parentheses) around each condition!
  • 🚨 Don't use the Python keywords and and or here! They do not behave as you'd want.
    • See BPD 10.3 for an explanation.
In [24]:
states[(states.get('Party') == 'Republican') & (states.get('Region') == 'South')]
Out[24]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Arkansas South Little Rock 3011524 52035 Republican 57.87
Florida South Tallahassee 21538187 53625 Republican 401.64
... ... ... ... ... ... ...
Tennessee South Nashville 6910840 41235 Republican 167.60
Texas South Austin 29145505 261232 Republican 111.57
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

In [25]:
# You can also add line breaks within brackets.
states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
Out[25]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Arkansas South Little Rock 3011524 52035 Republican 57.87
Florida South Tallahassee 21538187 53625 Republican 401.64
... ... ... ... ... ... ...
Tennessee South Nashville 6910840 41235 Republican 167.60
Texas South Austin 29145505 261232 Republican 111.57
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

The & and | operators work element-wise!¶

In [26]:
(states.get('Party') == 'Republican')
Out[26]:
State
Alabama          True
Alaska           True
Arizona          True
                 ... 
West Virginia    True
Wisconsin        True
Wyoming          True
Name: Party, Length: 50, dtype: bool
In [27]:
(states.get('Region') == 'South')
Out[27]:
State
Alabama           True
Alaska           False
Arizona          False
                 ...  
West Virginia     True
Wisconsin        False
Wyoming          False
Name: Region, Length: 50, dtype: bool
In [28]:
(states.get('Party') == 'Republican') & (states.get('Region') == 'South')
Out[28]:
State
Alabama           True
Alaska           False
Arizona          False
                 ...  
West Virginia     True
Wisconsin        False
Wyoming          False
Length: 50, dtype: bool

Original Question: What are the top three most-populated Republican states in the South?¶

In [29]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
    .sort_values(by='Population', ascending=False)
)
Out[29]:
Region Capital City Population Land Area Party Density
State
Texas South Austin 29145505 261232 Republican 111.57
Florida South Tallahassee 21538187 53625 Republican 401.64
Georgia South Atlanta 10711908 57513 Republican 186.25
... ... ... ... ... ... ...
Arkansas South Little Rock 3011524 52035 Republican 57.87
Mississippi South Jackson 2961279 46923 Republican 63.11
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

How do we extract the first three rows of this DataFrame?

Using .take to select rows by position¶

  • Querying allows us to select rows that satisfy a certain condition.
  • We can also select rows in specific positions with .take(sequence_of_integer_positions). This keeps only the rows whose positions are in the specified sequence (list/array).
    • This is analogous to using .iloc[] on a Series.
    • It's rare to need to select rows by integer position. Querying is far more useful.
In [30]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region')=='South')]
    .sort_values(by='Population', ascending=False)
    .take([0, 1, 2])
)
Out[30]:
Region Capital City Population Land Area Party Density
State
Texas South Austin 29145505 261232 Republican 111.57
Florida South Tallahassee 21538187 53625 Republican 401.64
Georgia South Atlanta 10711908 57513 Republican 186.25
  • .take(np.arange(3)) could equivalently be used in place of .take([0, 1, 2]).

Extra Practice¶

Write code to answer each question below.

  1. What is the capital city of the state in the 'West' with the largest land area?
  2. How many states in the 'Northeast' have more land area than an average US state?
  3. What is the total population of the 'Midwest', 'South', and 'Northeast?
✅ Click here to see the answers after you've attempted the problems on your own.
  1. What is the capital city of the state in the West with the largest land area?
states[states.get('Region') == 'West'].sort_values(by='Land Area', ascending=False).get('Capital City').iloc[0]
  1. How many states in the Northeast have more land area than an average US state?
states[(states.get('Region') == 'Northeast') & 
       (states.get('Land Area') > states.get('Land Area').mean())].shape[0]
 
  1. What is the total population of the Midwest, South, and Northeast?
states[(states.get('Region') == 'Midwest') | 
       (states.get('Region') == 'South') | 
       (states.get('Region') == 'Northeast')].get('Population').sum()

     Alternate solution to 3:

states.get('Population').sum() - states[states.get('Region') == 'West'].get('Population').sum()
    
In [31]:
...
Out[31]:
Ellipsis

Adjusting columns¶

Adjusting columns with .assign, .drop, and .get¶

  1. To rename a column, use .assign to create a new column containing the same values as an existing column.
    • New columns are added on the right.
In [32]:
states.assign(Pop_in_millions=states.get('Population') / 1_000_000)
Out[32]:
Region Capital City Population Land Area Party Density Pop_in_millions
State
Alabama South Montgomery 5024279 50645 Republican 99.21 5.02
Alaska West Juneau 733391 570641 Republican 1.29 0.73
Arizona West Phoenix 7151502 113594 Republican 62.96 7.15
... ... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62 1.79
Wisconsin Midwest Madison 5893718 54158 Republican 108.82 5.89
Wyoming West Cheyenne 576851 97093 Republican 5.94 0.58

50 rows × 7 columns

  1. Then use .drop(columns=list_of_column_labels) to drop any columns you no longer need.
    • Alternatively, use .get(list_of_column_labels) to keep only certain columns. The columns will appear in the order you specify, so this is also useful for reordering columns!
In [33]:
(
    states
    .assign(Pop_in_millions=states.get('Population') / 1_000_000)
    .drop(columns=['Population'])
)
Out[33]:
Region Capital City Land Area Party Density Pop_in_millions
State
Alabama South Montgomery 50645 Republican 99.21 5.02
Alaska West Juneau 570641 Republican 1.29 0.73
Arizona West Phoenix 113594 Republican 62.96 7.15
... ... ... ... ... ... ...
West Virginia South Charleston 24038 Republican 74.62 1.79
Wisconsin Midwest Madison 54158 Republican 108.82 5.89
Wyoming West Cheyenne 97093 Republican 5.94 0.58

50 rows × 6 columns

In [34]:
(
    states
    .assign(Pop_in_millions=states.get('Population') / 1_000_000)
    # .get() is also useful for reordering columns
    .get(['Pop_in_millions', 'Capital City'])
)
Out[34]:
Pop_in_millions Capital City
State
Alabama 5.02 Montgomery
Alaska 0.73 Juneau
Arizona 7.15 Phoenix
... ... ...
West Virginia 1.79 Charleston
Wisconsin 5.89 Madison
Wyoming 0.58 Cheyenne

50 rows × 2 columns

Two ways to .get¶

  1. Getting a single column name gives a Series.
  2. Getting a list of column names gives a DataFrame. (Even if the list has just one element!)
In [35]:
states.get('Capital City')
Out[35]:
State
Alabama          Montgomery
Alaska               Juneau
Arizona             Phoenix
                    ...    
West Virginia    Charleston
Wisconsin           Madison
Wyoming            Cheyenne
Name: Capital City, Length: 50, dtype: object
In [36]:
states.get(['Capital City', 'Party'])
Out[36]:
Capital City Party
State
Alabama Montgomery Republican
Alaska Juneau Republican
Arizona Phoenix Republican
... ... ...
West Virginia Charleston Republican
Wisconsin Madison Republican
Wyoming Cheyenne Republican

50 rows × 2 columns

In [37]:
states.get(['Capital City'])
Out[37]:
Capital City
State
Alabama Montgomery
Alaska Juneau
Arizona Phoenix
... ...
West Virginia Charleston
Wisconsin Madison
Wyoming Cheyenne

50 rows × 1 columns

Example 9: Which region is most populated?¶

Key concept: Grouping by one column.

Organizing states by region¶

We can find the total population of any one region using the tools we already have.

In [38]:
states[states.get('Region') == 'West'].get('Population').sum()
Out[38]:
np.int64(78588572)
In [39]:
states[states.get('Region') == 'Midwest'].get('Population').sum()
Out[39]:
np.int64(68985454)

But can we find the total population of every region all at the same time, without writing very similar code multiple times? Yes, there is a better way!

A new method: .groupby¶

Observe what happens when we use the .groupby method on states with the argument 'Region'.

In [40]:
states.get(['Region', 'Population']).groupby('Region').sum()
Out[40]:
Population
Region
Midwest 68985454
Northeast 57609148
South 125576562
West 78588572

These populations (for the 'West' and 'Midwest') match the ones we found on the previous slide, except now we get the populations for all regions at the same time. What just happened? 🤯

An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets, shown below.

Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0

Let's see what happens under the hood when we run pets.groupby('Species').mean().

In [41]:
show_grouping_animation()

Let's try it out!¶

In [42]:
pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets
Out[42]:
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
In [43]:
pets[pets.get('Species') == 'hamster'].get(['Species', 'Weight', 'Age'])
Out[43]:
Species Weight Age
5 hamster 1 3.0
In [44]:
pets.get(['Species', 'Weight', 'Age']).groupby('Species').mean()
Out[44]:
Weight Age
Species
cat 17.50 8.5
dog 48.33 2.5
hamster 1.00 3.0

It takes several steps to go from the original pets DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.

Back to states: which region is most populated?¶

In [45]:
states
Out[45]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [46]:
states.get(['Region', 'Population']).groupby('Region').sum()
Out[46]:
Population
Region
Midwest 68985454
Northeast 57609148
South 125576562
West 78588572
In [47]:
# Note the use of .index – remember, the index isn't a column!
(
    states
    .groupby('Region')
    .sum()
    .sort_values(by='Population', ascending=False)
    .index[0]
)
Out[47]:
'South'

Using .groupby in general¶

In short, .groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values from different rows with the same value in the specified column.

To use .groupby:

  1. Choose a column to group by and column(s) to aggregate with .get().
    • .get(['Region', 'Population', ...]) to select out the columns you want to group by and aggregate together.
    • Rule of thumb:
      • Choose one categorical column to group by (in this case, Region).
      • Choose one or more numeric columns to aggregate (Population, Land Area, etc.)
      • Don't include non-numeric columns to aggregate unless your aggregation method doesn't perform arithmetic (e.g. .count() is okay on string columns, but not .mean()).
  1. Make groups with .groupby().
    • .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
    • In the resulting DataFrame, there will be one row for every unique value in that column.
  1. Call an aggregation method.
    • The aggregation method will be applied within each group.
    • The aggregation method is applied individually to each column.
      • If it doesn't make sense to use the aggregation method on a column, you will get an error or nonsensical result.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observations on grouping¶

  1. After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.
In [48]:
states
Out[48]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [49]:
states.get(['Region', 'Population']).groupby('Region').sum()
Out[49]:
Population
Region
Midwest 68985454
Northeast 57609148
South 125576562
West 78588572

Tip: look for keywords "per," "for each," and "indexed by" when solving problems.

  1. The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, you can get an (long, scary-looking) error. It really helps to ask an AI tool to help translate these to regular English.
In [50]:
states.get(['Region', 'Party']).groupby('Region').mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:1944, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1943 try:
-> 1944     res_values = self._grouper.agg_series(ser, alt, preserve_dtype=True)
   1945 except Exception as err:

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/ops.py:873, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    871     preserve_dtype = True
--> 873 result = self._aggregate_series_pure_python(obj, func)
    875 npvalues = lib.maybe_convert_objects(result, try_float=False)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/ops.py:894, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    893 for i, group in enumerate(splitter):
--> 894     res = func(group)
    895     res = extract_result(res)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2461, in GroupBy.mean.<locals>.<lambda>(x)
   2458 else:
   2459     result = self._cython_agg_general(
   2460         "mean",
-> 2461         alt=lambda x: Series(x, copy=False).mean(numeric_only=numeric_only),
   2462         numeric_only=numeric_only,
   2463     )
   2464     return result.__finalize__(self.obj, method="groupby")

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/series.py:6570, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6562 @doc(make_doc("mean", ndim=1))
   6563 def mean(
   6564     self,
   (...)   6568     **kwargs,
   6569 ):
-> 6570     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/generic.py:12485, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12478 def mean(
  12479     self,
  12480     axis: Axis | None = 0,
   (...)  12483     **kwargs,
  12484 ) -> Series | float:
> 12485     return self._stat_function(
  12486         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12487     )

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/generic.py:12442, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12440 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12442 return self._reduce(
  12443     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12444 )

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/series.py:6478, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6474     raise TypeError(
   6475         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6476         "with non-numeric dtypes."
   6477     )
-> 6478 return op(delegate, skipna=skipna, **kwds)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
   1699 if isinstance(x, str):
   1700     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701     raise TypeError(f"Could not convert string '{x}' to numeric")
   1702 try:

TypeError: Could not convert string 'DemocraticRepublicanRepublicanRepublicanRepublicanDemocraticRepublicanRepublicanRepublicanRepublicanRepublicanRepublican' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[50], line 1
----> 1 states.get(['Region', 'Party']).groupby('Region').mean()

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/babypandas/utils.py:20, in suppress_warnings.<locals>.wrapper(*args, **kwargs)
     18 with warnings.catch_warnings():
     19     warnings.simplefilter("ignore")
---> 20     return func(*args, **kwargs)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/babypandas/bpd.py:1514, in DataFrameGroupBy.mean(self)
   1510 """
   1511 Compute mean of group.
   1512 """
   1513 f = _lift_to_pd(self._pd.mean)
-> 1514 return f()

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/babypandas/bpd.py:1596, in _lift_to_pd.<locals>.closure(*vargs, **kwargs)
   1590 vargs = [x._pd if isinstance(x, types) else x for x in vargs]
   1591 kwargs = {
   1592     k: x._pd if isinstance(x, types) else x
   1593     for (k, x) in kwargs.items()
   1594 }
-> 1596 a = func(*vargs, **kwargs)
   1597 if isinstance(a, pd.DataFrame):
   1598     return DataFrame(data=a)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2459, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2452     return self._numba_agg_general(
   2453         grouped_mean,
   2454         executor.float_dtype_mapping,
   2455         engine_kwargs,
   2456         min_periods=0,
   2457     )
   2458 else:
-> 2459     result = self._cython_agg_general(
   2460         "mean",
   2461         alt=lambda x: Series(x, copy=False).mean(numeric_only=numeric_only),
   2462         numeric_only=numeric_only,
   2463     )
   2464     return result.__finalize__(self.obj, method="groupby")

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2005, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   2002     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   2003     return result
-> 2005 new_mgr = data.grouped_reduce(array_func)
   2006 res = self._wrap_agged_manager(new_mgr)
   2007 if how in ["idxmin", "idxmax"]:

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/internals/managers.py:1488, in BlockManager.grouped_reduce(self, func)
   1484 if blk.is_object:
   1485     # split on object-dtype blocks bc some columns may raise
   1486     #  while others do not.
   1487     for sb in blk._split():
-> 1488         applied = sb.apply(func)
   1489         result_blocks = extend_blocks(applied, result_blocks)
   1490 else:

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/internals/blocks.py:395, in Block.apply(self, func, **kwargs)
    389 @final
    390 def apply(self, func, **kwargs) -> list[Block]:
    391     """
    392     apply the function to my values; return a block if we are not
    393     one
    394     """
--> 395     result = func(self.values, **kwargs)
    397     result = maybe_coerce_values(result)
    398     return self._split_op_result(result)

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2002, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1999     return result
   2001 assert alt is not None
-> 2002 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   2003 return result

File ~/repos/dsc10/dsc10-2026-wi-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:1948, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1946     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1947     # preserve the kind of exception that raised
-> 1948     raise type(err)(msg) from err
   1950 dtype = ser.dtype
   1951 if dtype == object:

TypeError: agg function failed [how->mean,dtype->object]

Or, you can get a nonsensical result:

In [51]:
# whoa, what happened here?
states.get(['Region', 'Party']).groupby('Region').sum()
Out[51]:
Party
Region
Midwest DemocraticRepublicanRepublicanRepublicanRepubl...
Northeast DemocraticDemocraticDemocraticDemocraticDemocr...
South RepublicanRepublicanDemocraticRepublicanRepubl...
West RepublicanRepublicanDemocraticDemocraticDemocr...
  1. Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.
In [52]:
states.groupby('Region').max()
Out[52]:
Capital City Population Land Area Party Density
Region
Midwest Topeka 12812508 81759 Republican 288.77
Northeast Trenton 20201249 47126 Republican 1263.12
South Tallahassee 29145505 261232 Republican 636.37
West Santa Fe 39538223 570641 Republican 253.81
In [53]:
12812508 / 81759 == 288.77
Out[53]:
False
  1. The column names don't make sense after grouping with the .count() aggregation method.
In [54]:
states.groupby('Region').count()
Out[54]:
Capital City Population Land Area Party Density
Region
Midwest 12 12 12 12 12
Northeast 9 9 9 9 9
South 16 16 16 16 16
West 13 13 13 13 13

Dropping, renaming, and reordering columns¶

Consider dropping unneeded columns and renaming columns as follows:

  1. Use .assign to create a new column containing the same values as the old column(s).
  2. Use .drop(columns=list_of_column_labels) to drop the old column(s).
    • Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
In [55]:
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
                    States=states_by_region.get('Capital City')
                    ).get(['States'])
states_by_region
Out[55]:
States
Region
Midwest 12
Northeast 9
South 16
West 13

Example Exam Problems: IMDb dataset 🎞️¶

No description has been provided for this image

Extra practice¶

We won't cover this section in class. Instead, it's here for you to practice with some harder examples that are typical of what we will ask in exams or quizzes.

The video below walks through the solutions (it's also linked here). You can also see the solutions by clicking the "✅ Click here to see the answer." button below each question.

In [56]:
from IPython.display import YouTubeVideo
YouTubeVideo('xg7rnjWnZ48')
Out[56]:

Before watching the video or looking at the solutions, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams! Feel free to ask about them in office hours or on Ed.

In [57]:
imdb = bpd.read_csv('data/imdb.csv').set_index('Title').sort_values(by='Rating')
imdb
Out[57]:
Votes Rating Year Decade
Title
Akira 91652 8.0 1988 1980
Per un pugno di dollari 124671 8.0 1964 1960
Guardians of the Galaxy 527349 8.0 2014 2010
... ... ... ... ...
The Godfather: Part II 692753 9.0 1974 1970
The Shawshank Redemption 1498733 9.2 1994 1990
The Godfather 1027398 9.2 1972 1970

250 rows × 4 columns

Question: How many movies appear from each decade?¶

In [58]:
imdb.groupby('Decade').count()
Out[58]:
Votes Rating Year
Decade
1920 4 4 4
1930 7 7 7
1940 14 14 14
... ... ... ...
1990 42 42 42
2000 50 50 50
2010 29 29 29

10 rows × 3 columns

In [59]:
# We'll learn how to make plots like this in the next lecture!
imdb.groupby('Decade').count().plot(y='Year');
No description has been provided for this image

Question: What was the highest rated movie of the 1990s?¶

Let's try to do this two different ways.

Without grouping¶

In [60]:
imdb[imdb.get('Decade') == 1990].sort_values('Rating', ascending=False).index[0]
Out[60]:
'The Shawshank Redemption'

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

With grouping¶

In [61]:
imdb.reset_index().groupby('Decade').max()
Out[61]:
Title Votes Rating Year
Decade
1920 The Kid 98794 8.3 1927
1930 The Wizard of Oz 259235 8.5 1939
1940 The Treasure of the Sierra Madre 350551 8.6 1949
... ... ... ... ...
1990 Unforgiven 1498733 9.2 1999
2000 Yip Man 1473049 8.9 2009
2010 X-Men: Days of Future Past 1271949 8.7 2015

10 rows × 4 columns

  • It turns out that this method does not yield the correct answer.
  • When we use an aggregation method (e.g. .max()), aggregation is done to each column individually.
  • While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
  • Taking the max is not helpful here.

Question: How many years have more than 3 movies rated above 8.5?¶

In [ ]:
 
✅ Click here to see the answer.
good_movies_per_year = imdb[imdb.get('Rating') > 8.5].groupby('Year').count()
good_movies_per_year[good_movies_per_year.get('Votes') > 3].shape[0]    

As mentioned below, you can also use:

(good_movies_per_year.get('Votes') > 3).sum() 

Aside: Using .sum() on a boolean array¶

  • Summing a boolean array gives a count of the number of True elements because Python treats True as 1 and False as 0.
  • Can you use that fact here?
In [ ]:
 

Question: Out of the years with more than 3 movies, which had the highest average rating?¶

In [ ]:
 
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The year with the highest average rating, among the years with more than 3 movies
imdb.groupby('Year').mean()[more_than_3_ix].sort_values(by='Rating').index[-1] 

Question: Which year had the longest movie titles, on average?¶

Hint: Use .str.len() on the column or index that contains the names of the movies.

In [ ]:
 
✅ Click here to see the answer.
(
    imdb.assign(title_length=imdb.index.str.len())
    .groupby('Year').mean()
    .sort_values(by='title_length')
    .index[-1]
)

The year is 1964 – take a look at the movies from 1964 by querying!

Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?¶

In [ ]:
 
✅ Click here to see the answer.
# A Series of Trues and Falses; True when there were at least 3 movies on the list from that year
more_than_3_ix = imdb.groupby('Year').count().get('Votes') > 3

# The sum of the ratings of movies from years that had at least 3 movies on the list
total_rating = imdb.groupby('Year').sum()[more_than_3_ix].get('Rating').sum()

# The total number of movies from years that had at least 3 movies on the list
count = imdb.groupby('Year').count()[more_than_3_ix].get('Rating').sum()

# The correct answer
average_rating = total_rating / count

# Close, but incorrect: 
# Doesn't account for the fact that different years have different numbers of movies on the list
close_but_wrong = imdb.groupby('Year').mean()[more_than_3_ix].get('Rating').mean()
    

Summary, next time¶

Summary¶

  • To write queries that involve multiple conditions:
    • Put parentheses around all conditions.
    • Separate conditions using & if you require all to be true, or | if you require at least one to be true.
  • df.groupby(column_name).agg_method() aggregates all rows with the same value for column_name into a single row in the resulting DataFrame, using agg_method() to combine values.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().
    • Tip: look for keywords "per," "for each," and "indexed by" when solving problems.

Next time¶

A picture is worth a 1000 words – it's time to visualize!