In [1]:
# Run this cell to set up packages for lecture.
from lec03_imports import *

Lecture 3 – Arrays and DataFrames¶

DSC 10, Summer Session I 2025¶

Agenda¶

  • Recap of arrays and ranges
  • DataFrames
    • Querying
    • Grouping

Note:¶

  • Remember to check the resources tab of the course website for programming resources.
  • Reminders:
    • Lab 0 due tomorrow
    • Lab 1 due Sunday
    • HW 1 due Monday
    • Quiz 1 on Monday
  • Some key links moving forward:
    • DSC 10 reference sheet.
    • babypandas notes.
    • babypandas documentation.

DataFrames¶

pandas¶

  • pandas is a Python package that allows us to work with tabular data – that is, data in the form of a table that we might otherwise work with as a spreadsheet (in Excel or Google Sheets).
  • pandas is the tool for doing data science in Python.
No description has been provided for this image

But pandas is not so cute...¶

No description has been provided for this image

Enter babypandas!¶

  • We at UCSD have created a smaller, nicer version of pandas called babypandas.
  • It keeps the important stuff and has much better error messages.
  • It's easier to learn, but is still valid pandas code. You are learning pandas!
    • Think of it like learning how to build LEGOs with many, but not all, of the possible Lego blocks. You're still learning how to build LEGOs, and you can still build cool things!
No description has been provided for this image

DataFrames in babypandas 🐼¶

  • Tables in babypandas (and pandas) are called "DataFrames."
  • To use DataFrames, we'll need to import babypandas.
In [2]:
import babypandas as bpd

Reading data from a file 📖¶

  • We'll usually work with data stored in the CSV format. CSV stands for "comma-separated values."

  • We can read in a CSV using bpd.read_csv(...). Replace the ... with a path to the CSV file relative to your notebook; if the file is in the same folder as your notebook, this is just the name of the file.

In [3]:
# Our CSV file is stored not in the same folder as our notebook, 
# but within a folder called data.
states = bpd.read_csv('data/states.csv')
states
Out[3]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

About the data 🗽¶

Most of the data is self-explanatory, but there are a few things to note:

  • 'Population' figures come from the 2020 census.
  • 'Land Area' is measured in square miles.
  • The 'Region' column places each state in one of four regions, as determined by the US Census Bureau.
No description has been provided for this image
  • The 'Party' column classifies each state as 'Democratic' or 'Republican' based on a political science measurement called the Cook Partisan Voter Index.
No description has been provided for this image (source)

Structure of a DataFrame¶

  • DataFrames have columns and rows.
    • Think of each column as an array. Columns contain data of the same type.
  • Each column has a label, e.g. 'Capital City' and 'Land Area'.
    • Column labels are stored as strings.
  • Each row has a label too – these are shown in bold at the start of the row.
    • Right now, the row labels are 0, 1, 2, and so on.
    • Together, the row labels are called the index. The index is not a column!
In [4]:
# This DataFrame has 50 rows and 6 columns.
states
Out[4]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

Example 1: Population density¶

Key concepts: Accessing columns, performing calculations with them, and adding new columns.

Finding population density¶

Question: What is the population density of each state, in people per square mile?

In [5]:
states
Out[5]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

  • We have, separately, the population and land area of each state.
  • Steps:
    • Get the 'Population' column.
    • Get the 'Land Area' column.
    • Divide these columns element-wise.
    • Add a new column to the DataFrame with these results.

Step 1 – Getting the 'Population' column¶

  • We can get a column from a DataFrame using .get(column_name).
  • 🚨 Column names are case sensitive!
  • Column names are strings, so we need to use quotes.
  • The result looks like a 1-column DataFrame, but is actually a Series.
In [6]:
states
Out[6]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

In [7]:
states.get('Population')
Out[7]:
0     5024279
1      733391
2     7151502
       ...   
47    1793716
48    5893718
49     576851
Name: Population, Length: 50, dtype: int64

Digression: Series¶

  • A Series is like an array, but with an index.
  • In particular, Series support arithmetic, just like arrays.
In [8]:
states.get('Population')
Out[8]:
0     5024279
1      733391
2     7151502
       ...   
47    1793716
48    5893718
49     576851
Name: Population, Length: 50, dtype: int64
In [9]:
type(states.get('Population'))
Out[9]:
babypandas.bpd.Series

Steps 2 and 3 – Getting the 'Land Area' column and dividing element-wise¶

In [10]:
states.get('Land Area')
Out[10]:
0      50645
1     570641
2     113594
       ...  
47     24038
48     54158
49     97093
Name: Land Area, Length: 50, dtype: int64
  • Just like with arrays, we can perform arithmetic operations with two Series, as long as they have the same length and same index.
  • Operations happen element-wise (by matching up corresponding index values), and the result is also a Series.
In [11]:
states.get('Population') / states.get('Land Area')
Out[11]:
0      99.21
1       1.29
2      62.96
       ...  
47     74.62
48    108.82
49      5.94
Length: 50, dtype: float64

Step 4 – Adding the densities to the DataFrame as a new column¶

  • Use .assign(name_of_column=data_in_series) to assign a Series (or array, or list) to a DataFrame.
  • 🚨 Don't put quotes around name_of_column.
  • This creates a new DataFrame, which we must save to a variable if we want to keep using it.
In [12]:
states.assign(
    Density=states.get('Population') / states.get('Land Area')
)
Out[12]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [13]:
states
Out[13]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

In [14]:
states = states.assign(
    Density=states.get('Population') / states.get('Land Area')
)
states
Out[14]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

Example 2: Exploring population density¶

Key concept: Computing statistics of columns using Series methods.

Questions¶

  • What is the highest population density of any one state?
  • What is the average population density across all states?

Series, like arrays, have helpful methods, including .min(), .max(), and .mean().

In [15]:
states.get('Density').max()
Out[15]:
1263.1212945335872

What state does this correspond to? We'll see how to find out shortly!

Other statistics:

In [16]:
states.get('Density').min()
Out[16]:
1.2852055845969708
In [17]:
states.get('Density').mean()
Out[17]:
206.54513507096468
In [18]:
states.get('Density').median()
Out[18]:
108.31649013462203
In [19]:
# Lots of information at once!
states.get('Density').describe()
Out[19]:
count      50.00
mean      206.55
std       274.93
          ...   
50%       108.32
75%       224.57
max      1263.12
Name: Density, Length: 8, dtype: float64

Example 3: Which state has the highest population density?¶

Key concepts: Sorting. Accessing using integer positions.

Step 1 – Sorting the DataFrame¶

  • Use the .sort_values(by=column_name) method to sort.
    • The by= can be omitted, but helps with readability.
  • Like most DataFrame methods, this returns a new DataFrame.
In [20]:
states.sort_values(by='Density')
Out[20]:
State Region Capital City Population Land Area Party Density
1 Alaska West Juneau 733391 570641 Republican 1.29
49 Wyoming West Cheyenne 576851 97093 Republican 5.94
25 Montana West Helena 1084225 145546 Republican 7.45
... ... ... ... ... ... ... ...
20 Massachusetts Northeast Boston 7029917 7800 Democratic 901.27
38 Rhode Island Northeast Providence 1097379 1034 Democratic 1061.29
29 New Jersey Northeast Trenton 9288994 7354 Democratic 1263.12

50 rows × 7 columns

This sorts, but in ascending order (small to large). The opposite would be nice!

Step 1 – Sorting the DataFrame in descending order¶

  • Use .sort_values(by=column_name, ascending=False) to sort in descending order.
  • ascending is an optional argument. If omitted, it will be set to True by default.
    • This is an example of a keyword argument, or a named argument.
    • If we want to specify the sorting order, we must use the keyword ascending=.
In [21]:
ordered_states = states.sort_values(by='Density', ascending=False)
ordered_states
Out[21]:
State Region Capital City Population Land Area Party Density
29 New Jersey Northeast Trenton 9288994 7354 Democratic 1263.12
38 Rhode Island Northeast Providence 1097379 1034 Democratic 1061.29
20 Massachusetts Northeast Boston 7029917 7800 Democratic 901.27
... ... ... ... ... ... ... ...
25 Montana West Helena 1084225 145546 Republican 7.45
49 Wyoming West Cheyenne 576851 97093 Republican 5.94
1 Alaska West Juneau 733391 570641 Republican 1.29

50 rows × 7 columns

In [22]:
# We must specify the role of False by using ascending=, 
# otherwise Python does not know how to interpret this.
states.sort_values(by='Density', False)
  File "/var/folders/2k/9mnd960x2j1d9b35wyjwwx200000gp/T/ipykernel_13708/2563070448.py", line 3
    states.sort_values(by='Density', False)
                                          ^
SyntaxError: positional argument follows keyword argument

Step 2 – Extracting the state name¶

  • We saw that the most densely populated state is New Jersey, but how do we extract that information using code?
  • First, grab an entire column as a Series.
  • Navigate to a particular entry of the Series using .iloc[integer_position].
    • iloc stands for "integer location" and is used to count the rows, starting at 0.
In [23]:
ordered_states
Out[23]:
State Region Capital City Population Land Area Party Density
29 New Jersey Northeast Trenton 9288994 7354 Democratic 1263.12
38 Rhode Island Northeast Providence 1097379 1034 Democratic 1061.29
20 Massachusetts Northeast Boston 7029917 7800 Democratic 901.27
... ... ... ... ... ... ... ...
25 Montana West Helena 1084225 145546 Republican 7.45
49 Wyoming West Cheyenne 576851 97093 Republican 5.94
1 Alaska West Juneau 733391 570641 Republican 1.29

50 rows × 7 columns

In [24]:
ordered_states.get('State')
Out[24]:
29       New Jersey
38     Rhode Island
20    Massachusetts
          ...      
25          Montana
49          Wyoming
1            Alaska
Name: State, Length: 50, dtype: object
In [25]:
# We want the first entry of the Series, which is at "integer location" 0.
ordered_states.get('State').iloc[0]
Out[25]:
'New Jersey'
  • The row label that goes with New Jersey is 29, because our original data was alphabetized by state and New Jersey is the 30th state alphabetically. But we don't use the row label when accessing with iloc; we use the integer position counting from the top.
  • If we try to use the row label (29) with iloc, we get the state with the 30th highest population density, which is not New Jersey.
In [26]:
ordered_states.get('State').iloc[29]
Out[26]:
'Minnesota'

Example 4: What is the population density of Pennsylvania?¶

Key concepts: Setting the index. Accessing using row labels.

Population density of Pennsylvania¶

We know how to get the 'Density' of all states. How do we find the one that corresponds to Pennsylvania?

In [27]:
states
Out[27]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [28]:
# Which one is Pennsylvania?
states.get('Density')
Out[28]:
0      99.21
1       1.29
2      62.96
       ...  
47     74.62
48    108.82
49      5.94
Name: Density, Length: 50, dtype: float64

Utilizing the index¶

  • When we load in a DataFrame from a CSV, columns have meaningful names, but rows do not.
In [29]:
bpd.read_csv('data/states.csv')
Out[29]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

  • The row labels (or the index) are how we refer to specific rows. Instead of using numbers, let's refer to these rows by the names of the states they correspond to.
  • This way, we can easily identify, for example, which row corresponds to Pennsylvania.

Setting the index¶

  • To change the index, use .set_index(column_name).
  • Row labels should be unique identifiers.
    • Each row should have a different, descriptive name that corresponds to the contents of that row's data.
In [30]:
states
Out[30]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [31]:
states.set_index('State')
Out[31]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

  • Now there is one fewer column. When you set the index, a column becomes the index, and the old index disappears.
  • 🚨 Like most DataFrame methods, .set_index returns a new DataFrame; it does not modify the original DataFrame.
In [32]:
states
Out[32]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

In [33]:
states = states.set_index('State')
states
Out[33]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [34]:
# Which one is Pennsylvania? The one whose row label is "Pennsylvania"!
states.get('Density')
Out[34]:
State
Alabama           99.21
Alaska             1.29
Arizona           62.96
                  ...  
West Virginia     74.62
Wisconsin        108.82
Wyoming            5.94
Name: Density, Length: 50, dtype: float64

Accessing using the row label¶

To pull out one particular entry of a DataFrame corresponding to a row and column with certain labels:

  1. Use .get(column_name) to extract the entire column as a Series.
  2. Use .loc[] to access the element of a Series with a particular row label.

In this class, we'll always first access a column, then a row (but row, then column is also possible).

In [35]:
states.get('Density')
Out[35]:
State
Alabama           99.21
Alaska             1.29
Arizona           62.96
                  ...  
West Virginia     74.62
Wisconsin        108.82
Wyoming            5.94
Name: Density, Length: 50, dtype: float64
In [36]:
states.get('Density').loc['Pennsylvania']
Out[36]:
290.60858681804973

Summary: Accessing elements of a DataFrame¶

  • First, .get the appropriate column as a Series.
  • Then, use one of two ways to access an element of a Series:
    • .iloc[] uses the integer position.
    • .loc[] uses the row label.
    • Each is best for different scenarios.
In [37]:
states.get('Density')
Out[37]:
State
Alabama           99.21
Alaska             1.29
Arizona           62.96
                  ...  
West Virginia     74.62
Wisconsin        108.82
Wyoming            5.94
Name: Density, Length: 50, dtype: float64
In [38]:
states.get('Density').iloc[4]
Out[38]:
253.80971119342146
In [39]:
states.get('Density').loc['California']
Out[39]:
253.80971119342146

Note¶

  • Sometimes the integer position and row label are the same.
  • This happens by default with bpd.read_csv.
In [40]:
bpd.read_csv('data/states.csv')
Out[40]:
State Region Capital City Population Land Area Party
0 Alabama South Montgomery 5024279 50645 Republican
1 Alaska West Juneau 733391 570641 Republican
2 Arizona West Phoenix 7151502 113594 Republican
... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican
48 Wisconsin Midwest Madison 5893718 54158 Republican
49 Wyoming West Cheyenne 576851 97093 Republican

50 rows × 6 columns

In [41]:
bpd.read_csv('data/states.csv').get('Capital City').loc[35]
Out[41]:
'Oklahoma City'
In [42]:
bpd.read_csv('data/states.csv').get('Capital City').iloc[35]
Out[42]:
'Oklahoma City'

Querying and Grouping¶

The data: US states 🗽¶

We'll continue working with the same data from last time.

In [43]:
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states
Out[43]:
State Region Capital City Population Land Area Party Density
0 Alabama South Montgomery 5024279 50645 Republican 99.21
1 Alaska West Juneau 733391 570641 Republican 1.29
2 Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ... ...
47 West Virginia South Charleston 1793716 24038 Republican 74.62
48 Wisconsin Midwest Madison 5893718 54158 Republican 108.82
49 Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 7 columns

Index by state, not row numbers.

In [44]:
states = states.set_index('State')

Example 5: Which states are in the West?¶

Key concept: Querying.

Not covered on the quiz!

The problem¶

We want to create a DataFrame consisting of only the states whose 'Region' is 'West'. How do we do that?

The solution¶

In [45]:
# This DataFrame only contains rows where the 'Region' is 'West'!
only_west = states[states.get('Region') == 'West']
only_west
Out[45]:
Region Capital City Population Land Area Party Density
State
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
California West Sacramento 39538223 155779 Democratic 253.81
... ... ... ... ... ... ...
Utah West Salt Lake City 3271616 82170 Republican 39.82
Washington West Olympia 7705281 66456 Democratic 115.95
Wyoming West Cheyenne 576851 97093 Republican 5.94

13 rows × 6 columns

Aside: Booleans (another type)¶

  • When we compare two values, the result is either True or False.
    • Notice, these words are not in quotes.
  • bool is a data type in Python, just like int, float, and str.
    • It stands for "Boolean", named after George Boole, an early mathematician.
  • There are only two possible Boolean values: True or False.
    • Yes or no.
    • On or off.
    • 1 or 0.
In [46]:
5 == 6
Out[46]:
False
In [47]:
type(5 == 6)
Out[47]:
bool
In [48]:
9 + 10 < 21
Out[48]:
True

Comparison operators¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

When comparing an entire Series to a single value, the result is a Series of bools (via broadcasting).

In [49]:
states
Out[49]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [50]:
states.get('Region') == 'West'
Out[50]:
State
Alabama          False
Alaska            True
Arizona           True
                 ...  
West Virginia    False
Wisconsin        False
Wyoming           True
Name: Region, Length: 50, dtype: bool

What is a query? 🤔¶

  • A query is code that extracts rows from a DataFrame for which certain condition(s) are true.
  • We use queries to filter DataFrames to contain only the rows that satisfy given conditions.

How do we query a DataFrame?¶

To select only certain rows of states:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (discard), usually by making a comparison.
  2. Then pass it into states[sequence_goes_here].
In [51]:
states[states.get('Region') == 'Pacific Northwest']
Out[51]:
Region Capital City Population Land Area Party Density
State
In [52]:
states.get('Region') == 'Pacific Northwest'
Out[52]:
State
Alabama          False
Alaska           False
Arizona          False
                 ...  
West Virginia    False
Wisconsin        False
Wyoming          False
Name: Region, Length: 50, dtype: bool

Example 6: What proportion of US states are Republican?¶

Key concept: Shape of a DataFrame.

Strategy¶
  1. Query to extract a DataFrame of just the states where the 'Party' is 'Republican'.
  2. Count the number of such states.
  3. Divide by the total number of states.
In [53]:
only_rep = states[states.get('Party') == 'Republican']
only_rep
Out[53]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

31 rows × 6 columns

Shape of a DataFrame¶

  • .shape returns the number of rows and columns in a given DataFrame.
    • .shape is not a method, so we don't use parentheses.
    • .shape is an attribute, as it describes the DataFrame.
  • Access each with []:
    • .shape[0] for rows.
    • .shape[1] for columns.
In [54]:
only_rep.shape
Out[54]:
(31, 6)
In [55]:
# Number of rows.
only_rep.shape[0]
Out[55]:
31
In [56]:
# Number of columns.
only_rep.shape[1]
Out[56]:
6
In [57]:
# What proportion of US states are Republican?
only_rep.shape[0] / states.shape[0]
Out[57]:
0.62

Example 7: Which Midwestern state has the most land area?¶

Key concepts: Working with the index. Combining multiple steps.

Strategy¶
  1. Query to extract a DataFrame of just the states in the 'Midwest'.
  2. Sort by 'Land Area' in descending order.
  3. Extract the first element from the index.
In [58]:
midwest = states[states.get('Region') == 'Midwest']
midwest
Out[58]:
Region Capital City Population Land Area Party Density
State
Illinois Midwest Springfield 12812508 55519 Democratic 230.78
Indiana Midwest Indianapolis 6785528 35826 Republican 189.40
Iowa Midwest Des Moines 3190369 55857 Republican 57.12
... ... ... ... ... ... ...
Ohio Midwest Columbus 11799448 40861 Republican 288.77
South Dakota Midwest Pierre 886667 75811 Republican 11.70
Wisconsin Midwest Madison 5893718 54158 Republican 108.82

12 rows × 6 columns

In [59]:
midwest_sorted = midwest.sort_values(by='Land Area', ascending=False)
midwest_sorted
Out[59]:
Region Capital City Population Land Area Party Density
State
Kansas Midwest Topeka 2937880 81759 Republican 35.93
Minnesota Midwest Saint Paul 5706494 79627 Democratic 71.67
Nebraska Midwest Lincoln 1961504 76824 Republican 25.53
... ... ... ... ... ... ...
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Ohio Midwest Columbus 11799448 40861 Republican 288.77
Indiana Midwest Indianapolis 6785528 35826 Republican 189.40

12 rows × 6 columns

In [60]:
midwest_sorted.get('State').iloc[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/2k/9mnd960x2j1d9b35wyjwwx200000gp/T/ipykernel_13708/3899766623.py in <cell line: 0>()
----> 1 midwest_sorted.get('State').iloc[0]

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/babypandas/utils.py in wrapper(*args, **kwargs)
     18         with warnings.catch_warnings():
     19             warnings.simplefilter("ignore")
---> 20             return func(*args, **kwargs)
     21 
     22     return wrapper

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/babypandas/bpd.py in get(self, key)
    325         if any(mask):
    326             k = [key] if isinstance(key, str) else key
--> 327             raise KeyError("{} not found in columns".format(np.array(k)[mask]))
    328 
    329         f = _lift_to_pd(self._pd.get)

KeyError: "['State'] not found in columns"

Working with the index¶

  • We can't use .get because .get is only for columns, and there is no column called 'State'.
    • Instead, 'State' is the index of the DataFrame.
  • To extract the index of a DataFrame, use .index.
    • Like .shape, this is an attribute of the DataFrame, not a method. Don't use parentheses.
  • Access particular elements in the index with [].
In [61]:
midwest_sorted.index
Out[61]:
Index(['Kansas', 'Minnesota', 'Nebraska', 'South Dakota', 'North Dakota',
       'Missouri', 'Michigan', 'Iowa', 'Illinois', 'Wisconsin', 'Ohio',
       'Indiana'],
      dtype='object', name='State')
In [62]:
midwest_sorted.index[0]
Out[62]:
'Kansas'

Combining multiple steps¶

  • It is not necessary to define the intermediate variables midwest and midwest_sorted. We can do everything in one line of code.

  • When solving a multi-step problem, develop your solution incrementally. Write one piece of code at a time and run it.

In [63]:
# Full answer, which you should build up one step at a time.
states[states.get('Region') == 'Midwest'].sort_values(by='Land Area', ascending=False).index[0]
Out[63]:
'Kansas'
  • If a line of code gets too long, enclose it in parentheses to split it over multiple lines.
In [64]:
# You can space your code out like this if needed.
(
    states[states.get('Region') == 'Midwest']
    .sort_values(by='Land Area', ascending=False)
    .index[0]
)
Out[64]:
'Kansas'

Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total population of the 'West'?

A. states[states.get('Region') == 'West'].get('Population').sum()

B. states.get('Population').sum()[states.get('Region') == 'West']

C. states['West'].get('Population').sum()

D. More than one of the above.

Example 8: What are the top three most-populated Republican states in the South?¶

Key concepts: Queries with multiple conditions. Selecting rows by position.

Multiple conditions¶

  • To write a query with multiple conditions, use & for "and" and | for "or".
    • &: All conditions must be true.
    • |: At least one condition must be true.
  • You must use (parentheses) around each condition!
  • 🚨 Don't use the Python keywords and and or here! They do not behave as you'd want.
    • See BPD 10.3 for an explanation.
In [65]:
states[(states.get('Party') == 'Republican') & (states.get('Region') == 'South')]
Out[65]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Arkansas South Little Rock 3011524 52035 Republican 57.87
Florida South Tallahassee 21538187 53625 Republican 401.64
... ... ... ... ... ... ...
Tennessee South Nashville 6910840 41235 Republican 167.60
Texas South Austin 29145505 261232 Republican 111.57
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

In [66]:
# You can also add line breaks within brackets.
states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
Out[66]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Arkansas South Little Rock 3011524 52035 Republican 57.87
Florida South Tallahassee 21538187 53625 Republican 401.64
... ... ... ... ... ... ...
Tennessee South Nashville 6910840 41235 Republican 167.60
Texas South Austin 29145505 261232 Republican 111.57
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

The & and | operators work element-wise!¶

In [67]:
(states.get('Party') == 'Republican')
Out[67]:
State
Alabama          True
Alaska           True
Arizona          True
                 ... 
West Virginia    True
Wisconsin        True
Wyoming          True
Name: Party, Length: 50, dtype: bool
In [68]:
(states.get('Region') == 'South')
Out[68]:
State
Alabama           True
Alaska           False
Arizona          False
                 ...  
West Virginia     True
Wisconsin        False
Wyoming          False
Name: Region, Length: 50, dtype: bool
In [69]:
(states.get('Party') == 'Republican') & (states.get('Region') == 'South')
Out[69]:
State
Alabama           True
Alaska           False
Arizona          False
                 ...  
West Virginia     True
Wisconsin        False
Wyoming          False
Length: 50, dtype: bool

Original Question: What are the top three most-populated Republican states in the South?¶

In [70]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region') == 'South')]
    .sort_values(by='Population', ascending=False)
)
Out[70]:
Region Capital City Population Land Area Party Density
State
Texas South Austin 29145505 261232 Republican 111.57
Florida South Tallahassee 21538187 53625 Republican 401.64
Georgia South Atlanta 10711908 57513 Republican 186.25
... ... ... ... ... ... ...
Arkansas South Little Rock 3011524 52035 Republican 57.87
Mississippi South Jackson 2961279 46923 Republican 63.11
West Virginia South Charleston 1793716 24038 Republican 74.62

13 rows × 6 columns

How do we extract the first three rows of this DataFrame?

Using .take to select rows by position¶

  • Querying allows us to select rows that satisfy a certain condition.
  • We can also select rows in specific positions with .take(sequence_of_integer_positions). This keeps only the rows whose positions are in the specified sequence (list/array).
    • This is analogous to using .iloc[] on a Series.
    • It's rare to need to select rows by integer position. Querying is far more useful.
In [71]:
(
    states[(states.get('Party') == 'Republican') & 
       (states.get('Region')=='South')]
    .sort_values(by='Population', ascending=False)
    .take([0, 1, 2])
)
Out[71]:
Region Capital City Population Land Area Party Density
State
Texas South Austin 29145505 261232 Republican 111.57
Florida South Tallahassee 21538187 53625 Republican 401.64
Georgia South Atlanta 10711908 57513 Republican 186.25
  • .take(np.arange(3)) could equivalently be used in place of .take([0, 1, 2]).

Example 9: Which region is most populated?¶

Key concept: Grouping by one column.

Organizing states by region¶

We can find the total population of any one region using the tools we already have.

In [72]:
states[states.get('Region') == 'West'].get('Population').sum()
Out[72]:
78588572
In [73]:
states[states.get('Region') == 'Midwest'].get('Population').sum()
Out[73]:
68985454

But can we find the total population of every region all at the same time, without writing very similar code multiple times? Yes, there is a better way!

A new method: .groupby¶

Observe what happens when we use the .groupby method on states with the argument 'Region'.

In [74]:
states.groupby('Region').sum()
Out[74]:
Population Land Area Density
Region
Midwest 68985454 750524 1298.78
Northeast 57609148 161912 4957.49
South 125576562 868356 3189.37
West 78588572 1751054 881.62

These populations (for the 'West' and 'Midwest') match the ones we found on the previous slide, except now we get the populations for all regions at the same time. What just happened? 🤯

An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets, shown below.

Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0

Let's see what happens under the hood when we run pets.groupby('Species').mean().

In [75]:
show_grouping_animation()

Let's try it out!¶

In [76]:
pets = bpd.DataFrame().assign(
    Species=['dog', 'cat', 'cat', 'dog', 'dog', 'hamster'],
    Color=['black', 'golden', 'black', 'white', 'golden', 'golden'],
    Weight=[40, 15, 20, 80, 25, 1],
    Age=[5, 8, 9, 2, 0.5, 3]
)
pets
Out[76]:
Species Color Weight Age
0 dog black 40 5.0
1 cat golden 15 8.0
2 cat black 20 9.0
3 dog white 80 2.0
4 dog golden 25 0.5
5 hamster golden 1 3.0
In [77]:
pets.groupby('Species').mean()
Out[77]:
Weight Age
Species
cat 17.50 8.5
dog 48.33 2.5
hamster 1.00 3.0

It takes several steps to go from the original pets DataFrame to this grouped DataFrame, but we don't get to see any of Python's inner workings, just the final output.

Back to states: which region is most populated?¶

In [78]:
states
Out[78]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [79]:
states.groupby('Region').sum()
Out[79]:
Population Land Area Density
Region
Midwest 68985454 750524 1298.78
Northeast 57609148 161912 4957.49
South 125576562 868356 3189.37
West 78588572 1751054 881.62
In [80]:
# Note the use of .index – remember, the index isn't a column!
(
    states
    .groupby('Region')
    .sum()
    .sort_values(by='Population', ascending=False)
    .index[0]
)
Out[80]:
'South'

Using .groupby in general¶

In short, .groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values from different rows with the same value in the specified column.

To use .groupby:

  1. Choose a column to group by.
    • .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
    • In the resulting DataFrame, there will be one row for every unique value in that column.
  1. Choose an aggregation method.
    • The aggregation method will be applied within each group.
    • The aggregation method is applied individually to each column.
      • If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observations on grouping¶

  1. After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.
In [81]:
states
Out[81]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

In [82]:
states.groupby('Region').sum()
Out[82]:
Population Land Area Density
Region
Midwest 68985454 750524 1298.78
Northeast 57609148 161912 4957.49
South 125576562 868356 3189.37
West 78588572 1751054 881.62

Tip: look for keywords "per," "for each," and "indexed by" when solving problems.

  1. The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, the column will disappear. 🐇🎩
  1. Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.
In [83]:
states.groupby('Region').max()
Out[83]:
Capital City Population Land Area Party Density
Region
Midwest Topeka 12812508 81759 Republican 288.77
Northeast Trenton 20201249 47126 Republican 1263.12
South Tallahassee 29145505 261232 Republican 636.37
West Santa Fe 39538223 570641 Republican 253.81
In [84]:
12812508 / 81759 == 288.77
Out[84]:
False
  1. The column names don't make sense after grouping with the .count() aggregation method.
In [85]:
states.groupby('Region').count()
Out[85]:
Capital City Population Land Area Party Density
Region
Midwest 12 12 12 12 12
Northeast 9 9 9 9 9
South 16 16 16 16 16
West 13 13 13 13 13

Dropping, renaming, and reordering columns¶

Consider dropping unneeded columns and renaming columns as follows:

  1. Use .assign to create a new column containing the same values as the old column(s).
  2. Use .drop(columns=list_of_column_labels) to drop the old column(s).
    • Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
In [86]:
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
                    States=states_by_region.get('Capital City')
                    ).get(['States'])
states_by_region
Out[86]:
States
Region
Midwest 12
Northeast 9
South 16
West 13

Summary, next time¶

Summary¶

  • Arrays make it easy to perform arithmetic operations on all elements of an array and to perform element-wise operations on multiple arrays.
  • Ranges are arrays of equally-spaced numbers.
  • We learned many DataFrame methods and techniques. Don't feel the need to memorize them all right away.
  • Instead, refer to this lecture, the DSC 10 reference sheet, the babypandas notes, and the babypandas documentation when working on assignments.
  • Over time, these techniques will become more and more familiar. Lab 1 will walk you through many of them.
  • Practice! Frame your own questions using this dataset and try to answer them.

Next time¶

We'll frame more questions and learn more DataFrame manipulation techniques to answer them. In particular, we'll learn about querying and grouping.

Have a good Fourth of July weekend! 🌭🎆🇺🇸¶