import babypandas as bpd
import numpy as np
What is the difference between babypandas
and Python?¶
In short, babypandas
is a library in Python. This means that babypandas
is a fast and powerful way for us to manipulate DataFrames that was built on the Python programming language.
Note: list
means it is a list variable, and df
means it is a DataFrame.
Some common similarities can be found in the table below:
Similarities | Python | babypandas |
---|---|---|
Creating Data | Lists | bpd.DataFrame() , bpd.Series() |
Indexing | list[index] |
.loc[] , .iloc[] |
Adding Data | list.append() |
df.assign(new_column = data) |
Removing Data | list.remove() |
df.drop(columns=['col']) |
Applying Function | func(list) |
df.apply() |
Aggregation | sum(list) , max(list) , etc. |
df.sum() , df.mean() |
We will go through the elements in the table above with examples to help explain the differences between babypandas
and Python.
Table of Contents:¶
Click on the links below to quickly navigate to your desired topic.
Our Data:¶
Before we get into any examples I will establish variables we will use.
pop_estimates
is a DataFrame.tutors
is a list of strings (the tutors).fav_nums
is a list of my (Zoe's) favorite numbers.
You might recall this dataset from Lab 1. The estimates in the column "Population"
come from the International Database, maintained by the US Census Bureau.
pop_estimates = bpd.read_csv("data/world_population_2023.csv")
pop_estimates.iloc[0:5] #Here are the first five rows displayed for your convenience
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
3 | 1953 | 2683288876 |
4 | 1954 | 2731379529 |
You might recall this dataset from Homework 3. In the states
DataFrame , each state's 'Favorite Cereal'
is defined as the cereal, among the top 20 varieties, that has been Google searched a disproportionately high amount in that state.
states = bpd.read_csv('data/states.csv')
states = states.set_index("State")
states.iloc[0:5]
Favorite Cereal | |
---|---|
State | |
Delaware | Cap’n Crunch |
Illinois | Cinnamon Toast Crunch |
Kentucky | Cookie Crisp |
Arkansas | Froot Loops |
North Carolina | Cinnamon Toast Crunch |
The rest of the variables are made up by me (Zoe)!
tutors = ["Jack", "Ashley", "Jason", "Zoe", "Nick", "Guoxuan"]
tutors
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
fav_nums = [2, np.e, 14] # e is such a nice number!
fav_nums
[2, 2.718281828459045, 14]
Creating Data¶
- To create a DataFrame you can do
bpd.DataFrame()
- Refer to documentation for using babypandas
- To create a Series you can do
bpd.Series()
- To create a list you use brackets (
[]
)
In DSC 10 we read DataFrames in, so this is not super relevant to you all, but it is good to know!
# This makes an empty DataFrame... exciting!
bpd.DataFrame()
Indexing¶
Lists¶
Recall we can index lists using brackets []
and inside we are following [start : stop : step]
. This is also known as slicing!
Note: We are trying to extract multiple or individual elements using the brackets
# First element in tutors
tutors[0]
'Jack'
# First three elements in tutors
tutors[0:3] # notice that stop is exclusive!
['Jack', 'Ashley', 'Jason']
# Every other tutor
print(tutors) # What did tutors look like originally?
tutors[0:len(tutors):2]
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
['Jack', 'Jason', 'Nick']
# You do not need to specify 0!
tutors[::2]
['Jack', 'Jason', 'Nick']
Babypandas - iloc
¶
Recall we can use df.iloc[]
to isolate at an integer location. iloc
uses [start, stop]
.
# The first element
pop_estimates.iloc[0]
Year 1950 Population 2558023014 Name: 0, dtype: int64
# The first three elements
pop_estimates.iloc[0:3]
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
# Remember we do not need to put 0 if we don't want to!
pop_estimates.iloc[:3]
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
iloc
is at an integer location, which means we can still use it when the index is not an integer.
# The first element
states.iloc[0]
Favorite Cereal Cap’n Crunch Name: Delaware, dtype: object
We can also give .iloc
a list of elements if we want specific integer locations.
pop_estimates.iloc[[2, 14, 44]]
Year | Population | |
---|---|---|
2 | 1952 | 2637936830 |
14 | 1964 | 3282150912 |
44 | 1994 | 5650427498 |
Babypandas - loc
¶
Recall we can use df.loc[]
to isolate with a specific label. loc
also uses [start : stop]
.
# The first element
pop_estimates.loc[0] #the index's label is 0!
Year 1950 Population 2558023014 Name: 0, dtype: int64
# A re-fresh of the states' DataFrame
states.iloc[0:5]
Favorite Cereal | |
---|---|
State | |
Delaware | Cap’n Crunch |
Illinois | Cinnamon Toast Crunch |
Kentucky | Cookie Crisp |
Arkansas | Froot Loops |
North Carolina | Cinnamon Toast Crunch |
# We want California
states.loc["California"]
Favorite Cereal Honey Bunches of Oats Name: California, dtype: object
# We want Illinois and Arkansas
states.loc[["Illinois", "Arkansas"]]
Favorite Cereal | |
---|---|
State | |
Illinois | Cinnamon Toast Crunch |
Arkansas | Froot Loops |
Adding Data¶
Lists - .append
¶
This will add an item to the end of the list. It happens in place.
Note: When something happens in place it means the object is modified directly. You do not need to re-assign the variable.
tutors
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
tutors.append("Baby Panda")
tutors
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan', 'Baby Panda']
# You first give it an array, then you give it the values you want to add to the original array!
np.append([1, 2, 3], [[2, 4, 6],[1, 3, 5]])
array([1, 2, 3, 2, 4, 6, 1, 3, 5])
DataFrames - .assign
¶
This will add a new column to the DataFrame. It does not happen in place.
pop_estimates.assign(Population_Dupe = pop_estimates.get("Population")).iloc[0:5]
Year | Population | Population_Dupe | |
---|---|---|---|
0 | 1950 | 2558023014 | 2558023014 |
1 | 1951 | 2595838116 | 2595838116 |
2 | 1952 | 2637936830 | 2637936830 |
3 | 1953 | 2683288876 | 2683288876 |
4 | 1954 | 2731379529 | 2731379529 |
pop_estimates.iloc[0:5] #Notice it was not changed!
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
3 | 1953 | 2683288876 |
4 | 1954 | 2731379529 |
# This means we need a variable or need to re-assign the variable to contain the updated information
temp = pop_estimates.assign(Population_Dupe = pop_estimates.get("Population"))
temp.iloc[0:5]
Year | Population | Population_Dupe | |
---|---|---|---|
0 | 1950 | 2558023014 | 2558023014 |
1 | 1951 | 2595838116 | 2595838116 |
2 | 1952 | 2637936830 | 2637936830 |
3 | 1953 | 2683288876 | 2683288876 |
4 | 1954 | 2731379529 | 2731379529 |
Removing Data¶
List - .remove
¶
These are not necessary for our class. I am pointing this out so you do not try and .drop
from a list or dictionary!
.remove
happens in place.
tutors
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan', 'Baby Panda']
tutors.remove("Baby Panda")
tutors
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
DataFrame - .drop
¶
This will remove the column we specify to drop. It does not happen in place.
temp.drop(columns = "Population_Dupe").iloc[0:5]
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
3 | 1953 | 2683288876 |
4 | 1954 | 2731379529 |
# Use a list to drop multiple columns
temp.drop(columns = ["Population_Dupe"]).iloc[0:5]
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
3 | 1953 | 2683288876 |
4 | 1954 | 2731379529 |
# Notice once again that we did not update temp, so the column was not dropped
temp.iloc[0:5]
Year | Population | Population_Dupe | |
---|---|---|---|
0 | 1950 | 2558023014 | 2558023014 |
1 | 1951 | 2595838116 | 2595838116 |
2 | 1952 | 2637936830 | 2637936830 |
3 | 1953 | 2683288876 | 2683288876 |
4 | 1954 | 2731379529 | 2731379529 |
temp = temp.drop(columns = "Population_Dupe")
temp.iloc[0:5]
Year | Population | |
---|---|---|
0 | 1950 | 2558023014 |
1 | 1951 | 2595838116 |
2 | 1952 | 2637936830 |
3 | 1953 | 2683288876 |
4 | 1954 | 2731379529 |
# This function determines if a year was from the 20th or 21st century
def determine_century(year):
if 1900 <= year <= 1999:
return "20th Century"
elif year >= 2000:
return "21st Century"
# This function creates a list of Booleans if the number inside the list is greater than 5
def bigger_five(nums):
output = []
for num in nums:
if num > 5:
output.append(True)
else:
output.append(False)
return output
Lists - The function goes around your value!¶
# Recall the variable fav_nums
fav_nums
[2, 2.718281828459045, 14]
bigger_five(fav_nums)
[False, False, True]
Babypandas - .apply
¶
This is a method that will apply your function to each row inside of a Series. This will not work on a DataFrame!
pop_estimates.get("Year").apply(determine_century) # You get a series back
0 20th Century 1 20th Century 2 20th Century 3 20th Century 4 20th Century ... 69 21st Century 70 21st Century 71 21st Century 72 21st Century 73 21st Century Name: Year, Length: 74, dtype: object
pop_estimates = pop_estimates.assign(Century = pop_estimates.get("Year").apply(determine_century))
pop_estimates.iloc[[0, -1]]
Year | Population | Century | |
---|---|---|---|
0 | 1950 | 2558023014 | 20th Century |
73 | 2023 | 7982019198 | 21st Century |
Aggregation¶
A function is a block of reusable code that performs a specific task. It can take in inputs, perform operations, and return an output. They are defined with def
. They are not tied to any specific object. I (Zoe) likes to think of functions as something that hugs elements (parameters). It is surrounding the thing we want to transform.
A method also performs a specific task, but it is associated with an object. They are defined within a class and are called on objects. I (Zoe) likes to think of methods as something that follows an element (an object). It is always behind a variable with a dot (.
).
This might get a bit technical, but here is a table of the differences:
Function | Method |
---|---|
Independent and not associated with any object | Associated with an object (instance methods) or a class (class methods) |
Called by its name directly | Called on an object or class |
Parameters are user-defined | First parameter is self (for instance methods) or cls (for class methods) |
Defined using the def keyword outside of a class |
Defined inside a class |
Lists - Functions (min
and max
)¶
# If I want the minimum of a list I use a function
min(fav_nums)
2
# If I want the maximum of a list I use a function
max(fav_nums)
14
Babypandas - Methods (.min
and .max
)¶
# If I want the minimum of a Series I use a method
pop_estimates.get("Population").min()
2558023014
# If I want the maximum of a Series I use a method
pop_estimates.get("Population").max()
7982019198
Lists - Functions (mean
)¶
np.mean(fav_nums)
6.239427276153015
Babypandas - Methods (.mean
)¶
pop_estimates.get("Population").mean()
5082349061.162162
Lists - Functions (sum
)¶
sum(fav_nums)
18.718281828459045
Babypandas - Methods (.sum
)¶
pop_estimates.get("Population").sum()
376093830526
The End!¶
As you can see there are differences between normal Python (the coding language) and babypandas
. I hope you can refer to this as a guide to help you avoid making silly mistakes. If you have questions please post them on Ed. Thank you!