In [1]:
import babypandas as bpd
import numpy as np

What is the difference between babypandas and Python?¶

In short, babypandas is a library in Python. This means that babypandas is a fast and powerful way for us to manipulate DataFrames that was built on the Python programming language.

Note: list means it is a list variable, and df means it is a DataFrame.

Some common similarities can be found in the table below:

Similarities Python babypandas
Creating Data Lists bpd.DataFrame(), bpd.Series()
Indexing list[index] .loc[], .iloc[]
Adding Data list.append() df.assign(new_column = data)
Removing Data list.remove() df.drop(columns=['col'])
Applying Function func(list) df.apply()
Aggregation sum(list), max(list), etc. df.sum(), df.mean()

We will go through the elements in the table above with examples to help explain the differences between babypandas and Python.


Table of Contents:¶

Click on the links below to quickly navigate to your desired topic.

  • Defined Variables
  • Creating Data
  • Indexing
  • Adding Data
  • Removing Data
  • Applying Functions
  • Aggregation

Our Data:¶

Before we get into any examples I will establish variables we will use.

  • pop_estimates is a DataFrame.
  • tutors is a list of strings (the tutors).
  • fav_nums is a list of my (Zoe's) favorite numbers.

You might recall this dataset from Lab 1. The estimates in the column "Population" come from the International Database, maintained by the US Census Bureau.

In [2]:
pop_estimates = bpd.read_csv("data/world_population_2023.csv")
pop_estimates.iloc[0:5] #Here are the first five rows displayed for your convenience
Out[2]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830
3 1953 2683288876
4 1954 2731379529

You might recall this dataset from Homework 3. In the states DataFrame , each state's 'Favorite Cereal' is defined as the cereal, among the top 20 varieties, that has been Google searched a disproportionately high amount in that state.

In [3]:
states = bpd.read_csv('data/states.csv')
states = states.set_index("State")
states.iloc[0:5]
Out[3]:
Favorite Cereal
State
Delaware Cap’n Crunch
Illinois Cinnamon Toast Crunch
Kentucky Cookie Crisp
Arkansas Froot Loops
North Carolina Cinnamon Toast Crunch

The rest of the variables are made up by me (Zoe)!

In [4]:
tutors = ["Jack", "Ashley", "Jason", "Zoe", "Nick", "Guoxuan"]
tutors
Out[4]:
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
In [5]:
fav_nums = [2, np.e, 14] # e is such a nice number!
fav_nums
Out[5]:
[2, 2.718281828459045, 14]

Creating Data¶

  • To create a DataFrame you can do bpd.DataFrame()
    • Refer to documentation for using babypandas
  • To create a Series you can do bpd.Series()
  • To create a list you use brackets ([])

In DSC 10 we read DataFrames in, so this is not super relevant to you all, but it is good to know!

In [6]:
# This makes an empty DataFrame... exciting!
bpd.DataFrame()
Out[6]:

Indexing¶

Lists¶

Recall we can index lists using brackets [] and inside we are following [start : stop : step]. This is also known as slicing!

Note: We are trying to extract multiple or individual elements using the brackets

In [7]:
# First element in tutors
tutors[0]
Out[7]:
'Jack'
In [8]:
# First three elements in tutors
tutors[0:3] # notice that stop is exclusive!
Out[8]:
['Jack', 'Ashley', 'Jason']
In [9]:
# Every other tutor
print(tutors) # What did tutors look like originally?
tutors[0:len(tutors):2]
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
Out[9]:
['Jack', 'Jason', 'Nick']
In [10]:
# You do not need to specify 0!
tutors[::2]
Out[10]:
['Jack', 'Jason', 'Nick']

Babypandas - iloc¶

Recall we can use df.iloc[] to isolate at an integer location. iloc uses [start, stop].

In [11]:
# The first element
pop_estimates.iloc[0]
Out[11]:
Year                1950
Population    2558023014
Name: 0, dtype: int64
In [12]:
# The first three elements
pop_estimates.iloc[0:3]
Out[12]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830
In [13]:
# Remember we do not need to put 0 if we don't want to!
pop_estimates.iloc[:3]
Out[13]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830

iloc is at an integer location, which means we can still use it when the index is not an integer.

In [14]:
# The first element
states.iloc[0]
Out[14]:
Favorite Cereal    Cap’n Crunch
Name: Delaware, dtype: object

We can also give .iloc a list of elements if we want specific integer locations.

In [15]:
pop_estimates.iloc[[2, 14, 44]]
Out[15]:
Year Population
2 1952 2637936830
14 1964 3282150912
44 1994 5650427498

Babypandas - loc¶

Recall we can use df.loc[] to isolate with a specific label. loc also uses [start : stop].

In [16]:
# The first element 
pop_estimates.loc[0] #the index's label is 0!
Out[16]:
Year                1950
Population    2558023014
Name: 0, dtype: int64
In [17]:
# A re-fresh of the states' DataFrame
states.iloc[0:5]
Out[17]:
Favorite Cereal
State
Delaware Cap’n Crunch
Illinois Cinnamon Toast Crunch
Kentucky Cookie Crisp
Arkansas Froot Loops
North Carolina Cinnamon Toast Crunch
In [18]:
# We want California
states.loc["California"]
Out[18]:
Favorite Cereal    Honey Bunches of Oats
Name: California, dtype: object
In [19]:
# We want Illinois and Arkansas
states.loc[["Illinois", "Arkansas"]]
Out[19]:
Favorite Cereal
State
Illinois Cinnamon Toast Crunch
Arkansas Froot Loops

Adding Data¶

Lists - .append¶

This will add an item to the end of the list. It happens in place.

Note: When something happens in place it means the object is modified directly. You do not need to re-assign the variable.

In [20]:
tutors
Out[20]:
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']
In [21]:
tutors.append("Baby Panda")
In [22]:
tutors
Out[22]:
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan', 'Baby Panda']

Numpy Array - np.append¶

This will add an arrays to one another. Note this is different from a list's version!

You can read more about it here.

In [23]:
# You first give it an array, then you give it the values you want to add to the original array!
np.append([1, 2, 3], [[2, 4, 6],[1, 3, 5]])
Out[23]:
array([1, 2, 3, 2, 4, 6, 1, 3, 5])

DataFrames - .assign¶

This will add a new column to the DataFrame. It does not happen in place.

In [24]:
pop_estimates.assign(Population_Dupe = pop_estimates.get("Population")).iloc[0:5]
Out[24]:
Year Population Population_Dupe
0 1950 2558023014 2558023014
1 1951 2595838116 2595838116
2 1952 2637936830 2637936830
3 1953 2683288876 2683288876
4 1954 2731379529 2731379529
In [25]:
pop_estimates.iloc[0:5] #Notice it was not changed!
Out[25]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830
3 1953 2683288876
4 1954 2731379529
In [26]:
# This means we need a variable or need to re-assign the variable to contain the updated information
temp = pop_estimates.assign(Population_Dupe = pop_estimates.get("Population"))
temp.iloc[0:5]
Out[26]:
Year Population Population_Dupe
0 1950 2558023014 2558023014
1 1951 2595838116 2595838116
2 1952 2637936830 2637936830
3 1953 2683288876 2683288876
4 1954 2731379529 2731379529

Removing Data¶

List - .remove¶

These are not necessary for our class. I am pointing this out so you do not try and .drop from a list or dictionary!

.remove happens in place.

In [27]:
tutors
Out[27]:
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan', 'Baby Panda']
In [28]:
tutors.remove("Baby Panda")
In [29]:
tutors
Out[29]:
['Jack', 'Ashley', 'Jason', 'Zoe', 'Nick', 'Guoxuan']

DataFrame - .drop¶

This will remove the column we specify to drop. It does not happen in place.

In [30]:
temp.drop(columns = "Population_Dupe").iloc[0:5]
Out[30]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830
3 1953 2683288876
4 1954 2731379529
In [31]:
# Use a list to drop multiple columns
temp.drop(columns = ["Population_Dupe"]).iloc[0:5]
Out[31]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830
3 1953 2683288876
4 1954 2731379529
In [32]:
# Notice once again that we did not update temp, so the column was not dropped
temp.iloc[0:5]
Out[32]:
Year Population Population_Dupe
0 1950 2558023014 2558023014
1 1951 2595838116 2595838116
2 1952 2637936830 2637936830
3 1953 2683288876 2683288876
4 1954 2731379529 2731379529
In [33]:
temp = temp.drop(columns = "Population_Dupe")
temp.iloc[0:5]
Out[33]:
Year Population
0 1950 2558023014
1 1951 2595838116
2 1952 2637936830
3 1953 2683288876
4 1954 2731379529

Applying Functions¶

For this part I have made some functions below.

In [34]:
# This function determines if a year was from the 20th or 21st century

def determine_century(year):
    if 1900 <= year <= 1999:
        return "20th Century"
    elif year >= 2000:
        return "21st Century"
In [35]:
# This function creates a list of Booleans if the number inside the list is greater than 5

def bigger_five(nums):
    output = []
    for num in nums:
        if num > 5:
            output.append(True)
        else:
            output.append(False)
    return output

Lists - The function goes around your value!¶

In [36]:
# Recall the variable fav_nums
fav_nums
Out[36]:
[2, 2.718281828459045, 14]
In [37]:
bigger_five(fav_nums)
Out[37]:
[False, False, True]

Babypandas - .apply¶

This is a method that will apply your function to each row inside of a Series. This will not work on a DataFrame!

In [38]:
pop_estimates.get("Year").apply(determine_century) # You get a series back
Out[38]:
0     20th Century
1     20th Century
2     20th Century
3     20th Century
4     20th Century
          ...     
69    21st Century
70    21st Century
71    21st Century
72    21st Century
73    21st Century
Name: Year, Length: 74, dtype: object
In [39]:
pop_estimates = pop_estimates.assign(Century = pop_estimates.get("Year").apply(determine_century))
pop_estimates.iloc[[0, -1]]
Out[39]:
Year Population Century
0 1950 2558023014 20th Century
73 2023 7982019198 21st Century

Aggregation¶

A function is a block of reusable code that performs a specific task. It can take in inputs, perform operations, and return an output. They are defined with def. They are not tied to any specific object. I (Zoe) likes to think of functions as something that hugs elements (parameters). It is surrounding the thing we want to transform.

A method also performs a specific task, but it is associated with an object. They are defined within a class and are called on objects. I (Zoe) likes to think of methods as something that follows an element (an object). It is always behind a variable with a dot (.).

This might get a bit technical, but here is a table of the differences:

Function Method
Independent and not associated with any object Associated with an object (instance methods) or a class (class methods)
Called by its name directly Called on an object or class
Parameters are user-defined First parameter is self (for instance methods) or cls (for class methods)
Defined using the def keyword outside of a class Defined inside a class

Lists - Functions (min and max)¶

In [40]:
# If I want the minimum of a list I use a function
min(fav_nums)
Out[40]:
2
In [41]:
# If I want the maximum of a list I use a function
max(fav_nums)
Out[41]:
14

Babypandas - Methods (.min and .max)¶

In [42]:
# If I want the minimum of a Series I use a method
pop_estimates.get("Population").min()
Out[42]:
2558023014
In [43]:
# If I want the maximum of a Series I use a method
pop_estimates.get("Population").max()
Out[43]:
7982019198

Lists - Functions (mean)¶

In [44]:
np.mean(fav_nums)
Out[44]:
6.239427276153015

Babypandas - Methods (.mean)¶

In [45]:
pop_estimates.get("Population").mean()
Out[45]:
5082349061.162162

Lists - Functions (sum)¶

In [46]:
sum(fav_nums)
Out[46]:
18.718281828459045

Babypandas - Methods (.sum)¶

In [47]:
pop_estimates.get("Population").sum()
Out[47]:
376093830526

The End!¶

As you can see there are differences between normal Python (the coding language) and babypandas. I hope you can refer to this as a guide to help you avoid making silly mistakes. If you have questions please post them on Ed. Thank you!

Back to Table of Contents