# Run this cell to set up packages for lecture.
from lec08_imports import *
*Reminder:* Use the DSC 10 Reference Sheet.
max
, np.sqrt
, len
) and methods (e.g. .groupby
, .assign
, .plot
).multiples_of_10 = np.arange(10, 130, 10)
multiples_of_10
array([ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])
multiples_of_8 = np.arange(8, 13*8, 8)
multiples_of_8
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96])
What if we want to find the multiples of some other number, k
? We can copy-paste and change some numbers, but that is prone to error.
multiples_of_5 = ...
multiples_of_5
Ellipsis
It turns out that we can define our own "multiples" function just once, and re-use it many times for different values of k
. 🔁
def multiples(k):
'''This function returns the
first twelve multiples of k.'''
return np.arange(k, 13*k, k)
multiples(8)
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96])
multiples(5)
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60])
Note that we only had to specify how to calculate multiples a single time!
Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we define our own function in Python, we will use the following pattern.
show_def()
bpd.read_csv
without knowing how it works.multiples(7)
array([ 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84])
multiples(-2)
array([ -2, -4, -6, -8, -10, -12, -14, -16, -18, -20, -22, -24])
triple
has one parameter, x
.
def triple(x):
return x * 3
When we call triple
with the argument 5, within the body of triple
, x
means 5.
triple(5)
15
We can call triple
with other arguments, even strings!
triple(7 + 8)
45
triple('triton')
'tritontritontriton'
The names you choose for a function’s parameters are only known to that function (known as local scope). The rest of your notebook is unaffected by parameter names.
def triple(x):
return x * 3
triple(7)
21
Since we haven't defined an x
outside of the body of triple
, our notebook doesn't know what x
means.
x
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /tmp/ipykernel_181/32546335.py in <module> ----> 1 x NameError: name 'x' is not defined
We can define an x
outside of the body of triple
, but that doesn't change how triple
works.
x = 15
# When triple(12) is called, you can pretend
# there's an invisible line inside the body of x
# that says x = 12.
# The x = 15 above is ignored.
triple(12)
36
Functions can take any number of arguments.
greeting
takes no arguments.
def greeting():
return 'Hi! 👋'
greeting()
'Hi! 👋'
custom_multiples
takes two arguments!
def custom_multiples(k, how_many):
'''This function returns the
first how_many multiples of k.'''
return np.arange(k, (how_many + 1)*k, k)
custom_multiples(10, 7)
array([10, 20, 30, 40, 50, 60, 70])
custom_multiples(2, 100)
array([ 2, 4, 6, ..., 196, 198, 200])
The body of a function is not run until you use (call) the function.
Here, we can define where_is_the_error
without seeing an error message.
def where_is_the_error(something):
'''A function to illustrate that errors don't occur
until functions are executed (called).'''
return (1 / 0) + something
It is only when we call where_is_the_error
that Python gives us an error message.
where_is_the_error(5)
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) /tmp/ipykernel_181/3423408763.py in <module> ----> 1 where_is_the_error(5) /tmp/ipykernel_181/3410008411.py in where_is_the_error(something) 2 '''A function to illustrate that errors don't occur 3 until functions are executed (called).''' ----> 4 return (1 / 0) + something ZeroDivisionError: division by zero
first_name
¶Let's create a function called first_name
that takes in someone's full name and returns their first name. Example behavior is shown below.
>>> first_name('Pradeep Khosla')
'Pradeep'
Hint: Use the string method .split
.
General strategy for writing functions:
'Pradeep Khosla'.split(' ')[0]
'Pradeep'
def first_name(full_name):
'''Returns the first name given a full name.'''
return full_name.split(' ')[0]
first_name('Pradeep Khosla')
'Pradeep'
# What if there are three names?
first_name('Chancellor Pradeep Khosla')
'Chancellor'
return
keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to.return
.return
!print
and return
work differently!def pythagorean(a, b):
'''Computes the hypotenuse length of a right triangle with legs a and b.'''
c = (a ** 2 + b ** 2) ** 0.5
print(c)
x = pythagorean(3, 4)
5.0
# No output – why?
x
# Errors – why?
x + 10
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_181/3707561498.py in <module> 1 # Errors – why? ----> 2 x + 10 TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
def better_pythagorean(a, b):
'''Computes the hypotenuse length of a right triangle with legs a and b,
and actually returns the result.
'''
c = (a ** 2 + b ** 2) ** 0.5
return c
x = better_pythagorean(3, 4)
x
5.0
x + 10
15.0
Once a function executes a return
statement, it stops running.
def motivational(quote):
return 0
print("Here's a motivational quote:", quote)
motivational('Fall seven times and stand up eight.')
0
The DataFrame roster
contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.
roster = bpd.read_csv('data/roster-anon.csv')
roster
name | section | |
---|---|---|
0 | Camilla Ozqcfu | B |
1 | Yifan Ignjpe | A |
2 | Ali Jjsojh | A |
... | ... | ... |
491 | Qiankang Tzjsbb | D |
492 | Kim Wxnmuh | C |
493 | Jazmine Sxcvft | C |
494 rows × 2 columns
What is the most common first name among DSC 10 students? (Any guesses?)
roster
name | section | |
---|---|---|
0 | Camilla Ozqcfu | B |
1 | Yifan Ignjpe | A |
2 | Ali Jjsojh | A |
... | ... | ... |
491 | Qiankang Tzjsbb | D |
492 | Kim Wxnmuh | C |
493 | Jazmine Sxcvft | C |
494 rows × 2 columns
'name'
column.first_name
function¶Somehow, we need to call first_name
on every student's 'name'
.
roster
name | section | |
---|---|---|
0 | Camilla Ozqcfu | B |
1 | Yifan Ignjpe | A |
2 | Ali Jjsojh | A |
... | ... | ... |
491 | Qiankang Tzjsbb | D |
492 | Kim Wxnmuh | C |
493 | Jazmine Sxcvft | C |
494 rows × 2 columns
roster.get('name').iloc[0]
'Camilla Ozqcfu'
first_name(roster.get('name').iloc[0])
'Camilla'
first_name(roster.get('name').iloc[1])
'Yifan'
Ideally, there's a better solution than doing this hundreds of times...
.apply
¶func_name
to every element of column 'col'
in DataFrame df
, usedf.get('col').apply(func_name)
.apply
method is a Series method..apply
on Series, not DataFrames..apply
is also a Series..apply(first_name)
..apply(first_name())
.roster.get('name')
0 Camilla Ozqcfu 1 Yifan Ignjpe 2 Ali Jjsojh ... 491 Qiankang Tzjsbb 492 Kim Wxnmuh 493 Jazmine Sxcvft Name: name, Length: 494, dtype: object
roster.get('name').apply(first_name)
0 Camilla 1 Yifan 2 Ali ... 491 Qiankang 492 Kim 493 Jazmine Name: name, Length: 494, dtype: object
roster = roster.assign(
first=roster.get('name').apply(first_name)
)
roster
name | section | first | |
---|---|---|---|
0 | Camilla Ozqcfu | B | Camilla |
1 | Yifan Ignjpe | A | Yifan |
2 | Ali Jjsojh | A | Ali |
... | ... | ... | ... |
491 | Qiankang Tzjsbb | D | Qiankang |
492 | Kim Wxnmuh | C | Kim |
493 | Jazmine Sxcvft | C | Jazmine |
494 rows × 3 columns
Now that we have a column containing first names, we can find the distribution of first names.
name_counts = (
roster
.groupby('first')
.count()
.sort_values('name', ascending=False)
.get(['name'])
)
name_counts
name | |
---|---|
first | |
Andrew | 7 |
Noah | 5 |
Joseph | 5 |
... | ... |
Heaven | 1 |
Harsh | 1 |
Ziyong | 1 |
427 rows × 1 columns
Below:
Hint: Start by defining a DataFrame with only the names in name_counts
that appeared at least twice. You can use this DataFrame to answer both questions.
shared_names = name_counts[name_counts.get('name') >= 2] # Bar chart. shared_names.sort_values('name').plot(kind='barh', y='name', figsize=(5, 8)); # Proportion = # students with a shared name / total # of students. shared_names.get('name').sum() / roster.shape[0]
...
Ellipsis
...
Ellipsis
.apply
works with built-in functions, too!¶name_counts.get('name')
first Andrew 7 Noah 5 Joseph 5 .. Heaven 1 Harsh 1 Ziyong 1 Name: name, Length: 427, dtype: int64
# Not necessarily meaningful, but doable.
name_counts.get('name').apply(np.log)
first Andrew 1.95 Noah 1.61 Joseph 1.61 ... Heaven 0.00 Harsh 0.00 Ziyong 0.00 Name: name, Length: 427, dtype: float64
In name_counts
, first names are stored in the index, which is not a Series. This means we can't use .apply
on it.
name_counts.index
Index(['Andrew', 'Noah', 'Joseph', 'Ethan', 'Michael', 'Christopher', 'Daniel', 'Justin', 'Abhinav', 'Jaden', ... 'Ingkawat', 'I-Shan', 'Humza', 'Huilin', 'Honoka', 'Hirkani', 'Hilary', 'Heaven', 'Harsh', 'Ziyong'], dtype='object', name='first', length=427)
name_counts.index.apply(max)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_181/1905262767.py in <module> ----> 1 name_counts.index.apply(max) AttributeError: 'Index' object has no attribute 'apply'
To help, we can use .reset_index()
to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.
# What is the max of an individual string?
name_counts.reset_index().get('first').apply(max)
0 w 1 o 2 s .. 424 v 425 s 426 y Name: first, Length: 427, dtype: object
'Jayden Lcaert'
wants to see if there's another 'Jayden'
in their section.Strategy:
'Jayden Lcaert'
in?'Jayden'
?roster
name | section | first | |
---|---|---|---|
0 | Camilla Ozqcfu | B | Camilla |
1 | Yifan Ignjpe | A | Yifan |
2 | Ali Jjsojh | A | Ali |
... | ... | ... | ... |
491 | Qiankang Tzjsbb | D | Qiankang |
492 | Kim Wxnmuh | C | Kim |
493 | Jazmine Sxcvft | C | Jazmine |
494 rows × 3 columns
which_section = roster[roster.get('name') == 'Jayden Lcaert'].get('section').iloc[0]
which_section
'B'
first_cond = roster.get('first') == 'Jayden' # A Boolean Series!
section_cond = roster.get('section') == which_section # A Boolean Series!
how_many = roster[first_cond & section_cond].shape[0]
how_many
2
shared_first_and_section
¶Let's create a function named shared_first_and_section
. It will take in the full name of a student and return the number of students in their section with the same first name and section (including them).
Note: This is the first function we're writing that involves using a DataFrame within the function – this is fine!
def shared_first_and_section(name):
# First, find the row corresponding to that full name in roster.
# We're assuming that full names are unique.
row = roster[roster.get('name') == name]
# Then, get that student's first name and section.
first = row.get('first').iloc[0]
section = row.get('section').iloc[0]
# Now, find all the students with the same first name and section.
shared_info = roster[(roster.get('first') == first) & (roster.get('section') == section)]
# Return the number of such students.
return shared_info.shape[0]
shared_first_and_section('Jayden Lcaert')
2
Now, let's add a column to roster
that contains the values returned by shared_first_and_section
.
roster = roster.assign(shared=roster.get('name').apply(shared_first_and_section))
roster
name | section | first | shared | |
---|---|---|---|---|
0 | Camilla Ozqcfu | B | Camilla | 1 |
1 | Yifan Ignjpe | A | Yifan | 1 |
2 | Ali Jjsojh | A | Ali | 2 |
... | ... | ... | ... | ... |
491 | Qiankang Tzjsbb | D | Qiankang | 1 |
492 | Kim Wxnmuh | C | Kim | 1 |
493 | Jazmine Sxcvft | C | Jazmine | 1 |
494 rows × 4 columns
Let's find all of the students who are in a section with someone that has the same first name as them.
roster[(roster.get('shared') >= 2)].sort_values('shared', ascending=False)
name | section | first | shared | |
---|---|---|---|---|
486 | Joseph Jpoelz | C | Joseph | 4 |
411 | Joseph Vdhfyo | C | Joseph | 4 |
329 | Joseph Jaqwhh | C | Joseph | 4 |
... | ... | ... | ... | ... |
226 | William Mjsrep | D | William | 2 |
154 | Jayden Lcaert | B | Jayden | 2 |
2 | Ali Jjsojh | A | Ali | 2 |
46 rows × 4 columns
While the DataFrame above contains the information we were looking for, it is not organized very conveniently and it is somewhat redundant.
Wouldn't it be great if we could create a DataFrame like the one below? We'll see how next time!
Find the longest first name in the class that is shared by at least two students in the same section.
Hint: You'll have to use both .assign
and .apply
.
with_len = roster.assign(name_len=roster.get('first').apply(len)) with_len[with_len.get('shared') >= 2].sort_values('name_len', ascending=False).get('first').iloc[0]
...
Ellipsis
.apply
method allows us to call a function on every single element of a Series, which usually comes from .get
ting a column of a DataFrame.More advanced DataFrame manipulations!