# Run this cell to set up packages for lecture.
from lec08_imports import *
*Reminder:* Use the DSC 10 Reference Sheet.
max, np.sqrt, len) and methods (e.g. .groupby, .assign, .plot).
multiples_of_10 = np.arange(10, 130, 10)
multiples_of_10
array([ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120])
multiples_of_8 = np.arange(8, 13*8, 8)
multiples_of_8
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96])
What if we want to find the multiples of some other number, k? We can copy-paste and change some numbers, but that is prone to error.
multiples_of_5 = ...
multiples_of_5
Ellipsis
It turns out that we can define our own "multiples" function just once, and re-use it many times for different values of k. 🔁
def multiples(k):
'''This function returns the
first twelve multiples of k.'''
return np.arange(k, 13*k, k)
multiples(8)
array([ 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96])
multiples(5)
array([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60])
Note that we only had to specify how to calculate multiples a single time!
Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we define our own function in Python, we will use the following pattern.
show_def()
bpd.read_csv without knowing how it works.multiples(7)
array([ 7, 14, 21, 28, 35, 42, 49, 56, 63, 70, 77, 84])
multiples(-2)
array([ -2, -4, -6, -8, -10, -12, -14, -16, -18, -20, -22, -24])
triple has one parameter, x.
def triple(x):
return x * 3
When we call triple with the argument 5, within the body of triple, x means 5.
triple(5)
15
We can call triple with other arguments, even strings!
triple(7 + 8)
45
triple('triton')
'tritontritontriton'
The names you choose for a function’s parameters are only known to that function (known as local scope). The rest of your notebook is unaffected by parameter names.
def triple(x):
return x * 3
triple(7)
21
Since we haven't defined an x outside of the body of triple, our notebook doesn't know what x means.
x
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /tmp/ipykernel_181/32546335.py in <module> ----> 1 x NameError: name 'x' is not defined
We can define an x outside of the body of triple, but that doesn't change how triple works.
x = 15
# When triple(12) is called, you can pretend
# there's an invisible line inside the body of x
# that says x = 12.
# The x = 15 above is ignored.
triple(12)
36
Functions can take any number of arguments.
greeting takes no arguments.
def greeting():
return 'Hi! 👋'
greeting()
'Hi! 👋'
custom_multiples takes two arguments!
def custom_multiples(k, how_many):
'''This function returns the
first how_many multiples of k.'''
return np.arange(k, (how_many + 1)*k, k)
custom_multiples(10, 7)
array([10, 20, 30, 40, 50, 60, 70])
custom_multiples(2, 100)
array([ 2, 4, 6, ..., 196, 198, 200])
The body of a function is not run until you use (call) the function.
Here, we can define where_is_the_error without seeing an error message.
def where_is_the_error(something):
'''A function to illustrate that errors don't occur
until functions are executed (called).'''
return (1 / 0) + something
It is only when we call where_is_the_error that Python gives us an error message.
where_is_the_error(5)
--------------------------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) /tmp/ipykernel_181/3423408763.py in <module> ----> 1 where_is_the_error(5) /tmp/ipykernel_181/3410008411.py in where_is_the_error(something) 2 '''A function to illustrate that errors don't occur 3 until functions are executed (called).''' ----> 4 return (1 / 0) + something ZeroDivisionError: division by zero
first_name¶Let's create a function called first_name that takes in someone's full name and returns their first name. Example behavior is shown below.
>>> first_name('Pradeep Khosla')
'Pradeep'
Hint: Use the string method .split.
General strategy for writing functions:
'Pradeep Khosla'.split(' ')[0]
'Pradeep'
def first_name(full_name):
'''Returns the first name given a full name.'''
return full_name.split(' ')[0]
first_name('Pradeep Khosla')
'Pradeep'
# What if there are three names?
first_name('Chancellor Pradeep Khosla')
'Chancellor'
return keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to.return.return!print and return work differently!def pythagorean(a, b):
'''Computes the hypotenuse length of a right triangle with legs a and b.'''
c = (a ** 2 + b ** 2) ** 0.5
print(c)
x = pythagorean(3, 4)
5.0
# No output – why?
x
# Errors – why?
x + 10
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_181/3707561498.py in <module> 1 # Errors – why? ----> 2 x + 10 TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
def better_pythagorean(a, b):
'''Computes the hypotenuse length of a right triangle with legs a and b,
and actually returns the result.
'''
c = (a ** 2 + b ** 2) ** 0.5
return c
x = better_pythagorean(3, 4)
x
5.0
x + 10
15.0
Once a function executes a return statement, it stops running.
def motivational(quote):
return 0
print("Here's a motivational quote:", quote)
motivational('Fall seven times and stand up eight.')
0
The DataFrame roster contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.
roster = bpd.read_csv('data/roster-anon.csv')
roster
| name | section | |
|---|---|---|
| 0 | Camilla Ozqcfu | B |
| 1 | Yifan Ignjpe | A |
| 2 | Ali Jjsojh | A |
| ... | ... | ... |
| 491 | Qiankang Tzjsbb | D |
| 492 | Kim Wxnmuh | C |
| 493 | Jazmine Sxcvft | C |
494 rows × 2 columns
What is the most common first name among DSC 10 students? (Any guesses?)
roster
| name | section | |
|---|---|---|
| 0 | Camilla Ozqcfu | B |
| 1 | Yifan Ignjpe | A |
| 2 | Ali Jjsojh | A |
| ... | ... | ... |
| 491 | Qiankang Tzjsbb | D |
| 492 | Kim Wxnmuh | C |
| 493 | Jazmine Sxcvft | C |
494 rows × 2 columns
'name' column.first_name function¶Somehow, we need to call first_name on every student's 'name'.
roster
| name | section | |
|---|---|---|
| 0 | Camilla Ozqcfu | B |
| 1 | Yifan Ignjpe | A |
| 2 | Ali Jjsojh | A |
| ... | ... | ... |
| 491 | Qiankang Tzjsbb | D |
| 492 | Kim Wxnmuh | C |
| 493 | Jazmine Sxcvft | C |
494 rows × 2 columns
roster.get('name').iloc[0]
'Camilla Ozqcfu'
first_name(roster.get('name').iloc[0])
'Camilla'
first_name(roster.get('name').iloc[1])
'Yifan'
Ideally, there's a better solution than doing this hundreds of times...
.apply¶func_name to every element of column 'col' in DataFrame df, usedf.get('col').apply(func_name)
.apply method is a Series method..apply on Series, not DataFrames..apply is also a Series..apply(first_name)..apply(first_name()).roster.get('name')
0 Camilla Ozqcfu
1 Yifan Ignjpe
2 Ali Jjsojh
...
491 Qiankang Tzjsbb
492 Kim Wxnmuh
493 Jazmine Sxcvft
Name: name, Length: 494, dtype: object
roster.get('name').apply(first_name)
0 Camilla
1 Yifan
2 Ali
...
491 Qiankang
492 Kim
493 Jazmine
Name: name, Length: 494, dtype: object
roster = roster.assign(
first=roster.get('name').apply(first_name)
)
roster
| name | section | first | |
|---|---|---|---|
| 0 | Camilla Ozqcfu | B | Camilla |
| 1 | Yifan Ignjpe | A | Yifan |
| 2 | Ali Jjsojh | A | Ali |
| ... | ... | ... | ... |
| 491 | Qiankang Tzjsbb | D | Qiankang |
| 492 | Kim Wxnmuh | C | Kim |
| 493 | Jazmine Sxcvft | C | Jazmine |
494 rows × 3 columns
Now that we have a column containing first names, we can find the distribution of first names.
name_counts = (
roster
.groupby('first')
.count()
.sort_values('name', ascending=False)
.get(['name'])
)
name_counts
| name | |
|---|---|
| first | |
| Andrew | 7 |
| Noah | 5 |
| Joseph | 5 |
| ... | ... |
| Heaven | 1 |
| Harsh | 1 |
| Ziyong | 1 |
427 rows × 1 columns
Below:
Hint: Start by defining a DataFrame with only the names in name_counts that appeared at least twice. You can use this DataFrame to answer both questions.
shared_names = name_counts[name_counts.get('name') >= 2]
# Bar chart.
shared_names.sort_values('name').plot(kind='barh', y='name', figsize=(5, 8));
# Proportion = # students with a shared name / total # of students.
shared_names.get('name').sum() / roster.shape[0]
...
Ellipsis
...
Ellipsis
.apply works with built-in functions, too!¶name_counts.get('name')
first
Andrew 7
Noah 5
Joseph 5
..
Heaven 1
Harsh 1
Ziyong 1
Name: name, Length: 427, dtype: int64
# Not necessarily meaningful, but doable.
name_counts.get('name').apply(np.log)
first
Andrew 1.95
Noah 1.61
Joseph 1.61
...
Heaven 0.00
Harsh 0.00
Ziyong 0.00
Name: name, Length: 427, dtype: float64
In name_counts, first names are stored in the index, which is not a Series. This means we can't use .apply on it.
name_counts.index
Index(['Andrew', 'Noah', 'Joseph', 'Ethan', 'Michael', 'Christopher', 'Daniel',
'Justin', 'Abhinav', 'Jaden',
...
'Ingkawat', 'I-Shan', 'Humza', 'Huilin', 'Honoka', 'Hirkani', 'Hilary',
'Heaven', 'Harsh', 'Ziyong'],
dtype='object', name='first', length=427)
name_counts.index.apply(max)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) /tmp/ipykernel_181/1905262767.py in <module> ----> 1 name_counts.index.apply(max) AttributeError: 'Index' object has no attribute 'apply'
To help, we can use .reset_index() to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.
# What is the max of an individual string?
name_counts.reset_index().get('first').apply(max)
0 w
1 o
2 s
..
424 v
425 s
426 y
Name: first, Length: 427, dtype: object
'Jayden Lcaert' wants to see if there's another 'Jayden' in their section.Strategy:
'Jayden Lcaert' in?'Jayden'?roster
| name | section | first | |
|---|---|---|---|
| 0 | Camilla Ozqcfu | B | Camilla |
| 1 | Yifan Ignjpe | A | Yifan |
| 2 | Ali Jjsojh | A | Ali |
| ... | ... | ... | ... |
| 491 | Qiankang Tzjsbb | D | Qiankang |
| 492 | Kim Wxnmuh | C | Kim |
| 493 | Jazmine Sxcvft | C | Jazmine |
494 rows × 3 columns
which_section = roster[roster.get('name') == 'Jayden Lcaert'].get('section').iloc[0]
which_section
'B'
first_cond = roster.get('first') == 'Jayden' # A Boolean Series!
section_cond = roster.get('section') == which_section # A Boolean Series!
how_many = roster[first_cond & section_cond].shape[0]
how_many
2
shared_first_and_section¶Let's create a function named shared_first_and_section. It will take in the full name of a student and return the number of students in their section with the same first name and section (including them).
Note: This is the first function we're writing that involves using a DataFrame within the function – this is fine!
def shared_first_and_section(name):
# First, find the row corresponding to that full name in roster.
# We're assuming that full names are unique.
row = roster[roster.get('name') == name]
# Then, get that student's first name and section.
first = row.get('first').iloc[0]
section = row.get('section').iloc[0]
# Now, find all the students with the same first name and section.
shared_info = roster[(roster.get('first') == first) & (roster.get('section') == section)]
# Return the number of such students.
return shared_info.shape[0]
shared_first_and_section('Jayden Lcaert')
2
Now, let's add a column to roster that contains the values returned by shared_first_and_section.
roster = roster.assign(shared=roster.get('name').apply(shared_first_and_section))
roster
| name | section | first | shared | |
|---|---|---|---|---|
| 0 | Camilla Ozqcfu | B | Camilla | 1 |
| 1 | Yifan Ignjpe | A | Yifan | 1 |
| 2 | Ali Jjsojh | A | Ali | 2 |
| ... | ... | ... | ... | ... |
| 491 | Qiankang Tzjsbb | D | Qiankang | 1 |
| 492 | Kim Wxnmuh | C | Kim | 1 |
| 493 | Jazmine Sxcvft | C | Jazmine | 1 |
494 rows × 4 columns
Let's find all of the students who are in a section with someone that has the same first name as them.
roster[(roster.get('shared') >= 2)].sort_values('shared', ascending=False)
| name | section | first | shared | |
|---|---|---|---|---|
| 486 | Joseph Jpoelz | C | Joseph | 4 |
| 411 | Joseph Vdhfyo | C | Joseph | 4 |
| 329 | Joseph Jaqwhh | C | Joseph | 4 |
| ... | ... | ... | ... | ... |
| 226 | William Mjsrep | D | William | 2 |
| 154 | Jayden Lcaert | B | Jayden | 2 |
| 2 | Ali Jjsojh | A | Ali | 2 |
46 rows × 4 columns
While the DataFrame above contains the information we were looking for, it is not organized very conveniently and it is somewhat redundant.
Wouldn't it be great if we could create a DataFrame like the one below? We'll see how next time!

Find the longest first name in the class that is shared by at least two students in the same section.
Hint: You'll have to use both .assign and .apply.
with_len = roster.assign(name_len=roster.get('first').apply(len))
with_len[with_len.get('shared') >= 2].sort_values('name_len', ascending=False).get('first').iloc[0]
...
Ellipsis
.apply method allows us to call a function on every single element of a Series, which usually comes from .getting a column of a DataFrame.More advanced DataFrame manipulations!