In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (10, 5)
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
from IPython.display import display, IFrame
def show_def():
src = "https://docs.google.com/presentation/d/e/2PACX-1vRKMMwGtrQOeLefj31fCtmbNOaJuKY32eBz1VwHi_5ui0AGYV3MoCjPUtQ_4SB1f9x4Iu6gbH0vFvmB/embed?start=false&loop=false&delayms=60000"
width = 960
height = 569
display(IFrame(src, width, height))
```

- Lab 2 is due
**tomorrow at 11:59PM**. - Homework 2 is due
**Tuesday 1/31 at 11:59PM**. - The Midterm Project will be released in the middle of next week – start thinking about who you may want to partner up with!
- You don't have to work with a partner, but it is highly recommended.
- If you do, your partner doesn't have to be from your lecture section.

- Functions.
- Applying functions to DataFrames.
- Example: Student names.

**Reminder:** Use the DSC 10 Reference Sheet. You can also use it on exams!

- We've learned how to do quite a bit in Python:
- Manipulate arrays, Series, and DataFrames.
- Perform operations on strings.
- Create visualizations.

- But so far, we've been restricted to using existing functions (e.g.
`max`

,`np.sqrt`

,`len`

) and methods (e.g.`.groupby`

,`.assign`

,`.plot`

).

Suppose you drive to a restaurant 🥘 in LA, located exactly 100 miles away.

- For the first 50 miles, you drive at 80 miles per hour.
- For the last 50 miles, you drive at 60 miles per hour.

**Question:**What is your**average speed**throughout the journey?

$$\text{average speed} = \frac{\text{distance}}{\text{time}} = \frac{50 + 50}{\text{time}_1 + \text{time}_2} \text{ miles per hour}$$

In segment 1, when you drove 50 miles at 80 miles per hour, you drove for $\frac{50}{80}$ hours:

$$\text{speed}_1 = \frac{\text{distance}_1}{\text{time}_1}$$
$$80 \text{ miles per hour} = \frac{50 \text{ miles}}{\text{time}_1} \implies \text{time}_1 = \frac{50}{80} \text{ hours}$$

Then,

$$\text{average speed} = \frac{50 + 50}{\frac{50}{80} + \frac{50}{60}} \text{ miles per hour} $$
$$\begin{align*}\text{average speed} &= \frac{50}{50} \cdot \frac{1 + 1}{\frac{1}{80} + \frac{1}{60}} \text{ miles per hour} \\ &= \frac{2}{\frac{1}{80} + \frac{1}{60}} \text{ miles per hour} \end{align*}$$

The **harmonic mean** ($\text{HM}$) of two positive numbers, $a$ and $b$, is defined as

It is often used to find the average of multiple **rates**.

Finding the harmonic mean of 80 and 60 is not hard:

In [2]:

```
2 / (1 / 80 + 1 / 60)
```

Out[2]:

68.57142857142857

**This would require a lot of copy-pasting, which is prone to error.**

**define** our own "harmonic mean" **function** just once, and re-use it multiple times.

In [3]:

```
def harmonic_mean(a, b):
return 2 / (1 / a + 1 / b)
```

In [4]:

```
harmonic_mean(80, 60)
```

Out[4]:

68.57142857142857

In [5]:

```
harmonic_mean(20, 40)
```

Out[5]:

26.666666666666664

Note that we only had to specify how to calculate the harmonic mean once!

Functions are a way to divide our code into small subparts to prevent us from writing repetitive code. Each time we **define** our own function in Python, we will use the following pattern.

In [6]:

```
show_def()
```

- Functions take in inputs, known as
**arguments**, do something, and produce some outputs. - The beauty of functions is that
**you don't need to know how they are implemented in order to use them!**- This is the premise of the idea of
**abstraction**in computer science – you'll hear a lot about this in DSC 20.

- This is the premise of the idea of

In [7]:

```
harmonic_mean(20, 40)
```

Out[7]:

26.666666666666664

In [8]:

```
harmonic_mean(79, 894)
```

Out[8]:

145.17163412127442

In [9]:

```
harmonic_mean(-2, 4)
```

Out[9]:

-8.0

`triple`

has one **parameter**, `x`

.

In [10]:

```
def triple(x):
return x * 3
```

`triple`

with the **argument** 5, you can pretend that there's an invisible first line in the body of `triple`

that says `x = 5`

.

In [11]:

```
triple(5)
```

Out[11]:

15

Note that arguments can be of any type!

In [12]:

```
triple('triton')
```

Out[12]:

'tritontritontriton'

Functions can have any number of arguments. So far, we've created a function that takes two arguments – `harmonic_mean`

– and a function that takes one argument – `triple`

.

`greeting`

takes no arguments!

In [13]:

```
def greeting():
return 'Hi! 👋'
```

In [14]:

```
greeting()
```

Out[14]:

'Hi! 👋'

The body of a function is not run until you use (**call**) the function.

Here, we can define `where_is_the_error`

without seeing an error message.

In [15]:

```
def where_is_the_error(something):
'''You can describe your function within triple quotes. For example, this function
illustrates that errors don't occur until functions are executed (called).'''
return (1 / 0) + something
```

It is only when we **call** `where_is_the_error`

that Python gives us an error message.

In [16]:

```
where_is_the_error(5)
```

`first_name`

¶Let's create a function called `first_name`

that takes in someone's full name and returns their first name. Example behavior is shown below.

```
>>> first_name('Pradeep Khosla')
'Pradeep'
```

*Hint*: Use the string method `.split`

.

General strategy for writing functions:

- First, try and get the behavior to work on a single example.
- Then, encapsulate that behavior inside a function.

In [17]:

```
'Pradeep Khosla'.split(' ')[0]
```

Out[17]:

'Pradeep'

In [18]:

```
def first_name(full_name):
'''Returns the first name given a full name.'''
return full_name.split(' ')[0]
```

In [19]:

```
first_name('Pradeep Khosla')
```

Out[19]:

'Pradeep'

In [20]:

```
# What if there are three names?
first_name('Chancellor Pradeep Khosla')
```

Out[20]:

'Chancellor'

- The
`return`

keyword specifies what the output of your function should be, i.e. what a call to your function will evaluate to. - Most functions we write will use
`return`

, but using`return`

is not required. - Be careful:
`print`

and`return`

work differently!

In [21]:

```
def pythagorean(a, b):
'''Computes the hypotenuse length of a triangle with legs a and b.'''
c = (a ** 2 + b ** 2) ** 0.5
print(c)
```

In [22]:

```
x = pythagorean(3, 4)
```

5.0

In [23]:

```
# No output – why?
x
```

In [24]:

```
# Errors – why?
x + 10
```

In [25]:

```
def better_pythagorean(a, b):
'''Computes the hypotenuse length of a triangle with legs a and b, and actually returns the result.'''
c = (a ** 2 + b ** 2) ** 0.5
return c
```

In [26]:

```
x = better_pythagorean(3, 4)
x
```

Out[26]:

5.0

In [27]:

```
x + 10
```

Out[27]:

15.0

Once a function executes a `return`

statement, it stops running.

In [28]:

```
def motivational(quote):
return 0
print("Here's a motivational quote:", quote)
```

In [29]:

```
motivational('Fall seven times and stand up eight.')
```

Out[29]:

0

The names you choose for a function’s parameters are only known to that function (known as **local scope**). The rest of your notebook is unaffected by parameter names.

In [30]:

```
def what_is_awesome(s):
return s + ' is awesome!'
```

In [31]:

```
what_is_awesome('data science')
```

Out[31]:

'data science is awesome!'

In [32]:

```
s
```

In [33]:

```
s = 'DSC 10'
```

In [34]:

```
what_is_awesome('data science')
```

Out[34]:

'data science is awesome!'

The DataFrame `roster`

contains the names and lecture sections of all students enrolled in DSC 10 this quarter. The first names are real, while the last names have been anonymized for privacy.

In [35]:

```
roster = bpd.read_csv('data/roster-anon.csv')
roster
```

Out[35]:

name | section | |
---|---|---|

0 | Anya Iatypd | 10AM |

1 | Nathaniel Kcyrfu | 11AM |

2 | Jae Oadpmw | 10AM |

... | ... | ... |

347 | Danny Zsoyxb | 10AM |

348 | Alex Lrmwwt | 11AM |

349 | Giovanni Ibkdsu | 11AM |

350 rows × 2 columns

What is the most common first name among DSC 10 students? (Any guesses?)

In [36]:

```
roster
```

Out[36]:

name | section | |
---|---|---|

0 | Anya Iatypd | 10AM |

1 | Nathaniel Kcyrfu | 11AM |

2 | Jae Oadpmw | 10AM |

... | ... | ... |

347 | Danny Zsoyxb | 10AM |

348 | Alex Lrmwwt | 11AM |

349 | Giovanni Ibkdsu | 11AM |

350 rows × 2 columns

**Problem**: We can't answer that right now, since we don't have a column with first names. If we did, we could group by it.

**Solution**: Use our function that extracts first names on*every*element of the`'name'`

column.

`first_name`

function¶Somehow, we need to call `first_name`

on every student's `'name'`

.

In [37]:

```
roster
```

Out[37]:

name | section | |
---|---|---|

0 | Anya Iatypd | 10AM |

1 | Nathaniel Kcyrfu | 11AM |

2 | Jae Oadpmw | 10AM |

... | ... | ... |

347 | Danny Zsoyxb | 10AM |

348 | Alex Lrmwwt | 11AM |

349 | Giovanni Ibkdsu | 11AM |

350 rows × 2 columns

In [38]:

```
roster.get('name').iloc[0]
```

Out[38]:

'Anya Iatypd'

In [39]:

```
first_name(roster.get('name').iloc[0])
```

Out[39]:

'Anya'

In [40]:

```
first_name(roster.get('name').iloc[1])
```

Out[40]:

'Nathaniel'

Ideally, there's a better solution than doing this hundreds of times...

`.apply`

¶- To
**apply**a function to every element of column`column_name`

in DataFrame`df`

, use

`df.get(column_name).apply(function_name)`

- The
`.apply`

method is a**Series**method.**Important:**We use`.apply`

on Series,**not**DataFrames.- The output of
`.apply`

is also a Series.

- Pass
*just the name*of the function – don't call it!- Good ✅:
`.apply(first_name)`

. - Bad ❌:
`.apply(first_name())`

.

- Good ✅:

In [41]:

```
roster.get('name').apply(first_name)
```

Out[41]:

0 Anya 1 Nathaniel 2 Jae ... 347 Danny 348 Alex 349 Giovanni Name: name, Length: 350, dtype: object

In [42]:

```
roster = roster.assign(
first=roster.get('name').apply(first_name)
)
roster
```

Out[42]:

name | section | first | |
---|---|---|---|

0 | Anya Iatypd | 10AM | Anya |

1 | Nathaniel Kcyrfu | 11AM | Nathaniel |

2 | Jae Oadpmw | 10AM | Jae |

... | ... | ... | ... |

347 | Danny Zsoyxb | 10AM | Danny |

348 | Alex Lrmwwt | 11AM | Alex |

349 | Giovanni Ibkdsu | 11AM | Giovanni |

350 rows × 3 columns

In [43]:

```
name_counts = roster.groupby('first').count().sort_values('name', ascending=False).get(['name'])
name_counts
```

Out[43]:

name | |
---|---|

first | |

Ryan | 6 |

Jason | 3 |

Ethan | 3 |

... | ... |

Hannah | 1 |

Gwendal | 1 |

Zoe | 1 |

315 rows × 1 columns

Below:

- Create a
**bar chart**showing the number of students with each first name, but only include first names shared by at least two students. - Determine the
**proportion**of students in DSC 10 who have a first name that is shared by at least two students.

In [44]:

```
...
```

Out[44]:

Ellipsis

In [45]:

```
...
```

Out[45]:

Ellipsis

`.apply`

works with built-in functions, too!¶For instance, to find the length of each name, we might use the `len`

function:

In [46]:

```
roster
```

Out[46]:

name | section | first | |
---|---|---|---|

0 | Anya Iatypd | 10AM | Anya |

1 | Nathaniel Kcyrfu | 11AM | Nathaniel |

2 | Jae Oadpmw | 10AM | Jae |

... | ... | ... | ... |

347 | Danny Zsoyxb | 10AM | Danny |

348 | Alex Lrmwwt | 11AM | Alex |

349 | Giovanni Ibkdsu | 11AM | Giovanni |

350 rows × 3 columns

In [47]:

```
roster.get('first').apply(len)
```

Out[47]:

0 4 1 9 2 3 .. 347 5 348 4 349 8 Name: first, Length: 350, dtype: int64

We were able to apply `first_name`

to the `'name'`

column because it's a Series. The `.apply`

method doesn't work on the index, because the index is not a Series.

In [48]:

```
indexed_by_name = roster.set_index('name')
indexed_by_name
```

Out[48]:

section | first | |
---|---|---|

name | ||

Anya Iatypd | 10AM | Anya |

Nathaniel Kcyrfu | 11AM | Nathaniel |

Jae Oadpmw | 10AM | Jae |

... | ... | ... |

Danny Zsoyxb | 10AM | Danny |

Alex Lrmwwt | 11AM | Alex |

Giovanni Ibkdsu | 11AM | Giovanni |

350 rows × 2 columns

In [49]:

```
indexed_by_name.index.apply(first_name)
```

`.reset_index()`

¶Use `.reset_index()`

to turn the index of a DataFrame into a column, and to reset the index back to the default of 0, 1, 2, 3, and so on.

In [50]:

```
indexed_by_name.reset_index()
```

Out[50]:

name | section | first | |
---|---|---|---|

0 | Anya Iatypd | 10AM | Anya |

1 | Nathaniel Kcyrfu | 11AM | Nathaniel |

2 | Jae Oadpmw | 10AM | Jae |

... | ... | ... | ... |

347 | Danny Zsoyxb | 10AM | Danny |

348 | Alex Lrmwwt | 11AM | Alex |

349 | Giovanni Ibkdsu | 11AM | Giovanni |

350 rows × 3 columns

In [51]:

```
indexed_by_name.reset_index().get('name').apply(first_name)
```

Out[51]:

0 Anya 1 Nathaniel 2 Jae ... 347 Danny 348 Alex 349 Giovanni Name: name, Length: 350, dtype: object

- Suppose you're one of the $\approx$18\% of students in DSC 10 who has a first name that is shared with at least one other student.
- Let's try and determine whether someone
**in your lecture section**shares the same first name as you.

In [52]:

```
roster
```

Out[52]:

name | section | first | |
---|---|---|---|

0 | Anya Iatypd | 10AM | Anya |

1 | Nathaniel Kcyrfu | 11AM | Nathaniel |

2 | Jae Oadpmw | 10AM | Jae |

... | ... | ... | ... |

347 | Danny Zsoyxb | 10AM | Danny |

348 | Alex Lrmwwt | 11AM | Alex |

349 | Giovanni Ibkdsu | 11AM | Giovanni |

350 rows × 3 columns

For example, maybe `'Giovanni Ibkdsu'`

wants to see if there's another `'Giovanni'`

in their section.

Strategy:

- What section is
`'Giovanni Ibkdsu'`

in? - How many people are in that section and named
`'Giovanni'`

?

In [53]:

```
what_section = roster[roster.get('name') == 'Giovanni Ibkdsu'].get('section').iloc[0]
what_section
```

Out[53]:

'11AM'

In [54]:

```
how_many = roster[(roster.get('section') == what_section) & (roster.get('first') == 'Giovanni')].shape[0]
how_many
```

Out[54]:

2

`shared_first_and_section`

¶Let's create a function named `shared_first_and_section`

. It will take in the **full name** of a student and return **the number** of students in their section with the same first name and section (including them).

*Note*: This is the first function we're writing that involves using a DataFrame within the function – this is fine!

In [55]:

```
def shared_first_and_section(name):
# First, find the row corresponding to that full name in roster.
# We're assuming that full names are unique.
row = roster[roster.get('name') == name]
# Then, get that student's first name and section.
first = row.get('first').iloc[0]
section = row.get('section').iloc[0]
# Now, find all the students with the same first name and section.
shared_info = roster[(roster.get('first') == first) & (roster.get('section') == section)]
# Return the number of such students.
return shared_info.shape[0]
```

In [56]:

```
shared_first_and_section('Giovanni Ibkdsu')
```

Out[56]:

2

In [57]:

```
shared_first_and_section('Danny Zsoyxb')
```

Out[57]:

1

Now, let's add a column to `roster`

that contains the values returned by `shared_first_and_section`

.

In [58]:

```
roster = roster.assign(shared=roster.get('name').apply(shared_first_and_section))
roster
```

Out[58]:

name | section | first | shared | |
---|---|---|---|---|

0 | Anya Iatypd | 10AM | Anya | 1 |

1 | Nathaniel Kcyrfu | 11AM | Nathaniel | 1 |

2 | Jae Oadpmw | 10AM | Jae | 1 |

... | ... | ... | ... | ... |

347 | Danny Zsoyxb | 10AM | Danny | 1 |

348 | Alex Lrmwwt | 11AM | Alex | 1 |

349 | Giovanni Ibkdsu | 11AM | Giovanni | 2 |

350 rows × 4 columns

In [59]:

```
roster[(roster.get('shared') > 1)].sort_values('shared', ascending=False)
```

Out[59]:

name | section | first | shared | |
---|---|---|---|---|

300 | Ryan Siubvw | 9AM | Ryan | 3 |

140 | Ryan Pxydjz | 9AM | Ryan | 3 |

167 | Ryan Nwivbq | 9AM | Ryan | 3 |

... | ... | ... | ... | ... |

37 | Jasmine Nztgqf | 9AM | Jasmine | 2 |

35 | Ruby Lopqun | 9AM | Ruby | 2 |

349 | Giovanni Ibkdsu | 11AM | Giovanni | 2 |

25 rows × 4 columns

We can narrow this down to a particular lecture section if we'd like.

In [60]:

```
one_section_only = roster[(roster.get('shared') > 1) &
(roster.get('section') == '9AM')].sort_values('shared', ascending=False)
one_section_only
```

Out[60]:

name | section | first | shared | |
---|---|---|---|---|

140 | Ryan Pxydjz | 9AM | Ryan | 3 |

167 | Ryan Nwivbq | 9AM | Ryan | 3 |

300 | Ryan Siubvw | 9AM | Ryan | 3 |

... | ... | ... | ... | ... |

212 | Jonathan Jgchdp | 9AM | Jonathan | 2 |

316 | Bruce Rinuux | 9AM | Bruce | 2 |

339 | Jonathan Lspjmb | 9AM | Jonathan | 2 |

13 rows × 4 columns

In [61]:

```
one_section_only.get('first').unique()
```

Out[61]:

array(['Ryan', 'Bruce', 'Eddie', 'Ruby', 'Jasmine', 'Jonathan'], dtype=object)

While the DataFrames on the previous slide contain the info we were looking for, they're not organized very conveniently. For instance, there are three rows containing the fact that there are 3 `'Ryan'`

s in the 9AM lecture section.

Wouldn't it be great if we could create a DataFrame like the one below? We'll see how on Monday!

section | first | count | |
---|---|---|---|

0 | 9AM | Ryan | 3 |

1 | 10AM | Ryan | 2 |

2 | 9AM | Ruby | 2 |

3 | 10AM | Jason | 2 |

4 | 11AM | Giovanni | 2 |

Find the shortest first name in the class that is shared by at least two students in the same section.

*Hint*: You'll have to use both `.assign`

and `.apply`

.

In [62]:

```
...
```

Out[62]:

Ellipsis

- Functions are a way to divide our code into small subparts to prevent us from writing repetitive code.
- The
`.apply`

method allows us to call a function on every single element of a Series, which usually comes from`.get`

ting a column of a DataFrame.

More advanced DataFrame manipulations!