In [1]:

```
# Set up packages for lecture. Don't worry about understanding this code, but
# make sure to run it if you're following along.
import numpy as np
import babypandas as bpd
import pandas as pd
from matplotlib_inline.backend_inline import set_matplotlib_formats
import matplotlib.pyplot as plt
set_matplotlib_formats("svg")
plt.style.use('ggplot')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option("display.max_rows", 7)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)
from IPython.display import display, IFrame
def binning_animation():
src="https://docs.google.com/presentation/d/e/2PACX-1vTnRGwEnKP2V-Z82DlxW1b1nMb2F0zWyrXIzFSpQx_8Wd3MFaf56y2_u3JrLwZ5SjWmfapL5BJLfsDG/embed?start=false&loop=false&delayms=60000"
width=900
height=307
display(IFrame(src, width, height))
```

- Lab 2 is due
**Saturday 1/28 at 11:59PM**. - Homework 2 is due
**Tuesday 1/21 at 11:59PM**. - Come to office hours for help! See the calendar for directions.
- Optional extra videos from past quarters to supplement the last lecture:
- Using
`str.contains()`

. - How line plots work with sorting.

- Using

- Distributions.
- Density histograms.
- Overlaid plots.

The type of visualization we create depends on the kinds of variables we're visualizing.

**Scatter plot**: numerical vs. numerical.- Example: weight vs. height.

**Line plot**: sequential numerical (time) vs. numerical.- Example: height vs. time.

**Bar chart**: categorical vs. numerical.- Example: heights of different family members.

**Histogram**: distribution of numerical.

**Note:** We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

- The distribution of a variable consists of all values of the variable that occur in the data, along with their frequencies.
- Distributions help you understand:
*How often does a variable take on a certain value?* - Both categorical and numerical variables have distributions.

The distribution of a categorical variable can be displayed as a table or bar chart, among other ways! For example, let's look at the colleges of students enrolled in DSC 10 this quarter.

In [2]:

```
colleges = bpd.DataFrame().assign(College=['Seventh', 'Sixth', 'Roosevelt', 'Warren', 'Marshall', 'Muir', 'Revelle'],
Students=[45, 81, 46, 41, 50, 42, 43])
colleges
```

Out[2]:

College | Students | |
---|---|---|

0 | Seventh | 45 |

1 | Sixth | 81 |

2 | Roosevelt | 46 |

3 | Warren | 41 |

4 | Marshall | 50 |

5 | Muir | 42 |

6 | Revelle | 43 |

In [3]:

```
colleges.plot(kind='barh', x='College', y='Students');
```

In [4]:

```
colleges.plot(kind='bar', x='College', y='Students');
```

The distribution of a numerical variable cannot always be accurately depicted with a bar chart. For example, let's look at the number of streams for each of the top 200 songs on Spotify. 🎵

In [5]:

```
charts = bpd.read_csv('data/regional-us-daily-2023-01-21.csv')
charts = (charts.set_index('rank')
.assign(million_streams = np.round(charts.get('streams')/1000000, 2))
.get(['track_name', 'artist_names', 'streams', 'million_streams'])
)
charts
```

Out[5]:

track_name | artist_names | streams | million_streams | |
---|---|---|---|---|

rank | ||||

1 | Flowers | Miley Cyrus | 3356361 | 3.36 |

2 | Kill Bill | SZA | 2479445 | 2.48 |

3 | Creepin' (with The Weeknd & 21 Savage) | Metro Boomin, The Weeknd, 21 Savage | 1337320 | 1.34 |

... | ... | ... | ... | ... |

198 | Major Distribution | Drake, 21 Savage | 266986 | 0.27 |

199 | Sun to Me | Zach Bryan | 266968 | 0.27 |

200 | The Real Slim Shady | Eminem | 266698 | 0.27 |

200 rows × 4 columns

To see the distribution of the number of streams, we need to group by the `'million_streams'`

column.

In [6]:

```
stream_counts = charts.groupby('million_streams').count()
stream_counts = stream_counts.assign(Count=stream_counts.get('track_name')).drop(columns=['track_name', 'artist_names', 'streams'])
stream_counts
```

Out[6]:

Count | |
---|---|

million_streams | |

0.27 | 17 |

0.28 | 20 |

0.29 | 19 |

... | ... |

1.34 | 1 |

2.48 | 1 |

3.36 | 1 |

51 rows × 1 columns

In [7]:

```
stream_counts.plot(kind='bar', y='Count', figsize=(15,5));
```

This obscures the fact that the top two songs are outlier, with

**many more streams**than the other songs.The horizontal axis should be numerical (like a number line), not categorical. There should be more space between certain bars than others.

Instead of a bar chart, we'll visualize the distribution of a numerical variable with a **density histogram**. Let's see what a density histogram for `'million_streams'`

looks like. What do you notice about this visualization?

In [8]:

```
# Ignore the code for right now.
charts.plot(kind='hist', y='million_streams', density=True, bins=np.arange(0, 4, 0.5), ec='w');
```

- Binning is the act of counting the number of numerical values that fall within ranges defined by two endpoints. These ranges are called “bins”.
- A value falls in a bin if it is
**greater than or equal to the left**endpoint and**less than the right**endpoint.- [a, b): a is included, b is not.

- The width of a bin is its right endpoint minus its left endpoint.

In [9]:

```
binning_animation()
```

**Density histograms**(not bar charts!) visualize the distribution of a single numerical variable by placing numbers into bins.- To create one from a DataFrame
`df`

, usedf.plot( kind='hist', y=column_name, density=True )

- Optional but recommended: Use
`ec='w'`

to see where bins start and end more clearly.

- By default, Python will bin your data into 10 equally sized bins.
- You can specify another number of equally sized bins by setting the optional argument
`bins`

equal to some other integer value. - You can also specify custom bin start and endpoints by setting
`bins`

equal to a sequence of bin endpoints.- Can be a
`list`

or`numpy`

array.

- Can be a

In [10]:

```
# There are 10 bins by default, some of which are empty.
charts.plot(kind='hist', y='million_streams', density=True, ec='w');
```

In [11]:

```
charts.plot(kind='hist', y='million_streams', density=True, bins=20, ec='w');
```

In [12]:

```
charts.plot(kind='hist', y='million_streams', density=True,
bins=[0, 1, 2, 3, 4, 5],
ec='w');
```

In the three histograms above, what is different and what is the same?

- The general shape of all three histograms is the same, regardless of the bins. This shape is called
*right-skewed*. - More bins gives a finer, more granular picture of the distribution of the variable
`'million_streams'`

. - The $y$-axis values seem to change a lot when we change the bins. Hang onto that thought; we'll see why shortly.

- In a histogram, only the last bin is inclusive of the right endpoint!
- The bins you specify need not include all data values. Data values not in any bin won't be shown in the histogram.
- For equally sized bins, use
`np.arange`

.- Be
**very careful**with the endpoints. Example:`bins=np.arange(4)`

creates the bins [0, 1), [1, 2), [2, 3].

- Be
- Bins need not be equally sized.

In [13]:

```
charts.plot(kind='hist', y='million_streams', density=True,
bins=np.arange(4),
ec='w');
```

In [14]:

```
charts.plot(kind='hist', y='million_streams', density=True,
bins=[0, 0.5, 1, 1.5, 2.5, 4],
ec='w');
```

- In a density histogram, the $y$-axis can be hard to interpret, but it's designed to give the histogram a very nice property: $$\textbf{The bars of a density histogram }$$ $$\textbf{have a combined total area of 1.}$$
- This means the area of a bar is equal to the proportion of all data points that fall into that bin.
- Proportions and percentages represent the same thing.
- A proportion is a decimal between 0 and 1, a percentage is between 0\% and 100\%.
- The proportion 0.34 means 34\%.

In [15]:

```
charts.plot(kind='hist', y='million_streams', density=True,
bins=[0, 0.5, 1, 1.5, 2.5, 4],
ec='w');
```

Based on this histogram, what proportion of the top 200 songs had less than half a million streams?

- The height of the [0, 0.5) bar looks to be just shy of 1.6.
The width of the bin is 0.5 - 0 = 0.5.

Therefore, using the formula for the area of a rectangle,

- Since areas represent proportions, this means that the proportion of top 200 songs with less than 0.5 million streams was roughly 0.8 (or 80\%).

In [16]:

```
first_bin = charts[charts.get('million_streams') < 0.5].shape[0]
first_bin
```

Out[16]:

159

In [17]:

```
first_bin/200
```

Out[17]:

0.795

This matches the result we got. (Not exactly, since we made an estimate for the height.)

Since a bar of a histogram is a rectangle, its area is given by

$$\text{Area} = \text{Height} \times \text{Width}$$That means

$$\text{Height} = \frac{\text{Area}}{\text{Width}} = \frac{\text{Proportion (or Percentage)}}{\text{Width}}$$*density*, which is why we call it a density histogram.

In [18]:

```
charts.plot(kind='hist', y='million_streams', density=True,
bins=[0, 0.5, 1, 1.5, 2.5, 4],
ec='w');
```

The $y$-axis units here are "proportion per million streams", since the $x$-axis represents millions of streams.

- Unfortunately, the $y$-axis units on the histogram always displays as "Frequency".
**This is wrong!** - Can fix with
`plt.ylabel(...)`

but we usually don't.

Suppose we created a density histogram of people's shoe sizes. 👟 Below are the bins we chose along with their heights.

Bin | Height of Bar |
---|---|

[3, 7) | 0.05 |

[7, 10) | 0.1 |

[10, 12) | 0.15 |

[12, 16] | $X$ |

What should the value of $X$ be so that this is a valid histogram?

A. 0.02 B. 0.05 C. 0.2 D. 0.5 E. 0.7

Bar chart | Histogram |
---|---|

Shows the distribution of a categorical variable | Shows the distribution of a numerical variable |

1 categorical axis, 1 numerical axis | 2 numerical axes |

Bars have arbitrary, but equal, widths and spacing | Horizontal axis is numerical and to scale |

Lengths of bars are proportional to the numerical quantity of interest | Height measures density; areas are proportional to the proportion (percent) of individuals |

In this class, **"histogram" will always mean a "density histogram".** We will **only** use density histograms.

*Note:* It's possible to create what's called a *frequency histogram* where the $y$-axis simply represents a count of the number of values in each bin. While easier to interpret, frequency histograms don't have the important property that the total area is 1, so they can't be connected to probability in the same way that density histograms can. That makes them far less useful for data scientists.

The data for both cities comes from macrotrends.net.

In [19]:

```
population = bpd.read_csv('data/sd-sj-2022.csv').set_index('date')
population
```

Out[19]:

Pop SD | Growth SD | Pop SJ | Growth SJ | |
---|---|---|---|---|

date | ||||

1970 | 1209000 | 3.69 | 1009000 | 4.34 |

1971 | 1252000 | 3.56 | 1027000 | 1.78 |

1972 | 1297000 | 3.59 | 1046000 | 1.85 |

... | ... | ... | ... | ... |

2021 | 3272000 | 0.65 | 1799000 | 0.45 |

2022 | 3295000 | 0.70 | 1809000 | 0.56 |

2023 | 3319000 | 0.73 | 1821000 | 0.66 |

54 rows × 4 columns

In [20]:

```
population.plot(kind='line', y='Growth SD',
title='San Diego population growth rate', legend=False);
```

In [21]:

```
population.plot(kind='line', y='Growth SJ',
title='San Jose population growth rate', legend=False);
```