# Run this cell to set up packages for lecture.
from lec04_imports import *

states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states = states.set_index('State')
states

states.groupby('Region').count()

states.get('Capital City')

State
Alabama          Montgomery
Alaska               Juneau
Arizona             Phoenix
                    ...    
West Virginia    Charleston
Wisconsin           Madison
Wyoming            Cheyenne
Name: Capital City, Length: 50, dtype: object

states.get(['Capital City', 'Party'])

states.get(['Capital City'])

states_by_region = states.groupby('Region').count()
states_by_region

bpd.read_csv('data/lw_counts.csv').plot(x='Chapter');

exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo

exo.plot(kind='scatter', x='Distance', y='Magnitude');

df.plot(
    kind='scatter', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)

exo[exo.get('Distance') < 10000].plot(kind='scatter', x='Distance', y='Magnitude');

# There were multiple exoplanets discovered each year.
# What operation can we apply to this DataFrame so that there is one row per year?
exo

exo.groupby('Year').mean()

exo.groupby('Year').mean().plot(kind='line', y='Magnitude');

df.plot(
    kind='line', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)

YouTubeVideo('glzZ04D1kDg')

types = exo.groupby('Type').mean()
types

types.plot(kind='barh', y='Radius');

types.plot(kind='barh', y='Mass');

df.plot(
    kind='barh', 
    x=categorical_column_name, 
    y=numerical_column_name
)

# Count how many exoplanets are discovered by each detection method.
popular_detection = exo.groupby('Detection').count()
popular_detection

# Give columns more meaningful names and eliminate redundancy.
popular_detection = (popular_detection.assign(Count=popular_detection.get('Distance'))
                                      .get(['Count'])
                                      .sort_values(by='Count', ascending=False)
                    )
popular_detection

# Notice that the bars appear in the opposite order relative to the DataFrame.
popular_detection.plot(kind='barh', y='Count');

# Change "barh" to "bar" to get a vertical bar chart. 
# These are harder to read, but the bars do appear in the same order as the DataFrame.
popular_detection.plot(kind='bar', y='Count');

types.get(['Magnitude', 'Radius']).plot(kind='barh');

types

types.plot(kind='barh');

types

types.get(['Magnitude', 'Radius'])

types.get(['Magnitude', 'Radius']).plot(kind='barh');

# Remember, when we group and use .count(), the column names aren't meaningful.
type_counts = exo.groupby('Type').count()
type_counts

# As a result, we could have set y='Magnitude', for example, and gotten the same plot.
type_counts.plot(kind='barh', y='Distance', 
                 legend=False, title='Distribution of Exoplanet Types');

exo.groupby('Type').mean().get('Radius')

Type
Gas Giant       12.74
Neptune-like     3.11
Super Earth      1.58
Terrestrial      0.85
Name: Radius, dtype: float64

terr = exo[exo.get('Type') == 'Terrestrial']
terr

terr.get('Radius').describe()

count    193.00
mean       0.85
std        0.26
          ...  
50%        0.86
75%        0.92
max        3.13
Name: Radius, Length: 8, dtype: float64

terr_radius = terr.groupby('Radius').count()
terr_radius = (terr_radius
                 .assign(Count=terr_radius.get('Distance'))
                 .get(['Count'])
              )
terr_radius

# Ignore the code for right now.
terr.plot(kind='hist', y='Radius', density=True, bins = np.arange(0, 3.5, 0.25), ec='w');

# There are 7 terrestrial exoplanets with a radius of exactly 1.0,
# but the height of the bar starting at 1.0 is not 7!
terr[terr.get('Radius') == 1]

df.plot(
    kind='hist', 
    y=column_name,
    density=True
)

# There are 10 bins by default, some of which are empty.
terr.plot(kind='hist', y='Radius', density=True, ec='w');

terr.plot(kind='hist', y='Radius', density=True, bins=20, ec='w');

terr.plot(kind='hist', y='Radius', density=True, bins=[0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5], ec='w');

terr.plot(kind='hist', y='Radius', density=True,
            bins=np.arange(0, 3.5, 0.5),
            ec='w');

terr.sort_values('Radius', ascending=False)

terr.plot(kind='hist', y='Radius', density=True,
          bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');

terr.plot(kind='hist', y='Radius', density=True,
          bins=[0, 0.25, 0.5, 0.75, 2, 4], ec='w');

in_range = terr[(terr.get('Radius') >= 0.5) & (terr.get('Radius') < 0.75)].shape[0]
in_range

39

in_range / terr.shape[0]

0.20207253886010362

Action	Keyboard shortcut
Run cell + jump to next cell	SHIFT + ENTER
Save the notebook	CTRL/CMD + S
Create new cell above/below	A/B
Delete cell	DD

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
11 Comae Berenices b	304.0	4.72	Gas Giant	2007	Radial Velocity	6165.90	11.88
11 Ursae Minoris b	409.0	5.01	Gas Giant	2009	Radial Velocity	4684.81	11.99
14 Andromedae b	246.0	5.23	Gas Giant	2008	Radial Velocity	1525.58	12.65
...	...	...	...	...	...	...	...
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91
YZ Ceti c	12.0	12.07	Super Earth	2017	Radial Velocity	1.14	1.05
YZ Ceti d	12.0	12.07	Super Earth	2017	Radial Velocity	1.09	1.03

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
11 Comae Berenices b	304.0	4.72	Gas Giant	2007	Radial Velocity	6165.90	11.88
11 Ursae Minoris b	409.0	5.01	Gas Giant	2009	Radial Velocity	4684.81	11.99
14 Andromedae b	246.0	5.23	Gas Giant	2008	Radial Velocity	1525.58	12.65
...	...	...	...	...	...	...	...
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91
YZ Ceti c	12.0	12.07	Super Earth	2017	Radial Velocity	1.14	1.05
YZ Ceti d	12.0	12.07	Super Earth	2017	Radial Velocity	1.09	1.03

	Distance	Magnitude	Mass	Radius
Year
1995	50.00	5.45	146.20	13.97
1996	51.33	5.12	1020.67	13.09
1997	57.00	5.41	332.10	13.53
...	...	...	...	...
2021	1944.22	13.01	255.42	4.44
2022	508.61	10.62	943.16	6.77
2023	451.89	12.09	162.78	7.12

	Distance	Magnitude	Year	Mass	Radius
Type
Gas Giant	1096.40	10.30	2013.73	1472.39	12.74
Neptune-like	2189.02	13.52	2016.59	15.28	3.11
Super Earth	1916.26	13.85	2016.43	5.81	1.58
Terrestrial	1373.60	13.45	2016.37	1.62	0.85

	Region	Capital City	Population	Land Area	Party	Density
State
Alabama	South	Montgomery	5024279	50645	Republican	99.21
Alaska	West	Juneau	733391	570641	Republican	1.29
Arizona	West	Phoenix	7151502	113594	Republican	62.96
...	...	...	...	...	...	...
West Virginia	South	Charleston	1793716	24038	Republican	74.62
Wisconsin	Midwest	Madison	5893718	54158	Republican	108.82
Wyoming	West	Cheyenne	576851	97093	Republican	5.94

	Capital City	Population	Land Area	Party	Density
Region
Midwest	12	12	12	12	12
Northeast	9	9	9	9	9
South	16	16	16	16	16
West	13	13	13	13	13

	Distance	Magnitude	Type	Year	Mass	Radius
Detection
Astrometry	1	1	1	1	1	1
Direct Imaging	50	50	50	50	50	50
Disk Kinematics	1	1	1	1	1	1
...	...	...	...	...	...	...
Radial Velocity	1019	1019	1019	1019	1019	1019
Transit	3914	3914	3914	3914	3914	3914
Transit Timing Variations	23	23	23	23	23	23

	Distance	Magnitude	Year	Detection	Mass	Radius
Type
Gas Giant	1480	1480	1480	1480	1480	1480
Neptune-like	1793	1793	1793	1793	1793	1793
Super Earth	1577	1577	1577	1577	1577	1577
Terrestrial	193	193	193	193	193	193

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
EPIC 201497682 b	825.0	13.95	Terrestrial	2019	Transit	0.26	0.69
EPIC 201757695.02	1884.0	14.97	Terrestrial	2020	Transit	0.69	0.91
EPIC 201833600 c	840.0	14.71	Terrestrial	2019	Transit	0.97	1.00
...	...	...	...	...	...	...	...
TRAPPIST-1 e	41.0	17.02	Terrestrial	2017	Transit	0.69	0.92
TRAPPIST-1 h	41.0	17.02	Terrestrial	2017	Transit	0.33	0.76
YZ Ceti b	12.0	12.07	Terrestrial	2017	Radial Velocity	0.70	0.91

	Distance	Magnitude	Type	Year	Detection	Mass	Radius
Name
Kepler-33 c	3944.0	14.10	Terrestrial	2011	Transit	0.39	3.13
K2-138 f	661.0	12.25	Terrestrial	2017	Transit	1.63	2.85
Kepler-11 b	2108.0	13.82	Terrestrial	2010	Transit	1.90	1.80
...	...	...	...	...	...	...	...
Kepler-102 b	352.0	12.07	Terrestrial	2014	Transit	4.30	0.47
Kepler-444 b	119.0	8.87	Terrestrial	2015	Transit	0.04	0.40
Kepler-37 e	209.0	9.77	Terrestrial	2014	Transit Timing Variations	0.03	0.37

Lecture 4 – Data Visualization 📈, Density, and Density Histograms¶

DSC 10, Summer 2025¶

Agenda¶

Aside: Keyboard shortcuts¶

Adjusting columns¶

.count()¶

Adjusting columns with .assign, .drop, and .get¶

Two ways to .get¶

Activity¶

Why visualize?¶

Little Women¶

Why visualize?¶

Terminology¶

Individuals and variables¶

Types of variables¶

Examples of numerical variables¶

Examples of categorical variables¶

Concept Check ✅¶

Types of visualizations¶

Scatter plots¶

The data: exoplanets discovered by NASA 🪐¶

Scatter plots¶

Scatter plots¶

Zooming in 🔍¶

Line plots 📉¶

Line plots¶

Line plots¶

Extra video on line plots¶

Bar charts 📊¶

Bar charts¶

Bar charts¶

Bar charts and sorting¶

Multiple plots on the same axes¶

Overlaying plots¶

Selecting multiple columns at once¶

Distributions¶

What is the distribution of a variable?¶

Distributions of categorical variables¶

Terrestrial exoplanets 🌑¶

Visualizing the distribution of 'Radius', a numerical variable¶

Density histograms¶

Density histograms show the distribution of numerical variables¶

First key idea behind histograms: Binning 🗑️¶

Plotting a density histogram¶

Customizing the bins¶

Observations¶

Second key idea behind histograms: Total area is 1¶

Example calculation¶

Example calculation¶

`.count()`¶

Adjusting columns with `.assign`, `.drop`, and `.get`¶

Two ways to `.get`¶

Visualizing the distribution of `'Radius'`, a numerical variable¶