In [1]:
# Run this cell to set up packages for lecture.
from lec06_imports import *
%reload_ext pandas_tutor

Lecture 6 – Data Visualization 📈¶

DSC 10, Spring 2026¶

Agenda¶

  • Recap: Grouping.
  • Adjusting columns.
  • Why visualize?
  • Terminology.
  • Scatter plots.
  • Line plots.
  • Bar charts.

💡 Pro-Tip: Using Pandas Tutor¶

Pandas Tutor (built by Sam Lau) is pre-installed on DataHub. It visualizes the last line of a Jupyter cell. To use it, there are two steps:

  1. Add the line %reload_ext pandas_tutor to the top of your notebook and run it (notice that we already did that in this notebook).
  2. To visualize a cell, add %%pt to the top of the cell:
In [2]:
%%pt

df = bpd.read_csv('data/dogs_small.csv')
df.get(['size', 'longevity']).groupby('size').mean()

💡 Pro-Tip: Using keyboard shortcuts¶

There are several keyboard shortcuts built into Jupyter Notebooks designed to help you save time. To see them, either click the keyboard button in the toolbar above or hit the H key on your keyboard (as long as you're not actively editing a cell).

Particularly useful shortcuts:

Action Keyboard shortcut
Run cell + jump to next cell SHIFT + ENTER
Save the notebook CTRL/CMD + S
Create new cell above/below A/B
Delete cell DD

Recap: Grouping¶

The data: US states 🗽¶

We'll continue working with the same data from last time.

In [3]:
states = bpd.read_csv('data/states.csv')
states = states.assign(Density=states.get('Population') / states.get('Land Area'))
states = states.set_index('State')
states
Out[3]:
Region Capital City Population Land Area Party Density
State
Alabama South Montgomery 5024279 50645 Republican 99.21
Alaska West Juneau 733391 570641 Republican 1.29
Arizona West Phoenix 7151502 113594 Republican 62.96
... ... ... ... ... ... ...
West Virginia South Charleston 1793716 24038 Republican 74.62
Wisconsin Midwest Madison 5893718 54158 Republican 108.82
Wyoming West Cheyenne 576851 97093 Republican 5.94

50 rows × 6 columns

.groupby aggregates rows¶

In short, .groupby aggregates (collects) all rows with the same value in a specified column (e.g. 'Region') into a single row in the resulting DataFrame, combining the values from the different rows using an aggregation method (e.g. .sum()).

In [4]:
states.get(['Region', 'Population']).groupby('Region').sum()
Out[4]:
Population
Region
Midwest 68985454
Northeast 57609148
South 125576562
West 78588572

Using .groupby in general¶

  1. Isolate the columns you'll need.
    • Use .get([...]) to select the columns you'll need.
    • Rule of thumb:
      • Choose a categorical column to group by (e.g. Region).
      • Choose one or more columns to aggregate (e.g. Population, Land Area, etc.). Most commonly, these are numerical columns.
      • The aggregation method should make sense for the data in the columns you want to aggregate (e.g. .mean() works for int and float values, but not for strings).
  1. Make groups with .groupby(categorical_column).
    • .groupby(categorical_column) will gather rows which have the same value in the specified column.
    • In the resulting DataFrame, there will be one row for every unique value in that column.
  1. Follow up with an aggregation method.
    • The aggregation method is applied within each group.
    • The aggregation method is applied individually to each column.
      • If it doesn't make sense to use the aggregation method on a column, you will get an error or nonsensical result. This is why we use .get first, so we only aggregate the columns we want to!
    • Aggregation methods you should know: .count(), .sum(), .mean(), .median(), .max(), and .min().
      • 🚨 Note: In this class, avoid using .size() which is similar but not the same as .count().
In [5]:
(states
    # Step 1
    .get(['Region', 'Population'])
    # Step 2
    .groupby('Region')
    # Step 3
    .sum()
)
Out[5]:
Population
Region
Midwest 68985454
Northeast 57609148
South 125576562
West 78588572

The role of .get¶

Observe what happens if we don't start with .get.

In [6]:
states.groupby('Region').sum()
Out[6]:
Capital City Population Land Area Party Density
Region
Midwest SpringfieldIndianapolisDes MoinesTopekaLansing... 68985454 750524 DemocraticRepublicanRepublicanRepublicanRepubl... 1298.78
Northeast HartfordAugustaBostonConcordTrentonAlbanyHarri... 57609148 161912 DemocraticDemocraticDemocraticDemocraticDemocr... 4957.49
South MontgomeryLittle RockDoverTallahasseeAtlantaFr... 125576562 868356 RepublicanRepublicanDemocraticRepublicanRepubl... 3189.37
West JuneauPhoenixSacramentoDenverHonoluluBoiseHele... 78588572 1751054 RepublicanRepublicanDemocraticDemocraticDemocr... 881.62
In [7]:
states.groupby('Region').mean()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:1944, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1943 try:
-> 1944     res_values = self._grouper.agg_series(ser, alt, preserve_dtype=True)
   1945 except Exception as err:

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/ops.py:873, in BaseGrouper.agg_series(self, obj, func, preserve_dtype)
    871     preserve_dtype = True
--> 873 result = self._aggregate_series_pure_python(obj, func)
    875 npvalues = lib.maybe_convert_objects(result, try_float=False)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/ops.py:894, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
    893 for i, group in enumerate(splitter):
--> 894     res = func(group)
    895     res = extract_result(res)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2461, in GroupBy.mean.<locals>.<lambda>(x)
   2458 else:
   2459     result = self._cython_agg_general(
   2460         "mean",
-> 2461         alt=lambda x: Series(x, copy=False).mean(numeric_only=numeric_only),
   2462         numeric_only=numeric_only,
   2463     )
   2464     return result.__finalize__(self.obj, method="groupby")

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/series.py:6570, in Series.mean(self, axis, skipna, numeric_only, **kwargs)
   6562 @doc(make_doc("mean", ndim=1))
   6563 def mean(
   6564     self,
   (...)   6568     **kwargs,
   6569 ):
-> 6570     return NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/generic.py:12485, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  12478 def mean(
  12479     self,
  12480     axis: Axis | None = 0,
   (...)  12483     **kwargs,
  12484 ) -> Series | float:
> 12485     return self._stat_function(
  12486         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  12487     )

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/generic.py:12442, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  12440 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 12442 return self._reduce(
  12443     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  12444 )

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/series.py:6478, in Series._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   6474     raise TypeError(
   6475         f"Series.{name} does not allow {kwd_name}={numeric_only} "
   6476         "with non-numeric dtypes."
   6477     )
-> 6478 return op(delegate, skipna=skipna, **kwds)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:720, in nanmean(values, axis, skipna, mask)
    719 the_sum = values.sum(axis, dtype=dtype_sum)
--> 720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/nanops.py:1701, in _ensure_numeric(x)
   1699 if isinstance(x, str):
   1700     # GH#44008, GH#36703 avoid casting e.g. strings to numeric
-> 1701     raise TypeError(f"Could not convert string '{x}' to numeric")
   1702 try:

TypeError: Could not convert string 'SpringfieldIndianapolisDes MoinesTopekaLansingSaint PaulJefferson CityLincolnBismarckColumbusPierreMadison' to numeric

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 states.groupby('Region').mean()

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/babypandas/utils.py:20, in suppress_warnings.<locals>.wrapper(*args, **kwargs)
     18 with warnings.catch_warnings():
     19     warnings.simplefilter("ignore")
---> 20     return func(*args, **kwargs)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/babypandas/bpd.py:1514, in DataFrameGroupBy.mean(self)
   1510 """
   1511 Compute mean of group.
   1512 """
   1513 f = _lift_to_pd(self._pd.mean)
-> 1514 return f()

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/babypandas/bpd.py:1596, in _lift_to_pd.<locals>.closure(*vargs, **kwargs)
   1590 vargs = [x._pd if isinstance(x, types) else x for x in vargs]
   1591 kwargs = {
   1592     k: x._pd if isinstance(x, types) else x
   1593     for (k, x) in kwargs.items()
   1594 }
-> 1596 a = func(*vargs, **kwargs)
   1597 if isinstance(a, pd.DataFrame):
   1598     return DataFrame(data=a)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2459, in GroupBy.mean(self, numeric_only, engine, engine_kwargs)
   2452     return self._numba_agg_general(
   2453         grouped_mean,
   2454         executor.float_dtype_mapping,
   2455         engine_kwargs,
   2456         min_periods=0,
   2457     )
   2458 else:
-> 2459     result = self._cython_agg_general(
   2460         "mean",
   2461         alt=lambda x: Series(x, copy=False).mean(numeric_only=numeric_only),
   2462         numeric_only=numeric_only,
   2463     )
   2464     return result.__finalize__(self.obj, method="groupby")

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2005, in GroupBy._cython_agg_general(self, how, alt, numeric_only, min_count, **kwargs)
   2002     result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   2003     return result
-> 2005 new_mgr = data.grouped_reduce(array_func)
   2006 res = self._wrap_agged_manager(new_mgr)
   2007 if how in ["idxmin", "idxmax"]:

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/internals/managers.py:1488, in BlockManager.grouped_reduce(self, func)
   1484 if blk.is_object:
   1485     # split on object-dtype blocks bc some columns may raise
   1486     #  while others do not.
   1487     for sb in blk._split():
-> 1488         applied = sb.apply(func)
   1489         result_blocks = extend_blocks(applied, result_blocks)
   1490 else:

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/internals/blocks.py:395, in Block.apply(self, func, **kwargs)
    389 @final
    390 def apply(self, func, **kwargs) -> list[Block]:
    391     """
    392     apply the function to my values; return a block if we are not
    393     one
    394     """
--> 395     result = func(self.values, **kwargs)
    397     result = maybe_coerce_values(result)
    398     return self._split_op_result(result)

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:2002, in GroupBy._cython_agg_general.<locals>.array_func(values)
   1999     return result
   2001 assert alt is not None
-> 2002 result = self._agg_py_fallback(how, values, ndim=data.ndim, alt=alt)
   2003 return result

File ~/Desktop/sp26-dsc10/dsc10-2026-sp-private/.venv/lib/python3.14/site-packages/pandas/core/groupby/groupby.py:1948, in GroupBy._agg_py_fallback(self, how, values, ndim, alt)
   1946     msg = f"agg function failed [how->{how},dtype->{ser.dtype}]"
   1947     # preserve the kind of exception that raised
-> 1948     raise type(err)(msg) from err
   1950 dtype = ser.dtype
   1951 if dtype == object:

TypeError: agg function failed [how->mean,dtype->object]

Observations on grouping¶

  1. After grouping, the index changes. The new row labels are the group labels (i.e., the unique values in the column that we grouped on), sorted in ascending order.

💡 Pro-Tip: look for keywords "per," "for each," and "indexed by" when solving problems.

  1. The aggregation method is applied separately to each column. If it does not make sense to apply the aggregation method to a certain column, you will get an error or a nonsensical result.
  1. Since the aggregation method is applied to each column separately, the rows of the resulting DataFrame need to be interpreted with care.
  1. The column names don't make sense after grouping with the .count() aggregation method.
In [8]:
states.groupby('Region').count()
Out[8]:
Capital City Population Land Area Party Density
Region
Midwest 12 12 12 12 12
Northeast 9 9 9 9 9
South 16 16 16 16 16
West 13 13 13 13 13

Dropping, renaming, and reordering columns¶

Consider dropping unneeded columns and renaming columns as follows:

  1. Use .assign to create a new column containing the same values as the old column(s).
  2. Use .drop(columns=list_of_column_labels) to drop the old column(s).
    • Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!
In [9]:
states_by_region = states.groupby('Region').count()
states_by_region = states_by_region.assign(
                    States=states_by_region.get('Capital City')
                    ).get(['States'])
states_by_region
Out[9]:
States
Region
Midwest 12
Northeast 9
South 16
West 13

Why visualize?¶

Little Women¶

In Lecture 1, we were able to answer questions about the plot of Little Women without having to read the novel and without having to understand Python code. Some of those questions included:

  • Who is the main character?
  • Which pair of characters gets married at the end?

We answered these questions from a data visualization alone!

In [10]:
bpd.read_csv('data/lw_counts.csv').plot(x='Chapter', linewidth=3.0);
No description has been provided for this image

Why visualize?¶

  • Computers are better than humans at crunching numbers, but humans are better at identifying visual patterns.
  • Visualizations allow us to understand lots of data quickly – they make it easier to spot trends and communicate our results with others.
  • There are many types of visualizations; in this class, we'll look at scatter plots, line plots, bar charts, and histograms, but there are many others.
    • The right choice depends on the type of data.

Terminology¶

Individuals and variables¶

No description has been provided for this image
  • Individual (row): Person/place/thing for which data is recorded. Also called an observation.
  • Variable (column): Something that is recorded for each individual. Also called a feature.

Types of variables¶

There are two main types of variables:

  • Numerical: It makes sense to do arithmetic with the values.
  • Categorical: Values fall into categories, that may or may not have some order to them.

Note that here, "variable" does not mean a variable in Python, but rather it means a column in a DataFrame.

Examples of numerical variables¶

  • Salaries of NBA players 🏀.
    • Individual: An NBA player.
    • Variable: Their salary.
  • Company's annual profit 💰.
    • Individual: A company.
    • Variable: Its annual profit.
  • Flu shots administered per day 💉.
    • Individual: Date.
    • Variable: Number of flu shots administered on that date.

Examples of categorical variables¶

  • Movie genres 🎬.
    • Individual: A movie.
    • Variable: Its genre.
  • Zip codes 🏠.
    • Individual: US resident.
    • Variable: Zip code.
      • Even though they look like numbers, zip codes are categorical (arithmetic doesn't make sense).
  • Level of prior programming experience for students in DSC 10 🧑‍🎓.
    • Individual: Student in DSC 10.
    • Variable: Their level of prior programming experience, e.g. none, low, medium, or high.
      • There is an order to these categories!

Concept Check ✅ – Answer at cc.dsc10.com¶

Which of these is not a numerical variable?

A. Fuel economy in miles per gallon.

B. Number of quarters at UCSD.

C. College at UCSD (Sixth, Seventh, etc).

D. Bank account number.

E. More than one of these are not numerical variables.

Types of visualizations¶

The type of visualization we create depends on the kinds of variables we're visualizing.

  • Scatter plot: Numerical vs. numerical.
  • Line plot: Sequential numerical (time) vs. numerical.
  • Bar chart: Categorical vs. numerical.
  • Histogram: Numerical.
    • Will cover next time.

We may interchange the words "plot", "chart", and "graph"; they all mean the same thing.

Scatter plots¶

The data: exoplanets discovered by NASA 🪐¶

An exoplanet is a planet outside our solar system. NASA has discovered over 5,000 exoplanets so far in its search for signs of life beyond Earth. 👽

Column Contents

'Distance'| Distance from Earth, in light years. 'Magnitude'| Apparent magnitude, which measures brightness in such a way that brighter objects have lower values. 'Type'| Categorization of planet based on its composition and size. 'Year'| When the planet was discovered. 'Detection'| The method of detection used to discover the planet. 'Mass'| The ratio of the planet's mass to Earth's mass. 'Radius'| The ratio of the planet's radius to Earth's radius.

In [11]:
exo = bpd.read_csv('data/exoplanets.csv').set_index('Name')
exo
Out[11]:
Distance Magnitude Type Year Detection Mass Radius
Name
11 Comae Berenices b 304.0 4.72 Gas Giant 2007 Radial Velocity 6165.90 11.88
11 Ursae Minoris b 409.0 5.01 Gas Giant 2009 Radial Velocity 4684.81 11.99
14 Andromedae b 246.0 5.23 Gas Giant 2008 Radial Velocity 1525.58 12.65
... ... ... ... ... ... ... ...
YZ Ceti b 12.0 12.07 Terrestrial 2017 Radial Velocity 0.70 0.91
YZ Ceti c 12.0 12.07 Super Earth 2017 Radial Velocity 1.14 1.05
YZ Ceti d 12.0 12.07 Super Earth 2017 Radial Velocity 1.09 1.03

5043 rows × 7 columns

Scatter plots¶

  • What is the relationship between 'Distance' and 'Magnitude'?
In [12]:
exo.plot(kind='scatter', x='Distance', y='Magnitude');
No description has been provided for this image
  • Further planets have greater 'Magnitude' (meaning they are less bright), which makes sense.

  • The data appears curved because 'Magnitude' is measured on a logarithmic scale. A decrease of one unit in 'Magnitude' corresponds to a 2.5 times increase in brightness.

No description has been provided for this image

Scatter plots¶

  • Scatter plots visualize the relationship between two numerical variables.
  • To create one from a DataFrame df, use
df.plot(
    kind='scatter', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)
  • The resulting scatter plot has one point per row of df.
  • If you put a semicolon after a call to .plot, it will hide the weird text output that displays.

Zooming in 🔍¶

The majority of exoplanets are less than 10,000 light years away; if we'd like to zoom in on just these exoplanets, we can query before plotting.

In [13]:
exo[exo.get('Distance') < 10000].plot(kind='scatter', x='Distance', y='Magnitude');
No description has been provided for this image

Line plots 📉¶

Line plots¶

  • How has the 'Magnitude' of newly discovered exoplanets changed over time?
In [14]:
# There were multiple exoplanets discovered each year.
# What operation can we apply to this DataFrame so that there is one row per year?
exo
Out[14]:
Distance Magnitude Type Year Detection Mass Radius
Name
11 Comae Berenices b 304.0 4.72 Gas Giant 2007 Radial Velocity 6165.90 11.88
11 Ursae Minoris b 409.0 5.01 Gas Giant 2009 Radial Velocity 4684.81 11.99
14 Andromedae b 246.0 5.23 Gas Giant 2008 Radial Velocity 1525.58 12.65
... ... ... ... ... ... ... ...
YZ Ceti b 12.0 12.07 Terrestrial 2017 Radial Velocity 0.70 0.91
YZ Ceti c 12.0 12.07 Super Earth 2017 Radial Velocity 1.14 1.05
YZ Ceti d 12.0 12.07 Super Earth 2017 Radial Velocity 1.09 1.03

5043 rows × 7 columns

  • Let's calculate the average 'Magnitude' of all exoplanets discovered in each 'Year'.
In [15]:
exo.get(['Year', 'Magnitude']).groupby('Year').mean()
Out[15]:
Magnitude
Year
1995 5.45
1996 5.12
1997 5.41
... ...
2021 13.01
2022 10.62
2023 12.09

29 rows × 1 columns

In [16]:
(exo
    .get(['Year', 'Magnitude'])
    .groupby('Year')
    .mean()
    .plot(
        kind='line', 
        y='Magnitude',
        linewidth=3.0)
);
No description has been provided for this image
  • It looks like the brightest planets were discovered first, which makes sense.

  • NASA's Kepler space telescope began its nine-year mission in 2009, leading to a boom in the discovery of exoplanets.

Line plots¶

  • Line plots show trends in numerical variables over time.
  • To create one from a DataFrame df, use
df.plot(
    kind='line', 
    x=x_column_for_horizontal, 
    y=y_column_for_vertical
)
  • To use the index as the x-axis, omit the x= argument.
    • This doesn't work for scatterplots, but it works for most other plot types.

Bar charts 📊¶

Bar charts¶

  • How big are each of the different 'Type's of exoplanets, on average?
In [17]:
exo
Out[17]:
Distance Magnitude Type Year Detection Mass Radius
Name
11 Comae Berenices b 304.0 4.72 Gas Giant 2007 Radial Velocity 6165.90 11.88
11 Ursae Minoris b 409.0 5.01 Gas Giant 2009 Radial Velocity 4684.81 11.99
14 Andromedae b 246.0 5.23 Gas Giant 2008 Radial Velocity 1525.58 12.65
... ... ... ... ... ... ... ...
YZ Ceti b 12.0 12.07 Terrestrial 2017 Radial Velocity 0.70 0.91
YZ Ceti c 12.0 12.07 Super Earth 2017 Radial Velocity 1.14 1.05
YZ Ceti d 12.0 12.07 Super Earth 2017 Radial Velocity 1.09 1.03

5043 rows × 7 columns

In [18]:
types = (exo
    .get(['Type', 'Mass', 'Radius', 'Magnitude', 'Distance'])
    .groupby('Type')
    .mean()
)
types
Out[18]:
Mass Radius Magnitude Distance
Type
Gas Giant 1472.39 12.74 10.30 1096.40
Neptune-like 15.28 3.11 13.52 2189.02
Super Earth 5.81 1.58 13.85 1916.26
Terrestrial 1.62 0.85 13.45 1373.60
In [19]:
types.plot(kind='barh', y='Radius');
No description has been provided for this image
In [20]:
types.plot(kind='barh', y='Mass');
No description has been provided for this image
  • It looks like the 'Gas Giant's are aptly named!
No description has been provided for this image

Bar charts¶

  • Bar charts visualize the relationship between a categorical variable and a numerical variable.
  • In a bar chart...
    • The thickness and spacing of bars is arbitrary.
    • The order of the categorical labels doesn't matter.
  • To create one from a DataFrame df, use
df.plot(
    kind='barh', 
    x=categorical_column_name, 
    y=numerical_column_name
)
  • The "h" in 'barh' stands for "horizontal".
    • It's easier to read labels this way.
  • Note that in the previous chart, we set y='Mass' even though mass is measured by x-axis length.

Bar charts and sorting¶

What are the most popular 'Detection' methods for discovering exoplanets?

In [21]:
# Count how many exoplanets are discovered by each detection method.
popular_detection = exo.groupby('Detection').count()
popular_detection
Out[21]:
Distance Magnitude Type Year Mass Radius
Detection
Astrometry 1 1 1 1 1 1
Direct Imaging 50 50 50 50 50 50
Disk Kinematics 1 1 1 1 1 1
... ... ... ... ... ... ...
Radial Velocity 1019 1019 1019 1019 1019 1019
Transit 3914 3914 3914 3914 3914 3914
Transit Timing Variations 23 23 23 23 23 23

11 rows × 6 columns

In [22]:
# Give columns more meaningful names and eliminate redundancy.
popular_detection = (popular_detection.assign(Count=popular_detection.get('Distance'))
                                      .get(['Count'])
                                      .sort_values(by='Count', ascending=False)
                    )
popular_detection
Out[22]:
Count
Detection
Transit 3914
Radial Velocity 1019
Direct Imaging 50
... ...
Astrometry 1
Disk Kinematics 1
Pulsar Timing 1

11 rows × 1 columns

In [23]:
# Notice that the bars appear in the opposite order relative to the DataFrame.
popular_detection.plot(kind='barh', y='Count');
No description has been provided for this image
In [24]:
# Change "barh" to "bar" to get a vertical bar chart. 
# These are harder to read, but the bars do appear in the same order as the DataFrame.
popular_detection.plot(kind='bar', y='Count');
No description has been provided for this image

Multiple plots on the same axes¶

Can we look at both the average 'Magnitude' and the average 'Radius' for each 'Type' at the same time?

In [25]:
types.get(['Magnitude', 'Radius']).plot(kind='barh');
No description has been provided for this image

How did we do that?

Overlaying plots¶

When calling .plot, if we omit the y=column_name argument, all other columns are plotted.

In [26]:
types
Out[26]:
Mass Radius Magnitude Distance
Type
Gas Giant 1472.39 12.74 10.30 1096.40
Neptune-like 15.28 3.11 13.52 2189.02
Super Earth 5.81 1.58 13.85 1916.26
Terrestrial 1.62 0.85 13.45 1373.60
In [27]:
types.plot(kind='barh');
No description has been provided for this image

Selecting multiple columns at once¶

Remember, to select multiple columns, use .get([column_1, ..., column_k]). This returns a DataFrame.

In [28]:
types
Out[28]:
Mass Radius Magnitude Distance
Type
Gas Giant 1472.39 12.74 10.30 1096.40
Neptune-like 15.28 3.11 13.52 2189.02
Super Earth 5.81 1.58 13.85 1916.26
Terrestrial 1.62 0.85 13.45 1373.60
In [29]:
types.get(['Magnitude', 'Radius'])
Out[29]:
Magnitude Radius
Type
Gas Giant 10.30 12.74
Neptune-like 13.52 3.11
Super Earth 13.85 1.58
Terrestrial 13.45 0.85
In [30]:
types.get(['Magnitude', 'Radius']).plot(kind='barh');
No description has been provided for this image

Summary¶

Summary¶

  • Visualizations make it easy to extract patterns from datasets.
  • There are two main types of variables: categorical and numerical.
  • The types of the variables we're visualizing inform our choice of which type of visualization to use.
  • Today, we looked at scatter plots, line plots, and bar charts.
  • Next time: histograms.