Lecture 6 – More Queries and GroupBy

DSC 10, Fall 2022

Announcements

Agenda

Resources:

About the Data: Get It Done service requests 👷

The requests DataFrame contains a summary of all service requests so far in 2022, broken down by neighborhood and service.

Recap: queries

What is a query? 🤔

How do we query a DataFrame?

To select only certain rows of requests:

  1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
  2. Then pass it into requests[sequence_goes_here].

Element-wise comparisons

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

Example 6: Which neighborhood has the most 'Weed Cleanup' requests?

Key concept: Selecting rows (via Boolean indexing).

Strategy

  1. Query to extract a DataFrame of just the 'Weed Cleanup' requests.
  2. Sort by 'total' in descending order.
  3. Extract the first element from the 'neighborhood' column.

What if the condition isn't satisfied?

Concept Check ✅ – Answer at cc.dsc10.com

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

Activity 🚘

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

Example 7: How many service requests were for 'Pothole' or 'Pavement Maintenance'?

Key concept: Queries with multiple conditions.

Multiple conditions

The & and | operators work element-wise

Original Question: How many service requests were for 'Pothole' or 'Pavement Maintenance'?

Concept Check ✅ – Answer at cc.dsc10.com

Each of the following questions can be answered by querying the requests DataFrame.

  1. Which neighborhood had the most 'Street Flooded' requests?
  2. In the 'Kearny Mesa' neighborhood, how many different types of services have open requests?
  3. How many requests have been closed in the 'La Jolla' neighborhood?

How many of the questions above require the query to have multiple conditions?

A. 0              B. 1              C. 2              D. 3

Bonus: Try to write the code to answer each question.

Example 8: Which neighborhood had the most Get It Done requests?

Key concept: Grouping by one column.

Organizing requests by neighborhood

It seems like there has to be a better way. And there is!

GroupBy: Split, aggregate, and combine

Observe what happens when we use the .groupby method on requests with the argument 'neighborhood'.

Note that the 'total' counts for Carmel Valley and Torrey Hills are the same as we saw on the previous slide. What just happened? 🤯

An illustrative example: Pets 🐱 🐶🐹

Consider the DataFrame pets containing pet species, colors, and weights.

Visualizing pets.groupby('Species').mean()

  1. Split the rows of pets into "groups" according to their values in the 'Species' column.
  2. Aggregate the rows with the same value of 'Species' by taking the mean of all numerical columns.
  3. Combine these means into a new DataFrame that is indexed by 'Species' and sorted by 'Species' in ascending order.

Note that the result contains just one row for cats, one row for dogs, and one row for hamsters!

Pandas Tutor

Back to Get It Done service requests 👷

Using .groupby in general

In short, .groupby aggregates all rows with the same value in a specified column (e.g. 'neighborhood') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values.

  1. Choose a column to group by.
    • .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
    • On the previous slide, we grouped by 'neighborhood'.
    • In the resulting DataFrame, there was one row for every unique value of 'neighborhood'.
  2. Choose an aggregation method.
    • The aggregation method will be applied within each group.
    • On the previous slide, we applied the .sum() method to every 'neighborhood'.
    • The aggregation method is applied individually to each column (e.g. the sums were computed separately for 'closed', 'open', and 'total').
      • If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output – we'll look at this in more detail shortly.
    • Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

Observation #1

Observation #2

The 'service' column has disappeared. Why?

Disappearing columns ✨🐇🎩

Observation #3

Activity

Write a line of code that evaluates to the service that has been requested in the fewest number of neighborhoods.

Strategy:

  1. Choose a column to group by.
  2. Choose an aggregation method. Some common ones are .count(), .sum(), .mean(), .median(), .max(), and .min().
  3. Sort the resulting DataFrame and extract the relevant piece of information (the service name).

Note: On an assignment, in questions like this with multiple correct answers (a tie for the fewest), you can put any one of them. We'll accept any correct answer.

Observation #4

More practice: IMDb dataset 🎞️

Question: How many movies appear from each decade?

Question: What was the highest rated movie of the 1990s?

Let's try to do this two different ways.

Without grouping

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

With grouping

Challenge problems

We won't cover these problems in class, but they're here for you to practice with some harder examples. To access the solutions, you'll need to watch this solution walkthrough video (start at 10:00).

Before watching the video, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams!

Question: How many years have more than 3 movies rated above 8.5?

Aside: Using .sum() on a Boolean array/Series

Question: Out of the years with more than 3 movies, which had the highest average rating?

Question: Which year had the longest movie titles, on average?

Hint: Use .str.len() on the column or index that contains the names of the movies.

Question: What is the average rating of movies from years that had at least 3 movies in the Top 250?