# Lecture 6 – More Queries and GroupBy¶

## DSC 10, Fall 2022¶

### Announcements¶

• Lab 2 is due on Saturday 10/8 at 11:59pm.
• Homework 2 is due on Tuesday 10/11 at 11:59pm.
• Avoid submission errors. You are responsible for submitting your work in the right format to Gradescope by the deadline.
• Discussion solutions are posted on Monday evenings to practice.dsc10.com.

### Agenda¶

• Recap: queries.
• Queries with multiple conditions.
• GroupBy.
• Extra practice, including challenge problems.

### About the Data: Get It Done service requests 👷¶

The requests DataFrame contains a summary of all service requests so far in 2022, broken down by neighborhood and service.

## Recap: queries¶

### What is a query? 🤔¶

• A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
• We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

### How do we query a DataFrame?¶

To select only certain rows of requests:

1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
2. Then pass it into requests[sequence_goes_here].

### Element-wise comparisons¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

## Example 6: Which neighborhood has the most 'Weed Cleanup' requests?¶

Key concept: Selecting rows (via Boolean indexing).

### Strategy¶

1. Query to extract a DataFrame of just the 'Weed Cleanup' requests.
2. Sort by 'total' in descending order.
3. Extract the first element from the 'neighborhood' column.

### Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

### Activity 🚘¶

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

## Example 7: How many service requests were for 'Pothole' or 'Pavement Maintenance'?¶

Key concept: Queries with multiple conditions.

### Multiple conditions¶

• To write a query with multiple conditions, use & for "and" and | for "or".
• You must use (parentheses) around each condition!
• ⚠️ Don't use the Python keywords and and or here! They do not behave as you'd want.

### Concept Check ✅ – Answer at cc.dsc10.com¶

Each of the following questions can be answered by querying the requests DataFrame.

1. Which neighborhood had the most 'Street Flooded' requests?
2. In the 'Kearny Mesa' neighborhood, how many different types of services have open requests?
3. How many requests have been closed in the 'La Jolla' neighborhood?

How many of the questions above require the query to have multiple conditions?

A. 0              B. 1              C. 2              D. 3

Bonus: Try to write the code to answer each question.

## Example 8: Which neighborhood had the most Get It Done requests?¶

Key concept: Grouping by one column.

### Organizing requests by neighborhood¶

• We can find the total number of Get It Done requests for any one neighborhood.
• For example, requests[requests.get('neighborhood') == 'Carmel Valley'].get('total').sum().
• But how can we find the total requests for every neighborhood at the same time?

It seems like there has to be a better way. And there is!

### GroupBy: Split, aggregate, and combine¶

Observe what happens when we use the .groupby method on requests with the argument 'neighborhood'.

Note that the 'total' counts for Carmel Valley and Torrey Hills are the same as we saw on the previous slide. What just happened? 🤯

### An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets containing pet species, colors, and weights.

### Visualizing pets.groupby('Species').mean()¶

1. Split the rows of pets into "groups" according to their values in the 'Species' column.
2. Aggregate the rows with the same value of 'Species' by taking the mean of all numerical columns.
3. Combine these means into a new DataFrame that is indexed by 'Species' and sorted by 'Species' in ascending order.

Note that the result contains just one row for cats, one row for dogs, and one row for hamsters!

### Pandas Tutor¶

• In the last cell, we saw not just the output of the code, but a visualization of the inner workings of the code.
• This is thanks to Pandas Tutor, a new tool developed by Sam Lau, who taught this course over the summer.
• Pandas Tutor draws diagrams to explain pandas (and babypandas) code.
• Add %%pt to the top of a code cell to explain the last line of babypandas code.
• This requires Pandas Tutor to be imported, which we already did in this notebook.
• You can also use Pandas Tutor through its website, pandastutor.com.

### Using .groupby in general¶

In short, .groupby aggregates all rows with the same value in a specified column (e.g. 'neighborhood') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values.

1. Choose a column to group by.
• .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
• On the previous slide, we grouped by 'neighborhood'.
• In the resulting DataFrame, there was one row for every unique value of 'neighborhood'.
2. Choose an aggregation method.
• The aggregation method will be applied within each group.
• On the previous slide, we applied the .sum() method to every 'neighborhood'.
• The aggregation method is applied individually to each column (e.g. the sums were computed separately for 'closed', 'open', and 'total').
• If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output – we'll look at this in more detail shortly.
• Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

### Observation #1¶

• The index has changed to neighborhood names.
• In general, the new row labels are the group labels (i.e., the unique values in the column that we grouped on).

### Observation #2¶

The 'service' column has disappeared. Why?

### Disappearing columns ✨🐇🎩¶

• The aggregation method – .sum(), in this case – is applied to each column.
• If it doesn't make sense to apply it to a particular column, that column will disappear.
• For instance, we can't sum strings, like in the 'service' column.
• However, we can compute the max of several strings. How?

### Observation #3¶

• The aggregation method is applied to each column separately.
• The rows of the resulting DataFrame need to be interpreted with care.
• Have there been 1703 'Weed Cleanup' requests in Balboa Park so far this year?
• Why isn't the 'total' column equal to the sum of the 'closed' and 'open' columns, as it originally was?

### Activity¶

Write a line of code that evaluates to the service that has been requested in the fewest number of neighborhoods.

Strategy:

1. Choose a column to group by.
2. Choose an aggregation method. Some common ones are .count(), .sum(), .mean(), .median(), .max(), and .min().
3. Sort the resulting DataFrame and extract the relevant piece of information (the service name).

Note: On an assignment, in questions like this with multiple correct answers (a tie for the fewest), you can put any one of them. We'll accept any correct answer.

### Observation #4¶

• The column names of the output of .groupby don't make sense when using the .count() aggregation method.
• Consider dropping unneeded columns and renaming columns as follows:
1. Use .assign to create a new column containing the same values as the old column(s).
2. Use .drop(columns=list_of_column_labels) to drop the old column(s).

## More practice: IMDb dataset 🎞️¶

### Question: What was the highest rated movie of the 1990s?¶

Let's try to do this two different ways.

#### Without grouping¶

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

#### With grouping¶

• It turns out that this method does not yield the correct answer.
• When we use an aggregation method (e.g. .max()), aggregation is done to each column individually.
• While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
• Taking the max is not helpful here.

## Challenge problems¶

We won't cover these problems in class, but they're here for you to practice with some harder examples. To access the solutions, you'll need to watch this solution walkthrough video (start at 10:00).

Before watching the video, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams!

### Question: How many years have more than 3 movies rated above 8.5?¶

#### Aside: Using .sum() on a Boolean array/Series¶

• Summing a Boolean array/Series gives a count of the number of True elements. This is because Python treats True as 1 and False as 0.
• Can you use that fact here?

### Question: Which year had the longest movie titles, on average?¶

Hint: Use .str.len() on the column or index that contains the names of the movies.