# Lecture 5 – More Querying and GroupBy¶

## DSC 10, Winter 2023¶

### Announcements¶

• Lab 1 is due Saturday at 11:59PM.
• Homework 1 is due Tuesday at 11:59PM.
• Come to office hours for help! Mine are 12:30-2:30 today. See the calendar for directions.

### Agenda¶

• Recap: queries.
• Queries with multiple conditions.
• GroupBy.
• Extra practice, including challenge problems.

### About the Data: Get It Done service requests 👷¶ The requests DataFrame contains a summary of all service requests so far in 2022, broken down by neighborhood and service.

## Recap: queries¶

### What is a query? 🤔¶

• A "query" is code that extracts rows from a DataFrame for which certain condition(s) are true.
• We often use queries to filter DataFrames so that they only contain the rows that satisfy the conditions stated in our questions.

### How do we query a DataFrame?¶

To select only certain rows of requests:

1. Make a sequence (list/array/Series) of Trues (keep) and Falses (toss), usually by making a comparison.
2. Then pass it into requests[sequence_goes_here].

### Element-wise comparisons¶

There are several types of comparisons we can make.

symbol meaning
== equal to
!= not equal to
< less than
<= less than or equal to
> greater than
>= greater than or equal to

## Example 5: Which neighborhood has the most 'Tree Maintenance' requests? 🌳¶

Key concept: Querying

### Strategy¶

1. Query to extract a DataFrame of just the 'Tree Maintenance' requests.
2. Sort by 'total' in descending order.
3. Extract the first element from the 'neighborhood' column.

### Concept Check ✅ – Answer at cc.dsc10.com¶

Which expression below evaluates to the total number of service requests in the 'Downtown' neighborhood?

A. requests[requests.get('neighborhood') == 'Downtown'].get('total').sum()

B. requests.get('total').sum()[requests.get('neighborhood') == 'Downtown']

C. requests['Downtown'].get('total').sum()

D. More than one of the above.

### Activity 🚘¶

Question: What is the most commonly requested service in the 'University' neighborhood (near UCSD)?

Write one line of code that evaluates to the answer.

## Example 6: How many service requests were for 'Pothole' or 'Pavement Maintenance'?¶

Key concept: Queries with multiple conditions.

### Multiple conditions¶

• To write a query with multiple conditions, use & for "and" and | for "or".
• You must use (parentheses) around each condition!
• ⚠️ Don't use the Python keywords and and or here! They do not behave as you'd want.

### Concept Check ✅ – Answer at cc.dsc10.com¶

Each of the following questions can be answered by querying the requests DataFrame.

1. Which neighborhood had the most 'Street Flooded' requests?
2. In the 'Kearny Mesa' neighborhood, how many different types of services have open requests?
3. How many requests have been closed in the 'La Jolla' neighborhood?

How many of the questions above require the query to have multiple conditions?

A. 0              B. 1              C. 2              D. 3

Bonus: Try to write the code to answer each question.

### Selecting rows by position with .take¶

• Querying allows us to select rows that satisfy a certain condition.
• We can also select rows in specific positions with .take([list_of_integer_positions]). This keeps only the rows whose positions are in the specified list.
• This is analogous to using .iloc[] on a Series.
• It's rare to need to select rows by integer position. Querying is far more useful.

## Example 7: Which neighborhood had the most Get It Done requests?¶

Key concept: Grouping by one column.

### Organizing requests by neighborhood¶

• We can find the total number of Get It Done requests for any one neighborhood.
• For example, requests[requests.get('neighborhood') == 'Carmel Valley'].get('total').sum().
• But how can we find the total requests for every neighborhood at the same time?

It seems like there has to be a better way. And there is!

### GroupBy: Split, aggregate, and combine¶

Observe what happens when we use the .groupby method on requests with the argument 'neighborhood'.

Note that the 'total' counts for Carmel Valley and Torrey Hills are the same as we saw on the previous slide. What just happened? 🤯

### An illustrative example: Pets 🐱 🐶🐹¶

Consider the DataFrame pets shown below.

### Visualizing pets.groupby('Species').mean()¶

1. Split the rows of pets into "groups" according to their values in the 'Species' column.
2. Aggregate the rows with the same value of 'Species' by taking the mean of all numerical columns.
3. Combine these means into a new DataFrame that is indexed by 'Species' and sorted by 'Species' in ascending order.

Note that the result contains just one row for cats, one row for dogs, and one row for hamsters!

### Pandas Tutor¶

• In the last cell, we saw not just the output of the code, but a visualization of the inner workings of the code.
• This is thanks to Pandas Tutor, a new tool developed by Sam Lau, who taught this course over the summer.
• Pandas Tutor draws diagrams to explain pandas (and babypandas) code.
• Add %%pt to the top of a code cell to explain the last line of babypandas code.
• This requires Pandas Tutor to be imported, which we already did in this notebook.
• You can also use Pandas Tutor through its website, pandastutor.com.

### Using .groupby in general¶

In short, .groupby aggregates all rows with the same value in a specified column (e.g. 'neighborhood') into a single row in the resulting DataFrame, using an aggregation method (e.g. .sum()) to combine values.

1. Choose a column to group by.
• .groupby(column_name) will gather rows which have the same value in the specified column (column_name).
• On the previous slide, we grouped by 'neighborhood'.
• In the resulting DataFrame, there was one row for every unique value of 'neighborhood'.
2. Choose an aggregation method.
• The aggregation method will be applied within each group.
• On the previous slide, we applied the .sum() method to every 'neighborhood'.
• The aggregation method is applied individually to each column (e.g. the sums were computed separately for 'closed', 'open', and 'total').
• If it doesn't make sense to use the aggregation method on a column, the column is dropped from the output – we'll look at this in more detail shortly.
• Common aggregation methods include .count(), .sum(), .mean(), .median(), .max(), and .min().

### Observation #1¶

• The index has changed to neighborhood names.
• In general, the new row labels are the group labels (i.e., the unique values in the column that we grouped on).

### Observation #2¶

The 'service' column has disappeared. Why?

### Disappearing columns ✨🐇🎩¶

• The aggregation method – .sum(), in this case – is applied to each column.
• If it doesn't make sense to apply it to a particular column, that column will disappear.
• For instance, we can't sum strings, like in the 'service' column.
• However, we can compute the max of several strings. How?

### Observation #3¶

• The aggregation method is applied to each column separately.
• The rows of the resulting DataFrame need to be interpreted with care.

Why isn't the 'total' column equal to the sum of the 'closed' and 'open' columns, as it originally was?

### Two choices to make when using .groupby¶

How do we find the number of different services requested in each neighborhood?

Two choices:

1. What column should we group by?
2. What aggregation method should we use?
• Some common ones are .count(), .sum(), .mean(), .median(), .max(), and .min().

### Observation #4¶

• The column names of the output of .groupby don't make sense when using the .count() aggregation method.
• Consider dropping unneeded columns and renaming columns as follows:
1. Use .assign to create a new column containing the same values as the old column(s).
2. Use .drop(columns=list_of_column_labels) to drop the old column(s). Alternatively, use .get(list_of_column_labels) to keep only the columns in the given list. The columns will appear in the order you specify, so this is also useful for reordering columns!

## More practice: IMDb dataset 🎞️¶ ### Question: What was the highest rated movie of the 1990s?¶

Let's try to do this two different ways.

#### Without grouping¶

Note: The command to extract the index of a DataFrame is .index - no parentheses! This is different than the way we extract columns, with .get(), because the index is not a column.

#### With grouping¶

• It turns out that this method does not yield the correct answer.
• When we use an aggregation method (e.g. .max()), aggregation is done to each column individually.
• While it's true that the highest rated movie from the 1990s has a rating of 9.2, that movie is not Unforgiven – instead, Unforgiven is the movie that's the latest in the alphabet among all movies from the 1990s.
• Taking the max is not helpful here.

## Challenge problems¶

We won't cover these problems in class, but they're here for you to practice with some harder examples. To access the solutions, you'll need to watch this solution walkthrough video (start at 10:00).

Before watching the video, make sure to try these problems on your own – they're great prep for homeworks, projects, and exams!

### Question: How many years have more than 3 movies rated above 8.5?¶

#### Aside: Using .sum() on a Boolean array/Series¶

• Summing a Boolean array/Series gives a count of the number of True elements. This is because Python treats True as 1 and False as 0.
• Can you use that fact here?

### Question: Which year had the longest movie titles, on average?¶

Hint: Use .str.len() on the column or index that contains the names of the movies.

## Summary, next time¶

### Summary¶

• We can write queries that involve multiple conditions, as long as we:
• Put parentheses around all conditions.
• Separate conditions using & if you require all to be true, or | if you require at least one to be true.
• The method call df.groupby(column_name).agg_method() aggregates all rows with the same value for column_name into a single row in the resulting DataFrame, using agg_method() to combine values.
• Aggregation methods to know: .count(), .sum(), .mean(), .median(), .max(), and .min().

### Next time¶

A picture is worth a 1000 words – it's time to visualize!