Python for Data Science – Data frames – Part 2

In this second part, we are going to talk about aggregation, merging and filtering! It’s a really important part of data analysis.

I’ll use an example of a dataset that I’m using to learn data science, therefore some things may not make sense for you because you didn’t see what I did in the past.

On the other hand, the important thing is that you understand what is happening, because you can use these steps for your future analysis. Here we go!

Let’s create a data frame that will contain the average ratings for each movie. For that, we’ll use the data frame ratings and we’ll group by movie ID. Also, we’ll use as_index as false for indexes started in 0 and not the movie ID. In addition, we’ll use the mean function to get each movie’s average rating.

  • avg_ratings – a variable that will receive a new data frame with the merged ratings
  • ratings – the data frame that contains the ratings
  • groupby – a method to group the data
  • movieId – the column that will be used for the grouping
  • as_index – as false because I want the indexes to be restarted instead of using the movie ID as index
  • mean() – it gives the mean for the movie
  • del avg_ratings[‘userId’] – exclude not needed columns
  • avg_ratings.head() – it gets the first five records of my new data frame

Now, we’ll add these ratings to our new data frame called box_office.

  • box_office – a new data frame that merges the movies’ data and the averages
  • on – columns that will be used as a reference for the merging
  • box_office.tail() – shows the last 5 records

Let’s create another filter to know if the movie is well-rated, I mean,  if the rating is equal or greater than 4. After that, let’s show the last 5 using a slicing.

Let’s create another filter to check if the movie is a comedy. To do so, we’ll check if the string contains ‘Comedy’. We’ll show the movies from 1st to 5th.

Now let’s use the created filters to see which comedy movies are best rated:

One more tip for the day 😀