Cleaned the IMDB Movies Data of more 6000 rows using Jupyter Notebook, python libraries such as numpy, pandas by removing all the null values, dropping all unnecessary rows and columns, deleting duplicate values so that data can be used for further analysis. And answered various question using python that can generate new insights.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
movies = pd.read_csv('IMDB_Movies.csv') originaldata = movies movies
movies.shape
movies.info()
Calculating the number of Null values of rows and coloumns. Finding the percentage of Null values in each coloumn.
movies.isnull().sum(axis=0).sort_values(ascending=False)
movies.isnull().sum(axis=1).sort_values(ascending=False)
movies.isnull().sum(axis=0).sort_values(ascending=False)/len(movies) * 100
color
director_facebook_likes
actor_1_facebook_likes
actor_2_facebook_likes
actor_3_facebook_likes
actor_2_name
cast_total_facebook_likes
actor_3_name
duration
facenumber_in_poster
content_rating
country
movies_imdb_link
aspect_ratio
plot_keywords
movies = movies.drop([
'color',
'director_facebook_likes',
'actor_1_facebook_likes',
'actor_2_facebook_likes',
'actor_3_facebook_likes',
'actor_2_name',
'cast_total_facebook_likes',
'actor_3_name',
'duration',
'facenumber_in_poster',
'content_rating',
'country',
'movie_imdb_link',
'aspect_ratio',
'plot_keywords'],axis=1)
movies
round(movies.isnull().sum().sort_values(ascending=False)/len(movies)*100,2)
movies = movies[movies['gross'].notnull()] movies = movies[movies['budget'].notnull()]
round(movies.isnull().sum().sort_values(ascending=False)/len(movies)*100,2)
Drop more unecessary rows which might have greater than five NaN values. Such rows are not of much use for the analysis and hence, should be removed.
(movies.isnull().sum(axis=1).sort_values(ascending=False) > 5).sum()
movies = movies[movies.isnull().sum(axis=1).sort_values(ascending=False) <= 5] movies
round(movies.isnull().sum().sort_values(ascending=False)/len(movies)*100,2)
movies.language.describe()
movies.language = movies.language.fillna('English')
round(movies.isnull().sum().sort_values(ascending=False)/len(movies)*100,2)
movies
len(movies)/len(originaldata)* 100
movies['budget'] = movies['budget']/1000000 movies['gross'] = movies['gross']/1000000
movies
Finding the movies with highest profit and store the top ten movies with highest profit in a new dataframe- top 10
movies['profit'] = movies['gross'] - movies['budget'] movies
movies.sort_values(by='profit',ascending=False)
top10 = movies.sort_values(by='profit',ascending=False).head(10) top10
movies.drop_duplicates(keep='first',inplace=True)
movies
top10 = movies.sort_values(by='profit',ascending=False).head(10) top10
Finding IMDB top 100 movies with highest IMDB rating and number of voted users greater than 25000. Add a rank coloumn containing the values 1 to 100.
IMDb_Top_100 = movies[movies['num_voted_users'] > 25000].sort_values(by='imdb_score',ascending=False).head(100) IMDb_Top_100
IMDb_Top_100['Rank'] = IMDb_Top_100['imdb_score'].rank(method='first',ascending=False) IMDb_Top_100
IMDb_Top_100[IMDb_Top_100['language']!='English']
top10director = movies.groupby('director_name').imdb_score.mean().sort_values(ascending=False).head(10) top10director
TempGenre = movies.genres.str.split('|',expand=True).iloc[:,0:2] TempGenre.columns=['genre_1','genre_2'] TempGenre.genre_2.fillna(TempGenre.genre_1,inplace=True) TempGenre
movies = pd.concat([movies,TempGenre],axis=1) movies
movies.groupby(['genre_1','genre_2']).gross.mean().sort_values(ascending=False).head(5)
Meryl_Streep = movies[movies['actor_1_name']=='Meryl Streep'] Leo_Caprio = movies[movies['actor_1_name']=='Leonardo DiCaprio'] Brad_Pitt = movies[movies['actor_1_name']=='Brad Pitt']
Combined = Meryl_Streep.append([Leo_Caprio,Brad_Pitt]) Combined
Combined.groupby('actor_1_name').num_critic_for_reviews.mean()
Combined
Combined.num_user_for_reviews = Combined.num_user_for_reviews.astype('int') Combined.num_user_for_reviews
Combined.groupby('actor_1_name').num_user_for_reviews.mean()
Combined.groupby('actor_1_name')[['num_critic_for_reviews','num_user_for_reviews']].mean()