Selecting the best director with help of the Data science

Vadym Byesyedin
3 min readJul 8, 2020

Data science is an interdisciplinary field which focuses on making inferences from large data sets. This field includes data cleaning, manipulation, analysis, visualization and presentation of findings in order to inform a high-level decisions in an organization. As such, it incorporates skills from computer science, mathematics, statistics, information visualization, graphic design, and business.

Big data quickly become a vital tool for business everywhere from Amazon’s sales to TV and drug discovery. Data scientists are responsible for breaking down big data into usable information that guides decision making. The impact of big data in our days can not be over estimated: for instance, it plays essential role in tracking the spread of COVID-19. Using GPS location and time we can find individuals who were possibly infected by someone who has confirmed disease.

In this article we will determine the best movie producer as an example of Data Science driven decisions. We will work on the movie-related dataset which was obtained from IMDB and contains information about movies from 2010 to 2020.

This analysis was performed by using Python and its libraries such as Pandas (for working with dataframes) and Seaborn (for visualization of results). Code was omitted from this article in order to make it more concise.

We begin with [ETL(extract, transform, load)] extracting, cleaning data and performing exploratory analysis(EDA). Cleaned data set included next variables: movie_name, popularity, director and profit, this dataset was saved. After sorting by popularity we obtained the following dataset

Top directors based on their most popular film from 2010 to 2020

But what happens when popular movie was directed by few people (like in the table above)? How can we decide which director is the best?

If director has worked on more than one movie, we can use popularity mean of their movies, which was previously grouped by director.

Top directors based on their films mean popularity from 2010 to 2020

Interesting, huh?

Most directors made 1 or 2 movies, thus low number of observations per director should be taken into account.

Film popularity distribution grouped by number of director films

We can also get insights about director who made most profitable movies. In order to do it, we need to sort this dataset by profit (in this case profit = worldwide_gross-production_budget) and obtain the following table

Top directors based on their most profitable film from 2010 to 2020

Similar to the previous example we can take the mean of profit in cases where it is possible.

Top directors based on their films mean profit from 2010 to 2020

Conclusion

In this article, we looked at the simplest application of DS in order to obtain insights from data. We did not event touch cool and impressive things like {Combinatorics,} Probability Theory, Statistical Distributions, Regressions, Machine Learning, Deep Learning, Neural Networks, etc.

--

--