Problem
Visualize movie data to see whether there is a correlation between movie budget, popularity, and average rating.
Approach
Dashboard consisting of four main sections:- A scatter plot: popularity vs budget. Display movie details on hover and click
- Movie details: a section containing basic information about the movie (title, poster, budget, etc.)
- A map showing the number of movies filmed across the world. Filter by one or multiple countries by clicking the map
- A bar chart showing number of movies per genre. Filter by one or multiple genres by clicking the bar chart
Choice of technologies
- The visualizations are created using D3 - an open-source library allowing to build different types of complex graphs.
- Due to the simple nature of the page, I decided not to use front-end frameworks and stick with vanilla JavaScript, HTML, and SASS compiled into CSS.
- It is easier to store movie data in documents format rather than in tables, so I chose to use MongoDB on the back end.
- Data is pulled from the back end API -- a MongoDB driver + Express application running on node.js.
- The application is built using parcel.js and hosted on a AWS EC2 instance running Nginx web server.
- Continuous integration is set up using Jenkins and GitHub web hooks.
Challenges
- Data formatting The data used were acquired from TMDB in CSV format, with individual fields wrapped in double quotation marks. Some of the the fields (such as movie plot) contained single, double quotation marks, as well as commas. Additionally, the object keys were wrapped in single quotes: this made it impossible to parse the file as JSON as is. To solve this problem, I reformatted the file in three steps:
- Wrote a script to convert the file from CSV format to DSV, changing the field separator from comma to pipe ( | ).
- Using a regular expression, took out all double quotes that wrapped individual fields.
- Using another regular expression, replaced single quotes that wrapped object keys with double quotes, without replacing the apostrophes.
- Wrote a script to read the resulting DSV file and populate the database.
- Performance tuning When a big time interval was chosen and no other filters were applied, a large amount of data had to be retrieved from the DB and transferred over to the front end. The initial version of the page took almost 3 seconds to load, and requests for more data taking up to 1.5 seconds. To decrease the load time, I took a number of steps:
- Decreased the amount of fields retrieved from the database. I decided not to read the plot field even though we might need it if the user clicks on the movie to see details. Instead, pull it only if and when the user actually requests the details. This introduced a slight delay in retrieving the movie details, but significantly decreased page load time and improved general responsiveness.
- Delegated processing of data to the MongoDB engine. Instead of retrieving all data about what movies were filmed in each country, count the movies per country in the DB and only transfer the final number.
- Set up indices on the DB to speed up reads: since new data in never written to the DB, there is no downside to adding several indices to the collection.
- Cache data on front-end: after each response from the backend API, save the retrieved data in memory. For any subsequent requests, if some or all data is cached, read it from memory and only request the missing part.