Amazon_Vine_Analysis

Amazon Vine Analysis

logo

Overview

This project is to analyize the reviews written by members of the paid Amazon Vine program. This program is a service that allows manufacturers and publishers to receive reviews for their products. Companies like SellBy pay a small fee to Amazon and provide products to Amazon Vine members, who are then required to provide a review.

In this project, I picked a product that was reviewed, from approximately 50 different products, from clothing apparel to wireless products.

For Deliverable 1, I will use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and lod the transformed data into pgAdmin.

For Deliverable 2, I will use Pyspark to read in the exported vine_table, in csv format, into a DataFrame, and determine if there is any bias towards favorable reviews from Vine members.

The results of the above, will allow me to provide SellBy the information they can analyize during their decision process to invest into using or not using the Amazon Vine program.

Resources

GitHub Application Link

Amazon Vine Analysis

Deliverable 1: Perform ETL on Amazon Product Reviews

For my analysis, I chose the product, Musical Instruments. My family, is large, and my parents encouraged us to join clubs, play sports, as well as join the band. All of my siblings, there are 6 kids total, played at least 2 instruments from middle school through high school. I even played during my first year of college. So this product caught my eye, right away.

Once the program was connected, I inserted to each database table the contents of each of the table DataFrames.

review_id_df.write.jdbc(url=jdbc_url, table='review_id_table', mode=mode, properties=config)
products_df.write.jdbc(url=jdbc_url, table='products_table', mode=mode, properties=config)
customers_df.write.jdbc(url=jdbc_url, table='customers_table', mode=mode, properties=config)
vine_df.write.jdbc(url=jdbc_url, table='vine_table', mode=mode, properties=config)

Below is queries, in pgAdmin showing the data was uploaded into the AWS Database. <img src=”images/review_id_table.png” width=50% height=50% /> <img src=”images/products_table.png” width=50% height=50% /> <img src=”images/customers_table.png” width=50% height=50% />
<img src=”images/vine_table.png” width=50% height=50% />

Deliverable 2: Determine Bias of Vine Reviews

Results

+----+-------------+--------------------+------------------+
|vine|Total_Reviews|Total_5_Star_Reviews| %_5_Star_To_Total|
+----+-------------+--------------------+------------------+
|   Y|           60|                  34|56.666666666666664|
|   N|        14477|                8212| 56.72445948746287|
+----+-------------+--------------------+------------------+

Summary

The results show, that despite the lower number of reviewers from the Amazon Vine program, 60, compared to 14,477, the percentage of 5 star reviews were exactly the same as the non-Vine reviewers, at 56.7%. This tells us that the paid reviewers did not give out more 5 star reviews because they were being paid either in free product or money.

An additional measurment could be done by adding the verified_purchase to the analysis. This gave us a bit of a percentage change for the non-Vine reviewers. Now we are looking at a 56.7% for Vine Reviewers vs a 57.4% for non-Vine Verified Purchase Reviewers. This is still a very close percentage, only 0.7% difference.

These are questions that Sellby need to calculate into their decision.

from pyspark.sql.functions import col,when,count,lit
ratings_total_df = percent_votes_df.groupBy("vine","verified_purchase").agg(
    count(col("vine")).alias("Total_Reviews"),
    count(when(col("star_rating") == 5, True)).alias("Total_5_Star_Reviews"),
    (count(when(col("star_rating") == 5, True))/count(col("vine"))*100).alias("%_5_Star_To_Total")).show()

+----+-----------------+-------------+--------------------+------------------+
|vine|verified_purchase|Total_Reviews|Total_5_Star_Reviews| %_5_Star_To_Total|
+----+-----------------+-------------+--------------------+------------------+
|   Y|                N|           60|                  34|56.666666666666664|
|   N|                Y|         8610|                4940|57.375145180023225|
|   N|                N|         5867|                3272| 55.76955854780978|
+----+-----------------+-------------+--------------------+------------------+

Thank you for your time and let me know if you wish to see any additional data.

Jill Hughes