Demo is given in our state of the art big data cluster
This weekend I am planning to explore machine learning on Yelp dataset. You are welcome to be part of this. I will update this topic as we make progress.
We will try to use Spark machine learning using Scala as programming language for this effort.
If you have lab access data sets are available under /data/yelp-dataset
You can also download the data set from kaggle to your own environment - https://www.kaggle.com/yelp-dataset/yelp-dataset
- Sign up to https://www.kaggle.com using browser
- Download using browser
- Upload data to the environment on which you are going to use the data
You should see following csv files once you download and unarchive data sets.
We need to use yelp_review.csv for this exercise.
Official Spark Documentation
We will be using official spark machine learning documentation.
Typical ML Cycle
- Understand data
- Create Data Frame
- Build training model using sample data
- Apply the model on the actual data
- Validate for accuracy