Getting Started - Setup Data Sets

As part of this topic, we will set up data sets

  • Go to https://github.com/dgadiraju/data
  • Clone or Download on to Virtual Machines created using Cloudera Quickstart or Hortonworks Sandbox
  • You can setup locally for practicing for Spark, but it is highly recommended to use HDFS which comes out of the box with Cloudera Quickstart or Hortonworks or our labs
  • On lab, they are already available
  • retail_db
    • Master tables
      • customers
      • products
      • categories
      • departments
    • Transaction tables
      • orders
      • order_items
  • To get the number of records from the file we can use below command wc -l /data/retail_db//

Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster