Import com.databricks

import com.databricks not avalble in y login

I am in the lab, trying to do exercise in scala. when I try to import com.databricks, it is not available, how should i get this ?

What are you trying to do from com.databricks? If you can highlight which class you are trying to use, I can direct you.

I was executing following from the spark- file formats and compressions

ordersByStatus.write.avro("/user/username/orders_by_status_avro")
:35: error: value avro is not a member of org.apache.spark.sql.DataFrameWriter
ordersByStatus.write.avro("/user/justycx201703/orders_by_status_avro")

I tried to import the package:
import com.databricks.spark.avro._

scala> import com.databricks.spark.avro._
:32: error: object databricks is not a member of package com
import com.databricks.spark.avro._
^

@j_thomas

You can start the spark-shell with below and try import it.

spark-shell --packages com.databricks:spark-avro_2.10:2.0.1

import com.databricks.spark.avro._

Thank you , it worked.

But now another error:
I am trying to write to avro file: and I am getting so much of error: following are the few lines of error

ordersByStatus.write.avro("/user/justycx201703/orders_by_status_avro")

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[order_status#3], functions=[(count(1),mode=Final,isDistinct=false)], output=[order_status#3,count_by_status#4L])
± TungstenExchange hashpartitioning(order_status#3,200), None

@j_thomas - Please past your complete code to recreate the issue

Here it is:

val sqlContext = org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.avro._

case class Order(
order_id:Int,
order_date: String,
order_customer_id: Int,
order_status: String);

val orders = sc.textFile("/user/username/retail_db/orders").
map(rec => {
val r = rec.split(",")
Order(r(0).toInt, r(1), r(2).toInt, r(3))
}).toDF()

orders.registerTempTable(“orders”)

val ordersByStatus = sqlContext.sql(“select order_status, count(1) count_by_status from orders group by order_status”)
ordersByStatus.write.avro("/user/username/orders_by_status_avro")

@j_thomas

I tried execute the above lines it’s working fine with your id. Started spark-shell as per below
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1 --conf spark.ui=port=45263

Check the results in
hadoop fs -ls /user/justycx201703/orders_by_status_avro

Note:- It’s default paritioned 200, before saving you can use setConf to reduce the output files

scala> sqlContext.getConf(“spark.sql.shuffle.partitions”)
res2: String = 200
scala> sqlContext.setConf(“spark.sql.shuffle.partitions”,“2”)
scala> sqlContext.getConf(“spark.sql.shuffle.partitions”)
res4: String = 2
scala>

unfortunately it doesnt work from my end: I am getting a big line of error code, no idea how to fix it

ordersByStatus.write.avro("/user/username/orders_by_status_avro")
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[order_status#3], functions=[(count(1),mode=Final,isDistinct=false)], output=[order_status#3,count_by_status#4L])
± TungstenExchange hashpartitioning(order_status#3,2), None
± TungstenAggregate(key=[order_status#3], functions=[(count(1),mode=Partial,isDistinct=false)], output=[order_status#3,count#7L])
± Project [order_status#3]
± Scan ExistingRDD[order_id#0,order_date#1,order_customer_id#2,order_status#3]

I am still not able to write to avro file:
ordersByStatus.write.avro("/user/username/orders_by_status_avro")

why is that i am getting error, how can i complete the work

Are you creating sqlContext again from spark shell?

I start scala:
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1 --conf spark.ui=port=45263

Following these steps
val sqlContext = org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import com.databricks.spark.avro._

case class Order(
order_id:Int,
order_date: String,
order_customer_id: Int,
order_status: String);

val orders = sc.textFile("/user/username/retail_db/orders").
map(rec => {
val r = rec.split(",")
Order(r(0).toInt, r(1), r(2).toInt, r(3))
}).toDF()

orders.registerTempTable(“orders”)

val ordersByStatus = sqlContext.sql(“select order_status, count(1) count_by_status from orders group by order_status”)
ordersByStatus.write.avro("/user/username/orders_by_status_avro")

sqlContext is already created when you run spark-shell.
So try running your code without these 2 lines.

still getting the same error

I am not able to display the result also: I am getting the same error

ordersByStatus.show

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
TungstenAggregate(key=[order_status#3], functions=[(count(1),mode=Final,isDistinct=false)], output=[order_status#3,count_by_status#4L])
± TungstenExchange hashpartitioning(order_status#3,200), None
± TungstenAggregate(key=[order_status#3], functions=[(count(1),mode=Partial,isDistinct=false)], output=[order_status#3,count#7L])
± Project [order_status#3]
± Scan ExistingRDD[order_id#0,order_date#1,order_customer_id#2,order_status#3]
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate.doExecute(TungstenAggregate.scala:80)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)

@j_thomas - Please paste your full code to recreate the issue.

Is this happening in the big data lab environment or Cloudera quickstart VM.

spark-shell --packages com.databricks:spark-avro_2.10:2.0.1 --conf spark.ui=port=45263

import sqlContext.implicits._
import com.databricks.spark.avro._

case class Order(
order_id:Int,
order_date: String,
order_customer_id: Int,
order_status: String);

val orders = sc.textFile("/user/username/retail_db/orders").
map(rec => {
val r = rec.split(",")
Order(r(0).toInt, r(1), r(2).toInt, r(3))
}).toDF()

orders.registerTempTable(“orders”)

val ordersByStatus = sqlContext.sql(“select order_status, count(1) count_by_status from orders group by order_status”)
ordersByStatus.show
ordersByStatus.write.avro("/user/username/orders_by_status_avro")

yes, this is in the big data lab