Convert NESTED JSON to many rows IN SPARK dynamically


#1

HI All
I have a nested JSON and want to convert that nested json to multiple rows

for exmple

input json as below

{“name”:“Michael”, “schools”:[{“sname”:“stanford”, “year”:2010}, {“sname”:“berkeley”, “year”:2012}]}
{“name”:“Andy”, “schools”:[{“sname”:“ucsb”, “year”:2011}]}

Need a utility class that convert in to below result without using explode.It can be any json ,schema can be generated dynamically.

±------±-------±—+
| name| sname|year|
±------±-------±—+
|Michael|stanford|2010|
|Michael|berkeley|2012|
| Andy| ucsb|2011|
±------±-------±—+


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster


#2

// Using Dataset/ DataFrame api

val spark: SparkSession = SparkSession.builder().master(“local[*]”)
.appName(“NestedJsonParser”).getOrCreate()

val jsonDF = spark.read.option("inferSchema", "true").json(inpPath)

jsonDF.printSchema()

jsonDF.show(false)

import spark.implicits._

val outDF = jsonDF.flatMap(row => {
  row.getSeq[Row](1).map(school => {
    Student(row.getString(0), school.getString(0), school.getLong(1))
  })
})

outDF.printSchema()

outDF.show()

// Using Explode
import org.apache.spark.sql.functions.explode

val outDF = jsonDF.withColumn("schools", explode($"schools")).map(row => {
  val school = row.getStruct(1)
  Student(row.getString(0), school.getString(0), school.getLong(1))
})

outDF.printSchema()

outDF.show()

For more detailed explanation, please visit this link: https://wp.me/p9U1Uy-h


#3

Thanks for your reply.

is their way to make 1xMxN ,without knowing metadata.Here you use schools as hardcoded ,hence if I use some other JSON which doesnot have school will fail.

Problem Recap

in kafka ,you will get any kind of nested json. we want to make it as flattern to without knowing the metadata or hardcode value like school.

thanks for your early reply


#4

Hi,
I had provided two different ways of solving the problem.

One using dataset api, looping the nested array and another using explode.

Just refer to the first one and let me know if you need any further help.
row.getSeqRow.map() this one would solve the problem as there’s no hard coded colum in this.

Regarding 1mn, plz provide the input and expected output.