How to parse a dataframe containing xml strings?


#1

I have a Dataframe and it has been imported from mysql

dataframe_mysql.show()
+----+---------+-------------------------------------------------------+
|  id|accountid|                                                xmldata|
+----+---------+-------------------------------------------------------+
|1001|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1002|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1003|    12346|<AccountSetup xmlns:xsi="test"><Customers test="test...|
|1004|    12347|<AccountSetup xmlns:xsi="test"><Customers test="test...|
+----+---------+-------------------------------------------------------+

In the xmldata column there is xml tags inside, I need to parse it in a structured data in a seperate dataframe.

Previously I had the xml file alone in a text file, and loaded in a spark dataframe using “com.databricks.spark.xml”

 spark-shell --packages com.databricks:spark-xml_2.10:0.4.1, 
 com.databricks:spark-csv_2.10:1.5.0

 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 val df = sqlContext.read.format("com.databricks.spark.xml")
 .option("rowTag","Account").load("mypath/Account.xml")

the final output I got as structured one

df.show()
 +----------+--------------------+--------------------+--------------+--------------------+-------+....
    |   AcctNbr|         AddlParties|           Addresses|ApplicationInd|       Beneficiaries|ClassCd|....
    +----------+--------------------+--------------------+--------------+--------------------+-------+....
    |AAAAAAAAAA|[[Securities Amer...|[WrappedArray([D,...|             T|[WrappedArray([11...|     35|....
    +----------+--------------------+--------------------+--------------+--------------------+-------+....

Please advice how to achieve the this when I have the xml content inside a dataframe.

I tried this

    val dff1 = Seq(
    Data(1001, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"a\">d</Customers></AccountSetup>"),
    Data(1002, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"b\">e</Customers></AccountSetup>"),
    Data(1003, 12345, "<AccountSetup xmlns:xsi=\"test\"><Customers test=\"c\">f</Customers></AccountSetup>")
        ).toDF
    dff1.show()
    val reader = new XmlReader().withRowTag("AccountSetup")
    val xmlrdd = dff1.select("xmldata").map(a => a.getString(0)).rdd
    xmlrdd.toDF("newRowXml").show()
    val xmldf = reader.xmlRdd(sqlcontext, xmlrdd)
    xmldf.show()

cant able to see xmldf.show()
Please advise.


Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster