Cant able to skip the attribute for loading xml data in Hive


#1

Hello Team,

I have an xml, I am inserting the value in hive using xmlserde. In the below xml I have an attribute
in AccountSetup and Accounts tags

<AccountSetup xmlns:xsi="test">
<Accounts xmlns="http://acct.com/institutional">
<Account>
<Id>12346</Id>
<AcctNbr>AAAAAAAAAA</AcctNbr>
<RegTypeCd>XXXX</RegTypeCd>
<ClassCd>35</ClassCd>
</Account>
</Accounts>
</AccountSetup>

and my create table query

CREATE TABLE suitability(Id STRING, AcctNbr STRING, RegTypeCd STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES
("column.xpath.Id"="AccountSetup/Accounts/Account/Id/text()",
"column.xpath.AcctNbr"="AccountSetup/Accounts/Account/AcctNbr/text()",
"column.xpath.RegTypeCd"="AccountSetup/Accounts/Account/RegTypeCd/text()"
)stored as 
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
tblproperties 
("xmlinput.start"="<AccountSetup ", "xmlinput.end"="</AccountSetup>");

It is showing null values in hive

hive> select * from suitability;
OK
NULL    NULL    NULL

Please advise on this.


#2

Hello @raki28

I am getting same results as you are. Not sure why. Here is my setup for your XML. Please note that I have changed following line: (“xmlinput.start”="", “xmlinput.end”="");

CREATE TABLE suitability(Id STRING, AcctNbr STRING, RegTypeCd STRING,ClassCd String)
row format serde ‘com.ibm.spss.hive.serde2.xml.XmlSerDe’
WITH SERDEPROPERTIES
(“column.xpath.Id”=“AccountSetup/Accounts/Account/Id/text()”,
“column.xpath.AcctNbr”=“AccountSetup/Accounts/Account/AcctNbr/text()”,
“column.xpath.RegTypeCd”=“AccountSetup/Accounts/Account/RegTypeCd/text()”,
“column.xpath.ClassCd”=“AccountSetup/Accounts/Account/ClassCd/text()”
)stored as
inputformat ‘com.ibm.spss.hive.serde2.xml.XmlInputFormat’
outputformat ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’
tblproperties
(“xmlinput.start”="", “xmlinput.end”="");

load data local inpath ‘/home/cloudera/Desktop/account.xml’ into table suitability;

I had another example and it’s working fine for me:

XML file

<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>

Create table

CREATE TABLE book_details(TITLE STRING, AUTHOR STRING,COUNTRY STRING,COMPANY STRING,PRICE FLOAT,YEAR INT)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.TITLE"="/BOOK/TITLE/text()",
"column.xpath.AUTHOR"="/BOOK/AUTHOR/text()",
"column.xpath.COUNTRY"="/BOOK/COUNTRY/text()",
"column.xpath.COMPANY"="/BOOK/COMPANY/text()",
"column.xpath.PRICE"="/BOOK/PRICE/text()",
"column.xpath.YEAR"="/BOOK/YEAR/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");

Load file

load data local inpath '/home/cloudera/Desktop/books.xml' into table book_details;


#4

Are you able to process same data after removing attributes from the tag? I removed xmlns:xsi=“test” and xmlns=“http://acct.com/institutional” and still could not process your XML file.


#3

Thanks for you reply. Your example is fine, still I am searching for retriving the values from the tag which has attributes, ie. for my scenario.


#5

Hi @mayank2711

Yes I can able to process the data after removing the attribute from accounts tag
xmlns=“http://acct.com/institutional”

<AccountSetup xmlns:xsi="test">
<Accounts>
<Account>
<Id>12346</Id>
<AcctNbr>AAAAAAAAAA</AcctNbr>
<RegTypeCd>XXXX</RegTypeCd>
<ClassCd>35</ClassCd>
</Account>
</Accounts>
</AccountSetup>

Create table query

CREATE TABLE suitability(Id STRING, AcctNbr STRING, RegTypeCd STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES
("column.xpath.Id"="AccountSetup/Accounts/Account/Id/text()",
"column.xpath.AcctNbr"="AccountSetup/Accounts/Account/AcctNbr/text()",
"column.xpath.RegTypeCd"="AccountSetup/Accounts/Account/RegTypeCd/text()"
)stored as 
inputformat     'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat    'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
tblproperties 
("xmlinput.start"="<AccountSetup ", "xmlinput.end"="</AccountSetup>");

output:
hive> select * from suitability;
OK
12346   AAAAAAAAAA      XXXX
Time taken: 0.423 seconds, Fetched: 1 row(s)

I am getting problem only in the second parent tag “Accounts”.