Help needed in parsing sample logfile in Spark

apache-spark
scala

#1

Hi Guys,

I am trying to load a sample log file into spark and my requirement is to extract specific columns (_raw and _time). I did it using spark.sql however in my _raw column the data looks like this
“[2018-12-02T23:59:56.965-06:00] [dfs_prod_bpel_admin] [NOTIFICATION] [DMS-50973] [oracle.dms.collector] [tid: DmsThread-2] [userId: weblogic12cagent] [ecid: 955a9a42-08f3-4f4a-81d4-e384312724b0-00025f58,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Caught Exception.[[
java.lang.IllegalArgumentException
at java.util.concurrent.ThreadPoolExecutor.setMaximumPoolSize(ThreadPoolExecutor.java:1680)
at oracle.dms.collector.Hunter._adjustThreadPoolSize(Hunter.java:263)
at oracle.dms.collector.Hunter.run(Hunter.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at oracle.dms.util.DmsThreadFactory$1$1.run(DmsThreadFactory.java:55)
at java.security.AccessController.doPrivileged(Native Method)
at oracle.dms.util.DmsThreadFactory$1.run(DmsThreadFactory.java:50)
at java.lang.Thread.run(Thread.java:748)
]]”

I need to clean this column so i can run SQL queries on it

Can some help me…

This is what i have done so far

val logfilecsv = spark.read.format(“csv”).option(“inferSchema”, “true”).option(“header”, “true”).load(“soa_prod_diag_10_jan.csv”)
logfilecsv.show(20, False)
logfilecsv.createOrReplaceTempView(“logs”)
val rawonly = spark.sql(“select _raw, _time from logs”).toDF