Welcome to Discourse

This is primarily for IT professionals to accelerate their learning new technologies such as Big Data, Cloud, Databases, Data Warehousing etc. This discussion forum will complement itversity platforms such as YouTube channel, website, labs etc.

ITVersity have the history of training thousands in Big Data. It is known for enabling Big Data skills as per learner’s preferences supported by state of the art lab (https://labs.itversity.com).

Here are the salient features of our training

  • Train thousands as per learner’s preference
  • Certification based content
  • 1000s cleared the certification
  • State of the art, PoC scale cluster to get real exposure
  • Live bootcamps
  • Forums based support
    and many more

unable to copy files from local to hdfs through putty ,I have windows machine !

may i know local means linux local or windows local ? if local means window then use winscp for copying the files

Instead of Putty, just, you try WinSCP to copy file from Local PC to remote PC

Hi I am facing a critical connection error to itversity. I am running the command:
pscp -scp companylist.csv mrao108108@gw02.itversity.com:. to upload the company list csv to itversity but I get an error.
Fatal: Network error: Connection timed out
Please advise what I should do

I am trying to ingest data from sql server table(Contacts table in this case) into spark dataframe.
Below is the command that i am using.
val table_df = sqlContext.read.format(“jdbc”).option(“url”,“jdbc…url…”).option(“driver”,"…driver name…").option(“dbtable”,“Contacts”).option(“inferSchema”,“false”).load.as(“Contacts”)

I want all the fields of the dataframe to be of String type…But here in the above command the fields are coming in the respective data types as in SQL server.
e.g 2.0000000000(type decimal(2,20)) in SQL server table is coming as 2E-10(type decimal(2,20)) in the dataframe.


I’m Himanshu from Bigdata batch actaully I need some help in some scenario

Scenario 1 . You have a data source which provides several billion complex records per day,

including nested structures and approximately 15-20 primary fields. In DNS, the data may

contain DNS query-response logs with IP and resolution information. What development

approaches might you take to monitor that source for changes and consistency?

Scenario 2 . Suppose you have several billion records a day containing domain addresses and

the IP address they resolved to, i.e., pairs of (domain, IP). A threat analyst asks you to build

them a way to search for a domain and return the IP addresses that belong to it. What are some

of the major considerations you would take into developing a solution? What questions might

you ask and what concerns would you have about scaling?

Scenario 3 . A junior data scientist consults you because their Spark job ‘didn’t work’. What are

some specific steps you would take to help them resolve their issue?

  1. When we submit a job using spark-submit, we specify --master yarn-client on the hadoop cluster. However, inside our project, if we have also set it using setMaster(“local”) , then in that case which one will take precedence?

  2. Also, when we build a jar using sbt package, then does it by default include only the contents under the src folder?

Hi Admin,

I just signed up for the lab for 30 days to practice spark, kafka and pyspark. But not sure how to use this virtual machine/lab for practice as I am not able to connect to it through cygwin.