YARN Capacity Scheduler: Queue Priority

Capacity Scheduler is designed to run Hadoop jobs in a shared, multi-tenant cluster in a friendly manner. Its main strength is that it guarantees specific capacity for a certain group of users by supporting multiple queues and allowing users to submit their queries into their dedicated queues. Each queue is given a fraction of total … Continue reading YARN Capacity Scheduler: Queue Priority →

Hive Performance Tuning

If you have been working in Big Data, you have definitely heard of Hive. Apache Hive is the data warehouse infrastructure build on top of Hadoop. I did a presentation on how to best use Apache Hive and few tips on how to best use it for one of our clients last week that I … Continue reading Hive Performance Tuning →

Hadoop Error org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block

Like almost all Mondays, today was a very challenging one. The first thing I noticed was that our primary namenode had faced some issues over the weekend and went down. Which means secondary namenode, namenode-02, was active. I checked namenode-01 and made sure it is okay before making it active again. After that, I was made … Continue reading Hadoop Error org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block →

How to import org.apache.spark.sql.SQLContext.implicits in Spark 1.6: error “value toDF is not a member of org.apache.spark.rdd.RDD”

Note: If you're using Spark 2.2, please read this post I am doing a mini project for my company using Spark/Scala and have been stuck with the error mentioned in the title for a couple of days. Googling that error suggested to import org.apache.spark.sql.SQLContext.implicits, and that's what I did: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.sql._ import … Continue reading How to import org.apache.spark.sql.SQLContext.implicits in Spark 1.6: error “value toDF is not a member of org.apache.spark.rdd.RDD” →