From Monolithic Architecture to Microservices and Event-Driven Systems

I’m a massive fan of streaming and real time data processing and solutions. I strongly believe a lot of use cases are going to be defined and implemented around fast and streaming data in near future, especially in IoT and streaming analytics. With 5G rolling out soon and its superfast bandwidth and wide geographical coverage, … Continue reading From Monolithic Architecture to Microservices and Event-Driven Systems →

AWS Glue Part 3: Automate Data Onboarding for Your AWS Data Lake

Choosing the right approach to populate a data lake is usually one of the first decisions made by architecture teams after deciding the technology to build their data lake with. A recent trend seems to be taking over is using Spark, since it’s fast and powerful and comes with a lot of flexibilities when used … Continue reading AWS Glue Part 3: Automate Data Onboarding for Your AWS Data Lake →

AWS Glue Part 2: ETL your data and query the result in Athena

In part one of my posts on AWS Glue, we saw how Crawlers could be used to traverse data in s3 and catalogue them in AWS Athena. Glue is a serverless service that could be used to create ETL jobs, schedule and run them. In this post we'll create an ETL job using Glue, execute … Continue reading AWS Glue Part 2: ETL your data and query the result in Athena →

AWS Glue Part 1: Discover and Catalogue Data Stored in s3

Learn how to add a Crawler in AWS Glue for data that is stored in s3

Airflow & Celery on Redis: when Airflow picks up old task instances

This is going to be a quick post on Airflow. We realized that in one of our environments, Airflow scheduler picks up old task instances that were already a success (whether marked as success or completed successfully). You can verify this is actually your issue by ssh into your Airflow workers, and run: ps -ef … Continue reading Airflow & Celery on Redis: when Airflow picks up old task instances →

How to import spark.implicits._ in Spark 2.2: error “value toDS is not a member of org.apache.spark.rdd.RDD”

I wrote about how to import implicits in spark 1.6 more than 2 years ago. But things have changed in Spark 2.2: the first thing you need to do when coding in Spark 2.2 is to set up an SparkSession object. SparkSession is the entry point to programming Spark with DataSet and DataFrame. Like Spark … Continue reading How to import spark.implicits._ in Spark 2.2: error “value toDS is not a member of org.apache.spark.rdd.RDD” →

Spark Error “java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE” in Spark 1.6

RDDs are the building blocks of Spark and what make it so powerful: they are stored in memory for fast processing. RDDs are broken down into partitions (blocks) of data, a logical piece of distributed dataset. The underlying abstraction for blocks in Spark is a ByteBuffer, which limits the size of the block to 2 … Continue reading Spark Error “java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE” in Spark 1.6 →

Spark Error CoarseGrainedExecutorBackend Driver disassociated! Shutting down: Spark Memory & memoryOverhead

Another common error we saw in yarn application logs was this: 17/08/31 15:58:07 WARN CoarseGrainedExecutorBackend: An unknown (datanode-022:43969) driver disconnected. 17/08/31 15:58:07 ERROR CoarseGrainedExecutorBackend: Driver 10.1.1.111:43969 disassociated! Shutting down. Googling this error suggests increasing spark.yarn.driver.memoryOverhead or spark.yarn.executor.memoryOverhead or both. That has apparently worked for a lot of people. Or at least those who were smart enough to understand … Continue reading Spark Error CoarseGrainedExecutorBackend Driver disassociated! Shutting down: Spark Memory & memoryOverhead →

Spark Error: Failed to Send RPC to Datanode

This past week we had quite few issues with users not being able to run Spark jobs running in YARN Cluster mode. Particularly a team that was on tight schedule used to get errors like this all the time: java.io.IOException: Failed to send RPC 8277242275361198650 to datanode-055: java.nio.channels.ClosedChannelException Mostly accompanied by error messages like: org.apache.spark.SparkException: Error … Continue reading Spark Error: Failed to Send RPC to Datanode →

YARN Capacity Scheduler: Queue Priority

Capacity Scheduler is designed to run Hadoop jobs in a shared, multi-tenant cluster in a friendly manner. Its main strength is that it guarantees specific capacity for a certain group of users by supporting multiple queues and allowing users to submit their queries into their dedicated queues. Each queue is given a fraction of total … Continue reading YARN Capacity Scheduler: Queue Priority →

Hive Performance Tuning

If you have been working in Big Data, you have definitely heard of Hive. Apache Hive is the data warehouse infrastructure build on top of Hadoop. I did a presentation on how to best use Apache Hive and few tips on how to best use it for one of our clients last week that I … Continue reading Hive Performance Tuning →

OBIEE RPD Design: Convert Snowflake to Star schema from multiple sources in (Combine dimensions)

As I play more with OBIEE, I learn more about what it is capable of and where its main power resides. OBIEE has 3 layers: Physical, Business Model and Mapping, and Presentation. The middle layer, BMM, is what makes OBIEE special: it is where we can define how data from different sources and tables come … Continue reading OBIEE RPD Design: Convert Snowflake to Star schema from multiple sources in (Combine dimensions) →

OBIEE: Multiple joins between same tables (Fact to Dim)

Hi all. I am finally writing a new post after more than 1 year and surprisingly it is not on SQL Server! I must confess that I am not a front-end kinda person and do not particularly enjoy doing dashboards and reports. But I recently started a new job and my first project is going … Continue reading OBIEE: Multiple joins between same tables (Fact to Dim) →

SQL Server 2014 and SSDT (AKA BIDS)

Hey folks. This is going to be a short post, just wanted to mention something that may come handy for those who are interested in play with SQL Server 2014. I downloaded SQL Server 2014 a couple of weeks ago and started exploring its new features to see what has changed/improved. After going through very … Continue reading SQL Server 2014 and SSDT (AKA BIDS) →

Drop failed for DatabaseRole : The database principal owns a schema in the schema and cannot be dropped

This error is raised when there is a schema owned by the role you are trying to drop. The most straight forward and quick fix for this error is to revert the schema ownership to the appropriate role in the database, which will make the dropping role not the owner of any schema in the database … Continue reading Drop failed for DatabaseRole : The database principal owns a schema in the schema and cannot be dropped →