Control IoT Devices Using Scala on Databricks (Based on ML Model Output)

Featured

A few weeks ago I did a talk at AI Bootcamp here in Melbourne on how we can build a serverless solution on Azure that would take us one step closer to powering industrial machines with AI, using the same technology stack that is typically used to deliver IoT analytics use cases. I demoed a solution that received data from an IoT device, in this case a crane, compared the data with the result of a machine learning model that has ran and written its predictions to a repository, in this case a CSV file, and then decided if any actions needs to be taken on the machine, e.g. slowing the crane down if the wind picks up. My solution had 3 main components:

  1- IoT Hub: The gateway to cloud, where IoT devices connect and send data to

 2- Databricks: The brain of the solution where the data received from IoT device is compared with what the ML algorithm has predicted, and then decided if to take any actions

  3- Azure Functions: A Java function was deployed to Azure Functions to call a Direct Method on my simulated crane and instruct it to slow down. Direct Method, or device method, provide the ability to control the behaviour of IoT devices from the cloud.

Since then, I upgraded my solution by moving the part responsible for sending direct methods to IoT devices from Azure Functions into Databricks, as I promised at the end of my talk. Doing this has a few advantages:

  • The solution will be more efficient since the data received from IoT devices hops one less step before the command is sent back and there are no delays caused by Azure Functions.
  • The solution will be more simplified, since there are less components deployed in it to get the job done. The less the components and services in a solution, the lower the complexity of managing and running it
  • The solution will be cheaper. We won’t be paying for Azure Functions calls, which could be pretty considerable amount of money when there are millions of devices connecting to cloud

This is what the solution I will go through in this post looks like:

I must mention here that I am not going to store the incoming sensor data as part of this blog post, but I would recommend to do so in Delta tables if you’re looking for a performant and modern storage solution.

Azure IoT Hub

IoT Hub is the gateway to our cloud-based solution. Devices connect to IoT Hub and start sending their data across. I explained what needs to be done to set up IoT Hub and register IoT devices with it in my previous post. I just upgraded “SimulatedDevice.java” to resemble my simulated crane and send additional metrics to cloud: “device_id”, “temperature”,”humidity”,”height”,”device_time”,”latitude”,”longitude”,”wind_speed”,”load_weight” and”lift_angle”. Most of these metrics will be used in my solution in one way or another. A sample record sent by the simulated crane looks like below:

Machine Learning

Those who know me are aware that I’m not a data scientist of any sort. I understand and appreciate most of the machine learning algorithms and the beautiful mathematical formulas behind them, but what I’m more interested in is how we can enable using those models and apply them to our day to day lives to solve problems in form of industry use cases.

For the purpose of this blog post, I assumed that there was an ML model that has ran and made its predictions on how much the crane needs to slow down based on what is happening in the field in terms of metrics sent by sensors installed on the crane:

This is how the sample output from our assumed ML model looks like:

As an example to clarify what this output means, let’s look at the first row. It specifies if temperature is between 0 and 10 degrees of celsius, and wind speed is between 0 and 5 km/h, and load height is between 15 and 20 meters, and load weight is between 0 and 2 tons, and load lift angle was between 0 and 5 degrees, then the crane needs to slow down by 5 percent. Let’s see how we can use this result set and take actions on our crane based on the data we receive in real time.

Azure Key Vault

There are a couple of different connection strings that we need for our solution to work, such as “IoT Hub event hub-compatible” and “service policy” connection strings. When building production grade solutions, it’s important to store sensitive information such as connection strings in a secure way. I will show how we can use plain-text connection string as well as accessing one stored in Azure Key Vault in the following sections of this post.

Our solution needs a way of connecting to the IoT Hub to invoke direct methods on IoT devices. For that, we need to get the connection string associated with Service policy of the IoT Hub using the following Azure cli command:

az iot hub show-connection-string --policy-name service --name <IOT_HUB_NAME> --output table

And store it in Azure Key Vault. So go ahead and create an Azure Key Vault and then:

  • Click on Secrets on the left pane and then click on Generate/Import
  • Select Manual for Upload options
  • Specify a Name such as <IOT_HUB_NAME>-service-policy-cnn-string
  • Paste the value you got from the above Azure Cli command in the Value text box

After completing the steps above and creating the secret, we will be able to use the service connection string associated with the IoT Hub in the next stages of our solution, which will be built in Databricks.

Databricks

In my opinion, the second most important factor separating Artificial Intelligence from other kinds of intelligent systems, after the complexity and type of algorithms, is the ability to process data and respond to events in real time. If the aim of AI is to replace most of human’s functionalities, the AI-powered systems must be able to mimic and replicate human brain’s ability to scan and act as events happen.

I am going to use Spark Streaming as the mechanism to process data in real time in this solution. The very first step is to set up Databricks, you can read on how to do that in my previous blog post. Don’t forget to install “azure-eventhubs-spark_2.11:2.3.6” library as instructed there. The code snippets you will see in the rest of this post are in Scala.

Load ML Model results

To be able to use the the results of our ML model, we need to load it as a Dataframe. I have the file containing the sample output saved in Azure Blob Storage. What I’ll do first is to mount that blob in DBFS:

dbutils.fs.mount( source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name(if any)>", mountPoint = "/mnt/ml-output", extraConfigs = Map("fs.azure.account.key.<storage-account-name>" -> "<Storage_Account_Key>"))

To get Storage_Account_Key, navigate to your storage account in Azure portal, click on Access Keys from left pane and copy the string under Key1 -> Key.

After completing above steps, we will be able to use the mount point in the rest of our notebook, which is much easier than having to refer to the storage blob every time. The next code snippet shows how we can create a Dataframe using the mount point we just created:

val CSV_FILE="/mnt/ml-output/ml_results.csv" val mlResultsDF = sqlContext.read.format("csv") .option("header","true") .option("inferSchema","true") .load(CSV_FILE)

Executing the cell above in Databricks notebook creates a DataFrame containing the fields in the sample output file:

Connect to IoT Hub and read the stream

This step is explained in my previous blog post as well, make sure you follow the steps in “Connect to IoT Hub and read the stream” section. For the reference, Below is the Scala code you would need to have in the next cell in the Databricks notebook:

import org.apache.spark.eventhubs._ import org.apache.spark.eventhubs.{ ConnectionStringBuilder, EventHubsConf, EventPosition } import org.apache.spark.sql.functions.{ explode, split } val connectionString = ConnectionStringBuilder("<Event hub connection string from Azure portal>").setEventHubName("<Event Hub-Compatible Name>") .build val eventHubsConf = EventHubsConf(connectionString) .setStartingPosition(EventPosition.fromEndOfStream) .setConsumerGroup("<Consumer Group>") val sensorDataStream = spark.readStream .format("eventhubs") .options(eventHubsConf.toMap) .load()

Apply structure to incoming data

Finishing previous step gives us a DataFrame containing the data we receive from our connected crane. If we run a display command on the DataFrame, we see that the data we received does not really resemble what is sent by the simulator code. That’s because the data sent by our code actually goes into the first column, body, in binary format. We need to apply appropriate schema on top of that column to be able to work with the incoming data with structured mechanism, e.g. SQL.

import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ val schema = (new StructType) .add("device_id", StringType) .add("temperature", DoubleType) .add("humidity", DoubleType) .add("height", DoubleType) .add("device_time", MapType( StringType, new StructType() .add("year", IntegerType) .add("month", IntegerType) .add("day", IntegerType) .add("hour", IntegerType) .add("minute", IntegerType) .add("second", IntegerType) .add("nano", IntegerType) ) ) .add("latitude", DoubleType) .add("longitude", DoubleType) .add("wind_speed", DoubleType) .add("load_weight", DoubleType) .add("lift_angle", DoubleType) val sensorDataDF = sensorDataStream .select(($"enqueuedTime").as("Enqueued_Time"),($"systemProperties.iothub-connection-device-id").as("Device_ID") ,(from_json($"body".cast("string"), schema).as("telemetry_json"))) .withColumn("eventTimeString", concat($"telemetry_json.device_time.date.year".cast("string"),lit("-"), $"telemetry_json.device_time.date.month".cast("string") ,lit("-"), $"telemetry_json.device_time.date.day".cast("string") ,lit(" "), $"telemetry_json.device_time.time.hour".cast("string") ,lit(":"), $"telemetry_json.device_time.time.minute".cast("string") ,lit(":"), $"telemetry_json.device_time.time.second".cast("string") )) .withColumn("eventTime", to_timestamp($"eventTimeString")) .select("Device_ID","telemetry_json.temperature","telemetry_json.height","telemetry_json.wind_speed","telemetry_json.load_weight","telemetry_json.lift_angle" ,"eventTimeString","eventTime") .withWatermark("eventTime", "5 seconds")

The code above does the following:

  • Defines the schema matching the data we produce at source
  • Applies the schema on the incoming binary data
  • Extracts the fields in a structured format including the time the record was generated at source, eventTime
  • Uses the eventTime column to define watermarking to deal with late arriving records and drop data older than 5 seconds

The last point is important to notice. The solution we’re building here is to deal with changes in the environment where the crane is operating in, in real-time. This means that the solution should wait for the data sent by the crane only for a limited time, in this case 5 seconds. The idea is that the solution shouldn’t take actions based on old data, since a lot may change in the environment in 5 seconds. Therefore, it drops late records and will not consider them. Remember to change that based on the use case you’re working on.

Running display on the resulting DataFrame we get following:

display(sensorDataDF)

Decide when to take action

Now that we have both the ML model results and IoT Device data in separate DataFrames, we are ready to code the logic that defines when our solution should send a command back to the device and command it to slow down. One point that is worth noticing here is that the mlResultsDF is a static DataFrame whereas sensorDataDF is a streaming one. You can check that by running:

println(sensorDataDF.isStreaming)

Let’s go through what needs to be done at this stage one more time: as the data streams in from the crane, we need to compare it with the result of the ML model and slow it down when the incoming data falls in the range defined by the ML model. This is easy to code: all we need to do is to join the 2 datasets and check for when this rule is met:

val joinedDF = sensorDataDF
.join(mlResultsDF, $"temperature" >= $"Min_Temperature" && $"temperature" < $"Max_Temperature" && $"wind_speed" >= $"Min_Wind_Speed" && $"wind_speed" < $"Max_Wind_Speed" && $"load_weight" >= $"Min_Load_Weight" && $"load_weight" < $"Max_Load_Weight" && $"height" >= $"Min_Load_Height" && $"height" < $"Max_Load_Height" && $"lift_angle" >= $"Min_Lift_Angle" && $"lift_angle" < $"Max_Lift_Angle")

The result of running above cell in Databricks notebook would be a streaming DataFrame which contains the records our solution need to act upon.

Connect to IoT Hub where devices send data to

Now that we have worked out when our solution should react to incoming data, we can move to the next interesting part which is defining how and what of taking action on the device. We already discussed that an action is taken on the crane by calling a direct method. This method instructs crane to slow down by the percentage that is passed in as the parameter. If you look into the SimulatedDevice.java file in the azure-iot-samples-java github repo, there is a switch expression in the “call” function of “DirectMethodCallback” class which defines the code to be executed based on the different direct methods called on the IoT device. I extended that to simulate crane being slowed down, but you can work with the existing “SetTelemetryInterval” method.

So, what we need to do at this stage is to connect to the IoT Hub where the device is registered with from Databricks notebook, and then invoke the direct method URL which should be in the form of “https://{iot hub}/twins/{device id}/methods/?api-version=2018-06-30

To authenticate with IoT Hub, we need to create a SAS token in form of “SharedAccessSignature sig=<signature>&se=<expiryTime>&sr=<resourceURI>”. Remember the service level policy connection string we put in Azure Key Vault? We are going to use that to build the SAS Token. But first, we need to extract the secrets stored in Key Vault:

val vaultScope = "kv-scope-01" var keys = dbutils.secrets.list(vaultScope) val keysAndSecrets = collection.mutable.Map[String, String]() for (x <- keys){ val scopeKey = x.toString().substring(x.toString().indexOf("(")+1,x.toString().indexOf(")")) keysAndSecrets += (scopeKey -> dbutils.secrets.get(scope = vaultScope, key = scopeKey) ) }

After running the code in above cell in Databricks notebook, we get a map of all the secrets stored in the Key Vault in form of (“Secret_Name” -> “Secret_Value”).

Next, we need a function to build the SAS token that will be accepted by IoT Hub when authenticating the requests. This function will do the followings:

  • Uses the IoT Hub name to get the service policy connection string from the Map of secrets we built previously
  • Extracts components of the retrieved connection string to build host name, resource uri and shared access key
  • Computes a Hash-based Message Authentication Code (HMAC) by using the SHA256 hash function, from the shared access key in the connection string
  • Uses an implementation of “Message Authentication Code” (MAC) algorithm to create the signature for the SAS Token
  • And finally returns SharedAccessSignature

import javax.crypto.Mac import javax.crypto.spec.SecretKeySpec import java.net.URLEncoder import java.nio.charset.StandardCharsets import java.lang.System.currentTimeMillis val iotHubName = "test-direct-method-scala-01" object SASToken{ def tokenBuilder(deviceName: String):String = { val iotHubConnectionString = keysAndSecrets(iotHubName+"-service-policy-cnn-string") val hostName = iotHubConnectionString.substring(0,iotHubConnectionString.indexOf(";")) val resourceUri = hostName.substring(hostName.indexOf("=")+1,hostName.length) val targetUri = URLEncoder.encode(resourceUri, String.valueOf(StandardCharsets.UTF_8)) val SharedAccessKey = iotHubConnectionString.substring(iotHubConnectionString.indexOf("SharedAccessKey=")+16,iotHubConnectionString.length)//iotHubConnectionStringComponents(2).split("=") val currentTime = currentTimeMillis() val expiresOnTime = (currentTime + (365*60*60*1000))/1000 val toSign = targetUri + "\n" + expiresOnTime; var keyBytes = java.util.Base64.getDecoder.decode(SharedAccessKey.getBytes("UTF-8")) val signingKey = new SecretKeySpec(keyBytes, "HmacSHA256") val mac = Mac.getInstance("HmacSHA256") mac.init(signingKey) val rawHmac = mac.doFinal(toSign.getBytes("UTF-8")) val signature = URLEncoder.encode( new String(java.util.Base64.getEncoder.encode(rawHmac), "UTF-8")) val sharedAccessSignature = s"Authorization: SharedAccessSignature sr=test-direct-method-scala-01.azure-devices.net&sig="+signature+"&se="+expiresOnTime+"&skn=service" return sharedAccessSignature } }

The code above is the part I am most proud of getting to work as part of this blog post. I had to go through literally several thousands of lines of Java code to figure out how Microsoft does it, and convert it to Scala. But please let me know if you can think of a better way of doing it.

Take action by invoking direct method on the device

All the work we did so far was to prepare for this moment: to be able to call a direct method on our simulated crane and slow it down based on what the ML algorithm dictates. And we need to be able to do so as records stream into our final DataFrame, joinedDF. Let’s define how we can do that.

We need to write a class that extends ForeachWrite. This class needs to implement 3 methods:

  • open: used when we need to open new connections, for example to a data store to write records to
  • process: the work to be done whenever a new record is added to the streaming DataFrame is added here
  • close: used to close the connection opened in first method, if any

We don’t need to open and therefore close any connections, so let’s check the process method line by line:

  1. Extracts device name from incoming data. If you run a display on joinedDF, you’ll see that the very first column is the device name
  2. Builds “sharedAccessSignature” bu calling SASToken.tokenBuilder and passing in the device name
  3. Builds “deviceDirectMethodUri” in ‘{iot hub}/twins/{device id}/methods/’ format
  4. Builds “cmdParams” to include the name of the method to be called, response timeout, and the payload. Payload is the adjustment percentage that will be sent to the crane
  5. Builds a curl command with the required parameters and SAS token
  6. Executes the curl command

import org.apache.spark.sql.{ForeachWriter, Row}
import java.util.ArrayList
import sys.process._
class StreamProcessor() extends ForeachWriter[Row] {
def open(partitionId: Long, epochId: Long) = {
println("Starting.. ")
true
}
def process(row: Row) = {
val deviceName = row(0).toString().slice(1,row(0).toString().length-1)
val sharedAccessSignature = SASToken.tokenBuilder(deviceName)
val deviceDirectMethodUri = "https://test-direct-method-scala-01.azure-devices.net/twins/"+deviceName+"/methods?api-version=2018-06-30"
val cmdParams = s"""{"methodName": "setHeightIncrements","responseTimeoutInSeconds": 10,"payload":"""+ row(18).toString()+"}"
val cmd = Seq("curl","-X", "POST", "-H", sharedAccessSignature, "-H","Content-Type: application/json" ,"-d", cmdParams,deviceDirectMethodUri)
cmd.!
}
def close(errorOrNull: Throwable) = {
if (errorOrNull!=null){
println(errorOrNull.toString())
}
}
}

The very last step is to call the methods we defined in the StreamProcessor class above as the records stream in from our connected crane. This is done by calling foreach sink on writeStream:

val query =
joinedDF
.writeStream
.foreach(new StreamProcessor())
.start()

And we’re done. We have solution that is able to control theoretically millions of IoT devices using only 3 services on Azure.

The next step to go from here would be to add security measures and mechanisms to our solution, as well as monitoring and alerting. Hopefully I’ll get time to do them soon.

AWS Glue Part 3: Automate Data Onboarding for Your AWS Data Lake

When it comes to building data lakes in AWS s3, it makes even more sense to use Spark. Why? Because you can take advantage of Glue and build ETL jobs that generate and execute Spark for you, server-less. It means you won’t need to worry about building and maintaining EMR clusters, scale them up and down based on when what job runs. Glue takes care of all of it for you.

AWW Glue

In part one and part two of my posts on AWS Glue, we saw how to create crawlers to catalogue our data and then how to develop ETL jobs to transform them. Here we’ll see how we can use Glue to automate onboarding new datasets into data lakes.

On-board New Data Sources Using Glue

On-boarding new data sources could be automated using Terraform and AWS Glue. By onbaording I mean have them traversed and catalogued, convert data to the types that are more efficient when queried by engines like Athena, and create tables for transferred data.

Below is the list of what needs to be implemented. Note that Terraform doesn’t fully support AWS Glue yet, so some steps needs to be implemented manually. See here for more information.

   1- Create s3 folder structure using Terraform (resource “aws_s3_bucket_object”). There are 2 folder structures that needs to be created:

       a- The structure that matches the pattern at which data lands, for example: s3://my_bucket/raw_data/data_source_name/table_name/. You can create multiple folders here, one per table that you’re onboarding.

      b- The structure to store data after it is transferred: s3://my_bucket/processed_data/data_source_name/table_name/.

   2- Create a new database for the source being on-boarded using Terraform. You can create this database in Glue (Terraform resource “aws_glue_catalog_database”) or in Athena (resource “aws_athena_database”). I couldn’t see any difference when I tried both options.

3- Create a new Crawler using Terraform for the new data source (Terraform doesn’t support Glue Crawlers yet, do this step manually until this issue is closed). This is the crawler responsible for inferring data structure of what’s landing in s3 and catalogue and create tables in Athena.

a- Crawler should point to the database related to the source. In example above, it should point to s3://my_bucket/raw_data/data_source_name/

b- Crawler will create one table per subfolder where it’s pointing to in s3, in Athena database (which will be used as source in ETL jobs later). In other words, we’ll need multiple folders in source folder in s3, but only one crawler in Glue.

c- Prefix table name to specify the table type, in this case raw e.g. “raw_”.

Note that tables created by this crawler are only for storing metadata. They won’t be used by users or data engineers to query data, we’ll create another set of tables for that in step 5.

4- Create new Glue ETL job using Terraform

a- Specify schedule according to the frequency at which data lands in s3

      b- ETL job will read the data in raw folder, convert it to Parquet (or any other columnar format like ORC), and store in Processed folder

   5- Create a new Crawler using Terraform to catalogue transformed data (again, you need to do this manually for now)

a- Schedule should match that of the ETL job in step 4. This is to make sure data processed and transformed by ETL is available for queries as soon as possible.

      b- It will create a table in Athena in the database where source table is

      c- Prefix table’s name: “processed_”

 

By following steps above, we have a self-evolving data on-boarding process that we can take from one environment to another in a matter of minutes. A very obvious use case would be to move from non-prod to prod after each source/table is tested and verified, just by pointing our Terraform scripts to the new environment.

Hope this post helps, and please do not hesitate to give me feedback via comments

 

 

 

 

AWS Glue Part 2: ETL your data and query the result in Athena

In part one of my posts on AWS Glue, we saw how Crawlers could be used to traverse data in s3 and catalogue them in AWS Athena.

Glue is a serverless service that could be used to create ETL jobs, schedule and run them. In this post we’ll create an ETL job using Glue, execute the job and then see the final result in Athena. We’ll go through the details of the code generated in a later post.

For the purpose this tutorial I am going to use Glue to flatten the json returned by calling Jira API. It’s a long and complex json response, you can see how it looks like here. We had to do it recently at work and it took 2 analysts 2 days to understand the structure and list out all the fields. Using Glue, it’ll take 15 minutes!

Note that if your JSON file contains arrays and you want to be able to flatten the data in arrays, you can use jq to get rid of array and have all the data in JSON format. More about jq here.

Let’s get started:

1. Navigate to AWS Glue console and click on Jobs under ETL in the left hand pane

2. Click on Add job button to kick off Add job wizard

3. Fill up job properties. Most of them are self-explanatory:

a. Provide name.

b. A role that has full Glue access as well as access to the s3 buckets where this job is going to read data from and write results to, as well as save Spark script it generates.

c. Specify whether you’re going to to use Glue interface to develop the basics of your job, have it run an existing script that is already pushed to s3, or start writing the Spark code from scratch.

In this example we’ll select option 1, to have Glue generate the script for us. We get the option to edit it later, if need be.

d. Specify s3 buckets where your script to be saved for future use and where temporary data would be:

etl_job_properties

4. Select where your source data is. This section lists the tables in Athena databases that the Glue role has access to. We’ll use the table we created in part one:

etl choose source

5. Next step? You guessed it right, choosing the target for your ETL job. I want to store the result of my job as a new table, convert my JSON to Parquet (since its faster and less expensive for Athena to query data stored in columnar format) and specify where I want my result to be stored in s3:

etl choose target

6. Here’s the exciting part. Glue matches all the columns in the source table to columns in the target table it’s going to create for us. This is where we can see how our JSON file actually looks like and flatten it by taking columns we’re interested in out of their respected JSON structs:

a. Expand fields, issuetype and project:

etl map source to dest

b. Remove all the unwanted columns by clicking on the cross button next to them on Target side. W can add the ones that we want to have in our flattened output one by one, by clicking on Add column on top right and then map columns in source to the new ones we just created:

etl map source to dest 2

7. Click Finish

8. The next page you’ll see is Glue’s script editor. Here you can review the Spark script generated for you and either run it as it is or make changes to it. For now we’re going to run it as it is. Click on Run job button. You’ll be asked to provide job parameters, put in 10 for the number of concurrent PDUs and click on Run job:

etl run job

Wait for the job to finish and head to the location in s3 where you stored the result. You’ll see a new file created there for you:

etl result s3

Now that we have our data transformed and converted to Parquet, it’s time to make it available for SQL queries. If you went through my first post on Glue, you’d know the answer is to use Crawlers to create the table in Athena. Follow those steps, create a crawler and have your table available to be queried using SQL. I have done that and this is how my result looks like for what we did together in this document:

etl_athena

Easy, right? You don’t have to worry about provisioning servers, have the right software and version installed on them, and then compete with other applications to acquire resources. That is the power of serverless services offered by cloud providers. Which I personally find very useful, time and cost saving.

 

AWS Glue Part 1: Discover and Catalogue Data Stored in s3

AWS Glue

Glue is a fully managed extract, transform, and load (ETL) service offered by Amazon Web Services. Glue discovers your data (stored in S3 or other databases) and stores the associated metadata (e.g. table definition and schema) in the Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

Once your ETL job is ready, you can schedule it to run on Glue’s fully managed, scale-out Apache Spark environment. It provides a flexible scheduler with dependency resolution, job monitoring, and alerting.

Glue provides out-of-the-box integration with Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, and any Apache Hive Metastore-compatible application.

Discover Data Using Crawlers

AWS Glue is able to traverse data stores using Crawlers and populate data catalogues with one or more metadata tables. These tables could be used by ETL jobs later as source or target.

Below are the steps to add a crawler to analyse and catalogue data in an s3 bucket:

1. Sign in to the AWS Management Console and open the AWS Glue console. Choose the         Crawlers tab.

2. Choose Add crawler, it’ll lunch the Add crawler wizard. Follow the Wizard:

a. Specify a name and description for your crawler.

b. Add a data store. Here you have options to specify an s3 bucket or a JDBC connection. After selecting s3, select option for “Specified path in my account” and select folder icon next to “Include path” to select where the data to be crawled is:

Crawler Add Data Source

c. You can add another data source, in case you want to join data from 2 different places together:

Crawler Add Another Datasource

d. Choose an IAM role that has permissions to work with Glue. This role should have full access to run Glue jobs as well as access to the s3 buckets it reads data from and stores script to:

Crawler Choose IAM Role

e. Create a schedule for your Crawler. You can have it run on demand or chose one of the options in drop-down:

Crawler Schedule

f. The next step is to chose the location where the output from your crawler will be stored. This is a database in Athena, and you can pre-fix the name of the tables created by your crawler to be distinguishable easily from other tables in the database:

Crawler Configure Output

g. Review your crawler’s settings and click on Finish. You’ll be redirected to the main Crawlers page, where your crawler is listed.

h. Click on “Run it now?”:

Crawlers Main 2

When crawler finished running, go to Athena console and check your table’s there:

Athena Source Table

Examine table’s DDL. It’s an external table pointing to the location in s3 where your Crawler “crawled”. And start writing queries on it. It’s the first table you created using Glue crawlers. First of many. 🙂