Recently I am working on platform analytics solution for veracity.com, the open industry data platform from DNV GL. Data Ingest is one essential part for the whole data analytics end to end flow.
Application Insights(AI) logs are one of the key data source, hence we need build streaming analytics solution for such real time data source. Here we take availability logs data as example, you can also use other AI data sources (PageViews….).
Export the data
There is a nice document for enabling the continues export from Application Insights.
Create a storage account
When you are creating it, I suggest to use “Standard-LRS“ with Cold tier. This is simply because we use storage account as a temporary storage place. (We will move into VERACITY container for storage in production)
Configure the export settings
In this case, we are exporting one type:
Take a closer look at the folder content. Below screenshot shows the PageViews data between 12am to 1pm on 28th of Sept.
Please note that:
- The date and time are UTC and are when the telemetry was deposited in the store – not the time it was generated.
- The folders under Availability are stored in yyyy-mm-dd/hh/ structure.
- Each blob is a text file that contains multiple ‘\n’-separated rows.
- It contains the telemetry processed over a time period of roughly half a minute. It means that each folder will have about 100 to 120 files.
- Each row represents a telemetry data point such as a request or page view.
- Each row is an JSON document. The detailed data structure can be found at here.
With the nice Azure Storage Account Explorer, it is pretty easy to check the content of the blob file.
Please note that Application Insights also implemented the same logic as IP Anonymizationin Google Analytics. For all IP address that is collected by Application Insights, last octagon is anonymized to 0 (you can see the highlighted in above screenshot).
Read the data from blob
They are many ways to transfer the data out from storage account. Here we are going to use Azure Databricks.
Add one Azure Databricks service following here.
Add one cluster, choose the basic one following here.
Add one notebook: named for example: Streaming Data Ingest
In order Azure Databricks can read data from blob storage, there are two ways: Databricks directly read blob storage through HDFS API; Or mount blob storage container into Databricks file system. See more information about how to access Blob Storage as here. We choose the second one.
Mount the storage folder into Databricks file system(DBFS), to mount a Blob storage container or a folder inside a container, use the following command:
dbutils.fs.mount( source = "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>", mountPoint = "/mnt/export", extraConfigs = Map("<conf-key>" -> "<conf-value>"))
Then you can check the folder data with command:
%fs ls /mnt
import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession .builder .appName("BlobStorageStreaming") .getOrCreate() import spark.implicits._
In Spark 2.0+, we prefer use Structured Streaming(DataFrame /DataSet API) in, rather than Spark Core API, but when we see the Availability log data, it is XML like format, with several hierarchy. This is not easy to programming define the Structure type. The tricks here is that we first read one blob file to get Schema, then use this Schema when we create the structured streaming.
spark.conf.set( "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net", "<your key>") val df = spark.read.json("wasbs://<your container>@<your-storage-account-name>.core.windows.net/veracity_ai_prod_41f11b1f17bb43d49ba51beabf2dd0a9/Availability/2018-06-04/06") val availabilitySchema = df.schema print(availabilitySchema)
Using structured streaming here to access the files in blob folder, we are going to emulate a stream from them by reading one file at a time, in the chronological order they were created.
//Use file sink to store output data into blob storage again. val inputPath = "/mnt/export" val outputPath = "/mnt/output" val streamingInput = spark.readStream.schema(availabilitySchema).option("maxFilesTrigger", 1).json(inputPath + "/*/*/") streamingInput.isStreaming streamingInput.createOrReplaceTempView("logs")
Transform the data
We use SQL to do the transform, then convert the DataFrame into DataSet, suggested if you need to do more analytics afterwards. In Structured Streaming,
query is a handle to the streaming query that is running in the background.
val sqlDF = spark.sql("SELECT internal.data.id as ID, availability.testTimestamp as timestamp, availability.testName as name, availability.runLocation as location, availability.message as message, availability.durationMetric.value as duration FROM logs") val sqlDS = sqlDF.as[(String, String, String, String, String, Double)] //sqlDF.show() //sqlDF.printSchema() val query2 = sqlDS .writeStream .format("csv") .option("path", outputPath) .option("checkpointLocation", outputPath + "/checkpoint/") .queryName("tblAvailability2") .start()
Check the output folder in blob, to make sure that files are handled one by one continously (in micro-batch).
Note that: By the time writing this article, Structured Streaming does not support JDBC sink natively. Hence we need to extend the JDBCSink by ourselves.