Monthly Archives: July 2019

Handling Excel Data in Azure Databricks

By now, there is no default support of loading data from Spark in Cloud.(Here we take Azure Databricks as the example). Based on research, some links sound helpful.

https://stackoverflow.com/questions/44196741/how-to-construct-dataframe-from-a-excel-xls-xlsx-file-in-scala-spark

https://stackoverflow.com/questions/16888888/how-to-read-a-xlsx-file-using-the-pandas-library-in-ipython

This article describes some practice based on our recent projects.

Data Preparation

Sample MS Excel data as below: Two sheets: Companies, and Users

CompanyIDCompanyNameDescription
100Comp1Desc1
101Comp2Desc2
102Comp3Desc2

Solution 1: Read Excel data using Pandas, then convert Pandas DataFrame into Spark DataFrame.

from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd

filepath = '/dbfs/xxx/xxx/Samples.xlsx'
companies_pd_df = pd.read_excel(filepath, sheet_name='Companies')
companiesSchema = StructType([
  StructField("CompanyID", IntegerType(), False),
  StructField("CompanyName", StringType(), False),
  StructField("Description", StringType(), False)])
companies_df = spark.createDataFrame(companies_pd_df, schema=companiesSchema)

Note that:

  1. While convert Pandas DataFrame into Spark DataFrame, we need to manually define the Schema, otherwise the conversion will fail probably.
  2. In some cases, the created Spark DataFrame may display some dummy data or additional unnecessary row. In this case, we may filter out those unnecessary rows. e.g. companies_Df = companies_df.filter(isnan(“CompaniesID”) != True)

Solution 2: Use Spark Excel.

More details documentation can be found here. Code sample in Azure Databricks:

%scala

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.crealytics.spark.excel._

val companiesSchema = StructType(Array(
  StructField("CompanyID", IntegerType, nullable = true),
  StructField("CompanyName", StringType, nullable = true),
  StructField("Description", StringType, nullable = true)))
val filepath = "/mnt/xxx/xxx/Samples.xlsx"
val df = spark.read
    .format("com.crealytics.spark.excel")
    .option("sheetName", "Companies")
    .option("useHeader", "true")
    .schema(companiesSchema)
    .load(filepath)
df.show()

Note that:

  1. Currently Spark Excel plugin is only available for Scala, not for Python yet. Hence we have to use magic command for Python notebook.
  2. Error/Exceptions may happens for some versions. I have tested out this successfully with version com.crealytics:spark-excel_2.11:0.12.0