By now, there is no default support of loading data from Spark in Cloud.(Here we take Azure Databricks as the example). Based on research, some links sound helpful.
This article describes some practice based on our recent projects.
Data Preparation
Sample MS Excel data as below: Two sheets: Companies, and Users
CompanyID | CompanyName | Description |
100 | Comp1 | Desc1 |
101 | Comp2 | Desc2 |
102 | Comp3 | Desc2 |
Solution 1: Read Excel data using Pandas, then convert Pandas DataFrame into Spark DataFrame.
from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
filepath = '/dbfs/xxx/xxx/Samples.xlsx'
companies_pd_df = pd.read_excel(filepath, sheet_name='Companies')
companiesSchema = StructType([
StructField("CompanyID", IntegerType(), False),
StructField("CompanyName", StringType(), False),
StructField("Description", StringType(), False)])
companies_df = spark.createDataFrame(companies_pd_df, schema=companiesSchema)
Note that:
- While convert Pandas DataFrame into Spark DataFrame, we need to manually define the Schema, otherwise the conversion will fail probably.
- In some cases, the created Spark DataFrame may display some dummy data or additional unnecessary row. In this case, we may filter out those unnecessary rows. e.g. companies_Df = companies_df.filter(isnan(“CompaniesID”) != True)
Solution 2: Use Spark Excel.
More details documentation can be found here. Code sample in Azure Databricks:
%scala
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.crealytics.spark.excel._
val companiesSchema = StructType(Array(
StructField("CompanyID", IntegerType, nullable = true),
StructField("CompanyName", StringType, nullable = true),
StructField("Description", StringType, nullable = true)))
val filepath = "/mnt/xxx/xxx/Samples.xlsx"
val df = spark.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Companies")
.option("useHeader", "true")
.schema(companiesSchema)
.load(filepath)
df.show()
Note that:
- Currently Spark Excel plugin is only available for Scala, not for Python yet. Hence we have to use magic command for Python notebook.
- Error/Exceptions may happens for some versions. I have tested out this successfully with version com.crealytics:spark-excel_2.11:0.12.0