PySpark- How to handle source data schema change

PySpark- How to handle source data schema change - python

I'm trying to use PySpark to read from Avro file into dataframe, do some transformations and write the dataframe out to HDFS as hive tables using the code below. The file format for the hive tables is parquet.
df.write.mode("overwrite").format("hive").insertInto("mytable")
#this write a partition every day. When re-run, it would overwrite that run day's partition
The problem is, when the source data has a schema change, like added a column, it will fail with an error saying: source file structure not match with existing table schema. How should I handle this case programmatically? Many thanks for your help.
Edited :I want the new schema changes to be reflected in target table. I'm looking for a programmatic way to do this.

You should be able to query off the system tables. You can run a comparison on these to see what changes have occurred since your last run.

Not sure which Spark version you're using, if you're using Spark version >= 2.2, you can use alter in your Spark SQL statement. For you case it should be:
spark.sql("ALTER TABLE mytable add columns (col1 string, col2 timestamp)")
You can check this document for more information about Hive operation in Spark: https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html#add-columns

It is possible to evolve the schema of the Parquet files. This can be achieved in your use case with Schema Merging
As per your requirement, the a new partition is created everyday and you are also applying the overwrite option which is really a suited option for schema evolution.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files
setting the global SQL option spark.sql.parquet.mergeSchema to true.
Code from offical Spark Documentation
from pyspark.sql import Row
# spark is from the previous example.
# Create a simple DataFrame, stored into a partition directory
sc = spark.sparkContext
squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
.map(lambda i: Row(single=i, double=i ** 2)))
squaresDF.write.parquet("data/test_table/key=1")
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
.map(lambda i: Row(single=i, triple=i ** 3)))
cubesDF.write.parquet("data/test_table/key=2")
# Read the partitioned table
mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()
# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths.
# root
# |-- double: long (nullable = true)
# |-- single: long (nullable = true)
# |-- triple: long (nullable = true)
# |-- key: integer (nullable = true)
Note: suggestion would be to not enable the setting at global level as its quite a heavy operation.

Related

How can I get last modified date of a delta table in pyspark?

I have a delta table dbfs:/mnt/some_table in pyspark, which as you know is a folder with a series of .parquet files. I want to get the last modified time of that table without having to query data in the table.
If I do dbutils.fs.ls(path) for the table I get a modificationTime which seems to just be now() every time I query it. This makes me believe modificationTime doesn't work accurately on folders in pyspark.
I could just get the modificationTime of every .parquet file in the folder and use the greatest number, but I'd wonder if there is another already built in way, or more performant way to get the last modification time of a delta table.

Simplest way to do that is to query history of the table - it's relatively lightweight operation that will read that data from the transaction log:
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, pathToTable)
lastOperationTimestamp = deltaTable.history(1).select("timestamp").collect()[0][0]

How to efficiently setup column data types in SQL from pandas dataframe

More of a theoretical question as to the best way to set something up.
I have quite a large dataframe in pandas (roughly 330 columns) and I am hoping to transfer it into a table in SQL Server.
My current process has been to export the dataframe as a .csv and then use the Import Flat File function to first of all create the table, and then in future I have a direct connection setup in Python to interact. For smaller dataframes this has worked fine as it has been easier to change data column types and eventually get it to work.
When doing it on the larger dataframes my problem is that I am frequently getting the following message:
TITLE: Microsoft SQL Server Management Studio
Error inserting data into table. (Microsoft.SqlServer.Import.Wizard)
The given value of type String from the data source cannot be converted to type nvarchar of the specified target column. (System.Data)
String or binary data would be truncated. (System.Data)
It doesn't give me a specific column as to what is causing the problem so is there any way to more efficiently get this data in as opposed to going through each column manually?
Any help would be appreciated! Thanks

As per your query, this is in fact an issue when you are trying to write a string value into a column, the size limit is exceeded. Either you may increase the column size limit or try truncating before inserting.
Let's say column A in df is of type varchar(500), Try the following before insertion :-
df.A=df.A.apply(lambda x: str(x)[:500])
Below is the sqlalchemy alternative for the insertion.
connect_str="mssql+pyodbc://<username>:<password>#<dsnname>"
To create a connection -
engine = create_engine(connect_str)
Create the table -
from sqlalchemy import Table, MetaData, Column, Integer
m = MetaData()
t = Table('example', m,
Column('column_1', Integer),
Column('column_2', Integer)),
...)
m.create_all(engine)
Once created, do the following :-
df.to_sql('example', if_exists='append')

Overwrite mode in loop when partition using pyspark

#Start and End is a range of dates.
start = date(2019, 1, 20)
end = date(2019, 1, 22)
for single_date in daterange(start, end):
query = "(SELECT ID, firstname,lastname,date FROM dbo.emp WHERE date = '%s' ) emp_alias" %((single_date).strftime("%Y-%m-%d %H:%M:%S"))
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
df.write.format("parquet").mode("ignore").partitionBy("Date").save("/mnt/data/empData.parquet")
I have data for number of days in a table and I need as parquet files partitioned by date. I have to save by day in loop as data is huge and I can't put all the days like years data in one dataframe. I tried on all save modes. In 'Ignore' mode it saves for first day. In 'Overwrite' mode, it saves of last day. In 'append' mode, it adds the data. What I need is, if data is available for that day it should ignore for that day and leave the data what's already there but if data is not available then create in parquet file partitioned by date. Please help.

There is currently no PySpark SaveMode that will allow you to preserve the existing partitions, while inserting the new ones, if you also want to use Hive partitioning (which is what you’re asking for, when you call the method partitionBy). Note that there is the option to do the opposite, which is to overwrite data in some partitions, while preserving the ones for which there is no data in the DataFrame (set the configuration setting "spark.sql.sources.partitionOverwriteMode" to "dynamic" and use SaveMode.Overwrite when writing datasets).
You can still achieve what you want though, by first creating a set of all the already existing partitions. You could do that with PySpark, or using any of the libraries that will allow you to perform listing operations in file systems (like Azure Data Lake Storage Gen2) or key-value stores (like AWS S3). Once you have that list, you use it to filter the new dataset for the data you still want to write. Here’s an example with only PySpark:
In [1]: from pyspark.sql.functions import lit
...: df = spark.range(3).withColumn("foo", lit("bar"))
...: dir = "/tmp/foo"
...: df.write.mode("overwrite").partitionBy("id").parquet(dir) # initial seeding
...: ! tree /tmp/foo
...:
...:
/tmp/foo
├── id=0
│   └── part-00001-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
├── id=1
│   └── part-00002-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
├── id=2
│   └── part-00003-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
└── _SUCCESS
3 directories, 4 files
In [2]: df2 = spark.range(5).withColumn("foo", lit("baz"))
...: existing_partitions = spark.read.parquet(dir).select("id").distinct()
...: df3 = df2.join(existing_partitions, "id", how="left_anti")
...: df3.write.mode("append").partitionBy("id").parquet(dir)
...: spark.read.parquet(dir).orderBy("id").show()
...:
...:
+---+---+
|foo| id|
+---+---+
|bar| 0|
|bar| 1|
|bar| 2|
|baz| 3|
|baz| 4|
+---+---+
As you can see, only 2 partitions were added. The ones that were already existing have been preserved.
Now, getting the existing_partitions DataFrame required a read of the data. Spark won’t actually read all of the data though, just the partitioning column and the metadata. As mentioned earlier, you could get this data as well using any of the APIs relevant to where your data is stored. In my particular case as well as in yours, seeing as how you’re writing to the /mnt folder on Databricks, I could’ve simply used the built-in Python function os.walk: dirnames = next(os.walk(dir))[1], and created a DataFrame from that.
By the way, the reason you get the behaviours you’ve seen is:
ignore mode
In 'Ignore' mode it saves for first day.
Because you’re using a for-loop and the output directory was initially probably non-existing, the first date partition will be written. In all subsequent iterations of the for-loop, the DataFrameWriter object will not write anymore, because it ses there’s already some data (one partitione, for the first date) there.
overwrite mode
In 'Overwrite' mode, it saves of last day.
Actually, it saved a partition in each iteration of the for-loop, but because you’re instructing the DataFrameWriter to overwrite, it will remove all previously existing partitions in the directory. So it looks like only the last one was written.
append mode
In 'append' mode, it adds the data
This doesn’t need further explanation.
One suggestion: there’s probably no need to read from the database multiple times (using the for-loop to create multiple different queries and jdbc-connections). You could probably update the query to have WHERE date BETWEEN %(start) AND %(end), remove the for-loop altogether and enjoy an efficient write.

Is there a dataset file format for Pandas which can be indexed on multiple columns (that is, 'database-indexed'), and/or can be updated cheaply?

I'm building an interactive browser and editor for larger-than-memory datasets which will be later processed with Pandas. Thus, I'll need to have indexes on several columns that the dataset will be interactively sorted or filtered on (database indexes, not Pandas indexing), and I'd like the dataset file format to support cheap edits without rewriting most of the file. Like a database, only I want to be able to just send the files away afterwards in a Pandas-compatible format, without exporting.
So, I wonder if any of the formats that Pandas supports:
Have an option of building database-indexes on several columns (for sorting and filtering)
Can be updated 'in-place' or otherwise cheaply without shifting the rest of the records around
Preferably both of the above
What are my options?
I'm a complete noob in Pandas, and so far it seems that most of the formats are simply serialized sequential records, just like CSV, and at most can be sorted or indexed on one column. If nothing better comes up, I'll have to either build the indexes myself externally and juggle the edited rows manually before exporting the dataset, or dump the whole dataset in and out of a database—but I'd prefer avoiding both of those.
Edit: more specifically, it appears that Parquet has upper/lower bounds recorded for each column in each data page, and I wonder if these can be used as sort-of-indexes to speed up sorting on arbitrary columns, or whether other formats have similar features.

I would argue that parquet is indeed a good format for this situation. It maps well to the tabular nature of pandas dataframes, stores most common data in efficient binary representations (with optional compression), and is a standard, portable format. Furthermore, it allows you to load only those columns or "row groups" (chunks) you require. This latter gets to the crux of your problem.
Pandas' .to_parquet() will automatically store metadata relating to the indexing of your dataframe, and create the column max/min metadata as you suggest. If you use the fastparquet backend, you can use the filters= keyword when loading to select only some of the row-groups (this does not filter within row-groups)
pd.read_parquet('split', filters=[('cat', '==', '1')],
engine='fastparquet')
(selects only row-groups where some values of field 'cat' are equal to '1')
This can be particularly efficient, if you have used directory-based partitioning on writing, e.g.,
out.to_parquet('another_dataset.parq', partition_on=['cat'],
engine='fastparquet', file_scheme='hive')
Some of these options are only documented in the fastparquet docs, and maybe the API of that library implements slightly more than is available via the pandas methods; and I am not sure how well such options are implemented with the arrow backend.
Note further, that you may wish to read/save your dataframes using dask's to/read_parquet methods. Dask will understand the index if it is 1D and perform the equivalent of the filters= operation automatically load only relevant parts of the data on disc when you do filtering operations on the index. Dask is built to deal with data that does not easily fit into memory, and do computations in parallel.
(in answer to some of the comments above: Pandas-SQL interaction is generally not efficient, unless you can push the harder parts of the computation into a fast DB backend - in which case you don't really have a problem)
EDITs some specific notes:
parquet is not in general made for atomic record updating; but you could write to chunks of the whole (not via the pandas API - I think this is true for ALL of the writing format methods)
the "index" you speak on is not the same thing as a pandas index, but I am thinking that the information above may show that the sort of indexing in parquet is still useful for you.

If you decide to go the database route, SQLite is perfect since it's shipped with Python already, the driver api is in Python's standard library, and the fie format is platform independent. I use it for all my personal projects.
Example is modified from this tutorial on Pandas + sqlite3 and the pandas.io documentation:
# Code to create the db
import sqlite3
import pandas as pd
# Create a data frame
df = pd.DataFrame(dict(col1=[1,2,3], col2=['foo', 'bar', 'baz']))
df.index = ('row{}'.format(i) for i in range(df.shape[0]))
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Write the data to your table (overwrite old data)
df.to_sql('your_table', conn, if_exists='replace')
# Add more data
new_rows = pd.DataFrame(
dict(col1=[3, 4], col2=['notbaz', 'grunt']),
index=('row2', 'row3')
)
new_rows.to_sql('your_table', conn, if_exists='append') # `append`
This part is an aside in case you need more complex stuff:
# (oops - duplicate row 2)
# also ... need to quote "index" column name because it's a keyword.
query = 'SELECT * FROM your_table WHERE "index" = ?'
pd.read_sql(query, conn, params=('row2',))
# index col1 col2
# 0 row2 3 baz
# 1 row2 3 notbaz
# For more complex queries use pandas.io.sql
from pandas.io import sql
query = 'DELETE FROM your_table WHERE "col1" = ? AND "col2" = ?'
sql.execute(query, conn, params=(3, 'notbaz'))
conn.commit()
# close
conn.close()
When you or collaborators want to read from the database, just send them the
file data/your_database.sqlite and this code:
# Code to access the db
import sqlite3
import pandas as pd
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Load the data into a DataFrame
query = 'SELECT * FROM your_table WHERE col1 BETWEEN ? and ?'
df = pd.read_sql_query(query, conn, params=(1,3))
# index col1 col2
# 0 row0 1 foo
# 1 row1 2 bar
# 2 row2 3 baz

How to clone RDD object [Pyspark]

1) How can I clone a RDD object to another?
2) or reading a csv file, I'm using pandas to read and then using sc.parallelize to convert list to RDD object. Is that okay or should I use some RDD method to directly read from csv?
3) I understand that I need to convert huge data to RDDs but do I also need to convert single int values into RDDs? If I just declare an int variable, will it be distributed across nodes?

You can use spark-csv to reas a csv file in spark
Here is how you read in spark 2.X
spark.read
.schema(my_schema)
.option("header", "true")
.csv("data.csv")
In Spark < 2.0
sqlContext
.read.format("com.databricks.spark.csv")
.option("header", "true")
.option(schema, my_schema)
.load("data.csv"))
For more options as your requirement please refer here
You can just assign a RDD or dataframe to another variable to get cloned.
Hope this helps.

I am a bit confused on your 1) question. An RDD is an immutable object so you can load your data into one RDD and then you can define two different transformations based on your initial RDD. Each transformation will use the original RDD and generate new RDDs.
Something like this:
# load your CSV
loaded_csv_into_rdd = sc.textFile('data.csv').map(lambda x: x.split(','))
# You could even .cache or .persist the data
# Here two new RDDs will be created based on the data that you loaded
one_rdd = loaded_csv_into_rdd.<apply one transformation>
two_rdd = loaded_csv_into_rdd.<apply another transformation>
This is using the low level API, RDDs. It is probably better if you try to do with the DataFrame API (Dataset[Row]) as it can infer the schema and in general will be easier to use.
On 2) what you are looking for if you want to use RDDs is sc.textFile and then you apply a split by commas to generate lists that you can then manage.
On 3), in Spark you do not declare variables. You are working with functional programming so state is not something that you need to keep. You have accumulators which is a special case, but in general you are defining functions that are applied to the entire dataset, something that is called coarse grained transformations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark- How to handle source data schema change - python

You should be able to query off the system tables. You can run a comparison on these to see what changes have occurred since your last run.

Related

How can I get last modified date of a delta table in pyspark?

How to efficiently setup column data types in SQL from pandas dataframe

Overwrite mode in loop when partition using pyspark

Is there a dataset file format for Pandas which can be indexed on multiple columns (that is, 'database-indexed'), and/or can be updated cheaply?

How to clone RDD object [Pyspark]

Categories

Resources