Overwrite mode in loop when partition using pyspark

Overwrite mode in loop when partition using pyspark - python

#Start and End is a range of dates.
start = date(2019, 1, 20)
end = date(2019, 1, 22)
for single_date in daterange(start, end):
query = "(SELECT ID, firstname,lastname,date FROM dbo.emp WHERE date = '%s' ) emp_alias" %((single_date).strftime("%Y-%m-%d %H:%M:%S"))
df = spark.read.jdbc(url=jdbcUrl, table=query, properties=connectionProperties)
df.write.format("parquet").mode("ignore").partitionBy("Date").save("/mnt/data/empData.parquet")
I have data for number of days in a table and I need as parquet files partitioned by date. I have to save by day in loop as data is huge and I can't put all the days like years data in one dataframe. I tried on all save modes. In 'Ignore' mode it saves for first day. In 'Overwrite' mode, it saves of last day. In 'append' mode, it adds the data. What I need is, if data is available for that day it should ignore for that day and leave the data what's already there but if data is not available then create in parquet file partitioned by date. Please help.

There is currently no PySpark SaveMode that will allow you to preserve the existing partitions, while inserting the new ones, if you also want to use Hive partitioning (which is what you’re asking for, when you call the method partitionBy). Note that there is the option to do the opposite, which is to overwrite data in some partitions, while preserving the ones for which there is no data in the DataFrame (set the configuration setting "spark.sql.sources.partitionOverwriteMode" to "dynamic" and use SaveMode.Overwrite when writing datasets).
You can still achieve what you want though, by first creating a set of all the already existing partitions. You could do that with PySpark, or using any of the libraries that will allow you to perform listing operations in file systems (like Azure Data Lake Storage Gen2) or key-value stores (like AWS S3). Once you have that list, you use it to filter the new dataset for the data you still want to write. Here’s an example with only PySpark:
In [1]: from pyspark.sql.functions import lit
...: df = spark.range(3).withColumn("foo", lit("bar"))
...: dir = "/tmp/foo"
...: df.write.mode("overwrite").partitionBy("id").parquet(dir) # initial seeding
...: ! tree /tmp/foo
...:
...:
/tmp/foo
├── id=0
│   └── part-00001-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
├── id=1
│   └── part-00002-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
├── id=2
│   └── part-00003-5d14d286-81e1-4eb1-969e-c0d8089712ce.c000.snappy.parquet
└── _SUCCESS
3 directories, 4 files
In [2]: df2 = spark.range(5).withColumn("foo", lit("baz"))
...: existing_partitions = spark.read.parquet(dir).select("id").distinct()
...: df3 = df2.join(existing_partitions, "id", how="left_anti")
...: df3.write.mode("append").partitionBy("id").parquet(dir)
...: spark.read.parquet(dir).orderBy("id").show()
...:
...:
+---+---+
|foo| id|
+---+---+
|bar| 0|
|bar| 1|
|bar| 2|
|baz| 3|
|baz| 4|
+---+---+
As you can see, only 2 partitions were added. The ones that were already existing have been preserved.
Now, getting the existing_partitions DataFrame required a read of the data. Spark won’t actually read all of the data though, just the partitioning column and the metadata. As mentioned earlier, you could get this data as well using any of the APIs relevant to where your data is stored. In my particular case as well as in yours, seeing as how you’re writing to the /mnt folder on Databricks, I could’ve simply used the built-in Python function os.walk: dirnames = next(os.walk(dir))[1], and created a DataFrame from that.
By the way, the reason you get the behaviours you’ve seen is:
ignore mode
In 'Ignore' mode it saves for first day.
Because you’re using a for-loop and the output directory was initially probably non-existing, the first date partition will be written. In all subsequent iterations of the for-loop, the DataFrameWriter object will not write anymore, because it ses there’s already some data (one partitione, for the first date) there.
overwrite mode
In 'Overwrite' mode, it saves of last day.
Actually, it saved a partition in each iteration of the for-loop, but because you’re instructing the DataFrameWriter to overwrite, it will remove all previously existing partitions in the directory. So it looks like only the last one was written.
append mode
In 'append' mode, it adds the data
This doesn’t need further explanation.
One suggestion: there’s probably no need to read from the database multiple times (using the for-loop to create multiple different queries and jdbc-connections). You could probably update the query to have WHERE date BETWEEN %(start) AND %(end), remove the for-loop altogether and enjoy an efficient write.

Related

PySpark- How to handle source data schema change

I'm trying to use PySpark to read from Avro file into dataframe, do some transformations and write the dataframe out to HDFS as hive tables using the code below. The file format for the hive tables is parquet.
df.write.mode("overwrite").format("hive").insertInto("mytable")
#this write a partition every day. When re-run, it would overwrite that run day's partition
The problem is, when the source data has a schema change, like added a column, it will fail with an error saying: source file structure not match with existing table schema. How should I handle this case programmatically? Many thanks for your help.
Edited :I want the new schema changes to be reflected in target table. I'm looking for a programmatic way to do this.

You should be able to query off the system tables. You can run a comparison on these to see what changes have occurred since your last run.

Not sure which Spark version you're using, if you're using Spark version >= 2.2, you can use alter in your Spark SQL statement. For you case it should be:
spark.sql("ALTER TABLE mytable add columns (col1 string, col2 timestamp)")
You can check this document for more information about Hive operation in Spark: https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-alter-table.html#add-columns

It is possible to evolve the schema of the Parquet files. This can be achieved in your use case with Schema Merging
As per your requirement, the a new partition is created everyday and you are also applying the overwrite option which is really a suited option for schema evolution.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files
setting the global SQL option spark.sql.parquet.mergeSchema to true.
Code from offical Spark Documentation
from pyspark.sql import Row
# spark is from the previous example.
# Create a simple DataFrame, stored into a partition directory
sc = spark.sparkContext
squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
.map(lambda i: Row(single=i, double=i ** 2)))
squaresDF.write.parquet("data/test_table/key=1")
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
.map(lambda i: Row(single=i, triple=i ** 3)))
cubesDF.write.parquet("data/test_table/key=2")
# Read the partitioned table
mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()
# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths.
# root
# |-- double: long (nullable = true)
# |-- single: long (nullable = true)
# |-- triple: long (nullable = true)
# |-- key: integer (nullable = true)
Note: suggestion would be to not enable the setting at global level as its quite a heavy operation.

Performance issue with Pandas when aggregating using a custom defined function

I have database of above 500.000 records and 140 MB (when stored as CSV). Pandas takes about 1.5 seconds to load it, parsing dates included. Is not a problem at all. Now, I have a Python program that is continuously creating more records, which I want to add to the database (I also remove older records, so the database has a fairly stable size). And I'm facing a performance issue, as adding the new records takes longer than the process creating such records.
For adding these new records, I basically merge the freshly obtained Dataframe with the one that contains the database, which is loaded from a CSV file, i.e.:
# read the database
old_df = pd.read_csv('database.csv',
index_col=False,
parse_dates=['date'],
dtype=dtypes)
# some process produces new_df
# I merge them by just concatenating
merged = pd.concat([df, df2])
This step is even faster, so no problem so far. Perhaps it's worth to note that the new_df is tiny compared to old_df. Typically less than 10 new records are added each time.
Now, a particularity of this database is that some of the new records are supposed to replace their counterpart in the database, i.e. they not just grow but update it (The details are not important for the problem, but for a bit of context, the database keeps memory of previous fails in the column type, which can be either 'success' or' 'failed', that correspond to attempts to get a file stored in the column file. This way, when a latter attempts of the program success, the record for the fail is replaced by the success.)
The replacement consists of grouping the database by the column file, so each file is unique. Once grouped, I need to aggregate to define a value for type, so I keep just one record for the given file. And my problem is that the aggregation is done through a user defined function that has become a bottleneck of the program.
This code:
merged = merged.groupby('file', as_index=False).agg({'type': 'last'})
runs in less than a second, whereas this:
def keep_success(x):
"""! Auxiliary function to keep `success` if it exist."""
if (x == "success").any():
return 'success'
else:
return x.iloc[-1]
merged = merged.groupby('file', as_index=False).agg({'type': keep_success})
takes more than a minute. So far I was using 'last', but a change in my program made that sometimes 'success' is previous to 'fail', so I need to account for the unknown order of these two values.
TL;DR; I need a FAST way to aggregate records in a Dataframe sharing the file column, and keeping just the value 'success' for the column type in case there is any occurrence of this value within the group. Otherwise we keep 'failed'
EDIT to add my guess:
I think the problem is in the string comparison. The program has to go through ALL the database making trivial/useless comparisons that systematically are not fulfilled. To replace about 10 records, we need to check the equity of above 500.000 strings. Can I work around this taking advantage of what I known, i.e. that most records, once grouped, are unique so we do not need to do anything to with them?

Add new column to pandas dataframe when columns used in conditions may be variable

I need to add a 'Status' column to my Pandas df that checks multiple columns of the df in order to display a message about possible options for the user. I think I could do this if the columns were static but the next time a new tool is brought on line it will show up in the data as a new column that will need to be checked. I'm trying to avoid hardcoding in specific tool names.
I figured a way to conditionally create a column for each tool that works well but it creates one for each tool. In the above case I need one column instead of multiple.
Here's how I'm managing to create a new column based on all the tools that are in the dataframe:
tools = unded[unded.columns[unded.columns.str.contains("tool")]]
undedtools = tools.columns.values.tolist()
for tool in undedtools:
unded.loc[(unded[tool] == 'Y') & (unded['lines_down'] == 'N'), tool + '_format'] = 2
unded.loc[(unded[tool]) == 0, tool + '_format'] = 3
This creates a column for each tool named like "tool123_format". The columns get filled in with a number which I use for formatting a report. So now that I have a column like this for each tool I need to check all of these columns and report on the status.
I would expect it to report something like "tool123 and tool456 are open" if it finds a 2 in the format column for each of those tools. Then the next line may have no open tools so it would say "All paths are closed. Escalate to eng"
How do I get the tool names conditionally into this "Status" column for each row of the dataframe? I had previously had this whole thing in SQL but I'm getting tired of adding dozens of new lines to my CASE WHEN statement each time we add a new tool.

Is there a dataset file format for Pandas which can be indexed on multiple columns (that is, 'database-indexed'), and/or can be updated cheaply?

I'm building an interactive browser and editor for larger-than-memory datasets which will be later processed with Pandas. Thus, I'll need to have indexes on several columns that the dataset will be interactively sorted or filtered on (database indexes, not Pandas indexing), and I'd like the dataset file format to support cheap edits without rewriting most of the file. Like a database, only I want to be able to just send the files away afterwards in a Pandas-compatible format, without exporting.
So, I wonder if any of the formats that Pandas supports:
Have an option of building database-indexes on several columns (for sorting and filtering)
Can be updated 'in-place' or otherwise cheaply without shifting the rest of the records around
Preferably both of the above
What are my options?
I'm a complete noob in Pandas, and so far it seems that most of the formats are simply serialized sequential records, just like CSV, and at most can be sorted or indexed on one column. If nothing better comes up, I'll have to either build the indexes myself externally and juggle the edited rows manually before exporting the dataset, or dump the whole dataset in and out of a database—but I'd prefer avoiding both of those.
Edit: more specifically, it appears that Parquet has upper/lower bounds recorded for each column in each data page, and I wonder if these can be used as sort-of-indexes to speed up sorting on arbitrary columns, or whether other formats have similar features.

I would argue that parquet is indeed a good format for this situation. It maps well to the tabular nature of pandas dataframes, stores most common data in efficient binary representations (with optional compression), and is a standard, portable format. Furthermore, it allows you to load only those columns or "row groups" (chunks) you require. This latter gets to the crux of your problem.
Pandas' .to_parquet() will automatically store metadata relating to the indexing of your dataframe, and create the column max/min metadata as you suggest. If you use the fastparquet backend, you can use the filters= keyword when loading to select only some of the row-groups (this does not filter within row-groups)
pd.read_parquet('split', filters=[('cat', '==', '1')],
engine='fastparquet')
(selects only row-groups where some values of field 'cat' are equal to '1')
This can be particularly efficient, if you have used directory-based partitioning on writing, e.g.,
out.to_parquet('another_dataset.parq', partition_on=['cat'],
engine='fastparquet', file_scheme='hive')
Some of these options are only documented in the fastparquet docs, and maybe the API of that library implements slightly more than is available via the pandas methods; and I am not sure how well such options are implemented with the arrow backend.
Note further, that you may wish to read/save your dataframes using dask's to/read_parquet methods. Dask will understand the index if it is 1D and perform the equivalent of the filters= operation automatically load only relevant parts of the data on disc when you do filtering operations on the index. Dask is built to deal with data that does not easily fit into memory, and do computations in parallel.
(in answer to some of the comments above: Pandas-SQL interaction is generally not efficient, unless you can push the harder parts of the computation into a fast DB backend - in which case you don't really have a problem)
EDITs some specific notes:
parquet is not in general made for atomic record updating; but you could write to chunks of the whole (not via the pandas API - I think this is true for ALL of the writing format methods)
the "index" you speak on is not the same thing as a pandas index, but I am thinking that the information above may show that the sort of indexing in parquet is still useful for you.

If you decide to go the database route, SQLite is perfect since it's shipped with Python already, the driver api is in Python's standard library, and the fie format is platform independent. I use it for all my personal projects.
Example is modified from this tutorial on Pandas + sqlite3 and the pandas.io documentation:
# Code to create the db
import sqlite3
import pandas as pd
# Create a data frame
df = pd.DataFrame(dict(col1=[1,2,3], col2=['foo', 'bar', 'baz']))
df.index = ('row{}'.format(i) for i in range(df.shape[0]))
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Write the data to your table (overwrite old data)
df.to_sql('your_table', conn, if_exists='replace')
# Add more data
new_rows = pd.DataFrame(
dict(col1=[3, 4], col2=['notbaz', 'grunt']),
index=('row2', 'row3')
)
new_rows.to_sql('your_table', conn, if_exists='append') # `append`
This part is an aside in case you need more complex stuff:
# (oops - duplicate row 2)
# also ... need to quote "index" column name because it's a keyword.
query = 'SELECT * FROM your_table WHERE "index" = ?'
pd.read_sql(query, conn, params=('row2',))
# index col1 col2
# 0 row2 3 baz
# 1 row2 3 notbaz
# For more complex queries use pandas.io.sql
from pandas.io import sql
query = 'DELETE FROM your_table WHERE "col1" = ? AND "col2" = ?'
sql.execute(query, conn, params=(3, 'notbaz'))
conn.commit()
# close
conn.close()
When you or collaborators want to read from the database, just send them the
file data/your_database.sqlite and this code:
# Code to access the db
import sqlite3
import pandas as pd
# Connect to your database
conn = sqlite3.connect("data/your_database.sqlite")
# Load the data into a DataFrame
query = 'SELECT * FROM your_table WHERE col1 BETWEEN ? and ?'
df = pd.read_sql_query(query, conn, params=(1,3))
# index col1 col2
# 0 row0 1 foo
# 1 row1 2 bar
# 2 row2 3 baz

Organizing column and header data with pandas, python

I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.
My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this.
Reading Multiple CSV Files into Python Pandas Dataframe
The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:
Flight ID: XXXXXX
Date: 01-27-10 Time: 5:25:19
OWNER
Release Point: xx.304N xx.060E 11 m
Serial Number xxxxxx
Surface Data: 985.1 mb 1.0 C 100% 1.0 m/s # 308 deg.
I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?

Looks like somebody is working with radiosondes...
When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):
import pandas as pd
with open("filename.csv",'r') as data:
header = data.read().split('\n')[:5] # change to match number of your header rows
data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])
# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]
# a lot of the header information will get stored as metadata for me.
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
'date':date}
I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.
new_index = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)
You now have a multi-level indexed dataframe.
Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.
Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.
data.metadata = {FLIGHT1:{'date':date},
FLIGHT2:{'date':date}}
Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.
Your question was quite broad, so you got a broad answer. I hope this was helpful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Overwrite mode in loop when partition using pyspark - python

Related

PySpark- How to handle source data schema change

Performance issue with Pandas when aggregating using a custom defined function

Add new column to pandas dataframe when columns used in conditions may be variable

Is there a dataset file format for Pandas which can be indexed on multiple columns (that is, 'database-indexed'), and/or can be updated cheaply?

Organizing column and header data with pandas, python

Categories

Resources