I have some dataframe df in pySpark, which results from calling:
df = spark.sql("select A, B from org_table")
df = df.stuffIdo
I want to overwrite org_table at the end of my script.
Since overwriting input-tabels is forbidden, I checkpointed my data:
sparkContext.setCheckpointDir("hdfs:/directoryXYZ/PrePro_temp")
checkpointed = df.checkpoint(eager=True)
The lineage should be broken now and I can also see my checkpointed data with checkpointed.show() (works). What does not work is writing the table:
checkpointed.write.format('parquet')\
.option("checkpointLocation", "hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite').saveAsTable('org_table')
This results in an error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://org_table_path/org_table/part-00081-4e9d12ea-be6a-4a01-8bcf-1e73658a54dd-c000.snappy.parquet
I have tried several things like refreshing the org_table before doing the writing etc., but I'm puzzled here. How can I solve this error?
I would be careful with such operations where transformed input is new output. The reason for that is that you can lost your data in case of any error. Let's imagine that your transformation logic was buggy and you generated invalid data. But you saw that only one day later. Moreover, to fix the bug, you cannot use the data you've just transformed. You needed the data before the transformation. What do you do to bring data consistent again ?
An alternative approach would be:
exposing a view
at each batch you're writing a new table and at the end you only replace the view with this new table
after some days you can also plan a cleaning job that will delete the tables from last X days
If you want to stay with your solution, why not simply to do that instead of dealing with checkpointing ?
df.write.parquet("hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite')
df.load("hdfs:/directoryXYZ/PrePro_temp").write.format('parquet').mode('overwrite').saveAsTable('org_table')
Of course, you will read the data twice but it looks less hacky than the one with checkpoint. Moreover, you could store your "intermediate" data in different dirs every time and thanks to that you can address the issue I exposed at the beginning. Even though you had a bug, you can still bring valid version of data by simply choosing a good directory and doing .write.format(...) to org_table.
Related
I have database of above 500.000 records and 140 MB (when stored as CSV). Pandas takes about 1.5 seconds to load it, parsing dates included. Is not a problem at all. Now, I have a Python program that is continuously creating more records, which I want to add to the database (I also remove older records, so the database has a fairly stable size). And I'm facing a performance issue, as adding the new records takes longer than the process creating such records.
For adding these new records, I basically merge the freshly obtained Dataframe with the one that contains the database, which is loaded from a CSV file, i.e.:
# read the database
old_df = pd.read_csv('database.csv',
index_col=False,
parse_dates=['date'],
dtype=dtypes)
# some process produces new_df
# I merge them by just concatenating
merged = pd.concat([df, df2])
This step is even faster, so no problem so far. Perhaps it's worth to note that the new_df is tiny compared to old_df. Typically less than 10 new records are added each time.
Now, a particularity of this database is that some of the new records are supposed to replace their counterpart in the database, i.e. they not just grow but update it (The details are not important for the problem, but for a bit of context, the database keeps memory of previous fails in the column type, which can be either 'success' or' 'failed', that correspond to attempts to get a file stored in the column file. This way, when a latter attempts of the program success, the record for the fail is replaced by the success.)
The replacement consists of grouping the database by the column file, so each file is unique. Once grouped, I need to aggregate to define a value for type, so I keep just one record for the given file. And my problem is that the aggregation is done through a user defined function that has become a bottleneck of the program.
This code:
merged = merged.groupby('file', as_index=False).agg({'type': 'last'})
runs in less than a second, whereas this:
def keep_success(x):
"""! Auxiliary function to keep `success` if it exist."""
if (x == "success").any():
return 'success'
else:
return x.iloc[-1]
merged = merged.groupby('file', as_index=False).agg({'type': keep_success})
takes more than a minute. So far I was using 'last', but a change in my program made that sometimes 'success' is previous to 'fail', so I need to account for the unknown order of these two values.
TL;DR; I need a FAST way to aggregate records in a Dataframe sharing the file column, and keeping just the value 'success' for the column type in case there is any occurrence of this value within the group. Otherwise we keep 'failed'
EDIT to add my guess:
I think the problem is in the string comparison. The program has to go through ALL the database making trivial/useless comparisons that systematically are not fulfilled. To replace about 10 records, we need to check the equity of above 500.000 strings. Can I work around this taking advantage of what I known, i.e. that most records, once grouped, are unique so we do not need to do anything to with them?
I'm trying to understand how to leverage cache() to improve my performance. Since cache retains a DataFrame in memory "for reuse", it seems like i need to understand the conditions that eject the DataFrame from memory to better understand how to leverage it.
After defining transformations, I call an action, is the dataframe, after the action completes, gone from memory? This would imply that if I do execute an action on a dataframe, but I continue to do other stuff with the data, all the previous parts of the DAG, from the read to the action, will need to be re done.
Is this accurate?
The fact is, that after an action is executed another dag is created. You can check this via SparkUI
In code you can try to identify where your dag is done and new started by looking for actions
When looking at code you can use this simple rule:
When function is transforming one df into another df - its not action but only lazy evaluated transformation. Even if this is join or something else which requires shuffling
When fuction is returning value other dan df, then you are using and action (for example count which is returning Long)
Lets take a look at this code (Its Scala but api is similar). I created this example just to show you the mechanism, this could be done better of course :)
import org.apache.spark.sql.functions.{col, lit, format_number}
import org.apache.spark.sql.DataFrame
val input = spark.read.format("csv")
.option("header", "true")
.load("dbfs:/FileStore/shared_uploads/***#gmail.com/city_temperature.csv")
val dataWithTempInCelcius: DataFrame = input
.withColumn("avg_temp_celcius",format_number((col("AvgTemperature").cast("integer") - lit(32)) / lit(1.8), 2).cast("Double"))
val dfAfterGrouping: DataFrame = dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.groupBy(col("Year"), col("Month"))
.max("avg_temp_celcius")//Not an action, we are still doing transofrmation from df to another df
val maxTemp: Row = dfAfterGrouping
.orderBy(col("Year"))
.first() // <--- first is our first action, output from first is a Row and not df
dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.count() // <--- count is our second action
Here you may see what is the problem. It looks like between first and second action i am doing transformation which was already done in first dag. This intermediate results of calculation was not cached, so in second dag Spark is unable to get the dataframe after filter from the memory which leads us to recomputation. Spark is going to fetch the data again, apply our filter and then calculate the count.
In SparkUI u will find two separate dags and both of them are going to read the source csv
If you cache intermediate results after first .filter(col("City") === "Warsaw") and then use this cached DF to do grouping and count you will still find two separate dags (number of action has not changed) but this time in the plan for second dag you will find "In memory table scan" instead of read of a csv file - that means that Spark is reading data from cache
Now you can see in memory relation in plan. There is still read csv node in the dag but as you can see, for second action its skipped (0 bytes read)
** I am using Databrics cluster with Spark 3.2, SparkUI may look different on your env
Quote...
Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
See https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
There is similar questions asked but i didnt get it to work. So here goes. (This used to work fine up until recently so I dont know if python update messed this up or not).
"data" is a panda dataFrame.
data = data[data['plant name'].notnull()]
data = data[data['plant name'] != '0']
When I use my full range of data I get the "Exception: cannot handle a non-unique multi-index!". This may be related to several columns having empty names in the excel i read this from (in spyder when I look at the data these have "NaN" for name. I thought this would work as long as I dont have several column name that are the name i'm filtering by (in this case 'plant name'). But if I reduce the data to only onclude the first 3 columns it doesn't give me exception.
The second problem is these 2 lines of code used to remove all rows where the 'plant name' was '0' or empty. They dont do anything anymore (even if i get them to not crash by removing the columns mentioned above).
Thank you!
This is my code:
def fun(df, file):
symbol = df.select(df.SMBL).distinct().collect()
for i in symbol:
csv_data = df.filter(df.SMBL == i.SMBL)
csv_data.write.csv('%s/'%(BUCKET_PATH))
using collect() slows the process. How to access the column 'SMBL' without using collect?
As far as I understand you are trying to write to files that are named based on the SMBL dataframe column. I suggest to write the dataframe with partitionBy() in which you specify the column. It might be needed to make a user defined function based on the SMBL column in order to get the right partition naming.
Doing this you don't need to call collect previously to the write action.
I plan to make a 'table' class that I can use throughout my data-analyzis program to store gathered data to. Objective is to make simple tables like this:
ID Mean size Stdv Date measured Relative flatness
----------------------------------------------------------------
1 133.4242 34.43 Oct 20, 2013 32093
2 239.244 34.43 Oct 21, 2012 3434
I will follow the sqlite3 suggestion from this post: python-data-structure-for-maintaing-tabular-data-in-memory, but I will still need to save it as a csv file (not as a dbase) and I want it to eat my data as we go: add columns on the fly whenever new measures become available and are deemed to be interesting. For that the class will need to be able to determine the data type of the data thrown at it.
Sqlite3 has limited datatypes, float, int, date and string. Python and numpy together have many types. Is there an easy was to quickly decide what the datatype is of the variable? So my table class can automatically add a column when new data is entered containing new fields.
I am not too concerned about performance, the table should be fairly small.
I want to use my class like so:
dt = Table()
dt.add_record({'ID':5, 'Mean size':39.4334'})
dt.add_record({'ID':5, 'Goodness of fit': 12})
In the last line, there is new data. the Table class needs to figure out what kind of data that is and then add a column to the sqlite3 table. Making it all string seems a bit to floppy, I still want to keep my high precision floats correct....
Also: If something like this already exists, I'd like to know about it.
It seems that your question is: "Is there an easy was to quickly decide what the datatype is of the variable?". This is a simple question, and the answer is:
type(variable).
But the context you provide requires a more careful answer.
Since SQLite3 only provides only a few data types (slightly different ones than what you said), you need to map your input variables to the types provided by SQLite3.
But you may encounter further problems: You may need to change the types of columns as you receive new records, if you do not want to require that the column type be fixed in advance.
For example, for the Goodness of fit column in your example, you get an int (12) first. But you may get a float (e.g. 10.1) the second time, which shows that both values must be interpreted as floats. And if next time you receive a string, then all of them must be strings, right? But then the exact formatting of the numbers also counts: whereas 12 and 12.0 are the same when you interpret them as floats, they are not when you interpret them as strings; and the first value may become "12.0" when you convert all of them to strings.
So either you throw an exception when the type of consecutive values for the same column do not match, or you try to convert the previous values according to the new ones; but occasionally you may need to re-read the input.
Nevertheless, once you make those decision regarding the expected behavior, it should not be a very difficult problem to implement.
Regarding your last question: I personally do not know of an existing implementation to this problem.