I have database of above 500.000 records and 140 MB (when stored as CSV). Pandas takes about 1.5 seconds to load it, parsing dates included. Is not a problem at all. Now, I have a Python program that is continuously creating more records, which I want to add to the database (I also remove older records, so the database has a fairly stable size). And I'm facing a performance issue, as adding the new records takes longer than the process creating such records.
For adding these new records, I basically merge the freshly obtained Dataframe with the one that contains the database, which is loaded from a CSV file, i.e.:
# read the database
old_df = pd.read_csv('database.csv',
index_col=False,
parse_dates=['date'],
dtype=dtypes)
# some process produces new_df
# I merge them by just concatenating
merged = pd.concat([df, df2])
This step is even faster, so no problem so far. Perhaps it's worth to note that the new_df is tiny compared to old_df. Typically less than 10 new records are added each time.
Now, a particularity of this database is that some of the new records are supposed to replace their counterpart in the database, i.e. they not just grow but update it (The details are not important for the problem, but for a bit of context, the database keeps memory of previous fails in the column type, which can be either 'success' or' 'failed', that correspond to attempts to get a file stored in the column file. This way, when a latter attempts of the program success, the record for the fail is replaced by the success.)
The replacement consists of grouping the database by the column file, so each file is unique. Once grouped, I need to aggregate to define a value for type, so I keep just one record for the given file. And my problem is that the aggregation is done through a user defined function that has become a bottleneck of the program.
This code:
merged = merged.groupby('file', as_index=False).agg({'type': 'last'})
runs in less than a second, whereas this:
def keep_success(x):
"""! Auxiliary function to keep `success` if it exist."""
if (x == "success").any():
return 'success'
else:
return x.iloc[-1]
merged = merged.groupby('file', as_index=False).agg({'type': keep_success})
takes more than a minute. So far I was using 'last', but a change in my program made that sometimes 'success' is previous to 'fail', so I need to account for the unknown order of these two values.
TL;DR; I need a FAST way to aggregate records in a Dataframe sharing the file column, and keeping just the value 'success' for the column type in case there is any occurrence of this value within the group. Otherwise we keep 'failed'
EDIT to add my guess:
I think the problem is in the string comparison. The program has to go through ALL the database making trivial/useless comparisons that systematically are not fulfilled. To replace about 10 records, we need to check the equity of above 500.000 strings. Can I work around this taking advantage of what I known, i.e. that most records, once grouped, are unique so we do not need to do anything to with them?
Related
So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]
Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.
I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps
I have some dataframe df in pySpark, which results from calling:
df = spark.sql("select A, B from org_table")
df = df.stuffIdo
I want to overwrite org_table at the end of my script.
Since overwriting input-tabels is forbidden, I checkpointed my data:
sparkContext.setCheckpointDir("hdfs:/directoryXYZ/PrePro_temp")
checkpointed = df.checkpoint(eager=True)
The lineage should be broken now and I can also see my checkpointed data with checkpointed.show() (works). What does not work is writing the table:
checkpointed.write.format('parquet')\
.option("checkpointLocation", "hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite').saveAsTable('org_table')
This results in an error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://org_table_path/org_table/part-00081-4e9d12ea-be6a-4a01-8bcf-1e73658a54dd-c000.snappy.parquet
I have tried several things like refreshing the org_table before doing the writing etc., but I'm puzzled here. How can I solve this error?
I would be careful with such operations where transformed input is new output. The reason for that is that you can lost your data in case of any error. Let's imagine that your transformation logic was buggy and you generated invalid data. But you saw that only one day later. Moreover, to fix the bug, you cannot use the data you've just transformed. You needed the data before the transformation. What do you do to bring data consistent again ?
An alternative approach would be:
exposing a view
at each batch you're writing a new table and at the end you only replace the view with this new table
after some days you can also plan a cleaning job that will delete the tables from last X days
If you want to stay with your solution, why not simply to do that instead of dealing with checkpointing ?
df.write.parquet("hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite')
df.load("hdfs:/directoryXYZ/PrePro_temp").write.format('parquet').mode('overwrite').saveAsTable('org_table')
Of course, you will read the data twice but it looks less hacky than the one with checkpoint. Moreover, you could store your "intermediate" data in different dirs every time and thanks to that you can address the issue I exposed at the beginning. Even though you had a bug, you can still bring valid version of data by simply choosing a good directory and doing .write.format(...) to org_table.
I am building a repeat orders report in ipython notebook using graphlab and sframes. I have a csv file with roughly 100k rows of data containing user_id, user_email, user_phone. I added a new column called unique identifier. For each row I am traversing all other rows to see if user_id, user_email or user_phone matches the current record. If unique identifier is not empty and there is a match, I assign user_id from the current record into unique_identifier slot of each matching record.
At the end, I get an SFrame with 4 columns, where unique_identifier contains user_id of the oldest order for all matching orders. I am doing this via .apply method with a lambda function. The whole process takes a few seconds on my laptop. However, after the process is done, the SFframe becomes extremely slow and unmanageable to the point where SFrame.save seems to be taking forever.
It seems like my process of adding unique_identifier clogs up the memory or something like that. However, the problem is irrelevant of the sframe size. If I limit it to just 10 rows, the problem persists. What am I doing wrong?
Here is my method
def set_unique_identifier():
orders['unique_identifier'] = ''
orders['unique_identifier'] = orders.apply(lambda order:
order['unique_identifier'] if order['unique_identifier'] else
orders[(orders['user_email']==order['user_email']) |
(orders['phone'] == order['user_phone'])][0]['user_id'])
don't use apply on entire sframe, instead, use it on SArray, that should speed up a little