I am building a repeat orders report in ipython notebook using graphlab and sframes. I have a csv file with roughly 100k rows of data containing user_id, user_email, user_phone. I added a new column called unique identifier. For each row I am traversing all other rows to see if user_id, user_email or user_phone matches the current record. If unique identifier is not empty and there is a match, I assign user_id from the current record into unique_identifier slot of each matching record.
At the end, I get an SFrame with 4 columns, where unique_identifier contains user_id of the oldest order for all matching orders. I am doing this via .apply method with a lambda function. The whole process takes a few seconds on my laptop. However, after the process is done, the SFframe becomes extremely slow and unmanageable to the point where SFrame.save seems to be taking forever.
It seems like my process of adding unique_identifier clogs up the memory or something like that. However, the problem is irrelevant of the sframe size. If I limit it to just 10 rows, the problem persists. What am I doing wrong?
Here is my method
def set_unique_identifier():
orders['unique_identifier'] = ''
orders['unique_identifier'] = orders.apply(lambda order:
order['unique_identifier'] if order['unique_identifier'] else
orders[(orders['user_email']==order['user_email']) |
(orders['phone'] == order['user_phone'])][0]['user_id'])
don't use apply on entire sframe, instead, use it on SArray, that should speed up a little
Related
I have database of above 500.000 records and 140 MB (when stored as CSV). Pandas takes about 1.5 seconds to load it, parsing dates included. Is not a problem at all. Now, I have a Python program that is continuously creating more records, which I want to add to the database (I also remove older records, so the database has a fairly stable size). And I'm facing a performance issue, as adding the new records takes longer than the process creating such records.
For adding these new records, I basically merge the freshly obtained Dataframe with the one that contains the database, which is loaded from a CSV file, i.e.:
# read the database
old_df = pd.read_csv('database.csv',
index_col=False,
parse_dates=['date'],
dtype=dtypes)
# some process produces new_df
# I merge them by just concatenating
merged = pd.concat([df, df2])
This step is even faster, so no problem so far. Perhaps it's worth to note that the new_df is tiny compared to old_df. Typically less than 10 new records are added each time.
Now, a particularity of this database is that some of the new records are supposed to replace their counterpart in the database, i.e. they not just grow but update it (The details are not important for the problem, but for a bit of context, the database keeps memory of previous fails in the column type, which can be either 'success' or' 'failed', that correspond to attempts to get a file stored in the column file. This way, when a latter attempts of the program success, the record for the fail is replaced by the success.)
The replacement consists of grouping the database by the column file, so each file is unique. Once grouped, I need to aggregate to define a value for type, so I keep just one record for the given file. And my problem is that the aggregation is done through a user defined function that has become a bottleneck of the program.
This code:
merged = merged.groupby('file', as_index=False).agg({'type': 'last'})
runs in less than a second, whereas this:
def keep_success(x):
"""! Auxiliary function to keep `success` if it exist."""
if (x == "success").any():
return 'success'
else:
return x.iloc[-1]
merged = merged.groupby('file', as_index=False).agg({'type': keep_success})
takes more than a minute. So far I was using 'last', but a change in my program made that sometimes 'success' is previous to 'fail', so I need to account for the unknown order of these two values.
TL;DR; I need a FAST way to aggregate records in a Dataframe sharing the file column, and keeping just the value 'success' for the column type in case there is any occurrence of this value within the group. Otherwise we keep 'failed'
EDIT to add my guess:
I think the problem is in the string comparison. The program has to go through ALL the database making trivial/useless comparisons that systematically are not fulfilled. To replace about 10 records, we need to check the equity of above 500.000 strings. Can I work around this taking advantage of what I known, i.e. that most records, once grouped, are unique so we do not need to do anything to with them?
Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.
The problem
I'm working with a data set, which has been given to me as a csv file with lines of the form id,data. I would like to work with this data in a pandas dataframe, with the id as the index.
Unfortunately, somewhere along the data pipeline, my csv files has a number of rows where the id's are missing. Fortunately, the rows of my data are not fully independent, so I can recreate the missing values: each row is linked to its predecessor, and I have access to an oracle which, when given an id, can give me all its data. This includes the id of its predecessor.
My question is therefore whether there's a simple way of filling in these missing values in my dataframe
My Solution
I don't have much experience working with pandas, but after playing around for a bit I came up with the following approach. I start by reading the csv file into a dataframe without setting the index, so I end up with a RangeIndex. I then
Find the location of the rows with missing ids
Add 1 to the index to get the children of each row
Ask the oracle for the parents of each child
Merge the parents and children on the child id
Subtract one from the index again, and set the parent ids
In code:
children = df.loc[df[df['id'].isna()].index + 1, 'id']
parents = pd.Series({x.id: x.parent_id for x in ask_oracle(children)},
name='parent_id')
combined = pd.merge(children, parents, left_on='id', right_index=True)
combined.set_index(combined.index - 1, inplace=True)
df.loc[combined.index, 'id'] = combined['parent_id']
This works, but I'm 95% sure it's going to look like scary black magic in a few months time.
In particular, I'm unhappy with
The way I get the location of the nan rows. Three lots of df[ in one line is just too many
The manual fiddling about with the indices I have to do to get the rows to match up.
Does anyone have any suggestions for a better way of doing things?
The format of the input data is fixed, as are the properties of the oracle, but if there's a smarter way of organising my dataframe, I'm more than happy to hear it.
I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps
I have an SQLite table with a constant number of rows. But as I generate values derived from some of these columns (new features), I want to add columns on the fly, alongside existing columns, without creating any new rows. I can add a column using ALTER TABLE, but calling cur.executemany("INSERT INTO...") causes the values to be appended in new rows.
I've tried:
cur.executemany("UPDATE DOS_APPENDIX SET FEATURE2=?", [(val,) for val in ["a", "b", "c"]])
For some reason this causes the "c" to be duplicated across rows 1, 2, 3 in column FEATURE2. And it's slow on a large list (~2 million).
Is there a way to bulk update? Something as graceful and fast as calling cur.executemany(INSERT INTO...)?
Do I have to update the rows one by one with a for loop?
If so, how would I do this if I don't have a WHERE condition (only row numbers)?
Note: The creation of a parallel column alongside an existing one comes with null values. These then get overwritten.
In a relational database, you probably don't want to do what you are describing, as it breaks normalization.
What I suggest is that you have a feature table where you store the features for each row:
CREATE TABLE observations (id INTEGER);
CREATE TABLE features (id INTEGER, name TEXT);
CREATE TABLE values (row_id INTEGER, feature_id INTEGER, value FLOAT);
This way you can add new features by adding one row to the features table and all the corresponding rows to the values table.
If you use UPDATE tbl SET column='value' - you get the value in all rows in that column. This is exactly what this query does. If you want to set the value only on specific rows (or on specific column) you should change the query accordingly (using where column1='some value' or by changing the column name.
If you update a table with ~2M rows - depending on the amount of data, it takes time :) If you take a look here (which is very old, and probably things are much faster now), you can see that update of 25K rows in sqlite took them 2.4 seconds (now double it by 80). Large updates takes time.
You can use bulk update, however I'm not sure what exactly you are trying to do. If you want to set column2 to value2 where column1 = value1 you can use:
cur.executemany("UPDATE DOS_APPENDIX SET column2=? WHERE column1=?", [(column2_val, column1_val) for ...])
In general - when you say "I don't have a WHERE condition (only row numbers)" - this is very problematic. You can use limit if you know exactly which rows you want to update, however the order of the rows can change so I really recommend against it. It will be much better to add an id to your rows and use it with your UPDATE query.