Updating multiple values in an SQLite column with Python - python

I have an SQLite table with a constant number of rows. But as I generate values derived from some of these columns (new features), I want to add columns on the fly, alongside existing columns, without creating any new rows. I can add a column using ALTER TABLE, but calling cur.executemany("INSERT INTO...") causes the values to be appended in new rows.
I've tried:
cur.executemany("UPDATE DOS_APPENDIX SET FEATURE2=?", [(val,) for val in ["a", "b", "c"]])
For some reason this causes the "c" to be duplicated across rows 1, 2, 3 in column FEATURE2. And it's slow on a large list (~2 million).
Is there a way to bulk update? Something as graceful and fast as calling cur.executemany(INSERT INTO...)?
Do I have to update the rows one by one with a for loop?
If so, how would I do this if I don't have a WHERE condition (only row numbers)?
Note: The creation of a parallel column alongside an existing one comes with null values. These then get overwritten.

In a relational database, you probably don't want to do what you are describing, as it breaks normalization.
What I suggest is that you have a feature table where you store the features for each row:
CREATE TABLE observations (id INTEGER);
CREATE TABLE features (id INTEGER, name TEXT);
CREATE TABLE values (row_id INTEGER, feature_id INTEGER, value FLOAT);
This way you can add new features by adding one row to the features table and all the corresponding rows to the values table.

If you use UPDATE tbl SET column='value' - you get the value in all rows in that column. This is exactly what this query does. If you want to set the value only on specific rows (or on specific column) you should change the query accordingly (using where column1='some value' or by changing the column name.
If you update a table with ~2M rows - depending on the amount of data, it takes time :) If you take a look here (which is very old, and probably things are much faster now), you can see that update of 25K rows in sqlite took them 2.4 seconds (now double it by 80). Large updates takes time.
You can use bulk update, however I'm not sure what exactly you are trying to do. If you want to set column2 to value2 where column1 = value1 you can use:
cur.executemany("UPDATE DOS_APPENDIX SET column2=? WHERE column1=?", [(column2_val, column1_val) for ...])
In general - when you say "I don't have a WHERE condition (only row numbers)" - this is very problematic. You can use limit if you know exactly which rows you want to update, however the order of the rows can change so I really recommend against it. It will be much better to add an id to your rows and use it with your UPDATE query.

Related

Performance issue with Pandas when aggregating using a custom defined function

I have database of above 500.000 records and 140 MB (when stored as CSV). Pandas takes about 1.5 seconds to load it, parsing dates included. Is not a problem at all. Now, I have a Python program that is continuously creating more records, which I want to add to the database (I also remove older records, so the database has a fairly stable size). And I'm facing a performance issue, as adding the new records takes longer than the process creating such records.
For adding these new records, I basically merge the freshly obtained Dataframe with the one that contains the database, which is loaded from a CSV file, i.e.:
# read the database
old_df = pd.read_csv('database.csv',
index_col=False,
parse_dates=['date'],
dtype=dtypes)
# some process produces new_df
# I merge them by just concatenating
merged = pd.concat([df, df2])
This step is even faster, so no problem so far. Perhaps it's worth to note that the new_df is tiny compared to old_df. Typically less than 10 new records are added each time.
Now, a particularity of this database is that some of the new records are supposed to replace their counterpart in the database, i.e. they not just grow but update it (The details are not important for the problem, but for a bit of context, the database keeps memory of previous fails in the column type, which can be either 'success' or' 'failed', that correspond to attempts to get a file stored in the column file. This way, when a latter attempts of the program success, the record for the fail is replaced by the success.)
The replacement consists of grouping the database by the column file, so each file is unique. Once grouped, I need to aggregate to define a value for type, so I keep just one record for the given file. And my problem is that the aggregation is done through a user defined function that has become a bottleneck of the program.
This code:
merged = merged.groupby('file', as_index=False).agg({'type': 'last'})
runs in less than a second, whereas this:
def keep_success(x):
"""! Auxiliary function to keep `success` if it exist."""
if (x == "success").any():
return 'success'
else:
return x.iloc[-1]
merged = merged.groupby('file', as_index=False).agg({'type': keep_success})
takes more than a minute. So far I was using 'last', but a change in my program made that sometimes 'success' is previous to 'fail', so I need to account for the unknown order of these two values.
TL;DR; I need a FAST way to aggregate records in a Dataframe sharing the file column, and keeping just the value 'success' for the column type in case there is any occurrence of this value within the group. Otherwise we keep 'failed'
EDIT to add my guess:
I think the problem is in the string comparison. The program has to go through ALL the database making trivial/useless comparisons that systematically are not fulfilled. To replace about 10 records, we need to check the equity of above 500.000 strings. Can I work around this taking advantage of what I known, i.e. that most records, once grouped, are unique so we do not need to do anything to with them?

Pandas: Dealing with missing column in input dataframe

I have a python code which performs mathematical calculations on multiple columns of the dataframe. This input comes from various sources so there is a possibility that sometimes one column is missing from the same.
This column is missing because its insignificant but i need to have a null column atleast for the code to run without errors.
I can add a null column using if loop but there are around 120 columns and i do not want to slow down the code. Is there any other way where the code can check each column is present in the original dataframe and then if any column is not present it adds a null column and then starts with execution of the actual code?
If you know that the column name is the same for every dataframe you could do something like this without having to loop over the column names
if col_name not in df.columns:
df[col_name] = '' # or whatever value you want to set it to
If speed is a super concern, which I can't tell, you could always convert the the columns to a set with set(df.columns) and reduce the search to O(1) time because it will be a hashed search. You can read more in detail on the efficiency of the in operator at this link How efficient is Python's 'in' or 'not in' operators?

Discard rows in dataframe if particular column values in list [duplicate]

I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())

Pandas: How to remove rows from a dataframe based on a list?

I have a dataframe customers with some "bad" rows, the key in this dataframe is CustomerID. I know I should drop these rows. I have a list called badcu that says [23770, 24572, 28773, ...] each value corresponds to a different "bad" customer.
Then I have another dataframe, lets call it sales, so I want to drop all the records for the bad customers, the ones in the badcu list.
If I do the following
sales[sales.CustomerID.isin(badcu)]
I got a dataframe with precisely the records I want to drop, but if I do a
sales.drop(sales.CustomerID.isin(badcu))
It returns a dataframe with the first row dropped (which is a legitimate order), and the rest of the rows intact (it doesn't delete the bad ones), I think I know why this happens, but I still don't know how to drop the incorrect customer id rows.
You need
new_df = sales[~sales.CustomerID.isin(badcu)]
You can also use query
sales.query('CustomerID not in #badcu')
I think the best way is to drop by index,try it and let me know
sales.drop(sales[sales.CustomerId.isin(badcu)].index.tolist())

SFrame manipulation slows down after adding of a new column

I am building a repeat orders report in ipython notebook using graphlab and sframes. I have a csv file with roughly 100k rows of data containing user_id, user_email, user_phone. I added a new column called unique identifier. For each row I am traversing all other rows to see if user_id, user_email or user_phone matches the current record. If unique identifier is not empty and there is a match, I assign user_id from the current record into unique_identifier slot of each matching record.
At the end, I get an SFrame with 4 columns, where unique_identifier contains user_id of the oldest order for all matching orders. I am doing this via .apply method with a lambda function. The whole process takes a few seconds on my laptop. However, after the process is done, the SFframe becomes extremely slow and unmanageable to the point where SFrame.save seems to be taking forever.
It seems like my process of adding unique_identifier clogs up the memory or something like that. However, the problem is irrelevant of the sframe size. If I limit it to just 10 rows, the problem persists. What am I doing wrong?
Here is my method
def set_unique_identifier():
orders['unique_identifier'] = ''
orders['unique_identifier'] = orders.apply(lambda order:
order['unique_identifier'] if order['unique_identifier'] else
orders[(orders['user_email']==order['user_email']) |
(orders['phone'] == order['user_phone'])][0]['user_id'])
don't use apply on entire sframe, instead, use it on SArray, that should speed up a little

Categories