Quick search in dataframe - python

We will be inserting data into mongodb. We are using pymongo.
We need to ensure that we insert a value into db only if the value is already not present in the db.
Since find() operation for every record to check for dup takes a long time, one suggestion was to bring entire values from db into list based on an identifier and search for the value in this list instead of doing find multiple times.
Here, the issue i am facing is if there are 1 million records for that identifier in db and i bring in the 1 million records into list. while processing 3k of my new records, i was checking for each record in 1 million which took around 9 minutes for the entire processing. I am assuming that this list can be replaced by dataframe for the faster search but someone can please help me achieve?
This is my code snippet to bring in the records into list first:
project_keys = {"_id":0}
for key in queryString:
project_keys[key] = 1
curr_data_arr = list(audit_collection.find({"source_name":default_attributes['source_name']}, project_keys))
Processing of new records:
df_dict = df.to_dict('records')
for row in df_dict:
query_string={}
for i in range(len(queryString)):
if(queryString[i]=="source_name"):
query_string[queryString[i]]=default_attributes["source_name"]
else:
query_string[queryString[i]]=row[queryString[i]]
if(len(query_string)!=0):
audit_data = next((sub for sub in curr_data_arr if (sub == query_string)), None)
logger.info(f"Duplicate data check in list from mongo db, result is {audit_data}")
else:
audit_data = {}
# do not insert/update if record found for audit data, log it and move to next record
if(audit_data):
duplicate_data_count += 1
# logger.info(f"Duplicate record found for audit. Duplicate record being inserted: {row}")
continue
# Insert if no record found
else:
arr.append(row)
new_data_count += 1
if(len(arr)>0):
logger.info(f"Length of unique array is {len(arr)}")
audit_collection.insert_many(arr)
So in above code snippet this will take ling processing time audit_data = next((sub for sub in curr_data_arr if (sub == query_string)). Example: db has 1 million records and currently i am inserting 3k records which took 9 minutes to check. If i remove that line without duplicate data check, data insertion takes place within 1 minute. How do i make this duplicate data check faster? Can i bring in the data in a dataframe instead and search in a dataframe instead of list? NOTE: Search would be based on primary key/queryString. would it become faster? Please provide code snippets

Related

What is the best way to query a pytable column with many values?

I have a 11 columns x 13,470,621 rows pytable. The first column of the table contains a unique identifier to each row (this identifier is always only present once in the table).
This is how I select rows from the table at the moment:
my_annotations_table = h5r.root.annotations
# Loop through table and get rows that match gene identifiers (column labeled gene_id).
for record in my_annotations_table.where("(gene_id == b'gene_id_36624' ) | (gene_id == b'gene_id_14701' ) | (gene_id == b'gene_id_14702')"):
# Do something with the data.
Now this works fine with small datasets, but I will need to routinely perform queries in which I can have many thousand of unique identifiers to match for in the table's gene_id column. For these larger queries, the query string can quickly get very large and I get an exception:
File "/path/to/my/software/python/python-3.9.0/lib/python3.9/site-packages/tables/table.py", line 1189, in _required_expr_vars
cexpr = compile(expression, '<string>', 'eval')
RecursionError: maximum recursion depth exceeded during compilation
I've looked at this question (What is the PyTables counterpart of a SQL query "SELECT col2 FROM table WHERE col1 IN (val1, val2, val3...)"?), which is somehow similar to mine, but was not satisfactory.
I come from an R background where we often do these kinds of queries (i.e. my_data_frame[my_data_frame$gene_id %in% c("gene_id_1234", "gene_id_1235"),] and was wondering if there was comparable solution that I could use with pytables.
Thanks very much,
Another approach to consider is combining 2 functions: Table.get_where_list() with Table.read_coordinates()
Table.get_where_list(): gets the row coordinates fulfilling the given condition.
Table.read_coordinates(): Gets a set of rows given their coordinates (in a list), and returns as a (record) array.
The code would look something like this:
my_annotations_table = h5r.root.annotations
gene_name_list = ['gene_id_36624', 'gene_id_14701', 'gene_id_14702']
# Loop through gene names and get rows that match gene identifiers (column labeled gene_id)
gene_row_list = []
for gene_name in gene_name_list:
gene_rows = my_annotations_table.get_where_list("gene_id == gene_name"))
gene_row_list.extend(gene_rows)
# Retieve all of the data in one call
gene_data_arr = my_annotations_table.read_coordinates(gene_row_list)
Okay, I managed to do some satisfactory improvements on this.
1st: optimize the table (with the help of the documentation - https://www.pytables.org/usersguide/optimization.html)
Create table. Make sure to specify the expectedrows=<int> arg as it has the potential to increase the query speed.
table = h5w.create_table("/", 'annotations',
DataDescr, "Annotation table unindexed",
expectedrows=self._number_of_genes,
filters=tb.Filters(complevel=9, complib='blosc')
#tb comes from import tables as tb ...
I also modified the input data so that the gene_id_12345 fields are simple integers (gene_id_12345 becomes 12345).
Once the table is populated with its 13,470,621 entries (i.e. rows),
I created a complete sorted index based on the gene_id column (Column.create_csindex()) and sorted it.
table.cols.gene_id.create_csindex()
table.copy(overwrite=True, sortby='gene_id', newname="Annotation table", checkCSI=True)
# Just make sure that the index is usable. Will print an empty list if not.
print(table.will_query_use_indexing('(gene_id == 57403)'))
2nd - The table is optimized, but I still can't query thousands of gene_ids at a time. So I simply separated them in chunks of 31 gene_ids (yes 31 was the absolute maximum, 32 was too much apparently).
I did not perform benchmarks, but querying ~8000 gene_ids now takes approximately 10 seconds which is acceptable for my needs.

Why is this sql statement super slow?

I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.

Getting information for multiple queries across multiple .csv files

I am currently trying to figure out a way to get information stored across multiple datasets as .csv files.
Context
For the purposes of this question, suppose I have 4 datasets: experiment_1.csv, experiment_2.csv, experiment_3.csv, and experiment_4.csv. In each dataset, there are 20,000+ rows with 80+ columns in each row. Each row represents an Animal, identified by a id number, and each column represents various experimental data about that Animal. Assume each row's Animal ID number is unique for each dataset, but not across all datasets. For instance, ID#ABC123 can be found in experiment_1.csv, experiment_2.csv, but not experiment_3.csv and experiment_4.csv
Problem
Say a user wants to get info for ~100 Animals by looking up each Animal's ID # across all datasets. How would I go about doing this? I'm relatively new to programming, and I would like to improve. Here's what I have so far.
class Animal:
def __init__(self, id_number, *other_parameters):
self.animal_id = id_number
self.animal_data = {}
def store_info(self, csv_row, dataset):
self.animal_data[dataset] = csv_row
# Main function
# ...
# Assume animal_queries = list of Animal Objects
# Iterate through each dataset csv file
for dataset in all_datasets:
# Make a copy of the list of queries
copy_animal_queries = animal_queries[:]
with open(dataset, 'r', newline='') as dataset_file:
reader = csv.DictReader(dataset_file, delimiter=',')
# Iterate through each row in the csv file
for row in reader:
# Check if the list is not empty
if animal_queries_copy:
# Get the current row's animal id number
row_animal_id = row['ANIMAL ID']
# Check if the animal id number matches with a query for
# every animal in the list
for animal in animal_queries_copy[:]:
if animal.animal_id == row_animal_id:
# If a match is found, store the info, remove the
# query from the list, and exit iterating through
# each query
animal.store_info(row, dataset)
animal_list_copy.remove(animal)
break
# If the list is empty, all queries were found for the current
# dataset, so exit iterating through rows in reader
else:
break
Discussion
Is there a more obvious approach for this? Assume that I want to use .csv files for now, and I will consider converting these .csv files to an easier-to-use format like SQL Tables later down the line (I am an absolute beginner at databases and SQL, so I need to spend time learning this).
The one thing that sticks out to me is that I have to create multiple copies of animal_queries: 1 for each dataset, and 1 for each row in a dataset (in the for loop). Since 1 row only contains 1 ID, I can exit the loop early once I find a match to an ID from animal_queries. In addition, since that ID was already found, I no longer need to search for that ID for the rest of the current dataset, so I remove it from the list, but I need to keep the original copy of the queries since I also need it to search the remaining datasets. However, I can't remove an element from a list while inside a for loop, so I need to create another copy as well. This doesn't seem optimal to me and I'm wondering if I'm approaching this in the wrong direction. Any help would be appreciated, thanks!
Well, you could greatly speed this up by using the pandas library for one thing. Ignoring the class definition for now, you could do the following:
import pandas as pd
file_names = ['ex_1.csv', 'ex_2.csv']
animal_queries = ['foo', 'bar'] #input by user
#create list of data sets
data_sets = [pd.read_csv(_file) for _file in file_names]
#create store of retrieved data
retrieved_data = [d_s[d_s['ANIMAL ID'].isin(animal_queries)] for d_s in data_sets]
#concatenate the data
final_data = pd.concat(retrieved_data)
#export to csv
final_data.to_csv('your_data')
This simplifies things a lot. The isin method slices each data frame where ANIMAL ID is found in the list animal_queires. Incidentally pandas will also help you to cope with sql tables also so is probably a good route for you to go down.

aggregating and pasting

I am beginning to move from R to Python and have a stupid question.
I have been looking for close to 5 hours to find a solution to my question.
I have the following code in R, which essentially takes the dataframe df and aggregates the outdates from a hospital based on unique ids. So my original table has many UIds repeated since someone may visit a hospital many times and each time they leave the hospital they have an out date. I want the UID, and all the outdates in one row. I could do this very easily with the following code in R.
newdf= aggregate(data = df, OutDate~UID, FUN=paste, sep="," )
Can anyone pray tell me how this can be accomplished in Python?
HEre's what my table looks like after using the above function in R
-UID1, 10/20/2008, 11/30/2008, 1/1/1900, 1/1/1900
-UID2, 6/19/2010, 1/1/1900
-UID3, 11/17/2009
-UID4, 3/14/2010 , 4/20/2010, 1/1/1900, 1/1/1900
-UID5, 12/12/2008, 8/27/2009, 1/1/1900
Ignore the dates, i just made them up. But the output needs to look like above.
Previously I had multiple UID1 rows for each of the dates in the current columns.
Now how do I do this in python.
You can do this with a dictionary comprehension:
from collections import defauldict
d = defaultdict(list)
for f in df.values():
// Assuming the first value is the UID:
d[f[0]].append(f)
Now d is a dictionary, where each key is the UID and the values are a list of rows from the dataframe. You can combine them into a string (like what you are doing with paste), like this:
for uid,values in d.iteritems():
for value in values:
print('{},{}'.format(uid,','.join(value)))
This sounds like building a dictionary where the key is the UID and you append each outdate to the key as you loop through the data. This assumes that you are getting the data in the form of a csv file where3 each row of data is read by csv.DictReader. I make the assumption based on what you seem to show of the data file and the separators. As a result, each entry in the row (which can include in time, out time, diagnosis, etc) is keyed by the header row. I will alsao assume that you can tell how to read the data into csv processing. The quick code below shows how to generate the dictionary entries from the row once you have it in.
I show the final way the data will look followed by how it was derived.
data = {UID1:(out1, out2, out3), UID2:(out3, out4)}
data = {}
for d in datarow:
uid = d[UID]
if uid not in data.keys():
data[uid] = ()
out = d[OUT]
data[uid].append(out)

Dealing with subsets of data using csv.DictReader

I'm parsing a big CSV file using csv.DictReader.
quotes=open( "file.csv", "rb" )
csvReader= csv.DictReader( quotes )
Then for each row I'm converting the time value in the CSV in datetime using this :
for data in csvReader:
year = int(data["Date"].split("-")[2])
month = strptime(data["Date"].split("-")[1],'%b').tm_mon
day = int(data["Date"].split("-")[0])
hour = int(data["Time"].split(":")[0])
minute = int(data["Time"].split(":")[1])
bars = datetime.datetime(year,month,day,hour,minute)
Now I would like to perform actions only on the rows of the same day. Would it be possible to do it in the same for loop or should I maybe save the data out per day and then perform actions? What would be an efficient way of baking the parsing?
As jogojapan has pointed out, it is important to know whether we can assume that the CSV file is sorted by date. If it is, then you could use itertools.groupby to simplify your code. For example, the for loop in this code iterates over the data one day at time:
import csv
import datetime
import itertools
with open("file.csv", "rb") as quotes:
csvReader = csv.DictReader(quotes)
lmb = lambda d: datetime.datetime.strptime(d["Date"], "%d-%b-%Y").date()
for k, g in itertools.groupby(csvReader, key = lmb):
# do stuff per day
counts = (int(data["Count"]) for data in g)
print "On {0} the total count was {1}".format(k, sum(counts))
I created a test "file.csv" containing the following data:
Date,Time,Count
1-Apr-2012,13:23,10
2-Apr-2012,10:57,5
2-Apr-2012,11:38,23
2-Apr-2012,15:10,1
3-Apr-2012,17:47,123
3-Apr-2012,18:21,8
and when I ran the above code I got the following results:
On 2012-04-01 the total count was 10
On 2012-04-02 the total count was 29
On 2012-04-03 the total count was 131
But remember that this will only work if the data in "file.csv" is sorted by date.
If (for some reason) you can assume that the input rows are already sorted by date, you could put them into a local container one by one as long as the date of any new row is the same as the previous one:
same_date_rows = []
prev_date = None
for data in csvReader:
# ... your existing code
bars = datetime.datetime(year,month,day,hour,minute)
if bars == prev_date:
same_date_rows.append(data)
else:
# New date. We process all rows collected so far
do_something(same_date_rows)
# Then we start a new collection for the new date
same_date_rows = [date]
# Remember the date of the current row
prev_date = bars
# Finally, process the final group of rows
do_something(same_date_rows)
But if you cannot make that assumption, you will have to
Either: Put the rows in a long list, sort that by date, and then apply an algorithm like the above to the sorted list
Or: Put the rows in a dictionary, using the date as key, and a list of rows as value for each key. Then you can iterate through the keys of that dictionary to get access to all rows that share a date.
The second of these two approaches is a little more space-consuming, but it may allow you do to some of the date-specific processing in the main loop, because whenever you receive a new row for an already-existing date, you could apply some of the date-specific processing right away, possibly avoiding the need to actually store all date-specific rows explicitly. Whether that is possible depends on what kind of processing you apply to the rows.
If you are not going for space efficeny, an elegant solution would be to create a dictionary where the key is your day, and the value is a list object, where all the information for each day is stored. Later you can do whatever operations you want based on per day.
For example
d = {} #Initialize emptry dictionry
for data in csvReader:
Day = int(data["Date"].split("-")[0])
try:
d[Day].append('Some_Val')
except KeyError:
d[Day] = ['Some_val']
This will either modify or create a new list object for each day. This is later easily accessible either by iterating over the dictionary or simply referring to the day as a key.
For example:
d[Some_Day]
will give you simply a list object with all the information you have stored. Given the linear lookup time of a dictionary, it should be quite efficent in terms of time.

Categories