Openpyxl and Binary Search - python

The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.

You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10

Related

How to get all combinations of records in csv file using python

I want all the combinations of records in a csv file(only one column)
Here is how the column looks like
Here is my code for what I had done
corpus = []
for index, row in df.iterrows():
corpus.append(pre_process(row['Chief Complaints']))
print(embeddings_similarity(corpus))
def embeddings_similarity(sentences):
# first we need to get data into | sentence_a | sentence_b | format
sentence_pairs = list(itertools.combinations(sentences, 2))
sentence_a = [pair[0] for pair in sentence_pairs]
sentence_b = [pair[1] for pair in sentence_pairs]
sentence_pairs_df = pd.DataFrame({'sentence_a': sentence_a, 'sentence_b': sentence_b})
From the above code I was able to get output fairly good that is 6X6 36 rows for given input in the picture.
But It takes a long time for more records so I was wondering is there any other way we can do to obtain the combinations of all records of a single column in a csv file.

Dict to Pandas Dataframe

Creaing first time my .csv export from a dict.
My dict has the fllowing structure:
dict_all[key] = {
"id_ja" : None,
"id_nein" : None ,
"ZUW_ja": set(),
"ZUW_nein": set(),
"missing_ZUW_ja" : set(),
"missing_ZUW_nein" : set()
}
My .CSV should look like:
ID_yes/ID_no, ZUW (this needs to be "ZUW" in every row), missing_ZUW_yes/missing_ZUW_nein and Relation (which needs to be -1 in every row)
For
missing_ZUW_yes/missing_ZUW_nein
I need to write a single row for each entry in this set.
That means the other three columns need to be duplicated as long there is as ID inside my missing_ZUW_yes/missing_ZUW_nein
Probably the easiest is to iterate over id_yes first and in this loop adding a row for each entry in missing_ZUW_yes. If the first half is done it might be easier to continue with id_no and missing_ZUW_no.. am I right with that?
My relevant dict entries look like:
dict["LM_Doctor"]= {"id_ja": 122344, "id_nein":122345, "missing_ZUW_ja": 123,132,143,12, "missing_ZUW_ja": 432,64,321}
and in the csv it should look like:
row 0 = Term ID 1
row 1 = 122344
row 0 = ZUW
row 1 = ZUW
row 0 = Term ID 2
row 1 = first id from missing_ZUW_yes #in this example 123
row 0 = RV
row 1 = -1
and row two should look the same except the fact that there should be 132 for missing_ZUW_yes

How to identify string repetition throughout rows of a column in a Pandas DataFrame?

I'm trying to think of a way to best handle this. If I have a data frame like this:
Module---|-Line Item---|---Formula-----------------------------------------|-repetition?|--What repeated--------------------------------|---Where repeated
Module 1-|Line Item 1--|---hello[SUM: hello2]------------------------------|----yes-----|--hello[SUM: hello2]---------------------------|---Module 1 Line item 2
Module 1-|Line Item 2--|---goodbye[LOOKUP: blue123] + hello[SUM: hello2]---|----yes-----|--hello[SUM: hello2], goodbye[LOOKUP: blue123]-|---Module 1 Line item 1, Module 2 Line Item 1
Module 2-|Line Item 1--|---goodbye[LOOKUP: blue123] + some other line item-|----yes-----|--goodbye[LOOKUP: blue123]---------------------|---Module 1 Line item 2
How would I go about setting up a search and find to locate and identify repetition in the middle or on edges or complete strings?
Sorry the formatting looks bad
Basically I have the module, line item, and formula columns filled in, but I need to figure out some sort of search function that I can apply to each of the last 3 columns. I'm not sure where to start with this.
I want to match any repetition that occurs between 3 or more words, including if for example a formula was 1 + 2 + 3 + 4 and that occurred 4 times in the Formula column, I'd want to give a yes to the boolean column "repetition" return 1 + 2 + 3 + 4 on the "Where repeated" column and a list of every module/line item combination where it occurred on the last column. I'm sure I can tweak it more to fit my needs once I get started.
This one was a bit messy, is surely some more straight forward way to do some of the steps, but it worked for your data.
Step 1: I just reset_index() (assuming index uses row numbers) to get row numbers into a column.
df.reset_index(inplace=True)
I then wrote a for loop which aim was to check for each given value, if that value is at any place in the given column (using the .str.contains() function, and if so, where. And then store that information in a dictionary. Note that here I used + to split the various values you search by as that looked to be a valid separator in your dataset, but you can adjust this accordingly
#the dictionary will have a key containing row number and the value we searched for
#the value will contain the module and line item values
result = {}
#create a rownumber variable so we know where in the dataset we are
rownumber = -1
#now we just iterate over every row of the Formula series
for row in df['Formula']:
rownumber +=1
#and also every relevant value within that cell
for value in row.split('+'):
#we clean the value from trailing/preceding whitespace
value = value.strip()
#and then we return our key and value and update our dictionary
key = 'row:|:'+str(rownumber)+':|:'+value
value = (df.loc[((df.Formula.str.contains(value,regex=False))) & (df.index!=rownumber),['Module','Line Item']])
result.update({key:value})
We can now unpack the dictionary into list, where we had a match:
where_raw = []
what_raw = []
rows_raw = []
for key,value in zip(result.keys(),result.values()):
if 'Empty' in str(value):
continue
else:
where_raw.append(list(value['Module']+' '+value['Line Item']))
what_raw.append(key.split(':|:')[2])
rows_raw.append(int(key.split(':|:')[1]))
tempdf = pd.DataFrame({'row':rows_raw,'where':where_raw,'what':what_raw})
tempdf now contains one row per match, however, we want to have one row per original row in the df, so we combine all matches for each main row into one
where = []
what = []
rows = []
for row in tempdf.row.unique():
where.append(list(tempdf.loc[tempdf.row==row,'where']))
what.append(list(tempdf.loc[tempdf.row==row,'what']))
rows.append(row)
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}))
lastly we can now get the result by merging the result with our original dataframe
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}),how='left',on='index').drop('index',axis=1)
and lastly we can add the repeated column like this:
result['repeated'] = (result['what']!='')
print(result)
Module Line Item Formula what where
Module 1 Line Item 1 hello[SUM: hello2] ['hello[SUM: hello2]'] [['Module 1 Line Item 2']]
Module 1 Line Item 2 goodbye[LOOKUP: blue123] + hello[SUM: hello2] ['goodbye[LOOKUP: blue123]', 'hello[SUM: hello2]'] [['Module 2 Line Item 1'], ['Module 1 Line Item 1']]
Module 2 Line Item 1 goodbye[LOOKUP: blue123] + some other line item ['goodbye[LOOKUP: blue123]'] [['Module 1 Line Item 2']]

Converting excel data to nested dict and list

This is almost the same from my yesterday's question. But I took it for granted to use a unique value list to create the nested dict & list structure. But then, I came to the question of how to build this dict & list structure (refer as data structure) row by row from the excel data.
The excel files (multiple files in a folder) all look like the following:
Category Subcategory Name
Main Dish Noodle Tomato Noodle
Main Dish Stir Fry Chicken Rice
Main Dish Soup Beef Goulash
Drink Wine Bordeaux
Drink Softdrink Cola
My desired structure of dict & list structure is:
data = [0:{'data':0, 'Category':[
{'name':'Main Dish', 'Subcategory':[
{'name':'Noodle', 'key':0, 'data':['key':1, 'title':'Tomato Noodle']},
{'name':'Stir Fry', 'key':1, 'data':['key':2, 'title':'Chicken Rice']},
{'name':'Soup', 'key':2, 'data':['key':3, 'title':'Beef Goulash']}]},
{'name':'Drink', 'Subcategory':[
{'name':'Wine', 'key':0, 'data':['key':1, 'title':'Bordeaux']},
{'name':'Softdrink', 'key':1, 'data':['key':2, 'title':'cola'}]}]},
1:{'data':1, 'Category':.........#Same structure as dataset 0}]
So, for each excel file, it is fine, just loop through and set {'data':0, 'Category':[]}, {'data':1, 'Category':[]} and so on. The key is, for each Category and Subcategory values, Main Dish has three entries in excel, but only needs 1 in the data structure, and Drink has two entries in excel, but only 1 in the data structure. For each subcategory nested in the category list, they follow the same rule, only unique values should be nested to category. Then, each corresponding Name of dishes, they go into the data structure depending on their category and subcategory.
The issue is, I cannot find a better way to convert the data to this data structure. Plus, there are other columns after the Name column. So it is kind of sophisticated. I was thinking to first extract the unique values from the entire column of category and subcategory, this simplifies the process, but leads to problems when filling in the corresponding Name values. If I am doing this from a row by row approach, then designing a if subcategory exist or category exit test to keep unique values are somehow difficult based on my current programming skills...
Therefore, what would be the best approach to convert this excel file into such a data structure? Thank you very much.
One way could be to read the excelfile into a dataframe using pandas, and then build on this excellent answer Pandas convert DataFrame to Nested Json
import pandas as pd
excel_file = 'path-to-your-excel.xls'
def fdrec(df):
drec = dict()
ncols = df.values.shape[1]
for line in df.values:
d = drec
for j, col in enumerate(line[:-1]):
if not col in d.keys():
if j != ncols-2:
d[col] = {}
d = d[col]
else:
d[col] = line[-1]
else:
if j!= ncols-2:
d = d[col]
return drec
df = pd.read_excel(excel_file)
print(fdrec(df))

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Categories