The problem: I have two spreadsheets. Spreadsheet 1 has about 20,000 rows. Spreadsheet 2 has near 1 million rows. When a value from a row in spreadsheet 1 matches a value from a row in spreadsheet 2, the entire row from spreadsheet 2 is written to excel. The problem isn't too difficult, but with such a large number of rows, the run time is incredibly long.
Book 1 Example:
|Key |Value |
|------|------------------|
|397241|587727227839578000|
An example of book 2:
ID
a
b
c
587727227839578000
393
24
0.43
My current solution is:
g1 = openpyxl.load_workbook('path/to/sheet/sheet1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)
g2 = openpyxl.load_workbook('path/to/sheet2/sheet2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)
for row in grid1_rows:
value1 = int(row[1].value)
print(value1)
for row2 in grid2_rows:
value2 = int(row2[0].value)
if value1 == value2:
new_Name = int(row[0].value)
print("match")
output_file.write(str(new_Name))
output_file.write(",")
output_file.write(",".join(str(c.value) for c in row2[1:]))
output_file.write("\n")
This solution works, but again the runtime is absurd. Ideally I'd like to take value1 (which comes from the first sheet,) then perform a binary search for that value on the other sheet, then just like my current solution, if it matches, copy the entire row to a new file. then just
If there's an even faster method to do this I'm all ears. I'm not the greatest at python so any help is appreciated.
Thanks.
You are getting your butt kicked here because you are using an inappropriate data structure, which requires you to use the nested loop.
The below example uses sets to match indices from first sheet to those in the second sheet. This assumes there are no duplicates on either sheet, which would seem weird given your problem description. Once we make sets of the indices from both sheets, all we need to do is intersect the 2 sets to find the ones that are on sheet 2.
Then we have the matches, but we can do better. If we put the second sheet row data into dictionary with the indices as the keys, then we can hold onto the row data while we do the match, rather than have to go hunting for the matching indices after intersecting the sets.
I've also put in an enumeration, which may or may not be needed to identify which rows in the spreadsheet are the ones of interest. Probably not needed.
This should execute in the blink of an eye after things are loaded. If you start to have memory issues, you may want to just construct the dictionary at the start rather than the list and the dictionary.
Book 1:
Book 2:
Code:
import openpyxl
g1 = openpyxl.load_workbook('Book1.xlsx',read_only=True)
grid1 = g1.active
grid1_rows = list(grid1.rows)[1:] # exclude the header
g2 = openpyxl.load_workbook('Book2.xlsx',read_only=True)
grid2 = g2.active
grid2_rows = list(grid2.rows)[1:] # exclude the header
# make a set of the values in Book 1 that we want to search for...
search_items = {int(t[0].value) for t in grid1_rows}
#print(search_items)
# make a dictionary (key-value paring) for the items in the 2nd book, and
# include an enumeration so we can capture the row number
lookup_dict = {int(t[0].value) : (idx, t) for idx,t in enumerate(grid2_rows, start=1)}
#print(lookup_dict)
# now let's intersect the set of search items and key values to get the keys of the matches...
keys = search_items & lookup_dict.keys()
#print(keys)
for key in keys:
idx = lookup_dict.get(key)[0] # the row index, if needed
row_data = lookup_dict.get(key)[1] # the row data
print(f'row {idx} matched value {key} and has data:')
print(f' name: {row_data[1].value:10s} \t qty: {int(row_data[2].value)}')
Output:
row 3 matched value 202 and has data:
name: steak qty: 3
row 1 matched value 455 and has data:
name: dogfood qty: 10
There's lots of good information out there on how to read space-delimited data with missing values if the data is fixed-width.
http://jonathansoma.com/lede/foundations-2017/pandas/opening-fixed-width-files/
Reading space delimited file in Python/Pandas with missing values
ASCII table with consecutive white-spaces as separators and missing data python pandas
I'm currently trying to read Japan's Meteorological Agency typhoon history data which is supposed to have this format, but doesn't actually:
# Header rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII
# Data rows:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|::::+::::|
AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P
It is very similar to NOAA's hurricane best track data, except that it comma delimited, and missing values were given -999 or NaN, which simplified reading the data. Additionally, Japan's data doesn't actually follow the advertised format. For example, column FFFF in the data rows don't always have width 4. Sometimes it has width 3.
I must say that I'm at a complete loss as how to process this data into a dataframe. I've investigated the pd.read_fwf method, and it initially looked promising until I discovered the malformed columns and the two different row types.
My question:
How can I approach cleaning this data and getting it into a dataframe? I'd just find a different dataset, but honestly I can't find any comprehensive typhoon data anywhere else.
I went a little deep for you here, because I'm assuming you're doing this in the name of science and if I can help someone trying to understand climate change then its a good cause.
After looking the data over I've noticed the issue is relating to the data being stored in a de-normalized structure. There are 2 ways you can approach this issue off the top of my head. Re-Writing the file to another file to load into pandas or dask is what I'll show, since thats probably the easiest way to think about it (but certainly not the most efficient for those that will inevitably roast me in the comments)
Think of this like its Two Separate Tables, with a 1-to-Many relationship. 1 table for Typhoons and another for the data belonging to a given typhoon.
A decent, but not really efficient way would be to rewrite it to a better nested structure, like JSON. Then load the data in using that. Note the 2 distinct types of columns.
Step 1: map the data out
There are really 2 tables in one table here. Each typhoon is going to show up as a row that appears like this:
66666 9119 150 0045 9119 0 6 MIRREILE 19920701
While the records for that typhoon are going to follow that row (think of this as a separate row:
20080100 002 3 178 1107 994 035 00000 0000 30600 0200
Load the File in, reading it as raw lines. By using the .readlines() method, we can read each individual line in as an item in a list.
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
Now that we have that read in, we're going to need to perform some logic to separate some lines from others. It appears the every time there is a Typhoon record, the line is preceded with a '66666', so lets key off that. So, given we look at each individual line in a horribly inefficient loop, we can write some if/else logic to have a look:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
Thats going to be a pretty solid way to separate that logic for now, which will be useful to guide splitting that up. Now, we need to write a loop that will check that for each row:
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
# do stuff
else:
# do other stuff
# read through lines list from the .readlines(), looping sequentially
for line in lines:
write_typhoon(line, collection)
Lastly, we're going to need to write some logic to now extract that data out in some manner within the if/then loop inside the write_typhoon() function. I didn't care to do a whole lot of thinking here, and opted for the simplest I could make it: defining the fwf metadata myself. because "yolo":
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
# clean that whitespace
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
# clean that whitespace
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
Okay that took me longer than I'm willing to admit. I wont lie. Gave me PTSD flashbacks from writing COBOL programs...
Anyway, now we have a nice, nested data structure in native python types. The fun can begin!
Step 2: Load this into a usable format
To analyze it, I'm assuming you'll want it in pandas (or maybe Dask if its too big). Here is what I was able to come up with along that front:
import pandas as pd
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
A great reference for that can be found in the answers for this question (particularly the second one, not the selected one)
Put it all together now:
from typing import Dict
import pandas as pd
# load the file as raw input
with open('./test.txt') as f:
lines = f.readlines()
# initialize list of dicts
collection = []
def write_typhoon(row: str, collection: Dict) -> Dict:
if row[:5] == '66666':
typhoon = {
"AA":row[:5],
"BB":row[6:11],
"CC":row[12:15],
"DD":row[16:20],
"EE":row[21:25],
"FF":row[26:27],
"GG":row[28:29],
"HH":row[30:50],
"II":row[51:],
"data":[]
}
for key, value in typhoon.items():
if key != 'data':
typhoon[key] = value.strip()
collection.append(typhoon)
else:
sub_data = {
"A":row[:9],
"B":row[9:12],
"C":row[13:14],
"D":row[15:18],
"E":row[19:23],
"F":row[24:32],
"G":row[33:40],
"H":row[41:42],
"I":row[42:46],
"J":row[47:51],
"K":row[52:53],
"L":row[54:57],
"M":row[58:70],
"P":row[71:]
}
for key, value in sub_data.items():
sub_data[key] = value.strip()
collection[-1]['data'].append(sub_data)
return collection
# read through file sequentially
for line in lines:
write_typhoon(line, collection)
# load to pandas df using json_normalize
df = pd.json_normalize(
collection,
record_path='data',
meta=["AA","BB","CC","DD","EE","FF","GG","HH","II"]
)
print(df.head(20)) # lets see what we've got!
There's someone who might have had the same problem and created a library for it, you can check it out here:
https://github.com/miniufo/besttracks
It also includes a quickstart notebook with loading the same dataset.
Here is how I ended up doing it. The key was realizing there are two types of rows in the data, but within each type the columns are fixed width:
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
So, here's how it went. I wrote these two functions to help me reformat the text file int CSV format:
def get_idxs(string, char):
idxs = []
for i in range(len(string)):
if string[i - 1].isalpha() and string[i] == char:
idxs.append(i)
return idxs
def replace(string, idx, replacement):
string = list(string)
try:
for i in idx: string[i] = replacement
except TypeError:
string[idx] = replacement
return ''.join(string)
# test it out
header_fmt = "AAAAA BBBB CCC DDDD EEEE F G HHHHHHHHHHHHHHHHHHHH IIIIIIII"
track_fmt = "AAAAAAAA BBB C DDD EEEE FFFF GGG HIIII JJJJ KLLLL MMMM P"
header_idxs = get_idxs(header_fmt, ' ')
track_idxs = get_idxs(track_fmt, ' ')
print(replace(header_fmt, header_idxs, ','))
print(replace(track_fmt, track_idxs, ','))
Testing the function on the format strings, we see commas were put in the appropriate places:
AAAAA,BBBB, CCC,DDDD,EEEE,F,G,HHHHHHHHHHHHHHHHHHHH, IIIIIIII
AAAAAAAA,BBB,C,DDD,EEEE,FFFF, GGG, HIIII,JJJJ,KLLLL,MMMM, P
So next apply those functions to the .txt and create a .csv file with the output:
from contextlib import ExitStack
from tqdm.notebook import tqdm
with ExitStack() as stack:
read_file = stack.enter_context(open('data/bst_all.txt', 'r'))
write_file = stack.enter_context(open('data/bst_all_clean.txt', 'a'))
for line in tqdm(read_file.readlines()):
if ' ' in line[:8]: # line is header data
write_file.write(replace(line, header_idxs, ',') + '\n')
else: # line is track data
write_file.write(replace(line, track_idxs, ',') + '\n')
The next task is to add the header data to ALL rows, so that all rows have the same format:
header_cols = ['indicator', 'international_id', 'n_tracks', 'cyclone_id', 'international_id_dup',
'final_flag', 'delta_t_fin', 'name', 'last_revision']
track_cols = ['date', 'indicator', 'grade', 'latitude', 'longitude', 'pressure', 'max_wind_speed',
'dir_long50', 'long50', 'short50', 'dir_long30', 'long30', 'short30', 'jp_landfall']
data = pd.read_csv('data/bst_all_clean.txt', names=track_cols, skipinitialspace=True)
data.date = data.date.astype('string')
# Get headers. Header rows have variable 'indicator' which is 5 characters long.
headers = data[data.date.apply(len) <= 5]
data[['storm_id', 'records', 'name']] = headers.iloc[:, [1, 2, 7]]
# Rearrange columns; bring identifiers to the first three columns.
cols = list(data.columns[-3:]) + list(data.columns[:-3])
data = data[cols]
# front fill NaN's for header data
data[['storm_id', 'records', 'name']] = data[['storm_id', 'records', 'name']].fillna(method='pad')
# delete now extraneous header rows
data = data.drop(headers.index)
And that yields some nicely formatted data, like this:
storm_id records name date indicator grade latitude longitude
15 5102.0 37.0 GEORGIA 51031900 2 2 67.0 1614
16 5102.0 37.0 GEORGIA 51031906 2 2 70.0 1625
17 5102.0 37.0 GEORGIA 51031912 2 2 73.0 1635
I have credit card charge data that has a column containing the description for the charge. I also created a dictionary that contains categories for different charges. For example, I have a category called grocery expenses (value) and regular expressions (Ralphs, Target). I combined my values in a string with the separator |.
I am using the Series.str.contains(pat,case=True,flags=0,na=nan,regex=True) function to see if the string in each index contains my regular expressions.
# libraries needed
# import pandas as pd
# import re
joined_string=['|'.join(value) for value in values]
the_list=joined_string
Example output: the_list=[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK"]
df['Description']='FOOD4LESS 0508 0000FULLERTON CA'
The Dataframe contains a column of different charges on your credit card
```python
for character_sequence in the_list:
boolean_output=df['Description'].str.contains(character_sequence,regex=True)
For some reason, the code is not going through each character sequence in my list. It only goes through one character sequence, but I need it to go through multiple character sequences.
Since there is no data to compare with, I will just present some dummy data.
import pandas as pd
names = ['Adam','Barry','Chuck','Dennis','Elon','Fridman','George','Harry']
df = pd.DataFrame(names, columns=['Names'])
# Apply regex and save to column: Regex
df['Regex'] = df.Names.str.contains('[ae]', regex=True)
df
Output:
Names Regex
0 Adam True
1 Barry True
2 Chuck False
3 Dennis True
4 Elon False
5 Fridman True
6 George True
7 Harry True
Solution with another Example akin to the Problem
First, your the_list variable is not correct. Assuming it is a typo, I would present my solution here. Please note that regex or regular expression, when applied to a column of data, essentially means that you are trying to find some patterns. How would you in first place know/check if your pattern recognition is working fine? Well, you would need a few data-points to at least validate the regex results. Since, you only provided one line of data, therefore, I will make some dummy data here and test if the regex produces expected results.
Note: Please check the Data Prepeartions section to see the data so you can replicate and test the solution.
import pandas as pd
import re
# Make regex string from the list of target keywords
regex_expression = '|'.join(the_list)
# Make dataframe from the list of descriptions
# --> see under Data section of the solution.
df = pd.DataFrame(descriptions, columns=['Description'])
# Regex search results for a subset of
# target keywords: "Gas|Internet|Water|Electricity,VONS"
df['Regex_A'] = df.Description.str.contains("Gas|Internet|Water|Electricity,VONS", regex=True)
# Regex search result of all target keywords
df['Regex_B'] = df.Description.str.contains(regex_expression, regex=True)
df
Output:
Description Regex_A Regex_B
0 FOOD4LESS 0508 0000FULLERTON CA False True
1 Electricity,VONS 0777 0123FULLERTON NY True True
2 PAVILIONS 1248 9800Ralphs MA False True
3 SPROUTS 9823 0770MARKET#WORK WI False True
4 Internet 0333 1008Water NJ True True
5 Enternet 0444 1008Wager NJ False False
Data Preparation
In a practical scenario, I would assume that in case of the type of problem you presented in the question, you would have a list of words, that you would like to look for in the dataframe column.
So, I took the liberty to first convert your string into a list of strings.
the_list="[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK]"
the_list = the_list.replace("[","").replace("]","").split("|")
the_list
Output:
['Gas',
'Internet',
'Water',
'Electricity,VONS',
'RALPHS',
'Ralphs',
'PAVILIONS',
'FOOD4LESS',
"TRADER JOE'S",
'GROCERY OUTLET',
'FOOD 4 LESS',
'SPROUTS',
'MARKET#WORK']
Also, we make five rows of data where we have have the keywords we are looking for; and then add another row to it where we expect a False as a result of the regex pattern search.
descriptions = [
'FOOD4LESS 0508 0000FULLERTON CA',
'Electricity,VONS 0777 0123FULLERTON NY',
'PAVILIONS 1248 9800Ralphs MA',
'SPROUTS 9823 0770MARKET#WORK WI',
'Internet 0333 1008Water NJ',
'Enternet 0444 1008Wager NJ',
]
I would need to parse some json files to a pandas dataframe. I want to have one column with the words present in the text, and another column with the corresponding entity – the entity will be the “Type” of the text below, when the “value” corresponds to the word, otherwise I want to assign the label ‘O’.
Below is an example.
This is the JSON file:
{"Text": "I currently use a Netgear Nighthawk AC1900. I find it reliable.",
"Entities": [
{
"Type": "ORGANIZATION ",
"Value": "Netgear"
},
{
"Type": "DEVICE ",
"Value": "Nighthawk AC1900"
}]
}
Here is what I want to get:
WORD TAG
I O
currently O
use O
a O
Netgear ORGANIZATION
Nighthawk AC1900 DEVICE
. O
I O
find O
it O
reliable O
. O
Can someone help me with the parsing? I can`t use the split() because sometime the values consists of two words. Hope this is clear. Thank you!
This is a difficult problem and will depend on what data isn't in this example and the output required. Do you have repeating data in the entity values? is order important? Did you want repetition in the output?
There are a few tools that can be used:
make a trie out of the Entity values before you search the string. This is good if you have overlapping versions of the same name like "Netgear" and "Netgear INC." and you want the longest version.
nltk.PunktSentenceTokenizer This one is finicky to work with about the Nouns. This tutorial does a better job of explaining how to deal with them.
I don't know if what you need is strictly what you post as a desired output.
The solution I am giving you is "dirty" (more elements and the column TAG is placed first)
You can manage to clean it and put it in the format you need. As you didn't provided a piece of code to start on, you can finish it.
Eventually you will find out that the purpose of stackoverflow is not to get people to write the code for you, but people to help you out with the code you are trying.
import json
import pandas as pd
#open and reading of the json:
with open('netgear.json','r') as jfile:
data = jfile.read()
info = json.loads(data)
#json into content
words,tags = info['Text'].split(),info['Entities']
#list to handle the Entities
prelist = []
for i in tags:
j = list(i.values())
#['ORGANIZATION ', 'Netgear']
#['DEVICE ', 'Nighthawk AC1900']
prelist.append(j)
#DataFrames to be merged
dft = pd.DataFrame(prelist,columns=['TAG','WORD'])
dfw = pd.DataFrame(words,columns=['WORD'])
#combine the dataFrames and NaN into 0
df = dfw.merge(dft, on='WORD', how='outer').fillna(0)
This is the output:
WORD TAG
0 I 0
1 I 0
2 currently 0
3 use 0
4 a 0
5 Netgear ORGANIZATION
6 Nighthawk 0
7 AC1900. 0
8 find 0
9 it 0
10 reliable. 0
11 Nighthawk AC1900 DEVICE