Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL - python

I'm using openpyxl in python, and I'm trying to run through 50k lines and grab data from each row and place it into a file. However.. what I'm finding is it runs incredibely slow the farther I get into it. The first 1k lines goes super fast, less than a minute, but after that it takes longer and longer and longer to do the next 1k lines.
I was opening a .xlsx file. I wonder if it is faster to open a .txt file as a csv or something or to read a json file or something? Or to convert somehow to something that will read faster?
I have 20 unique values in a given column, and then values are random for each value. I'm trying to grab a string of the entire unique value column for each value.
Value1: 1243,345,34,124,
Value2: 1243,345,34,124,
etc, etc
I'm running through the Value list, seeing if the name exists in a file, if it does, then it will access that file and append to it the new value, if the file doesn't exist, it will create the file and then set it to append. I have a dictionary that has all the "append write file" things connected to it, so anytime I want to write something, it will grab the file name, and the append thing will be available in the dict, it will look it up and write to that file, so it doesn't keep opening new files everytime it runs.
The first 1k took less than a minute.. now I'm on 4k to 5k records, and it's running all ready 5 minutes.. it seems to take longer as it goes up in records, I wonder how to speed it up. It's not printing to the console at all.
writeFile = 1
theDict = {}
for row in ws.iter_rows(rowRange):
for cell in row:
#grabbing the value
theStringValueLocation = "B" + str(counter)
theValue = ws[theStringValueLocation].value
theName = cell.value
textfilename = theName + ".txt"
if os.path.isfile(textfilename):
listToAddTo = theDict[theName]
listToAddTo.write("," + theValue)
if counter == 1000:
print "1000"
st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')
else:
writeFileName = open(textfilename, 'w')
writeFileName.write(theValue)
writeFileName = open(textfilename, 'a')
theDict[theName] = writeFileName
counter = counter + 1
I added some time stamps to the above code, it is not there, but you can see the output below. The problem I'm seeing is that it is going up higher and higher each 1k run. 2 minutes the firs ttime, thne 3 minutes, then 5 minutes, then 7 minutes. By the time it hits 50k, I'm worried it's going to be taking an hour or something and it will be taking too long.
1000
2016-02-25 15:15:08
20002016-02-25 15:17:07
30002016-02-25 15:20:52
2016-02-25 15:25:28
4000
2016-02-25 15:32:00
5000
2016-02-25 15:40:02
6000
2016-02-25 15:51:34
7000
2016-02-25 16:03:29
8000
2016-02-25 16:18:52
9000
2016-02-25 16:35:30
10000
Somethings I should make clear.. I don't know the names of the values ahead of time, maybe I should run through and grab those in a seperate python script to make this go faster?
Second, I need a string of all values seperated by comma, that's why I put it into a text file to grab later. I was thinking of doing it by a list as was suggested to me, but I'm wondering if that will have the same problem. I'm thinking the problem has to do with reading off excel. Anyway I can get a string out of it seperated by comma, I can do it another way.
Or maybe I could do try/catch instead of searching for the file everytime, and if there is an error, I can assume to create a new file? Maybe the lookup everytime is making it go really slow? the If the file exists?
this question is a continuation from my original here and I took some suggestions from there.... What is the fastest performance tuple for large data sets in python?

I think what you're trying to do is get a key out of column B of the row, and use that for the filename to append to. Let's speed it up a lot:
from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data
for row in ws.iter_rows(rowRange):
key = row[1].value
Value_entries[key].extend([cell.value for cell in row])
# All done. Now write files:
for key in Value_entries.keys():
with open(key + '.txt', 'w') as f:
f.write(','.join(Value_entries[key]))

It looks like you only want cells from the B-column. In this case you can use ws.get_squared_range() to restrict the number of cells to look at.
for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row):
for cell in row: # each row is always a sequence
filename = cell.value
if os.path.isfilename(filename):
…
It's not clear what's happening with the else branch of your code but you should probably be closing any files you open as soon as you have finished with them.

Based on the other question you linked to, and the code above, it appears you have a spreadsheet of name - value pairs. The name in in column A and the value is in column B. A name can appear multiple times in column A, and there can be a different value in column B each time. The goal is to create a list of all the values that show up for each name.
First, a few observations on the code above:
counter is never initialized. Presumably it is initialized to 1.
open(textfilename,...) is called twice without closing the file in between. Calling open allocates some memory to hold data related to operating on the file. The memory allocated for the first open call may not get freed until much later, maybe not until the program ends. It is better practice to close files when you are done with them (see using open as a context manager).
The looping logic isn't correct. Consider:
First iteration of inner loop:
for cell in row: # cell refers to A1
valueLocation = "B" + str(counter) # valueLocation is "B1"
value = ws[valueLocation].value # value gets contents of cell B1
name = cell.value # name gets contents of cell A1
textfilename = name + ".txt"
...
opens file with name based on contents of cell A1, and
writes value from cell B1 to the file
...
counter = counter + 1 # counter = 2
But each row has at least two cells, so on the second iteration of the inner loop:
for cell in row: # cell now refers to cell B1
valueLocation = "B" + str(counter) # valueLocation is "B2"
value = ws[valueLocation].value # value gets contents of cell B2
name = cell.value # name gets contents of cell B1
textfilename = name + ".txt"
...
opens file with name based on contents of cell "B1" <<<< wrong file
writes the value of cell "B2" to the file <<<< wrong value
...
counter = counter + 1 # counter = 3 when cell B1 is processed
Repeat for each of 50K rows. Depending on how many unique values are in column B, the program could be trying to have hundreds or thousands of open files (based on contents of cells A1, B1, A2, B2, ...) ==>> very slow or program crashes.
iter_rows() returns a tuple of the cells in the row.
As people suggested in the other question, use a dictionary and lists to store the values and write them all out at the end. Like so (Im using python 3.5, so you may have to adjust this if you are using 2.7)
Here is a straight forward solution:
from collections import defaultdict
data = defaultdict(list)
# gather the values into lists associated with each name
# data will look like { 'name1':['value1', 'value42', ...],
# 'name2':['value7', 'value23', ...],
# ...}
for row in ws.iter_rows():
name = row[0].value
value = row[1].value
data[name].append(value)
for key,valuelist in data.items():
# turn list of strings in to a long comma-separated string
# e.g., ['value1', 'value42', ...] => 'value1,value42, ...'
value = ",".join(valuelist)
with open(key + ".txt", "w") as f:
f.write(value)

Related

Python - program for searching for relevant cells in excel does not work correctly

I've written a code to search for relevant cells in an excel file. However, it does not work as well as I had hoped.
In pseudocode, this is it what it should do:
Ask for input excel file
Ask for input textfile containing keywords to search for
Convert input textfile to list containing keywords
For each keyword in list, scan the excelfile
If the keyword is found within a cell, write it into a new excelfile
Repeat with next word
The code works, but some keywords are not found while they are present within the input excelfile. I think it might have something to do with the way I iterate over the list, since when I provide a single keyword to search for, it works correctly. This is my whole code: https://pastebin.com/euZzN3T3
This is the part I suspect is not working correctly. Splitting the textfile into a list works fine (I think).
#IF TEXTFILE
elif btext == True:
#Split each line of textfile into a list
file = open(txtfile, 'r')
#Keywords in list
for line in file:
keywordlist = file.read().splitlines()
nkeywords = len(keywordlist)
print(keywordlist)
print(nkeywords)
#Iterate over each string in list, look for match in .xlsx file
for i in range(1, nkeywords):
nfound = 0
ws_matches.cell(row = 1, column = i).value = str.lower(keywordlist[i-1])
for j in range(1, worksheet.max_row + 1):
cursor = worksheet.cell(row = j, column = c)
cellcontent = str.lower(cursor.value)
if match(keywordlist[i-1], cellcontent) == True:
ws_matches.cell(row = 2 + nfound, column = i).value = cellcontent
nfound = nfound + 1
and my match() function:
def match(keyword, content):
"""Check if the keyword is present within the cell content, return True if found, else False"""
if content.find(keyword) == -1:
return False
else:
return True
I'm new to Python so my apologies if the way I code looks like a warzone. Can someone help me see what I'm doing wrong (or could be doing better?)? Thank you for taking the time!

Splitting the textfile into a list works fine (I think).
This is something you should actually test (hint: it does but is inelegant). The best way to make easily testable code is to isolate functional units into separate functions, i.e. you could make a function that takes the name of a text file and returns a list of keywords. Then you can easily check if that bit of code works on its own. A more pythonic way to read lines from a file (which is what you do, assuming one word per line) is as follows:
with open(filename) as f:
keywords = f.readlines()
The rest of your code may actually work better than you expect. I'm not able to test it right now (and don't have your spreadsheet to try it on anyway), but if you're relying on nfound to give you an accurate count for all keywords, you've made a small but significant mistake: it's set to zero inside the loop, and thus you only get a count for the last keyword. Move nfound = 0 outside the loop.
In Python, the way to iterate over lists - or just about anything - is not to increment an integer and then use that integer to index the value in the list. Rather loop over the list (or other iterable) itself:
for keyword in keywordlist:
...
As a hint, you shouldn't need nkeywords at all.
I hope this gets you on the right track. When asking questions in future, it'd be a great help to provide more information about what goes wrong, and preferably enough to be able to reproduce the error.

Array visibility in python

Simple question: i've got this code:
i want to fetch a row with Dictreader from the csv package, every entry i wanto to cast it float and put it in the data array. At the end of the scanning i want to print the first 10 elements of the array. It gives me error of visibility on the array data.
with open(train, "r") as traincsv:
trainreader = csv.DictReader(traincsv)
for row in trainreader:
data = [float(row['Sales'])]
print(data[:10])
If i put the print inside the for like this
with open(train, "r") as traincsv:
trainreader = csv.DictReader(traincsv)
for row in trainreader:
data = [float(row['Sales'])]
print(data[:10])
It prints all the entries not just 10.

You are overwriting data every time in the for loop. This is the source of your problem.
Please upload an example input for me to try and I will, but I believe what is below will fix your problem, by appending to data instead of overwriting it.
Also, it is good practice to leave the with block as soon as possible.
# Open with block and leave immediately
with open(train, "r") as traincsv:
trainreader = csv.DictReader(traincsv)
# Declare data as a blank list before iterations
data =[]
# Iterate through all of trainreader
for row in trainreader:
data.append([float(row['Sales'])])
# Now it should print first 10 results
print(data[:10])
Ways of appending a list:
data = data + [float(row['Sales'])]
data += [float(row['Sales'])]
data.append([float(row['Sales'])]

Organizing and printing information by a specific row in a csv file

I wrote a code that takes in some data, and I end up with a csv file that looks like the following:
1,Steak,Martins
2,Fish,Martins
2,Steak,Johnsons
4,Veggie,Smiths
3,Chicken,Johnsons
1,Veggie,Johnsons
where the first column is a quantity, the second column is the type of item (in this case the meal), and the third column is an identifier (in this case it is family name). I need to print this information to a text file in a specific way:
Martins
1 Steak
2 Fish
Johnsons
2 Steak
3 Chicken
1 Veggie
Smiths
4 Veggie
So What I want is the family name followed by what that family ordered. I wrote the following code to accomplish this, but it doesn't seem to be quite there.
import csv
orders = "orders.txt"
messy_orders = "needs_sorting.csv"
with open(messy_orders, 'rb') as orders_for_sorting, open(orders, 'a') as final_orders_file:
comp = []
reader_sorting = csv.reader(orders_for_sorting)
for row in reader_sorting:
test_bit = [row[2]]
if test_bit not in comp:
comp.append(test_bit)
final_orders_file.write(row[2])
for row in reader_sorting:
if [row[2]] == test_bit:
final_orders_file.write(row[0], row[1])
else:
print "already here"
continue
What I end up with is the following
Martins
2 Fish
Additionally, I never see it print "already here" though I think I should if it were working properly. What I suspect is happening is that the program goes through the second for loop, then exits the program without continuing the first loop. Unfortunately I'm not sure how to make it go back to the original loop once it has identified and printed all instances of a given family name in a file. I thought The reason I have it set up this way, is so that I can get the family name written as a header. Otherwise I would just sort the file by family name. Please note that after running the orders through my first program, I did manage to sort everything such that each row represents the complete quantity of that type of food for that family (there are no recurring instances of a row containing both Steak and Martins).

This is a problem that you solve with a dictionary; which will accumulate your items by the last name (family name) of your file.
The second thing you have to do is accumulate a total of each type of meal - keeping in mind that the data you are reading is a string, and not an integer that you can add, so you'll have to do some conversion.
To put all that together, try this snippet:
import csv
d = dict()
with open(r'd:/file.csv') as f:
reader = csv.reader(f)
for row in reader:
# if the family name doesn't
# exist in our dictionary,
# set it with a default value of a blank dictionary
if row[2] not in d:
d[row[2]] = dict()
# If the meal type doesn't exist for this
# family, set it up as a key in their dictionary
# and set the value to int value of the count
if row[1] not in d[row[2]]:
d[row[2]][row[1]] = int(row[0])
else:
# Both the family and the meal already
# exist in the dictionary, so just add the
# count to the total
d[row[2]][row[1]] += int(row[0])
Once you run through that loop, d looks like this:
{'Johnsons': {'Chicken': 3, 'Steak': 2, 'Veggie': 1},
'Martins': {'Fish': 2, 'Steak': 1},
'Smiths': {'Veggie': 4}}
Now its just a matter of printing it out:
for family,data in d.iteritems():
print('{}'.format(family))
for meal, total in data.iteritems():
print('{} {}'.format(total, meal))
At the end of the loop, you'll have:
Johnsons
3 Chicken
2 Steak
1 Veggie
Smiths
4 Veggie
Martins
2 Fish
1 Steak
You can later improve this snippet by using defaultdict

First time replier so here's a go. Have you considered keeping track of the orders and then writing to a file? I tried something using a dict based approach and it seems to work fine. The idea was to index by the family name and store a list of pairs containing the order quantities and types.
You may also want to consider the readability of your code - it's hard to follow and debug. However, what I think is happening is the line
for line in reader_sorting:
Iterates through reader_sorting. You read the 1st name, extract the family name, and later proceed to iterate again in reader_sorting. This time you start at the 2nd line, which family name matches, and you print it successfully. The rest of the line don't match, but you still iterate through them all. Now you've finished iterating through reader_sorting, and the loop finishes, even though in the outer loop you've read only one line.
One solution may be to create another iterator in the outer for loop and not expend the iterator that loop goes through. However, then you still need to deal with the possibility of double counting, or keeping track of indices. Another way may be to keep of the orders by family as you iterate.
import csv
orders = {}
with open('needs_sorting.csv') as file:
needs_sorting = csv.reader(file)
for amount, meal, family in needs_sorting:
if family not in orders:
orders[family] = []
orders[family].append((amount, meal))
with open('orders.txt', 'a') as file:
for family in orders:
file.write('%s\n' % family)
for amount, meal in orders[family]:
file.write('%s %s\n' % (amount, meal))

Writing sublists in a list of lists to separate text files

I’m really new to Python but find myself working on the travelling salesman problem with multiple drivers. Currently I handle the routes as a list of lists but I’m having trouble getting the results out in a suitable .txt format. Each sub-list represents the locations for a driver to visit, which corresponds to a separate list of lat/long tuples. Something like:
driver_routes = [[0,5,3,0],[0,1,4,2,0]]
lat_long =[(lat0,long0),(lat1,long1)...(latn,longn)]
What I would like is a separate .txt file (named “Driver(n)”) that lists the lat/long pairs for that driver to visit.
When I was just working with a single driver, the following code worked fine for me:
optimised_locs = open('Optimisedroute.txt', 'w')
for x in driver_routes:
to_write = ','.join(map(str, lat_long[x]))
optimised_locs.write(to_write)
optimised_locs.write("\n")
optimised_locs.close()
So, I took the automated file naming code from Chris Gregg here (Printing out elements of list into separate text files in python) and tried to make an iterating loop for sublists:
num_drivers = 2
p = 0
while p < num_drivers:
for x in driver_routes[p]:
f = open("Driver"+str(p)+".txt","w")
to_write = ','.join(map(str, lat_long[x]))
print to_write # for testing
f.write(to_write)
f.write("\n")
f.close()
print "break" # for testing
p += 1
The output on my screen looks exactly how I would expect it to look and I generate .txt files with the correct name. However, I just get one tuple printed to each file, not the list that I expect. It’s probably very simple but I can't see why the while loop causes this issue. I would appreciate any suggestions and thank you in advance.

You're overwriting the contents of the file f on every iteration of your for loop because you're re-opening it. You just need to modify your code as follows to open the file once per driver:
while p < num_drivers:
f = open("Driver"+str(p)+".txt","w")
for x in driver_routes[p]:
to_write = ','.join(map(str, lat_long[x]))
print to_write # for testing
f.write(to_write)
f.write("\n")
f.close()
p += 1
Note that opening f is moved to outside the for loop.

change/writing single value to csv file python

There is an error in my code :
_csv.Error: sequence expected
which i believe is because i am trying to write only one value not a list etc.
exRtFile = open ('exchangeRate.csv')
exchReader = csv.reader(exRtFile)
exchWriter = csv.writer(exRtFile)
loop2=0
while loop2==0:
selected=int(input("Please select an option: "))
if selected == 1:
change = input("What rate would you like to change: ")
changeRt = float(input("What would you like to change the rate to: "))
for row in exchReader:
currency = row[0]
if currency == change:
crntRt = row[1]
crntRt = changeRt
exchWriter.writerow(crntRt)
exRtFile.close()
what would be the best way to fix this, or is there a better wayy to change a value in an CSV file?
csv file:
Pound Sterling,1
Euro,1.22
US Dollar,1.67
Japanese Yen,169.948

Here is some code, not tested, that will do what you want. The idea is to read the text into memory, apply the updates, then write out the results over the original file.
You can further enhance this ask the user if they want to save their changes, and to add new currencies instead of just telling the user they're not known.
In the real world, I would break this code into three separate functions (or even classes), one for reading, one for writing, and one for editing the list.
import csv
rates = {}
# read file into dictionary
with open('csv_file.csv', 'r') as in_file:
rdr = csv.reader(in_file)
for item in reader:
rates[row[0]] = row[1]
# ask user for updates and apply to dictionary
while true:
cmd = raw_input('Enter exchange rate to adjust, or blank to exit')
if cmd is None or cmd.strip() == '':
break
if rates.has_key(cmd):
new_rate = float(raw_input('Enter new exchange rate:'))
rates[cmd] = new_rate
else:
print 'Currency {} is not known.'.format(cmd)
# Write the updated dictionary back over the same file.
with open('csv_file.csv', 'w') as out_file:
wrtr = csv_writer(out_file)
wrtr.writerows(rates)

Answering your question: Yes, the problem is that you were trying to write only a value, while writerow expects a list.
That said... Would you consider changing a bit the way your code works?
Here's what I've done (I've tested it now, so I know it works):
First, ask the user for all the changes to make and keep them in a dict where keys are the currency names (Euro, for instance) and the value is the new currency value (5.0, for instance) The user can get out of the loop pressing 0
Second, open and read your exchangeRate.csv file row by row. If the row[0] (name of the currency) is among the values to change, then change it in that row.
No matter what happens (regardless of whether the row needed to be changed or not) write that row in a new temporary file exchangeRate.csv.tmp
When all the rows in the original file are read, you'll have exchangeRate.csv.tmp with some rows unchanged and some rows changed. Swap (move) the .tmp file to exchangeRate.csv
Dunno... might be too much change maybe? Here it is, anyway:
import csv
import shutil
change_rates = {}
selected = 1
while selected:
selected=int(raw_input("Please select an option: (1 to change, 0 to exit)"))
if selected == 1:
change = raw_input("What rate would you like to change?: ")
changeRt = float(raw_input("What would you like to change the rate to: "))
change_rates[change] = changeRt
if len(change_rates) > 0:
with open('exchangeRate.csv', 'r') as f_in,\
open('exchangeRate.csv.tmp', 'w') as f_out:
exchReader = csv.reader(f_in)
exchWriter = csv.writer(f_out)
for row in exchReader:
if row[0] in change_rates:
row[1] = change_rates[row[0]]
exchWriter.writerow(row)
shutil.move('exchangeRate.csv.tmp', 'exchangeRate.csv')
And a sample execution below:
Please select an option: (1 to change, 0 to exit)1
What rate would you like to change?: Euro
What would you like to change the rate to: 5
Please select an option: (1 to change, 0 to exit)0
borrajax#borrajax:~/Documents/Tests$ cat ./exchangeRate.csv
Pound Sterling,1
Euro,5.0
US Dollar,1.67
Japanese Yen,169.948
You can always make more optimizations, such as... allow case insensitive searches, or check that the currency has actually been changed (like even if the user says he wants to change the currency Euro to 5.0, if that was the Euro's exchange rate then don't do anything)... Things like that.
EDIT 1:
I've just seen Larry Lustig's answer and I agree that for small files as it seems to be your case (files that you can fully load in memory) the continuous reading and writing from disk I posted is not optimal. His idea of keeping everything in memory and then do a bulk write to the same exchangeRate.csv file probably is a better fit for your needs.
EDIT 2:
To answer your questions in a comment to this answer:
what does .tmp do at the end of: exchangeRate.csv.tmp:
It's just a new name. I add the suffix .tmp to avoid a naming conflict with your original file (exchangeRate.csv). You could name it whatever you want, though (even foobar.baz)
What is the purpose of 'change' in the variable: change_rates[change] = changeRt:
change is a variable that contains the name of the currency to change (in the usage example I posted, change contains the string "Euro", because that's what the user (erm... me) typed on the console. Is just a way of accessing a dict.
What is the prupose of '[row[0]]' in: row1=change_rates[row[0]].
We agreed that when reading the file, row[0] (just like that, not [row[0]]) contains the name of the currency in the file (Euro, Pound Sterling... etcetera) right? So at a certain point of the execution row[0] will contain the string "Euro", which (in my test example) is the currency the user wanted to change. That string ("Euro") is also a key in the change_rates dictionary (because the user said he wanted to change it) so you are querying the value for the item with key "Euro" in the change_rates dictionary (which will give you 5.0). Is pretty much doing change_rates["Euro"] To see it a bit more clearer add the line print "Currencies to change: %s" % change_rates on the line right above if len(change_rates) > 0: (that'll show you how the dictionary looks like)
what does shutil.move('exchangeRate.csv.tmp', 'exchangeRate.csv') do?
It copies the file with the new currencies to exchangeRate.csv (see the shutil documentation)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.