using python openpyxl to write to an excel spreadsheet (string searches) - python

The following is my code. I want it to read the excel spreadsheet and use the data in the Warehouse column (i.e. search for substrings in that column's cells) to map and write a specific string to the corresponding cell in the next column called GeneralDescription. My spreadsheet has over 50000 rows. This snippet of code works for classifying two GeneralDescriptions at this point. Eventually I want to be able to easily scale this to cover all possible warehouses. The only thing that is not working and that I need specific help with is that when the string "WORLD WIDE DATA" appears in the Warehouse column, the code does not recognize it. I am assuming because of the all upper case. However, if the string "HUMANRESOURCES Toronto" appears in the Warehouse column, this code works correctly and writes "HumanResources" to the GeneralDescription column. It also recognizes "WWD" and "wwd" and correctly writes "World Wide Data" to the GeneralDescription column. I don't understand why its just that one particular string not is not being recognized unless it has something to do with whitespace. Also in the original speadsheet, there are some integers that identify Warehouses. If i don't delete these, I am unable to iterate through those rows. I need to keep these numbers in there. Any ideas on how i can make this work. Any help is much appreciated.
import openpyxl
import re
wb = openpyxl.load_workbook(filename="Trial_python.xlsx")
ws= wb.worksheets[0]
sheet = wb.active
for i in range(2, 94000):
if(sheet.cell(row=i, column=6).value !=None):
if(sheet.cell(row=i, column=6).value.lower()=="world wide data"):
sheet.cell(row=i, column=7).value="World Wide Data"
for j in re.findall(r"[\w']+", sheet.cell(row=i, column=6).value
if(j.lower()=="wwd" or j.lower()=="world wide data"):
sheet.cell(row=i, column=7).value="World Wide Data"
if(j.lower()=="humanresources"):
sheet.cell(row=i,column=7).value="HumanResources"
wb.save(filename="Trial_python.xlsx")

I'd recommend creating an empty list and as you iterate through the column store each of the values in there with .append(), that should help your code scale a bit better, although i'm sure there will be other more efficient solutions.
I'd also recommend moving away from using == to check for equality and try using is, this link goes into detail about the differences: https://dbader.org/blog/difference-between-is-and-equals-in-python
So your code should look like this:
...
business_list = ['world wide data', 'other_businesses', 'etc']
for i in range(2, 94000):
if(sheet.cell(row=i, column=6).value is not None):
if(sheet.cell(row=i, column=6).value.lower() in business_list:
sheet.cell(row=i, column=7).value = "World Wide Data"
...
Hope that helps
Edit to answer comments below
So to answer your question in comment 2, the business_list = [...] that we created will store anything that you want to check for. ie. if WWD, World Wide Data, 2467, etc. appear then you can check through this list, and if a match is found - which uses the in function - you can then write whatever you like into column 7. (final line of code).
If you want Machine operations or HumanResources or any of these other strings to appear there are a couple of methods that you can complete this. A simple way is to write a check for them like so:
...
business_list = ['world wide data', 'other_businesses', '2467',
'central operations', 'humanresources']
for i in range(2, 50000):
if(sheet.cell(row=i, column=6).value is not None):
if(sheet.cell(row=i, column=6).value.lower() in business_list:
if business_list[i].lower() == "humanresources":
sheet.cell(row = i, column = 7).value = "HumanResources"
if business_list[i].lower() == "machine operations":
sheet.cell(row = i, column = 7).value = "Machine Operations"
else:
sheet.cell(row = i, column = 7).value = "World Wide Data"
...
So to explain what is happening here, a list is created with the values that you want to check for, called business_list. you are then iterating through your columns and checking that the cell is not empty with not None:. From here you do an initial check to see if the value of the cell is something that you want to even check for - in business_list: and if it is - you use the index of what it found to identify and update the cell value.
This ensures that you aren't checking for something that might not be there by checking the list first. Since the values that you suggested are one for one, ie. HumanResources for humanresources, Machine Operations for machine operations.
As for scaling, it should be easy to add new checks by adding the new company name to the list, and then a 2 line statement of if this then cell = this.
I use a similar system for a sheet that is roughly 1.2m entries and performance is still fast enough for production, although I don't know how complex yours is. There may be other more efficient means of doing it but this system is simple to maintain in the future as well, hopefully this makes a bit more sense for you. let me know if not and i'll help if possible
EDIT: As for your last comment, I wouldn't assume something like that without doing a check since it could lead to false positives!

Related

Finding rows with specific words in an excel sheet [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a table similar to this format:
Table
My table has 50000 rows and 800 columns, certain cells (tab-separated) contain multiple comma-separated words (e.g L,N). I want to retain only rows that contain one of a set of words (say A to N) at a given column (say col2) and remove the remaining rows.
Is it possible to do using Vlookup or there is another way to do it? Any suggestion is appreciated.
Create a helper column, perhaps on the extreme right of a copy of your worksheet. Enter this formula in row 2 of that column. Modify the formula to replace column C with the ID of the helper column (the one in which you write the formula) and replace B with the column in which to find the word. Cope the formula all the way down.
=ISERROR(FIND(C$1,$B2))
Now enter the word you want to keep in cell(1) of the helper column (C$1 in my example). The column will fill with TRUE and FALSE.
TRUE means that the word wasn't found and the row should be deleted
FALSE means that the word exists and should be kept
Now sort the sheet on that column and delete the block with TRUE in it. 15 seconds flat, after you have done it for a few times. That's faster than any VBA or r or Python solution could be made to run.
The biggest issue I had was in communicating conveying what I did, how and why. No need to, so deleted.
So, Select any part, section or range of your table and run the code.
You can remove the section of the code below 'clear any existng fiters' if you dont need to delete the found data and want to keep in your sheet, still tagged for additional purposes (like you could copy those bits to another file if you want, etc).
The code below should do what you asked in your question, and leave you with just the parts of the table that DONT contain, in your selection, any of your terms. i.e. it will delete and move up rows in your table according to your criteria.
& yes Python & R can simply do this for you too too, with dataframes in python, with less code. But this VBA code worked for my many examples. Don't know how it will fare with 50000 rows and X columns, but it should be alright {Post edit: It works fine}.
Sub SearchTableorSelectionDeletetermsFound5()
Dim corresspondingpartner() As Variant
Dim rng As Range: Set rng = Selection
Dim col As Range
For Each col In rng.Columns
Dim r As Range
Dim rn As Variant
Dim Rownum As Long
Rownum = Selection.Rows.Count
ReDim rm(0 To Rownum) As Variant 'size of this array needs to equal or bigger than your selection
'With Sheet2
terms = Sheets("Sheet2").Cells(1, 1).CurrentRegion
k = 1
For rw = 0 To UBound(terms)
ReDim Preserve corresspondingpartner(rw)
corresspondingpartner(rw) = (k / k) 'gives each correspondong partner element an id of 1.
k = k + 1
Next
'End With
For Each r In Selection
n = 0
m = n
For Each c In terms
' Checks for each term in turn in the terms column.
' If it finds one, it inserts the corresponding corresspondingpartner name in the column cell/corresponding row column O (*post edit: now, column ADU)*
If r.Offset(0, 0).Value Like "*" & c & "*" Then
rm(n) = corresspondingpartner(n) 'corresspondingpartner(n) and in the end, you dont even need this, you can replace with any value which the auto fiter section looks for to delete
'so you can remove any instances and classees of corresspondingpartner, including the making of this corresponding second array
'turns out it could have been just if =1
Cells(r.Row, 801).Value = rm(n) / rm(n) 'Sheets("HXY2").
'##### YOU DONT EVEN NEED A HLOOKUP! :)
'#### BUT I STILL WANT TO GET RID OF THE OFFSET COLUMS, DO IT WITHOUT THEM. DONE!! :)
'''###''' same here , turns out could have just been =1
End If
n = n + 1
Next
Next
Next col
'Clear any existing filters
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
'1. Apply Filter
ActiveSheet.Range("A1:ADU5000").AutoFilter Field:=801, Criteria1:=corresspondingpartner(n) / corresspondingpartner(n)
'2. Delete Rows
Application.DisplayAlerts = False
ActiveSheet.Range("A1:ADU5000").SpecialCells(xlCellTypeVisible).Delete
Application.DisplayAlerts = True
'3. Clear Filter
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
End Sub
You might be able to see that in the beginning I was working with to print offset column results, from the table/selection - which took up unneccisary space and also was employing a VBA Application.WorksheetFunction.Hlookup in the code to give a final result column, tagging the rows to delete, but that was in the end unneccissary. Those earlier versions/macros worked too, but were slower, so I did it without the need for helper columns, using arrays.
I turned to my friend, [excel campus - delete rows based on conditions][1] to embed and tweak the autoflter code at the end which deletes the rows you don't want , so you don't have to do it yourself)
Its now a "virtual" hlookup on the array matches in your selection (or data), by deleting all rows that match your specifications/requirements leaving you with the data you want.
I know and have a huge hunch it could be further improved, expanded streamlined much further (starting with the way I produce the arrays), but I'm happy with its functionality and potential scope for now.
[1]: https://www.youtube.com/watch?v=UwNcSZrzd_w&t=6s)

Get dummy variables from a string column full of mess

I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.

How do you print out lines from a file that match a certain condition, but you have many columns to check?

name played wins loses
Leeroy,19,7,12
Jenkins,19,8,11
Tyler,19,0,19
Napoleon Wilson,19,7,12
Big Boss,19,7,12
Game Dude,19,5,14
Macho Man,19,3,16
Space Pirate,19,6,13
Billy Casper,19,7,12
Otacon,19,7,12
Big Brother,19,7,12
Ingsoc,19,5,14
Ripley,19,5,14
M'lady,19,4,15
Einstein100,19,8,11
Dennis,19,5,14
Esports,19,8,11
RNGesus,19,7,12
Kes,19,9,10
Magnitude,19,6,13
Basically, this is a file called firesideResults, which i open in my code and i have to check through it. If the win column contains a 0 i do not print it out, so if it contains a number other than zero i display it on the screen. However, i have multiple lists of numbers to deal with and i can't find how to only deal with one column of numbers.
my code was going to be
if option == ("C") or option == ("c"):
answer = False
file_3 = open("firesideResults.txt")
for column in file_3:
if ("0" not in column):
print(column)
But unfortunately, one of the other columns of code contain a 0 so i cannot do that. Thank you for your help and if possible please list any questions that i could check for help as i have been searching for so long.
Since you have comma-separated fields, the best way would be to use csv!
import csv
with open("firesideResults.txt") as file_3:
cr = csv.reader(file_3)
for row in cr:
if row[2]!="0":
print(row)
if third column of each row is not 0, then print it.
no substring issue: checks for exact field
checks for field at the "win" column only, not the other ones.

Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)
• Search a key word in Column (C#3) and keep 2 words front and back to that key word.
• Divert the print output to file.
• Here I don’t want to touch C#1, C#2 for referential integrity purposes.
Really appreciate for all your help.
My input file:
C #1 C # 2 C# 3 (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it
Desired output file: (only change in Column 3 or last column)
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
Code I am currently using:
s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():
if not line.strip():
continue
fields = line.split(None, 2)
joined = '|'.join(fields)
print(joined)
BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.
First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.
For 1 million cases you'll need pandas as for loop is too slow.
import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
"This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())
which gives
df
C1 C2 C3
0 12088 CITA very nice lists, better to
1 12089 CITA theme for lists keep it
There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.
Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword(), and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.
KEYWORD = "lists"
S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
def findword(words, keyword):
"""Return index of first occurrence of `keyword` in sequence
`words`, otherwise return None.
The current implementation searches for "keyword" as well as
for "keyword," (with trailing comma).
"""
for test in (keyword, "%s," % keyword):
try:
return words.index(test)
except ValueError:
pass
return None
for line in S.splitlines():
tokens = line.split("|")
words = tokens[2].split()
idx = findword(words, KEYWORD)
if idx is None:
# Keyword not found. Print line without change.
print line
continue
l = len(words)
start = idx-2 if idx > 1 else 0
end = idx+3 if idx < l-2 else -1
tokens[2] = " ".join(words[start:end])
print '|'.join(tokens)
Test:
$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
PS: I hope I got the indices right for slicing. You should check, nevertheless.

Function awfully slow

I was looking for historical data from our Brazilian stock market and found it at Bovespa's
website.
The problem is the format the data is in is terrible, it is mingled with all sorts of
other information about any particular stock!
So far so good! A great opportunity to test my fresh python skills (or so I thought)!
I managed to "organize/parse" pretty much all of the data with a few lines of code,
and then stumbled on a very annoying fact about the data. The very information I needed, stock prices(open, high, low, close), had no commas and was formatted like this: 0000000011200, which would be equivalent to 11 digits before the decimal comma.
So basically 0000000011200 = 112,00... You get the gist..
I wrote a few lines of code to edit that and then the nightmare kicked in.
The whole data set is around 358K rows long, and with my current script the deeper it
runs inside the list to edit it the longer it takes per edit.
Here is the code snipped I used for that:
#profile
def dataFix(datas):
x = 0
for entry in datas:
for i in range(9, 16):
data_org[datas.index(entry)][i] = entry[i][:11]+'.'+entry[i][11:]
x += 1
print x
Would anyone mind shining some light into this matter?
datas.index(entry)
There's your problem. datas.index(entry) requires Python to go through the datas list one element at a time, searching for entry. It's an incredibly slow way to do things, slower the bigger the list is, and it doesn't even work, because duplicate elements are always found at their first occurrence instead of the occurrence you're processing.
If you want to use the indices of the elements in a loop, use enumerate:
for index, entry in enumerate(datas):
...
First, probably more easy to convert price directly to a more usable format.
For exemple Decimal format permit you to do easy calculation without loosing precision.
Secondly, i think you didn't even need the index and can just use append.
Thirdly, say welcome to list comprehension and slice :P
from decimal import Decimal
data_org = []
for entries in datas:
data_org.append([Decimal(entry).scaleb(-2) for entry in entries[9:16]])
or even:
data_org = [[Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas]
or in a generator form:
data_org = ([Decimal(entry).scaleb(-2) for entry in entries[9:16]] for entries in datas)
or if you want to keeping the text form:
data_org = [['.'.join((entry[:-2], entry[-2:])) for entry in entries[9:16]] for entries in datas]
(replaceing [:11] by [:-2] permit to be independent of the input size and get 2 decimal from the end)

Categories