Finding rows with specific words in an excel sheet [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a table similar to this format:
Table
My table has 50000 rows and 800 columns, certain cells (tab-separated) contain multiple comma-separated words (e.g L,N). I want to retain only rows that contain one of a set of words (say A to N) at a given column (say col2) and remove the remaining rows.
Is it possible to do using Vlookup or there is another way to do it? Any suggestion is appreciated.

Create a helper column, perhaps on the extreme right of a copy of your worksheet. Enter this formula in row 2 of that column. Modify the formula to replace column C with the ID of the helper column (the one in which you write the formula) and replace B with the column in which to find the word. Cope the formula all the way down.
=ISERROR(FIND(C$1,$B2))
Now enter the word you want to keep in cell(1) of the helper column (C$1 in my example). The column will fill with TRUE and FALSE.
TRUE means that the word wasn't found and the row should be deleted
FALSE means that the word exists and should be kept
Now sort the sheet on that column and delete the block with TRUE in it. 15 seconds flat, after you have done it for a few times. That's faster than any VBA or r or Python solution could be made to run.

The biggest issue I had was in communicating conveying what I did, how and why. No need to, so deleted.
So, Select any part, section or range of your table and run the code.
You can remove the section of the code below 'clear any existng fiters' if you dont need to delete the found data and want to keep in your sheet, still tagged for additional purposes (like you could copy those bits to another file if you want, etc).
The code below should do what you asked in your question, and leave you with just the parts of the table that DONT contain, in your selection, any of your terms. i.e. it will delete and move up rows in your table according to your criteria.
& yes Python & R can simply do this for you too too, with dataframes in python, with less code. But this VBA code worked for my many examples. Don't know how it will fare with 50000 rows and X columns, but it should be alright {Post edit: It works fine}.
Sub SearchTableorSelectionDeletetermsFound5()
Dim corresspondingpartner() As Variant
Dim rng As Range: Set rng = Selection
Dim col As Range
For Each col In rng.Columns
Dim r As Range
Dim rn As Variant
Dim Rownum As Long
Rownum = Selection.Rows.Count
ReDim rm(0 To Rownum) As Variant 'size of this array needs to equal or bigger than your selection
'With Sheet2
terms = Sheets("Sheet2").Cells(1, 1).CurrentRegion
k = 1
For rw = 0 To UBound(terms)
ReDim Preserve corresspondingpartner(rw)
corresspondingpartner(rw) = (k / k) 'gives each correspondong partner element an id of 1.
k = k + 1
Next
'End With
For Each r In Selection
n = 0
m = n
For Each c In terms
' Checks for each term in turn in the terms column.
' If it finds one, it inserts the corresponding corresspondingpartner name in the column cell/corresponding row column O (*post edit: now, column ADU)*
If r.Offset(0, 0).Value Like "*" & c & "*" Then
rm(n) = corresspondingpartner(n) 'corresspondingpartner(n) and in the end, you dont even need this, you can replace with any value which the auto fiter section looks for to delete
'so you can remove any instances and classees of corresspondingpartner, including the making of this corresponding second array
'turns out it could have been just if =1
Cells(r.Row, 801).Value = rm(n) / rm(n) 'Sheets("HXY2").
'##### YOU DONT EVEN NEED A HLOOKUP! :)
'#### BUT I STILL WANT TO GET RID OF THE OFFSET COLUMS, DO IT WITHOUT THEM. DONE!! :)
'''###''' same here , turns out could have just been =1
End If
n = n + 1
Next
Next
Next col
'Clear any existing filters
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
'1. Apply Filter
ActiveSheet.Range("A1:ADU5000").AutoFilter Field:=801, Criteria1:=corresspondingpartner(n) / corresspondingpartner(n)
'2. Delete Rows
Application.DisplayAlerts = False
ActiveSheet.Range("A1:ADU5000").SpecialCells(xlCellTypeVisible).Delete
Application.DisplayAlerts = True
'3. Clear Filter
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
End Sub
You might be able to see that in the beginning I was working with to print offset column results, from the table/selection - which took up unneccisary space and also was employing a VBA Application.WorksheetFunction.Hlookup in the code to give a final result column, tagging the rows to delete, but that was in the end unneccissary. Those earlier versions/macros worked too, but were slower, so I did it without the need for helper columns, using arrays.
I turned to my friend, [excel campus - delete rows based on conditions][1] to embed and tweak the autoflter code at the end which deletes the rows you don't want , so you don't have to do it yourself)
Its now a "virtual" hlookup on the array matches in your selection (or data), by deleting all rows that match your specifications/requirements leaving you with the data you want.
I know and have a huge hunch it could be further improved, expanded streamlined much further (starting with the way I produce the arrays), but I'm happy with its functionality and potential scope for now.
[1]: https://www.youtube.com/watch?v=UwNcSZrzd_w&t=6s)

Related

Finding duplicates across multiple groups in a column- Python

I need really serious help with some code.
I've a dataframe in which I want to find duplicates across 2 columns- Material Part Number & Manufacturer.The columns have null values. The way I need to find duplicates is as follows.
I first check the Part Number column for rows with no null values. As I do not want the null values to be treated as duplicates
If the part numbers match, for two same part numbers, I check the manufacturer column for duplicates.
Incase both the manufacturer and the part numbers are the same for two or more rows, I output result into a new column called level of duplicacy. The output is high for all the rows in which the part numbers and the manufacturers have an exact match.
However, if the part numbers match and the manufacturer doesn't match, the output into the column is 'Moderate'.
If the part number itself doesn't match, then Level of Duplicacy is 'No Duplicate'
Also, For rows which are NA in the part number or manufacturer, put the level of duplicacy as 'No Duplicate' incase of Part number and Moderate incase of Manufacturer.
This is my input table
enter image description here
The code I have written for the same is
`for i in range(len(df)):
if pd.isnull(df.loc[i,'Material Part Number'])==False:
if (df['Material Part Number'].duplicated(keep=False))[i]==True:
if pd.isnull(df.loc[i,'Manufacturer'])==False:
if (df['Manufacturer'].duplicated(keep=False))[i]==True:
df.loc[i,'Level of Duplicacy']='High'
else:
df.loc[i,'Level of Duplicacy']='Moderate'
else:
df.loc[i,'Level of Duplicacy']='Moderate'
else:
df.loc[i,'Level of Duplicacy']='Not duplicate'
else:
df.loc[i,'Level of Duplicacy']='Not duplicate'`
The output I need is
enter image description here
The output I'm getting is
enter image description here
As you can see in the rows highlighted in yellow, my code isn't comparing manufacturers within one particular/unique part number but it's doing it across all the part numbers and I don't want it to do that. I know that .duplicated() function compares for the entire column but what if I want it to compare within each unique part number and then find a match? More of a groupby with duplicated? Can one of you help me modify the code I have written?
Thanks a lot.
Running a loop through the data frame would require you to do element wise comparisons of each item. I would suggest using matrix algebra to achieve this. Have a look below, this might be helpful.
df["Level of Duplicacy"] = "Not Duplicate"
Partdups = df.loc[df["Material Part Number"].duplicated(),"Material Part Number"].unique()
for dup in Partdups:
Nums = df.loc[df["Material Part Number"] == dup,:]
dupNums = Nums.loc[Nums["Manufacturer"].duplicated(),"Manufacturer"].unique()
for num in dupNums:
Nums.loc[Nums["Manufacturer"] == num,"Level of Duplicacy"] = "High"
Nums.loc[Nums["Manufacturer"] != num,"Level of Duplicacy"] = "Moderate"
df.iloc[Nums["Material Part Number"].index,:] = Nums
df.loc[pd.isnull(df["Material Part Number"]),"Level of Duplicacy"] = "Not Duplicate"

using python openpyxl to write to an excel spreadsheet (string searches)

The following is my code. I want it to read the excel spreadsheet and use the data in the Warehouse column (i.e. search for substrings in that column's cells) to map and write a specific string to the corresponding cell in the next column called GeneralDescription. My spreadsheet has over 50000 rows. This snippet of code works for classifying two GeneralDescriptions at this point. Eventually I want to be able to easily scale this to cover all possible warehouses. The only thing that is not working and that I need specific help with is that when the string "WORLD WIDE DATA" appears in the Warehouse column, the code does not recognize it. I am assuming because of the all upper case. However, if the string "HUMANRESOURCES Toronto" appears in the Warehouse column, this code works correctly and writes "HumanResources" to the GeneralDescription column. It also recognizes "WWD" and "wwd" and correctly writes "World Wide Data" to the GeneralDescription column. I don't understand why its just that one particular string not is not being recognized unless it has something to do with whitespace. Also in the original speadsheet, there are some integers that identify Warehouses. If i don't delete these, I am unable to iterate through those rows. I need to keep these numbers in there. Any ideas on how i can make this work. Any help is much appreciated.
import openpyxl
import re
wb = openpyxl.load_workbook(filename="Trial_python.xlsx")
ws= wb.worksheets[0]
sheet = wb.active
for i in range(2, 94000):
if(sheet.cell(row=i, column=6).value !=None):
if(sheet.cell(row=i, column=6).value.lower()=="world wide data"):
sheet.cell(row=i, column=7).value="World Wide Data"
for j in re.findall(r"[\w']+", sheet.cell(row=i, column=6).value
if(j.lower()=="wwd" or j.lower()=="world wide data"):
sheet.cell(row=i, column=7).value="World Wide Data"
if(j.lower()=="humanresources"):
sheet.cell(row=i,column=7).value="HumanResources"
wb.save(filename="Trial_python.xlsx")
I'd recommend creating an empty list and as you iterate through the column store each of the values in there with .append(), that should help your code scale a bit better, although i'm sure there will be other more efficient solutions.
I'd also recommend moving away from using == to check for equality and try using is, this link goes into detail about the differences: https://dbader.org/blog/difference-between-is-and-equals-in-python
So your code should look like this:
...
business_list = ['world wide data', 'other_businesses', 'etc']
for i in range(2, 94000):
if(sheet.cell(row=i, column=6).value is not None):
if(sheet.cell(row=i, column=6).value.lower() in business_list:
sheet.cell(row=i, column=7).value = "World Wide Data"
...
Hope that helps
Edit to answer comments below
So to answer your question in comment 2, the business_list = [...] that we created will store anything that you want to check for. ie. if WWD, World Wide Data, 2467, etc. appear then you can check through this list, and if a match is found - which uses the in function - you can then write whatever you like into column 7. (final line of code).
If you want Machine operations or HumanResources or any of these other strings to appear there are a couple of methods that you can complete this. A simple way is to write a check for them like so:
...
business_list = ['world wide data', 'other_businesses', '2467',
'central operations', 'humanresources']
for i in range(2, 50000):
if(sheet.cell(row=i, column=6).value is not None):
if(sheet.cell(row=i, column=6).value.lower() in business_list:
if business_list[i].lower() == "humanresources":
sheet.cell(row = i, column = 7).value = "HumanResources"
if business_list[i].lower() == "machine operations":
sheet.cell(row = i, column = 7).value = "Machine Operations"
else:
sheet.cell(row = i, column = 7).value = "World Wide Data"
...
So to explain what is happening here, a list is created with the values that you want to check for, called business_list. you are then iterating through your columns and checking that the cell is not empty with not None:. From here you do an initial check to see if the value of the cell is something that you want to even check for - in business_list: and if it is - you use the index of what it found to identify and update the cell value.
This ensures that you aren't checking for something that might not be there by checking the list first. Since the values that you suggested are one for one, ie. HumanResources for humanresources, Machine Operations for machine operations.
As for scaling, it should be easy to add new checks by adding the new company name to the list, and then a 2 line statement of if this then cell = this.
I use a similar system for a sheet that is roughly 1.2m entries and performance is still fast enough for production, although I don't know how complex yours is. There may be other more efficient means of doing it but this system is simple to maintain in the future as well, hopefully this makes a bit more sense for you. let me know if not and i'll help if possible
EDIT: As for your last comment, I wouldn't assume something like that without doing a check since it could lead to false positives!

How do you print out lines from a file that match a certain condition, but you have many columns to check?

name played wins loses
Leeroy,19,7,12
Jenkins,19,8,11
Tyler,19,0,19
Napoleon Wilson,19,7,12
Big Boss,19,7,12
Game Dude,19,5,14
Macho Man,19,3,16
Space Pirate,19,6,13
Billy Casper,19,7,12
Otacon,19,7,12
Big Brother,19,7,12
Ingsoc,19,5,14
Ripley,19,5,14
M'lady,19,4,15
Einstein100,19,8,11
Dennis,19,5,14
Esports,19,8,11
RNGesus,19,7,12
Kes,19,9,10
Magnitude,19,6,13
Basically, this is a file called firesideResults, which i open in my code and i have to check through it. If the win column contains a 0 i do not print it out, so if it contains a number other than zero i display it on the screen. However, i have multiple lists of numbers to deal with and i can't find how to only deal with one column of numbers.
my code was going to be
if option == ("C") or option == ("c"):
answer = False
file_3 = open("firesideResults.txt")
for column in file_3:
if ("0" not in column):
print(column)
But unfortunately, one of the other columns of code contain a 0 so i cannot do that. Thank you for your help and if possible please list any questions that i could check for help as i have been searching for so long.
Since you have comma-separated fields, the best way would be to use csv!
import csv
with open("firesideResults.txt") as file_3:
cr = csv.reader(file_3)
for row in cr:
if row[2]!="0":
print(row)
if third column of each row is not 0, then print it.
no substring issue: checks for exact field
checks for field at the "win" column only, not the other ones.

Python Removing Columns

I am trying to remove the last two columns from my data frame by using Python.
The issue is there are cells with values in the last two columns that we don't need, and those columns don't have headers.
Here's the code I wrote, but I'm really new to Python, and don't know how to take my original data and remove the last two columns.
import csv
with open("Filename","rb") as source:
rdr= csv.reader( source )
with open("Filename","wb") as result:
wrt= csv.writer ( result )
for r in rdr:
wrt.writerow( (r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8], r[9], r[10], r[11]) )
Thanks!
The proper Pythonic way to perform something like this is through slicing:
r[start:stop(:step)]
start and stop are indexes, where positive indexes are counted from the front and negative is counted from the end. Blank starts and stops are treated as the beginning and the end of r respectively. step is an optional parameter that I'll explain later. Any slice returns an array, which you can perform additional operations on or just return immediately.
In order to remove the last two values, you can use the slice
r[:-2]
Additional fun with step
Now that step parameter. It allows you to pick every stepth value from the selected slice. With an array of, say, r = [0,1,2,3,4,5,6,7,8,9,10] you can pick every other number starting with the first (all of the even numbers) with the slice r[::2]. In order to get results in reverse order, you can make the step negative:
> r = [0,1,2,3,4,5,6,7,8,9,10]
[0,1,2,3,4,5,6,7,8,9,10]
> r[::-1]
[10,9,8,7,6,5,4,3,2,1,0]

Increasing Values in one Column Using CSV Module and Python

All I would like to do is add .001 to each value that isn't a 0 in one column (say column 7 for example) in my csv file.
So instead of being 35, the value would be changed to 35.001 for example. I need to do this to make my ArcMap script work because if a whole number is the first read, the column is assigned as a short integer when it needs to be read as a float.
As of right now, I have:
writer.writerow([f if f.strip() =='0' else f+.001 for f in row])
This creates a concatenation error however and does not yet address the specific column I need this to work on.
Any help would be greatly appreciated.
Thank you.
The easiest thing to do is to just mutate the row in place, i.e.
if row[7].strip() != '0' and '.' not in row[7]:
row[7] = row[7] + '.001'
writer.writerow(row)
The concatenation error is caused by trying to add a float to a string, you just have to wrap the extra decimals in quotes
The extra condition on the if ensures that you don't accidentally end up with a number with 2 decimal points
It's pretty standard for numbers of the form like 35.0 to be treated as floats even though it's a whole number - check if ArcMap follows this convention, and then you can avoid reducing the accuracy of your numbers by just appending '.0'

Categories