Find index of duplicate rows in Openpyxl

Find index of duplicate rows in Openpyxl - python

I want to find the index of all duplicate rows in an excel file and add them to a list which will be handled later.
unwantedRows = []
Row = []
item = ""
for index, row in enumerate(ws1.iter_rows(max_col = 50), start = 1):
for cell in row:
if cell.value:
item += cell.value
if item in Row:
unwantedRows.append(index)
else:
Row.append(item)
However this fails to work. It only indexes rows that are completely empty. How do I fix this?

unwantedRows = []
Rows = []
for index, row in enumerate(ws1.iter_rows(max_col = 50), start = 1):
sublist = []
for cell in row:
sublist.append(cell.value)
if sublist not in Rows:
Rows.append((sublist))
else:
unwantedRows.append(index)

Without a tuple:
row_numbers_to_delete = []
rows_to_keep = []
for row in ws.rows:
working_list = []
for cell in row:
working_list.append(cell.value)
if working_list not in rows_to_keep:
rows_to_keep.append(working_list)
else:
row_numbers_to_delete.append(cell.row)
for row in row_numbers_to_delete:
ws.delete_rows(
idx=row,
amount=1
)

Related

create sublist of indices with each sublist referring to set of unique tuples from a list of tuples

I am trying to create sub list of indices by grouping indices of tuples with any of the elements being common from a list of tuples or keeping unique tuples indices separate. The definition of unique tuple being no element of the tuple is same as the elements in same position of other tuples in the list.
Example: List which groups same company together,with same company defined as same name or same registration number or same name of CEO.
company_list = [("companyA",0002,"ceoX"),
("companyB"),0002,"ceoY"),
("companyC",0003,"ceoX"),
("companyD",004,"ceoZ")]
The desired output would be:
[[0,1,2],[3]]
Does anyone know of a solution for this problem?

The companies form a graph. You want to create clusters from connected companies.
Try this:
company_list = [
("companyA",2,"ceoX"),
("companyB",2,"ceoY"),
("companyC",3,"ceoX"),
("companyD",4,"ceoZ")
]
# Prepare indexes
by_name = {}
by_number = {}
by_ceo = {}
for i, t in enumerate(company_list):
if t[0] not in by_name:
by_name[t[0]] = []
by_name[t[0]].append(i)
if t[1] not in by_number:
by_number[t[1]] = []
by_number[t[1]].append(i)
if t[2] not in by_ceo:
by_ceo[t[2]] = []
by_ceo[t[2]].append(i)
# BFS to propagate group to connected companies
groups = list(range(len(company_list)))
for i in range(len(company_list)):
g = groups[i]
queue = [g]
while queue:
x = queue.pop(0)
groups[x] = g
t = company_list[x]
for y in by_name[t[0]]:
if g < groups[y]:
queue.append(y)
for y in by_number[t[1]]:
if g < groups[y]:
queue.append(y)
for y in by_ceo[t[2]]:
if g < groups[y]:
queue.append(y)
# Assemble result
result = []
current = None
last = None
for i, g in enumerate(groups):
if g != last:
if current:
result.append(current)
current = []
last = g
current.append(i)
if current:
result.append(current)
print(result)

Fafl's answer is definitely more performant. If you're not worried about performance, here is a brute-force solution that might be easier to read. Tried to make it clear with some comments.
def find_index(res, target_index):
for index, sublist in enumerate(res):
if target_index in sublist:
# yes, it's present
return index
return None # not present
def main():
company_list = [
('companyA', '0002', 'CEOX'),
('companyB', '0002', 'CEOY'),
('companyC', '0003', 'CEOX'),
('companyD', '0004', 'CEOZ'),
('companyE', '0004', 'CEOM'),
]
res = []
for index, company_detail in enumerate(company_list):
# check if this `index` is already present in a sublist in `res`
# if the `index` is already present in a sublist in `res`, then we need to add to that sublist
# otherwise we will start a new sublist in `res`
index_to_add_to = None
if find_index(res, index) is None:
# does not exist
res.append([index])
index_to_add_to = len(res) - 1
else:
# exists
index_to_add_to = find_index(res, index)
for c_index, c_company_detail in enumerate(company_list):
# inner loop to compare company details with the other loop
if c_index == index:
# same, ignore
continue
if company_detail[0] == c_company_detail[0] or company_detail[1] == c_company_detail[1] or company_detail[2] == c_company_detail[2]:
# something matches, so append
res[index_to_add_to].append(c_index)
res[index_to_add_to] = list(set(res[index_to_add_to])) # make it unique
print(res)
if __name__ == '__main__':
main()

Check this out, I tried a lot for it. May be I am missing some test cases. Performance wise I think its good.
I have used set() and pop those which lie in one group.
company_list = [
("companyA",2,"ceoX"),
("companyB",2,"ceoY"),
("companyC",3,"ceoX"),
("companyD",4,"ceoZ"),
("companyD",3,"ceoW")
]
index = {val: key for key, val in enumerate(company_list)}
res = []
while len(company_list):
new_idx = 0
temp = []
val = company_list.pop(new_idx)
temp.append(index[val])
while new_idx < len(company_list) :
if len(set(val + company_list[new_idx])) < 6:
temp.append(index[company_list.pop(new_idx)])
else:
new_idx += 1
res.append(temp)
print(res)

How to solve ValueError while iterating and appending to list

ValueError: 2 columns passed, passed data had 4 columns:
import pandas as pd
def customedata():
colnum = input("How many columns do you need? ")
colnum = int(colnum)
rownum = input("How many rows do you need? ")
# user input column and row
rownum = int(rownum)
colName = []
rowName = []
# create an empty list
for col in range(0,colnum):
colValue =input('Enter the value for column name of column %s:' %(col + 1))
colName.append(colValue)
for row in range(0,rownum):
rowValue = (int(input('Enter the value of row number %s:' %(row + 1))))
rowName.append(rowValue)
row = row + 1
col = col + 1
# columns = colName[i]
df1= pd.DataFrame([rowName],columns = colName)
print(df1)
I tried to create a dataframe using user input rows and columns but I keep getting valueError. I tought that it had something wrong with the nested loop but I wasn't able to solve the problem.

I think it would be easier to create a pd.DataFrame from user input using a dictionary like below. First you create an empty dict, then you pass the colName as key and your rowNamelist as value. Then you can use pd.DataFrame.from_dict() to transform your dict to a pd.DataFrame
Hoping it helps :)
def customedata():
colnum = input("How many columns do you need? ")
colnum = int(colnum)
rownum = input("How many rows do you need? ")
# user input column and row
rownum = int(rownum)
colName = []
rowName = []
dictionary = {}
for col in range(0, colnum):
colValue = input('Enter the value for column name of column %s:' % (col + 1))
for row in range(0, rownum):
rowValue = (int(input('Enter the value of row number %s:' % (row + 1))))
rowName.append(rowValue)
dictionary[colValue] = rowName
df1 = pd.DataFrame.from_dict(dictionary)
print(df1)

I don't understand this code.I want to split him up

I don't quite understand how this paragraph is written.
The source code is as follows.
line = [cell.value for cell in col if cell.value != None]
I want to understand how to write this code.
I tried to use loops, but the results were different.
for cell in col:
if cell.value != None:
line = cell.value

You are quite close. FYI, the one-line syntax is called a list comprehension. Here is the equivalent.
line = list()
for cell in col:
if cell.value != None:
line.append(cell.value)

You're keep overriding the line variable while it should be a list:
line = []
for cell in col:
if cell.value != None:
line.append(cell.value)
As you see, the one-liner has two square brackets around it, so it becomes a list.

You are going in right direction but here line will be an array and each value is appended in the array
so code will look like following
line = []
for cell in col:
if cell.value != None:
line.append(cell.value)

line = [cell.value for cell in col if cell.value != None]
print(line)
line = []
for cell in col:
if cell.value != None:
line.append(cell.value)
print(line)
line = list()
for cell in col:
if cell.value != None:
line.append(cell.value)
print(line)
Translate to an empty list and write the contents as you did and add them to your list by append. I put here print line from me, you can ignore it.

Error in Python: "IndexError: list index out of range"

I am trying to make a list out of data from an excel file in python, but I receive this whenever I run my code
row[1] = int(row[1])
IndexError: list index out of range
>>>
This is the code I have that sorts it (by minimum, maximum, and average)
f = open("Class 2.csv", "r")
csvread = csv.reader(f)
nlist = []
for row in csvread:
filter(lambda x: 3 > 0, row)
row[0] = int(row[0])
row[1] = int(row[1])
row[2] = int(row[2])
row[3] = int(row[3])
minimum = min(row[1:4])
row.append(minimum)
maximum = max(row[1:4])
row.append(maximum)
average = round(sum(row[1:4])/3)
row.append(average)
nlist.append(row[0:4])
print(nlist)
Row[0] in my excel file is a name as well, so I also get an error that tells me that int(row[0]) cannot work because I is not an integer. I don't know how to change it so that I don't get this error.

Iteration row has no second element. Check it before sending:
If all values in list are numbers, use map for str to int values:
int_values = map(int, str_list)
or
int_values = map(int, str_list[0:4])
But this will not solve your problem, because in the list of values.

python xlrd string match

I couldnt find anything in the API. Is there a way to return the row number or coordinate of a cell based on a string match? For instance: You give the script a string and it scans through the .xls file and when it finds a cell with the matching string, it returns the coordinate or row number.

for i in range(sheet.nrows):
row = sheet.row_values(i)
for j in range(len(row)):
if row[j] == search_value:
return i,j
return None
something like that... just a basic search

You could try the following function, thank you Joran
def look4_xlrd (search_value, sheet) :
lines = []
columns = []
for i in range (sheet.nrows) :
row = sheet.row_values(i)
for j in range(len(row)) :
if row[j] == search_value :
lines.append(i)
columns.append(j)
del row
return lines, columns

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find index of duplicate rows in Openpyxl - python

unwantedRows = [] Rows = [] for index, row in enumerate(ws1.iter_rows(max_col = 50), start = 1): sublist = [] for cell in row: sublist.append(cell.value) if sublist not in Rows: Rows.append((sublist)) else: unwantedRows.append(index)

Related

create sublist of indices with each sublist referring to set of unique tuples from a list of tuples

How to solve ValueError while iterating and appending to list

I don't understand this code.I want to split him up

Error in Python: "IndexError: list index out of range"

python xlrd string match

Categories

Resources