Efficiently removing duplicates from a list [duplicate] - python

This question already has an answer here:
Removing repetitive/duplicate occurance in excel using python
(1 answer)
Closed 3 years ago.
Good evening. I have an excel file with zip codes and associated information. Those zip codes have a lot of duplicates. I'd like to figure out which zip codes I have by putting them all in a list without duplicates. This code works, but runs very slowly (took over 100 seconds), and was wondering what I could do to improve the efficiency of it.
I know that having to check the whole list for duplicates each time is contributing a lot to the inefficiency, but I'm not sure how to fix that. I also know that going through every row is probably not the best answer, but again I am pretty new and am now stuck.
Thanks in advance.
import sys
import xlrd
loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
def findUniqueZips():
zipsInSheet = []
for i in range(sheet.nrows):
if str(sheet.cell(i,0).value) in zipsInSheet:
pass
else:
zipsInSheet.append(str(sheet.cell(i,0).value))
print(zipsInSheet)
findUniqueZips()

If you're looking to avoid duplicates then you should definitely consider using Sets in python. See here
What I would do is to create a set and simply add all your elements to a set; note that, a set is an unordered, unique collection of items. Once all data has been added you can then just add all elements in the set it to your sheet. This, therefore, avoids redundant data.
import sys
import xlrd
loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
def findUniqueZips():
zipsInSheet = []
data = set()
for i in range(sheet.nrows):
data.add(str(sheet.cell(i,0).value)
#now add all elements in the set to your sheet
for i in range(len(data)):
zipsInSheet.append(str(sheet.cell(i,0).value))
print(zipsInSheet)
findUniqueZips()

I usually just convert it to a set. Sets are your friend. They are much faster than lists. Unless you intentionally need or want duplicates, use sets.
https://docs.python.org/3.7/tutorial/datastructures.html?highlight=intersection#sets

Related

Loop function to rename dataframes

I am new to coding and currently i want to create individual dataframes from each excel tab. It works out so far by doing a search in this forum (i found a sample using dictionary), but then i need one more step which i can't figure out.
This is the code i am using:
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
d[f'{sheet}'] = pd.read_excel(xls, sheet_name=sheet)
Let's say i have 3 excel tabs called 'alpha', 'beta' and 'charlie'.
the code above will gave me 3 dataframes and i can call them by typing: d['alpha'], d['beta'] and d['charlie'].
What i want is to rename the dataframes so instead of calling them by typing (for example) d['alpha'], i just need to write alpha (without any other extras).
Edit: The excel i want to parse has 50+ tabs and it can grow
Edit 2: Thank you all for the links and the answers! it is a great help
Don't rename them.
I can think of two scenarios here:
1. The sheets are fundamentally different
When people ask how to dynamically assign to variable names, the usual (and best) answer is "Use a dictionary". Here's one example.
Indeed, this is the reason Pandas does it this way!
In this case, my opinion is that your best move here is to do nothing, and just use the dictionary you have.
2. The sheets are roughly the same
If the sheets are all basically the same, and only differ by one attribute (e.g. they represent monthly sales and the names of the sheets are 'May', 'June', etc), then your best move is to merge them somehow, adding a column to reflect the sheet name (month, in my example).
Whatever you do, don't use exec or eval, no matter what anyone tells you. They are not options for beginner programmers.
I think you are looking for the build-in exec method, which executes strings.
But I do not recommend using exec, it is really widely discussed why it shouldn't be used or at least should be used cautiously.
As I do not have your data, I think it is achievable using the following code:
import pandas as pd
excel='sample.xlsx'
xls=pd.ExcelFile(excel)
for sheet in xls.sheet_names:
print(sheet)
code_to_execute = f'{sheet} = pd.read_excel(xls,sheet_name={sheet})'
exec(code_to_execute)
But again, I highlight that it is not the cleanest way to do that. Your approach is definitely cleaner, to be more precise, I would always use dicts for those kinds of assignments. See here for more about exec.
In general, you want to generate a string.
possible_string = 'a=10'
exec(possible_string)
print(a) # 10
You need to create variables which correspond to the three dataframes:
alpha, beta, charlie = d.values()
Edit:
Since you mentioned that the excel sheet could have 50+ tabs and could grow, you may prefer to do it your original loop. This can be done dynamically using exec
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
exec(f'{sheet}' + " = pd.read_excel(xls, sheet_name=sheet)")
It might be better practice, however, to simply index your sheets and access them by index. A 50+ length collection of excel sheets is probably better organized by appending to a list and accessing by index:
d = []
for sheet in xls.sheet_names:
print(sheet)
d.append(pd.read_excel(xls, sheet_name=sheet))
#d[0] = alpha; d[1] = beta, and so on...

How to find duplicates in a csv with python, and then alter the row

For a little background this is the csv file that I'm starting with. (the data is nonsensical and only used for proof of concept)
Jackson,Thompson,jackson.thompson#hotmail.com,test,
Luke,Wallace,luke.wallace#lycos.com,test,
David,Wright,david.wright#hotmail.com,test,
Nathaniel,Butler,nathaniel.butler#aol.com,test,
Eli,Simpson,noah.simpson#hotmail.com,test,
Eli,Mitchell,eli.mitchell#aol.com,,test2
Bob,Test,bob.test#aol.com,test,
What I am attempting to do with this csv on a larger scale is if the first value in the row is duplicated I need to take the data in the second entry and append it to the row with the first instance of the value. For example, in the data above "Eli" is represented twice, the first instance has "test" after the email value. The second instance of "Eli" does not have a value there it instead has another value in the next index over, and remove the duplicate row.
I would want it to go from this:
Eli,Simpson,noah.simpson#hotmail.com,test,,
Eli,Mitchell,eli.mitchell#aol.com,,test2
To this:
Eli,Simpson,noa.simpson#hotmail.com,test,test2
I have been able to successfully import this csv into my code using what is below.
import csv
f = open('C:\Projects\Python\Test.csv','r')
csv_f = csv.reader(f)
test_list = []
for row in csv_f:
test_list.append(row[0])
print(test_list)
At this point I was able to import my csv, and put the first names into my list. I'm not sure how to compare the indexes to make the changes I'm looking for. I'm a python rookie so any help/guidance would be greatly appreciated.
If you want to use pandas you could use the pandas .drop_deplicates() method. An example would look something like this.
import pandas as pd
csv_f = pd.read_csv(r'C:\a file with addresses')
data.drop_duplicates(subset=['thing_to_drop'], keep='first',inplace=False)
see pandas documentation https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwiej-eNrLrjAhVBGs0KHV6bB9kQFjADegQIABAB&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.drop_duplicates.html&usg=AOvVaw1uGhCrPNMDDZAZWE9_YA9D
I am a kind of a newbie in python as well but I would suggest using dictreader and look at the excel file as a dictionary meaning every raw is a dictionary.
this way you can iterate through the names easily.
Second, I would suggest making a list of names already known to you as you iterate through the excel file to check if this is a known name for example
name_list.append("eli")
then when you check if "eli" in name_list:
and add a key, value to the first one.
I don't know if this is best practice so don't roast me guys, but this is a simple and quick solution.
This will help you practice iterating through lists and dictionaries as well.
Here is a helpful link for reading about csv handling.

Python: How to restart a FOR loop, which iterates over a csv

I am using Python 3.5 and I wanna load data from a csv into several lists, but it only works exactly one time with a FOR-Loop. Then it loads 0 into it.
Here is the code:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=';')
list_f1_vorname = []
for row_f1 in csv_f1:
list_f1_vorname.append(row_f1[2])
list_f1_name = []
for row_f1 in csv_f1: # <-- HERE IS THE ERROR, IT DOESNT WORK A SECOND TIME!
list_f1_name.append(row_f1[3])
Does anybody know how to restart this thing?
Many thanks and best regards,
Saitam
csv_f1 is not an list, it is an iterative.
Either, you cache the csv_f1 into a list by using list() or you just recreate the object.
I would recommend to recreate the object in case your cvs data gets very big. This way, the data is not put into RAM completely.
The simple answer is to iterate over the csv once and store it into a list.
something like
my_list = []
for row in csv_f1:
my_list.append(row)
or what abukaj wrote with
csv_f1 = list(csv.reader(f1, delimiter=';'))
and then move on and iterate over that list as many times as you want.
However if you are only trying to get certain columns then you can simply do that in the same for loop.
list_f1_vorname = []
list_f1_name = []
for row in csv_f1:
list_f1_vorname.append(row[2])
list_f1_name.append(row[3])
The reason it doesn't work multiple times is because it is an iterator...so it will iterate over the values once but not restart at beginning again after it has already iterated over the data.
Try:
csv_f1 = list(csv.reader(f1, delimiter=';'))
It is not exactly restarting the reader, but rather caching the file contents in a list, which may be iterated many times.
One thing nobody noticed so far is that you're trying to store names and last names in two separate lists. This is not going to be very convenient to use later on. Therefore despite other answers show correct possible solutions to read names and last names from csv into two separate lists, I'm going to propose you to use a single list of dicts instead:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=";")
list_of_names = []
for row_f1 in csv_f1:
list_of_names.append({
"vorname": row_f1[2],
"name": row_f1[3]
})
Then you can iterate over this list and take the value you want. For example to simply print the values:
for row in list_of_names:
print(row["vorname"])
print(row["name"])
And the last but not least, you could build this list also by using list comprehension (kinda more Pythonic):
list_of_names = [
{
"vorname": row_f1[2],
"name": row_f1[3]
}
for row_f1 in csv_f1
]
As I said, I appreciate other answers. They solve the issue of csv reader being iterable and not a list-like object.
Nevertheless I see a little bit of XY Problem in your question. I've seen so many times attempts to store entity properties (name and last name are obviously related properties and form a simple entity together) in multiple lists. It always ends up with the code which is hard to read and maintain.

Unmerged Cell For Loop skipping cells

I am using openpyxl and found some code through goolging about unmerging cells in an xlsx workbook.
I got the code to work, but found that it was not removing all the merged cells in a single pass through. I set it up to run using a while loop and solved the issue, but was wondering what I am doing wrong to cause the skipping in the first place. Any insight would be helpful.
Code:
import openpyxl
wb = './filename.xlsx'
ws = wb[sheetname]
def remove_merged(sheet_object):
merged = ws.merged_cell_ranges
while len(merged)>0:
for mergedRNG in merged:
ws.unmerge_cells(range_string = mergedRNG)
merged = ws.merged_cell_ranges
return len(merged)
remove_merged(ws)
ws.merged_cell_ranges is mutable so you need to be careful that it is not directly used in any for-loop, because the implicit counter won't take into account that the property has been recalculated. This is a common gotcha in Python illustrated by:
l = list(range(10))
for i in l:
print(i)
l.pop(0) # anything that affects the structure of the list
The following is how to avoid this:
for rng in ws.merged_cell_ranges[:]: # create a copy of the list
ws.unmerge(rng) # remove range from original
PS. just copying stuff from an internet search isn't really advisable. There are several sites with outdated or unnecessarily complex code. Best referring to the documentation or asking on the mailing list.

How do I make a dynamically expanding array in python

Ok I have this part of code:
def Reading_Old_File(self, Path, turn_index, SKU):
print "Reading Old File! Turn Index = ", turn_index, "SKU= ", SKU
lenght_of_array=0
array_with_data=[]
if turn_index==1:
reading_old_file = open(Path,'rU')
data=np.genfromtxt(reading_old_file, delimiter="''", dtype=None)
for index, line_in_data in enumerate(data, start=0):
if index<3:
print index, "Not Yet"
if index>=3:
print ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Reading All Old Items"
i=index-3
old_items_data[i]=line_in_data.split("\t")
old_items_data[i]=[lines_old.strip()for lines_old in old_items_data]
print old_items_data[i]
print len(old_items_data)
So what I am doing here is, I'm reading a file, on my first turn, I want to read it all, and keep all data, so it would be something like:
old_items_data[1]=['123','dog','123','dog','123','dog']
old_items_data[2]=['124','cat','124','cat','124','cat']
old_items_data[n]=['amount of list members is equal each time']
each line of the file should be stored in list, so I can use it in future for comparing, when turn_index will be greater than 2 I'll compare coming line with lines in every list(array) by iterating over all lists.
So question is how do I do it, or is there any better way to compare lists?
I'm new to python so maybe someone could help me with this issue?
Thanks
You just need to use append.
old_items_data.append(line_in_data.split("\t"))
I would use the package pandas for this. It will not only be much quicker, but also simpler. Use pandas.read_table to import the data (specifying delimiter and row-skipping can be done here by passing arguments to sep and skiprows). Then, use pandas.DataFrame.apply to apply your function to the rows of your data.
The speed gains are going to come from the fact that pandas was optimized to perform actions across lists like this (in the case of a pandas DataFrame, these would be called rows). This applies to both importing the data and applying a function to every row. The simplicity gains should hopefully be clear.

Categories