How to Remove duplicates from a Lookup File using Python? - python

I've seen multiple responses around this type of question, but i don't believe I've seen any for the type of list I am concerned with. Hopefully I am not duplicating anything here. Your help is much appreciated!
I have a comma separated file that I use for data enrichment. It starts with the headers - TPCode,corporation_name - then the list of values follows. There are around 35k rows (if that matters).
I notice when the data from that lookup file (CSV) is displayed there are multiple entries for the same customer. Rather than going in and manually removing them, I would like to run a Python script to remove the duplicates
In the format of:
value,value
value,value
value,value
etc., what is the optimal way to remove the duplicates using Python? As a side note, each TPCode should be different, but a corp name can have multiple TPCodes.
Please let me know if you need any additional information.
Thanks in advance!

Hard to tell from your question if each line should be unique. If so you could do:
for l in sorted(set(line for line in open('ors_1202.log'))):
print(l.rstrip())
otherwise we need more info.

As the csv rows are tuple and tuples are immutable objects you can loop over your rwos and use a set container to hold the rows :
import csv
seen=set()
with open('in_file.csv', 'rb') as csvfile,pen('out_file.csv', 'wb') as csvout:
spamreader = csv.reader(csvfile, delimiter=',')
spamwriter = csv.writer(csvout, delimiter=',')
for row in spamreader:
seen.add(row)
if row not in seen :
pamwriter.writerow(row)
Note that member ship checking in set has O(1) complexity.

Related

How to find duplicates in a csv with python, and then alter the row

For a little background this is the csv file that I'm starting with. (the data is nonsensical and only used for proof of concept)
Jackson,Thompson,jackson.thompson#hotmail.com,test,
Luke,Wallace,luke.wallace#lycos.com,test,
David,Wright,david.wright#hotmail.com,test,
Nathaniel,Butler,nathaniel.butler#aol.com,test,
Eli,Simpson,noah.simpson#hotmail.com,test,
Eli,Mitchell,eli.mitchell#aol.com,,test2
Bob,Test,bob.test#aol.com,test,
What I am attempting to do with this csv on a larger scale is if the first value in the row is duplicated I need to take the data in the second entry and append it to the row with the first instance of the value. For example, in the data above "Eli" is represented twice, the first instance has "test" after the email value. The second instance of "Eli" does not have a value there it instead has another value in the next index over, and remove the duplicate row.
I would want it to go from this:
Eli,Simpson,noah.simpson#hotmail.com,test,,
Eli,Mitchell,eli.mitchell#aol.com,,test2
To this:
Eli,Simpson,noa.simpson#hotmail.com,test,test2
I have been able to successfully import this csv into my code using what is below.
import csv
f = open('C:\Projects\Python\Test.csv','r')
csv_f = csv.reader(f)
test_list = []
for row in csv_f:
test_list.append(row[0])
print(test_list)
At this point I was able to import my csv, and put the first names into my list. I'm not sure how to compare the indexes to make the changes I'm looking for. I'm a python rookie so any help/guidance would be greatly appreciated.
If you want to use pandas you could use the pandas .drop_deplicates() method. An example would look something like this.
import pandas as pd
csv_f = pd.read_csv(r'C:\a file with addresses')
data.drop_duplicates(subset=['thing_to_drop'], keep='first',inplace=False)
see pandas documentation https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwiej-eNrLrjAhVBGs0KHV6bB9kQFjADegQIABAB&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.drop_duplicates.html&usg=AOvVaw1uGhCrPNMDDZAZWE9_YA9D
I am a kind of a newbie in python as well but I would suggest using dictreader and look at the excel file as a dictionary meaning every raw is a dictionary.
this way you can iterate through the names easily.
Second, I would suggest making a list of names already known to you as you iterate through the excel file to check if this is a known name for example
name_list.append("eli")
then when you check if "eli" in name_list:
and add a key, value to the first one.
I don't know if this is best practice so don't roast me guys, but this is a simple and quick solution.
This will help you practice iterating through lists and dictionaries as well.
Here is a helpful link for reading about csv handling.

Setting fieldnames to a row other than the first for Dictreader in Python

I have a csv file where the table of data I am interested starts at say row 4. When I try to skip over rows with something like yield, the dictreader still picks the first row for field names. I can get what I need by applying islice to the dictreader:
reader = islice(csv.DictReader(csv_file), 4, None)
But are there other ways that would also work?
I can’t seem to find any other way.
I am new and self learning, just trying to expand the way I approach things.

How do I write unknown keys to a CSV in large data sets?

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.
What I have so far (this is a simplified structure for the purposes of this question):
import csv
resp = client.get_list()
while resp.token:
my_data = resp.data
process_data(my_data)
resp = client.get_list(resp.token)
def process_data(my_data):
#This section should write my_data to a CSV file
#I know I can use csv.dictwriter as part of the solution
#Notice that in this example, "fieldnames" is undefined
#Defining it is part of the question
with open('output.csv', 'a') as output_file:
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
for element in my_data:
writer.writerow(element)
The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.
Options that I've considered:
Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.
None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?
Options 1 and 2 both seem reasonable.
Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.
One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

Python: How to restart a FOR loop, which iterates over a csv

I am using Python 3.5 and I wanna load data from a csv into several lists, but it only works exactly one time with a FOR-Loop. Then it loads 0 into it.
Here is the code:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=';')
list_f1_vorname = []
for row_f1 in csv_f1:
list_f1_vorname.append(row_f1[2])
list_f1_name = []
for row_f1 in csv_f1: # <-- HERE IS THE ERROR, IT DOESNT WORK A SECOND TIME!
list_f1_name.append(row_f1[3])
Does anybody know how to restart this thing?
Many thanks and best regards,
Saitam
csv_f1 is not an list, it is an iterative.
Either, you cache the csv_f1 into a list by using list() or you just recreate the object.
I would recommend to recreate the object in case your cvs data gets very big. This way, the data is not put into RAM completely.
The simple answer is to iterate over the csv once and store it into a list.
something like
my_list = []
for row in csv_f1:
my_list.append(row)
or what abukaj wrote with
csv_f1 = list(csv.reader(f1, delimiter=';'))
and then move on and iterate over that list as many times as you want.
However if you are only trying to get certain columns then you can simply do that in the same for loop.
list_f1_vorname = []
list_f1_name = []
for row in csv_f1:
list_f1_vorname.append(row[2])
list_f1_name.append(row[3])
The reason it doesn't work multiple times is because it is an iterator...so it will iterate over the values once but not restart at beginning again after it has already iterated over the data.
Try:
csv_f1 = list(csv.reader(f1, delimiter=';'))
It is not exactly restarting the reader, but rather caching the file contents in a list, which may be iterated many times.
One thing nobody noticed so far is that you're trying to store names and last names in two separate lists. This is not going to be very convenient to use later on. Therefore despite other answers show correct possible solutions to read names and last names from csv into two separate lists, I'm going to propose you to use a single list of dicts instead:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=";")
list_of_names = []
for row_f1 in csv_f1:
list_of_names.append({
"vorname": row_f1[2],
"name": row_f1[3]
})
Then you can iterate over this list and take the value you want. For example to simply print the values:
for row in list_of_names:
print(row["vorname"])
print(row["name"])
And the last but not least, you could build this list also by using list comprehension (kinda more Pythonic):
list_of_names = [
{
"vorname": row_f1[2],
"name": row_f1[3]
}
for row_f1 in csv_f1
]
As I said, I appreciate other answers. They solve the issue of csv reader being iterable and not a list-like object.
Nevertheless I see a little bit of XY Problem in your question. I've seen so many times attempts to store entity properties (name and last name are obviously related properties and form a simple entity together) in multiple lists. It always ends up with the code which is hard to read and maintain.

CSV to Python Dictionary with all column names?

I'm still pretty new to using python to program from scratch so as an exercise I though I'd take a file that I process using SQL an try to duplicate the functionality using Python. It seems that I want to take my (compressed, zip) csv file and create a Dict of it (or maybe a dict of dicts?). When I use dict reader I get the 1st row as a key rather than each column as its own key? E.g.
import csv, sys, zipfile
sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip"
zip_file = zipfile.ZipFile(sys.argv[0])
items_file = zip_file.open('AllListing1RES.txt', 'rU')
for row in csv.DictReader(items_file,dialect='excel'):
pass
Yields:
>>> for key in row:
print 'key=%s, value=%s' % (key, row[key])
key=MLS_ACCT PARCEL_ID AREA COUNTY STREET_NUM STREET_NAME CITY ZIP STATUS PROP_TYPE LIST_PRICE LIST_DATE DOM DATE_MODIFIED BATHS_HALF BATHS_FULL BEDROOMS ACREAGE YEAR_BUILT YEAR_BUILT_DESC OWNER_NAME SOLD_DATE WITHDRAWN_DATE STATUS_DATE SUBDIVISION PENDING_DATE SOLD_PRICE,
value=492859 28-15-3-009-001.0000 200 JEFF 3828 ORLEANS RD MOUNTAIN BROOK 35243 A SFR 324900 3/3/2011 2 3/4/2011 12:04:11 AM 0 2 3 0 1968 EXIST SPARKS 3/3/2011 11:54:56 PM KNOLLWOOD
So what I'm looking for is a column for MLS_ACCT and a separate one for PARCEL_ID etc so I can then do things like average prices by all items that contain KNOLLWOOD in the SUBDIVISION field With a further sub section by date range, date sold etc.
I know well how to do it with SQL but As I said I'm tying to gain some Python skills here.
I have been reading for the last few days but have yet to find any very simple illustrations on this sort of use case. Pointers to said docs would be appreciated. I realize I could use memory resident SQL-lite but again my desire is to get the Python approach learned.I've read some on Numpy and Scipy and have sage loaded but still can't find some useful illustrations since those tools seem focussed on arrays with only numbers as elements and I have a lot of string matching I need to do as well as date range calculations and comparisons.
Eventually I'll need to substitute values in the table (since I have dirty data), I do this now by having a "translate table" which contains all dirty variants and provides a "clean" answer for final use.
Are you sure that this is a file with comma-separated values? It seems like the lines are being delimited by tabs.
If this is correct, specify a tab delimiter in the DictReader constructor.
for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'):
for key in row:
print 'key=%s, value=%s' % (key, row[key])
Source: http://docs.python.org/library/csv.html
Writing the operation in pure Python is certainly possible, but you'll have to choose your algorithms then. The row output you've posted above looks a whole lot like the parsing has gone wrong; in fact, it seems not to be a CSV at all, is it a TSV? Try passing delimiter='\t' or dialect=csv.excel_tab to DictReader.
Once the reading is done right, DictReader should work for getting rows as dictionaries, a typical row-oriented structure. Oddly enough, this isn't normally the efficient way to handle queries like yours; having only column lists makes searches a lot easier. Row orientation means you have to redo some lookup work for every row. Things like date matching requires data that is certainly not present in a CSV, like how dates are represented and which columns are dates.
An example of getting a column-oriented data structure (however, involving loading the whole file):
import csv
allrows=list(csv.reader(open('test.csv')))
# Extract the first row as keys for a columns dictionary
columns=dict([(x[0],x[1:]) for x in zip(*allrows)])
The intermediate steps of going to list and storing in a variable aren't necessary. The key is using zip (or its cousin itertools.izip) to transpose the table.
Then extracting column two from all rows with a certain criterion in column one:
matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2]
print map(columns['two'].__getitem__, matchingrows)
When you do know the type of a column, it may make sense to parse it, using appropriate functions like datetime.datetime.strptime.
At first glance it seems like your input might not actually be CSV, but maybe is tab delimited instead. Check out the docs at python.org, you can create a Dialect and use that to change the delimiter.
import csv
csv.register_dialect('exceltab', delimiter='\t')
for row in csv.DictReader(items_file,dialect='exceltab'):
pass

Categories