Checking if csv files have same items - python

I got two .csv files. One that has info1 and one that has info2. Files look like this
File1:
20170101,,,d,4,f,SWE
20170102,a,,,d,f,r,RUS <-
File2:
20170102,a,s,w,,,,RUS <-
20170103,d,r,,,,FIN
I want to combine these two lines (marked as "<-") and make a combined line like this:
20170102,a,s,w,d,f,r,RUS
I know that I could do script similar to this:
for row1 in csv_file1:
for row2 in csv_file2:
if (row1[0] == row2[0] and row1[1] == row2[1]):
do something
Is there any other way to find out which rows have the same items in the beginning or is this the only way? This is pretty slow way to find out the similarities and it takes several minutes to run on 100 000 row files.

Your implementation is O(n^2), comparing all lines in one file with all lines in another. Even worse if you re-read the second file for each line in the first file.
You could significantly speed this up by building an index from the content of the first file. The index could be as simple as a dictionary, with the first column of the file as key, and the line as value.
You can build that index in one pass on the first file.
And then make one pass on the second file,
checking for each line if the id is in the index.
If yes, then print the merged line.
index = {row[0]: row for row in csv_file1}
for row in csv_file2:
if row[0] in index:
# do something
Special thanks to #martineau for the dict comprehension version of building the index.
If there can be multiple items with the same id in the first file,
then the index could point to a list of those rows:
index = {}
for row in csv_file1:
key = row[0]
if key not in index:
index[key] = []
index[key].append(row)
This could be simplified a bit using defaultdict:
from collections import defaultdict
index = defaultdict(list)
for row in csv_file1:
index[rows[0]].append(row)

Related

Need a way to take three csv files and put into one as well as remove duplicates and replace values in Python

I'm new to Python but I need help creating a script that will take in three different csv files, combine them together, remove duplicates from the first column as well as remove any rows that are blank, then change a revenue area to a number.
The three CSV files are setup the same.
The first column is a phone number and the second column is a revenue area (city).
The first column will need all duplicates & blank values removed.
The second column will have values like "Macon", "Marceline", "Brookfield", which will need to be changed to a specific value like:
Macon = 1
Marceline = 8
Brookfield = 4
And then if it doesn't match one of those values put a default value of 9.
Welcome to Stack Overflow!
Firstly, you'll want to be using the csv library for the "reader" and "writer" functions, so import the csv module.
Then, you'll want to open the new file to be written to, and use the csv.writer function on it.
After that, you'll want to define a set (I name it seen). This will be used to prevent duplicates from being written.
Write your headers (if you need them) to the new file using the writer.
Open your first old file, using csv module's "reader". Iterate through the rows using a for loop, and add the rows to the "seen" set. If a row has been seen, simply "continue" instead of writing to the file. Repeat this for the next two files.
To assign the values to the cities, you'll want to define a dictionary that holds the old names as the keys, and new values for the names as the values.
So, your code should look something like this:
import csv
myDict = {'Macon' : 1, 'Marceline' : 8, 'Brookfield' : 4}
seen = set()
newFile = open('newFile.csv', 'wb', newline='') #newline argument will prevent the writer from writing extra newlines, preventing empty rows.
writer = csv.writer(newFile)
writer.writerow(['Phone Number', 'City']) #This will write a header row for you.
#Open the first file, read each row, skip empty rows, skip duplicate rows, change value of "City", write to new file.
with open('firstFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Open the second file, read each row, skip if row is empty, skip duplicate rows, change value of "City", write to new file.
with open('secondFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Open the third file, read each row, skip empty rows, skip duplicate rows, change value of "City", write to new file.
with open('thirdFile.csv', 'rb') as inFile:
for row in csv.reader(inFile):
if any(row):
row[1] = myDict[row[1]]
if row in seen:
continue
seen.add(row)
writer.writerow(row)
#Close the output file
newFile.close()
I have not tested this myself, but it is very similar to two different programs that I wrote, and I have attempted to combine them into one. Let me know if this helps, or if there is something wrong with it!
-JCoder96

CSV to Python Dictionary with multiple lists for one key

So I have a csv file formatted like this
data_a,dataA,data1,data11
data_b,dataB,data1,data12
data_c,dataC,data1,data13
, , ,
data_d,dataD,data2,data21
data_e,dataE,data2,data22
data_f,dataF,data2,data23
HEADER1,HEADER2,HEADER3,HEADER4
The column headers are at the bottom, and I want the third column to be the keys. You can see that the third column is the same value for each of the two blocks of data and these blocks of data are separated by empty values, so I want to store the 3 rows of values to this 1 key and also disregard some columns such as column 4. This is my code right now
#!usr/bin/env python
import csv
with open("example.csv") as f:
readCSV = csv.reader(f)
for row in readCSV:
# disregard separating rows
if row[2] != '':
myDict = {row[2]:[row[0],row[1]]}
print(myDict)
What I basically want is that when I call
print(myDict['data2'])
I get
{[data_d,dataD][data_e,dataE][data_f,dataF]}
I tried editing my if loop to
if row[2] == 'data2':
myDict = {'data2':[row[0],row[1]]}
and just make an if for every individual key, but I don't think this will work either way.
With your current method, you probably want a defaultdict. This is a dictionary-like object that provides a default value if the key doesn't already exist. So in your case, we set this up to be a list, and then for each row we loop through, we append the values in columns 0 and 1 to this list as a tuple, like so:
import csv
from collections import defaultdict
data = defaultdict(list)
with open("example.csv") as f:
readCSV = csv.reader(f)
for row in readCSV:
# disregard separating rows
if row[2] != '':
data[row[2]].append((row[0], row[1]))
print(data)
With the example provided, this prints a defaultdict with the following entries:
{'data1': [('data_a', 'dataA'), ('data_b', 'dataB'), ('data_c', 'dataC')], 'data2': [('data_d', 'dataD'), ('data_e', 'dataE'), ('data_f', 'dataF')]}
I'm not a super Python geek, but I would suggest to use pandas (import pandas as pd). So you load data with pd.read_csv(file, header). With header you can specify the row you want to be a header and then it's much much easier to manipulate with the dataset (e.g. dropping the vars (del df['column_name']), creating dictionaries, etc).
Here is documentation to pd.read_csv: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

What's the most elegant, Pythonic way to deduplicate two sequential records in a CSV and retaining one record?

I'm trying to deduplicate records in a CSV. I don't consider myself new to Python or to writing ETL scripts. I've done my due diligence and searched the S.O. pages and don't think this problem can be diluted to using SETs (like most deduplication problems).
My goal is: For all rows in which ORIG is equal to the previous row’s ORIG, among the two rows for which ORIG is equal, delete the row for which SEQ_TIME == 0.
As the Python dictum goes, "There should be one-- and preferably only one --obvious way to do it." I've written code that I believe accomplishes this, but anyone would tell you it's extremely un-Pythonic. The CSV data looks like this, and my separate results CSV underneath that. Rows that meet the condition are highlighted in yellow for easy comparison.
Data as CSV text:
TRAIN#,SEQ#,ORIG,DEP,DEST,ARR,SEQ_TIME
A21,9,BPK,0.582986111,X66,0.584375,2
A21,10,X66,0.584375,CNLEMOYN,0.586805556,3.5
A21,11,CNLEMOYN,0.586805556,SMT,0.590972222,6
A21,12,SMT,0.590972222,,0.590972222,0
A21,13,SMT,0.590972222,CNCANAL,0.591666667,1
A21,14,CNCANAL,0.591666667,MEWILSPR,0.594791667,4.5
A21,15,MEWILSPR,0.594791667,,0.594791667,0
A21,16,MEWILSPR,0.594791667,MELEMONT,0.6,7.5
A21,17,MELEMONT,0.6,,0.6,6.5
A21,18,MELEMONT,0.6,MELOCKPO,0.605208333,0
A21,19,MELOCKPO,0.605208333,,0.605208333,0
A21,20,MELOCKPO,0.605208333,XUD,0.60625,2.5
A21,21,XUD,0.60625,JOL,0.607638889,2
And (un-Pythonic) code that I think accomplishes the goal is below.
import csv
f = open("my_data.csv", "r")
reader = csv.reader(f, lineterminator="\n")
header = reader.next()
# Dict comprehension so we can refer to each column by index or name.
hdict = {value:index for index, value in enumerate(header)}
# Data is converted to a 2-D list, since I do other stuff with it later.
data = [row for row in reader]
# Main (un-pythonic) solution.
result = []
try:
i = 0
while True:
row1 = data[i]
row2 = data[i+1] # Will cause an IndexError on the last row.
if row1[hdict["ORIG"]] == row2[hdict["ORIG"]]:
if float(row1[hdict["SEQ_TIME"]]):
result.append(row1)
elif float(row2[hdict["SEQ_TIME"]]):
result.append(row2)
else:
raise AssertionError("Two sequential rows with equivalent ORIG cannot both have SEQ_TIME == 0.")
i += 1 # Force-skips to row3 in the next iteration, since row1 & row2 are handled above.
else:
result.append(row1)
i += 1 # I'm brute-forcing a loop with a manual index.
except IndexError:
result.append(data[-1]) # Handle the last row.
# Write results to some other CSV.
g = open("my_results.csv", "w")
writer = csv.writer(g, lineterminator="\n")
writer.writerow(header)
for row in result:
writer.writerow(result)
f.close()
g.close()
Although the while True: break idiom in Python is common and (I believe) sloppy coding, a try: while True: go on forever: except IndexError idiom is truly awful. Is there a more simple, elegant way to accomplish this task, such as a simple for loop?
One idea I pursued was using an iterable to control the cursor as it iterated through each row in a for loop:
data_iterable = iter(data)
for row in data_iterable:
row1 = row[:]
row2 = data_iterable.next() # Controlling the cursor here.
if row1[hdict["ORIG"]] == row2[hdict["ORIG"]]:
if float(row1[hdict["SEQ_TIME"]]):
result.append(row1)
else:
result.append(row2)
# The AssertionError check can be omitted.
else:
result.append(row1) # If nothing unusual...
result.append(row2) # append both rows.
The problem here is that this code only handles even-numbered duplicates and misses the odd-numbered duplicates.
Alternatively, we could iterate through the data twice, flagging rows we want to keep in a keep_these_rows list according to some ID like SEQ#. Then on the second pass, append only those rows to the result? But this seems equally clumsy to me and 2x as slow by necessity.
Any better solutions from the crowd?
NOTES:
The hdict is an easy way to combine csv.reader and csv.DictReader capabilities, so you can refer to rows by name e.g. row[hdict["ORIG"]] or index e.g. row[2].
I read one post by #DSM mentioning the itertools.GroupBy function as a contender. Would it do any good for us?
Thanks!
If the groups you want to compress are all contiguous, then you're right that itertools.groupby could be useful. Assuming that (say) we want to preserve SEQ_TIME == 0 cases if they're the only member of a group or if there are three contiguous entries with a SEQ_TIME == 0, we could do something like (Python 3 csv open style):
import csv
import itertools
with open("dedup.csv", newline="") as fp_in, open("dedup_out.csv", "w", newline="") as fp_out:
reader = csv.DictReader(fp_in)
writer = csv.DictWriter(fp_out, reader.fieldnames)
writer.writeheader()
for key, group in itertools.groupby(reader, key=lambda row: row["ORIG"]):
group = list(group)
if len(group) == 2:
group = [row for row in group if not float(row["SEQ_TIME"]) == 0]
writer.writerows(group)
which gives me
TRAIN#,SEQ#,ORIG,DEP,DEST,ARR,SEQ_TIME
A21,9,BPK,0.582986111,X66,0.584375,2
A21,10,X66,0.584375,CNLEMOYN,0.586805556,3.5
A21,11,CNLEMOYN,0.586805556,SMT,0.590972222,6
A21,13,SMT,0.590972222,CNCANAL,0.591666667,1
A21,14,CNCANAL,0.591666667,MEWILSPR,0.594791667,4.5
A21,16,MEWILSPR,0.594791667,MELEMONT,0.6,7.5
A21,17,MELEMONT,0.6,,0.6,6.5
A21,20,MELOCKPO,0.605208333,XUD,0.60625,2.5
A21,21,XUD,0.60625,JOL,0.607638889,2
where the group conditions can be adjusted as you need. If you know there will never be any SEQ_TIME=0 cases you want to keep, the code could get even simpler, but this should give you a place to start.

Write last three entries per name in a file

I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)
Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)

if row[0] in row[1] print row

I have a csv file that has 2 columns. I am simply trying to figure if each row[0] value is in some row[1] and if so, to print row.
Items in csv file:
COL1, COL2
1-A, 1-A
1-B, 2-A
2-A, 1-B
2565, 2565
51Bc, 51Bc
5161, 56
811, 65
681, 11
55, 3
3, 55
Code:
import csv
doc= csv.reader(open('file.csv','rb'))
for row in doc:
if row[0] in row[1]:
print row[0]
The end result should be:
1-A
1-B
2-A
2565
51Bc
55
3
Instead, it is giving me:
1-A
2565
51Bc
It prints those numbers because they are right next to each other side by side but what I need it to do is get the first item in COL1 and see if it finds it in the entire COL2 list and print if it does. Not see if its beside each other and print it.
When you say for row in doc, it's only getting one pair of elements and putting them in row. So there's no possible way row[1] can hold that entire column, at any point in time. You need to do an initial loop to get that column as a list, then loop through the csv file again to do the comparison. Actually, you could store both columns in separate lists, and only have to open the file once.
import csv
doc= csv.reader(open('file.csv','rb'))
# Build the lists.
first_col = []
second_col = set()
for row in doc:
first_col.append(row[0])
second_col.add(row[1])
# Now actually do the comparison.
for item in first_col:
if item in second_col:
print item
As per abarnert's suggestion, we're using a set() for the second column. sets are optimized for looking up values inside it, which is all we're doing with it. A list is optimized for looping through every element, which is what we do with first_col, so that makes more sense there.

Categories