Joining rows in CSV with Python

Joining rows in CSV with Python - python

I have a CSV (~1.5m rows) in the following format:
id, tag1, tag2, name1, value1
There are several rows with the same id. If a row has the same id, it will have the same tag1 and tag2. So, what I want to do is to append in the end of the row the name1, value1 which will be different.
Example:
Original:
id,tag1,tag2,name1,value1
12,orange,car,john,32
13,green,bike,george,23
12,orange,car,elen,21
Final:
id,tag1,tag2,name1,value1
12,orange,car,john,32,elen,21
13,green,bike,george,23
The only way I can do it is with a brute force script in Python. Create a dictionary with the key of the id and then a list with all the other parameters. Each time I find an id which is already in the dictionary, I just append the last two fields in the dictionary value as a list.
However, it is not the most efficient way to do it in such a big file. Is there any other way to do it, maybe with a library?

Kay's suggestion using sorted input data could look something like this:
with open('in.txt') as infile, open('out.txt', mode='w') as outfile:
# Prime the first line
line = infile.readline()
# When collating lines, running_line will look like:
# ['id,tag1,tag2', 'name1', 'value1', 'name2', 'value2', ...]
# Prime it with just the 'id,tag1,tag2' of the first line
running_line = [line[:-1].rsplit(',', 2)[0]]
while line:
curr_it12, name, value = line[:-1].rsplit(',', 2)
if running_line[0] == curr_it12:
# Current line's id/tag1/tag2 matches previous line's.
running_line.extend([name, value])
else:
# Current line's id/tag1/tag2 doesn't match. Output the previous...
outfile.write(','.join(running_line) + '\n')
# ...and start a new running_line
running_line = [curr_it12, name, value]
# Grab the next line
line = infile.readline()
# Flush the last line
outfile.write(','.join(running_line) + '\n')

Related

Checking if csv files have same items

I got two .csv files. One that has info1 and one that has info2. Files look like this
File1:
20170101,,,d,4,f,SWE
20170102,a,,,d,f,r,RUS <-
File2:
20170102,a,s,w,,,,RUS <-
20170103,d,r,,,,FIN
I want to combine these two lines (marked as "<-") and make a combined line like this:
20170102,a,s,w,d,f,r,RUS
I know that I could do script similar to this:
for row1 in csv_file1:
for row2 in csv_file2:
if (row1[0] == row2[0] and row1[1] == row2[1]):
do something
Is there any other way to find out which rows have the same items in the beginning or is this the only way? This is pretty slow way to find out the similarities and it takes several minutes to run on 100 000 row files.

Your implementation is O(n^2), comparing all lines in one file with all lines in another. Even worse if you re-read the second file for each line in the first file.
You could significantly speed this up by building an index from the content of the first file. The index could be as simple as a dictionary, with the first column of the file as key, and the line as value.
You can build that index in one pass on the first file.
And then make one pass on the second file,
checking for each line if the id is in the index.
If yes, then print the merged line.
index = {row[0]: row for row in csv_file1}
for row in csv_file2:
if row[0] in index:
# do something
Special thanks to #martineau for the dict comprehension version of building the index.
If there can be multiple items with the same id in the first file,
then the index could point to a list of those rows:
index = {}
for row in csv_file1:
key = row[0]
if key not in index:
index[key] = []
index[key].append(row)
This could be simplified a bit using defaultdict:
from collections import defaultdict
index = defaultdict(list)
for row in csv_file1:
index[rows[0]].append(row)

Write data from one csv to another python

I have three CSV files with attributes Product_ID, Name, Cost, Description. Each file contains Product_ID. I want to combine Name (file1), Cost(file2), Description(File3) to new CSV file with Product_ID and all three above attributes. I need efficient code as files contains over 130000 rows.
After combining all data to new file, I have to load that data in a dictionary.
Like: Product_Id as Key and Name,Cost,Description as Value.

It might be more efficient to read each input .csv into a dictionary before creating your aggregated result.
Here's a solution for reading in each file and storing the columns in a dictionary with Product_IDs as the keys. I assume that each Product_ID value exists in each file and that headers are included. I also assume that there are no duplicate columns across the files aside from Product_ID.
import csv
from collections import defaultdict
entries = defaultdict(list)
files = ['names.csv', 'costs.csv', 'descriptions.csv']
headers = ['Product_ID']
for filename in files:
with open(filename, 'rU') as f: # Open each file in files.
reader = csv.reader(f) # Create a reader to iterate csv lines
heads = next(reader) # Grab first line (headers)
pk = heads.index(headers[0]) # Get the position of 'Product_ID' in
# the list of headers
# Add the rest of the headers to the list of collected columns (skip 'Product_ID')
headers.extend([x for i,x in enumerate(heads) if i != pk])
for row in reader:
# For each line, add new values (except 'Product_ID') to the
# entries dict with the line's Product_ID value as the key
entries[row[pk]].extend([x for i,x in enumerate(row) if i != pk])
writer = csv.writer(open('result.csv', 'wb')) # Open file to write csv lines
writer.writerow(headers) # Write the headers first
for key, value in entries.items():
writer.writerow([key] + value) # Write the product IDs
# concatenated with the other values

A general solution that produces a record, maybe incomplete, for each id it encounters processing the 3 files needs the use of a specialized data structure that fortunately is just a list, with a preassigned number of slots
d = {id:[name,None,None] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d:
d[id][1] = cost
else:
d[id] = [None, cost, None]
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d:
d[id][2] = desc
else:
d[id] = [None, None, desc]
for id in d:
if all(d[id]):
print ','.join([id]+d[id])
else: # for this id you have not complete info,
# so you have to decide on your own what you want, I have to
pass
If you are sure that you don't want to further process incomplete records, the code above can be simplified
d = {id:[name] for id, name in [line.strip().split(',') for line in open(fn1)]}
for line in open(fn2):
id, cost = line.strip().split(',')
if id in d: d[id].append(name)
for line in open(fn3):
id, desc = line.strip().split(',')
if id in d: d[id].append(desc)
for id in d:
if len(d[id])==3: print ','.join([id]+d[id])

Remove duplicate rows in CSV comparing data in only two columns with Python

There are likely many ways to go about this, but here's the gist when it comes down to it:
I have two databases full of people, both exported into csv files. One of the databases is being decommissioned. I need to compare each csv file (or a combined version of the two) and filter out all non-unique people in the soon-to-be decommissioned server. This way I can import only unique people from the decommissioned database into the current database.
I only need to compare FirstName and LastName (which are two separate columns). Part of the problem is they are not precise duplicates, the names are all capitalized in one database, and very in the other.
Here is an example of the data when I combine the two csv files into one. The all CAPS names are from the current database (which is how the csv is currently formatted):
FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor ,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023
Would be parsed into:
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
Parsing the other columns is irrelevant, but obviously the data is very relevant, so it must remain untouched. (There are actually dozens of other columns, not just three).
To get an idea of how many duplicates I actually had, I ran this script (taken from a previous post):
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
Too simple for my needs though.

Using a set is no good unless you actually want to keep one unique line of with recurring values not only keep lines that are unique, you need to find the unique values looking through all the file first which a Counter dict will do:
with open("test.csv", encoding="utf-8") as f, open("file_out.csv", "w") as out:
from collections import Counter
from csv import reader, writer
wr = writer(out)
header = next(f) # get header
# get count of each first/last name pair lowering each string
counts = Counter((a.lower(), b.lower()) for a, b, *_ in reader(f))
f.seek(0) # reset counter
out.write(next(f)) # write header ?
# iterate over the file again, only keeping rows which have
# unique first and second names
wr.writerows(row for row in reader(f)
if counts[row[0].lower(),row[1].lower()] == 1)
Input:
FirstName,LastName,id,id2,id3
John,Doe,123,432,645
Jacob,Smith,456,372,383
Susy,Saucy,9999,12,8r83
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
JOHN,DOE,999,888,999
SUSY,SAUCY,8373,08j,9023
file_out:
FirstName,LastName,id,id2,id3
Jacob,Smith,456,372,383
Contractor,#1,8dh,28j,153s
Testing2,Contrator,7463,99999,0283
counts counts how many times each of the names appear after being lowered. We then reset the pointer and only write lines whose first two column values are only seen once in the whole file.
Or without the csv module which may be faster if you have namy columns:
with open("test.csv") as f, open("file_out.csv","w") as out:
from collections import Counter
header = next(f) # get header
next(f) # skip blank line
counts = Counter(tuple(map(str.lower,line.split(",", 2)[:2])) for line in f)
f.seek(0) # back to start of file
next(f), next(f) # skip again
out.write(header) # write original header ?
out.writelines(line for line in f
if counts[map(str.lower,line.split(",", 2)[:2])] == 1)

You could use the pandas package for this
import pandas as pd
import StringIO
replace the StringIO with path to your csv files
df1 = pd.read_table(StringIO.StringIO('''FirstName LastName id id2 id3
John Doe 123 432 645
Jacob Smith 456 372 383
Susy Saucy 9999 12 8r83
Contractor #1 8dh 28j 153s
Testing2 Contrator 7463 99999 0283'''), delim_whitespace=True)
df2 = pd.read_table(StringIO.StringIO('''FirstName LastName id id2 id3
JOHN DOE 999 888 999
SUSY SAUCY 8373 08j 9023'''), delim_whitespace=True)
Concatenate and uppercase the names
df1['name'] = (df1.FirstName + df1.LastName).str.upper()
df2['name'] = (df2.FirstName + df2.LastName).str.upper()
Select rows from df1 that do not match names from df2
df1[~df1.name.isin(df2.name)]

You can keep the idea of using a set. Just define a function that will return what you are interested in:
def name(line):
line = line.split(',')
n = ' '.join(line[:2])
return n.lower()
Without concatenating the two databases, read the names in the current database into a set.
with open('current.csv') as f:
next(f)
current_db = {name(line) for line in f}
Check the names in the decommissioned db and write them if not seen.
with open('decommissioned.csv') as old, open('unique.csv', 'w') as out:
next(old)
for line in old:
if name(line) not in current_db:
out.write(line)

You need to operate on a case-insignificant concatenation of the names. For instance:
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
field_list = line.split(' ')
key_name = (field_list[0] + "_" + filed_list[1]).lower()
if key_name in seen: continue # skip duplicate
seen.add(key_name)
out_file.write(line)

changed since it is in csv format
from collections import defaultdict
dd = defaultdict(list)
d = {}
import re
with open("data") as f:
for line in f:
line = line.strip().lower()
mobj = re.match('(\w+),(\w+|#\d),(.*)',line)
firstf, secondf, rest = mobj.groups()
key = firstf + "_" + secondf
d[key] = rest
dd[key].append(rest)
for k, v in d.items():
print(k, v)
output
jacob_smith 456,372,383
testing2_contrator 7463,99999,0283
john_doe 999,888,999
susy_saucy 8373,08j,9023
contractor_#1 8dh,28j,153s
jacob_smith 456 372 383
output
for k, v in dd.items():
print(k,v)
jacob_smith ['456,372,383']
testing2_contrator ['7463,99999,0283']
john_doe ['123,432,645', '999,888,999']
susy_saucy ['9999,12,8r83', '8373,08j,9023']
contractor_#1 ['8dh,28j,153s']

Trouble with sorting list and "for" statement snytax

I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()

You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.

You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.

Write last three entries per name in a file

I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)

Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.