Python merge csv files with matching Index - python

I want to merge two CSV files based on a field
The 1st one looks like this:
ID, field1, field2
1,a,green
2,b,white
2,b,red
2,b,blue
3,c,black
The second one looks like:
ID, field3
1,value1
2,value2
What I want to have is:
ID, field1, field2,field3
1,a,green,value1
2,b,white,value2
2,b,red,value2
2,b,blue,value2
3,c,black,''
I'm using pydev on eclipse
import csv
endings0=[]
endings1=[]
with open("salaries.csv") as book0:
for line in book0:
endings0.append(line.split(',')[-1])
endings1.append(line.split(',')[0])
linecounter=0
res = open("result.csv","w")
with open('total.csv') as book2:
for line in book2:
# if not header line:
l=line.split(',')[0]
for linecounter in range(0,endings1.__len__()):
if( l == endings1[linecounter]):
res.writelines(line.replace("\n","") +','+str(endings0[linecounter]))
print("done")

There are a bunch of things wrong with what you're doing
You should really really be using the classes in the csv module to read and write csv files. Importing the module isn't enough. You actually need to call its functions
You should never find yourself typing endings1.__len__(). Use len(endings1) instead
You should never find yourself typing for linecounter in range(0,len(endings1)).
Use either for linecounter, _ in enumerate(endings1),
or better yet for end1, end2 in zip(endings1, endings2)
A dictionary is a much better data structure for lookup than a pair of parallel lists. To quote pike:
If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.
Here's how I'd do it:
import csv
with open('second.csv') as f:
# look, a builtin to read csv file lines as dictionaries!
reader = csv.DictReader(f)
# build a mapping of id to field3
id_to_field3 = {row['ID']: row['field3'] for row in reader}
# you can put more than one open inside a with statement
with open('first.csv') as f, open('result.csv', 'o') as fo:
# csv even has a class to write files!
reader = csv.DictReader(f)
res = csv.DictWriter(fo, fieldnames=reader.fieldnames + ['field3'])
res.writeheader()
for row in reader:
# .get returns its second argument if there was no match
row['field3'] = id_to_field3.get(row['ID'], '')
res.writerow(row)

I have a high-level solution for you.
Deserialize your first CSV into dict1 mapping ID to a list containing a list containing field1 and field2.
Deserialize your second CSV into dict2 mapping ID to field3.
for each (id, list) in dict1, do list.append(dict2.setdefault(id, '')). Now serialize it back into CSV using whatever serializer you were using before.
I used dictionary's setdefault because I noticed that ID 3 is in the first CSV file but not the second.

Related

Can I use a Dictionary to store keywords that I need to loop through a csv file to find?

I am writing a python script that will go through a csv file row by row looking for a keyword. Once that keyword is found I need to write that entire row to a new .csv file. I am having trouble writing the for loop to complete this task and I do not understand how to write to a new .csv file. I will post what I have done so far below.
#!/usr/bin/python
# First import csv module to work on .csv file
import csv
#Lets open the files that will be used to read from and write too
infile = open('infile.csv','rb')
outfile = open('outfile.csv','w')
# Lets pass file object through csv reader method
csv_f_reader = csv.reader(infile)
writer = csv.writer(outfile,delimiter=',',quotechar='',quoting=csv.QUOTE_NONE)
#Lets create a dictionary to hold all search words as unique keys.
# the associated value will be used to keep count of how many successful
# hits the forloop hits.
search_word={'wifi':0,'WIFI':0,'wi-fi':0,'Wi-Fi':0,'Cisco':0,'cisco':0,'NETGEAR':0,'netgear':0,'Netge$
for csv_line in csv_f_reader:
match_found = False
for keyword in search_word.keys():
for csv_element in csv_line:
if keyword in csv_element:
match_found = True
search_word[keyword] +=1
if match_found:
writer.writerow(csv_line)
#Dont forget to close the file
infile.close()
outfile.close()
print search_word.keys(), search_word.values()
You really don't need to use a dictionary for your keywords here. Err... wait, oh you want to keep track of how many times you see each keyword. Your description didn't say that.
Anyway, you should be looping through the lines in the file and the keys. The loop should probably look like this:
for line in csv_f_reader:
for keyword in search_word.keys():
if keyword in line:
search_word[keyword] += 1
writer.write(line)
infile.close()
outfile.close()
I haven't double checked that you're using the csv module correctly, but that should give you an idea of what it should look like.
You don't need a dictionary for what you're describing (unless you're trying to count up the keyword instances). search_word.keys() gives you a list anyway which is OK.
First you want to iterate through the csv like this:
infile = open('infile.csv')
csv_f_reader = csv.reader(infile)
for csv_line in csv_f_reader:
print csv_line
If you try that, you'll see that each line gives you a list of all the elements. You can use your list of keywords to compare each one and write the ones that pass
for csv_line in csv_f_reader:
for k in search_word.keys():
if k in csv_line:
writer.writerow(csv_line)
In your case, the keywords aren't exactly the same as the CSV elements, they're inside them. We can deal with this by checking the elements for substrings:
for csv_line in csv_f_reader:
match_found = False
for k in search_word.keys():
for csv_element in csv_line:
if k in csv_element:
match_found = True
if match_found:
writer.writerow(csv_line)
One other thing, is you need to open the output file in write mode with:
outfile = open('outfile.csv', 'w')

I have two csv's .. need to compare and print result if different in python

I have two csv's having results of same files. like:
File Result
a.pdf, malicious
b.pdf, non-malicious
c.pdf malicious
and the second csv but having results of same file, like:
File Result
a.pdf non-malicious
b.pdf malicious
c.pdf non-malicious
I need to compare both printout the file name having different result... but in python..
Use csv module in python to read in the files.
https://docs.python.org/2/library/csv.html
csv.DictReader is a good choice. You'll get two dictionaries, which you can iterate and compare by keys.
Try this one:
import csv
with open('A.csv', newline='') as fileA:
with open('B.csv',newline='') as fileB:
readA = csv.DictReader(fileA)
readB = csv.DictReader(fileB)
fields = ['File','Result']
ListDiff = []
for rowA in readA:
for rowB in readB:
if rowA[fields[0]] == rowB[fields[0]] and rowA[fields[1]] != rowB[fields[1]]:
ListDiff.append(rowA[fields[0]])
break
print(ListDiff)

Converting to csv from?

I have got a file with the following lines
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f508f-e7c8-32b8-e044-0003ba298018","municipalityCode":"0766","municipalityName":"Hedensted","streetCode":"0072","streetName":"Værnegården","streetBuildingIdentifier":"13","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"8000","districtName":"Århus","presentationString":"Værnegården 13, 8000 Århus","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(553564 6179299)","x":553564,"y":6179299}]}
I want to transform every line into a csv readable file with headers. Like the following
status,message,data,addressAccessId,municipalityCode,municipalityName,streetCode,streetName,streetBuildingIdentifier,mailDeliverySublocationIdentifier,districtSubDivisionIdentifier,postCodeIdentifier,districtName,presentationString,addressSpecificCount,validCoordinates,geometryWkt,x,y
OK,OK,data:type,addressAccessType,0a3f508f-e7c8-32b8-e044-0003ba298018,0766,Hedensted,0072,Værnegården,13,,,8000,Århus,Værnegården 13, 8000 Århus,1,true,POINT553564 6179299,553564,6179299
How do I accomplish that? Code and explanation are very welcome. So far this is what I have come up with the following from this example:(How can I convert JSON to CSV?)
x = json.loads(x)
f = csv.writer(open('test.csv', 'wb+'))
# Write CSV Header, If you dont need that, remove this line
f.writerow(['status', 'message', 'type', 'addressAccessId', 'municipalityCode','municipalityName','streetCode','streetName','streetBuildingIdentifier','mailDeliverySublocationIdentifier','districtSubDivisionIdentifier','postCodeIdentifier','districtName','presentationString','addressSpecificCount','validCoordinates','geometryWkt','x','y'])
for x in x:
f.writerow([x['status'],
x['message'],
x['data']['type'],
x['data']['addressAccessId'],
x['data']['municipalityCode'],
x['data']['municipalityName'],
x['data']['streetCode'],
x['data']['streetName'],
x['data']['streetBuildingIdentifier'],
x['data']['mailDeliverySublocationIdentifier'],
x['data']['districtSubDivisionIdentifier'],
x['data']['postCodeIdentifier'],
x['data']['districtName'],
x['data']['presentationString'],
x['data']['addressSpecificCount'],
x['data']['validCoordinates'],
x['data']['geometryWkt'],
x['data']['x'],
x['data']['y']])
I have looked through and tried a lot of other solutions, including DictWriter, replace() and translate() to remove characthers but have not yet been able to transform the line to my need. The purpose being able to select the fields that are output into a new file, and transforming x and y to a new coordinate system. But for now Im just trying to parse the above line to a csv file. Can anyone offer code and explanation of their code? Thank you very much for your time.
Below are the first few lines of my addresses.txt
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5081-e039-32b8-e044-0003ba298018","municipalityCode":"0265","municipalityName":"Roskilde","streetCode":"0831","streetName":"Brønsager","streetBuildingIdentifier":"69","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Svogerslev","postCodeIdentifier":"4000","districtName":"Roskilde","presentationString":"Brønsager 69, 4000 Roskilde","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(690026 6169309)","x":690026,"y":6169309}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5089-ecab-32b8-e044-0003ba298018","municipalityCode":"0461","municipalityName":"Odense","streetCode":"9505","streetName":"Vægtens Kvarter","streetBuildingIdentifier":"271","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Holluf Pile","postCodeIdentifier":"5220","districtName":"Odense SØ","presentationString":"Vægtens Kvarter 271, 5220 Odense SØ","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(592191 6135829)","x":592191,"y":6135829}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f507c-adc3-32b8-e044-0003ba298018","municipalityCode":"0165","municipalityName":"Albertslund","streetCode":"0445","streetName":"Skyttehusene","streetBuildingIdentifier":"33","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"2620","districtName":"Albertslund","presentationString":"Skyttehusene 33, 2620 Albertslund","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(711079 6174741)","x":711079,"y":6174741}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f509c-7f57-32b8-e044-0003ba298018","municipalityCode":"0851","municipalityName":"Aalborg","streetCode":"5205","streetName":"Løvstikkevej","streetBuildingIdentifier":"36","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"9000","districtName":"Aalborg","presentationString":"Løvstikkevej 36, 9000 Aalborg","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(552407 6322490)","x":552407,"y":6322490}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5098-32a6-32b8-e044-0003ba298018","municipalityCode":"0779","municipalityName":"Skive","streetCode":"0462","streetName":"Landevejen","streetBuildingIdentifier":"52","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Håsum","postCodeIdentifier":"7860","districtName":"Spøttrup","presentationString":"Landevejen 52, 7860 Spøttrup","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(491515 6269739)","x":491515,"y":6269739}]}
Note that the data key holds a list of dictionaries. x['data']['type'] wouldn't work, but x['data'][0]['type'] would. There might be more than one such dictionary in that list, however. I'll assume you want a CSV row per x['data'] dictionary.
Next, it appears you have a UTF-8 BOM on every line; whatever wrote this was not using UTF-8 encoding correctly. We need to strip this marker, the first 3 characters.
Last, JSON strings are always Unicode data, and you have non-ASCII characters in your data, so you'll have to encode to bytestrings again before passing the data to the CSV writer object.
I'd use csv.DictWriter here, with a pre-defined list of field names:
import codecs
import csv
import json
fields = [
'status', 'message', 'type', 'addressAccessId', 'municipalityCode',
'municipalityName', 'streetCode', 'streetName', 'streetBuildingIdentifier',
'mailDeliverySublocationIdentifier', 'districtSubDivisionIdentifier',
'postCodeIdentifier', 'districtName', 'presentationString', 'addressSpecificCount',
'validCoordinates', 'geometryWkt', 'x', 'y']
with open('test.csv', 'wb') as csvfile, open('jsonfile', 'r') as jsonfile:
writer = csv.DictWriter(csvfile, fields)
writer.writeheader()
for line in jsonfile:
if line.startswith(codecs.BOM_UTF8):
line = line[3:]
entry = json.loads(line)
for item in entry['data']:
row = dict(item, status=entry['status'], message=entry['message'])
row = {k.encode('utf8'): unicode(v).encode('utf8') for k, v in row.iteritems()}
writer.writerow(row)
The row dictionary is basically a copy of each of the dictionaries in the entry['data'] list, with the status and message keys copied over separately. This makes row a flat dictionary instead.
I also read your input file line by line, as you say that each line contains a separate JSON entry.
Open the output file with cvs.DictWriter() and define the output header fields as you specified. Use extrasaction='ignore' and restval='' as options.
Look at Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6 for help with processing large files as I had a similar question Also look at the question that I link to.
I build a similar type of system from a JSON using appropriate loops.
for example,
def parse_row(currdata):
outx = {}
# currdata is defined earlier to point to the x['data'] dictionary
for eachx in currdata:
outx[eachx] = currdata[eachx]
return outx
where this is in a function with currdata as an argument and called with x['data'][row] as the input argument.
rows = len(x['data'])
for row in range(rows):
outx = parse_row(x['data'][row])
# process the row and create output
This should allow you to set up the parsing properly. I cannot copy the actual code into this answer but this should point you to a solution.

Write key to separate csv based on value in dictionary

[Using Python3] I have a csv file that has two columns (an email address and a country code; script is made to actually make it two columns if not the case in the original file - kind of) that I want to split out by the value in the second column and output in separate csv files.
eppetj#desrfpkwpwmhdc.com us ==> output-us.csv
uheuyvhy#zyetccm.com de ==> output-de.csv
avpxhbdt#reywimmujbwm.com es ==> output-es.csv
gqcottyqmy#romeajpui.com it ==> output-it.csv
qscar#tpcptkfuaiod.com fr ==> output-fr.csv
qshxvlngi#oxnzjbdpvlwaem.com gb ==> output-gb.csv
vztybzbxqq#gahvg.com us ==> output-us.csv
... ... ...
Currently my code kind of does this, but instead of writing each email address to the csv it overwrites the email placed before that. Can someone help me out with this?
I am very new to programming and Python and I might not have written the code in the most pythonic way, so I would really appreciate any feedback on the code in general!
Thanks in advance!
Code:
import csv
def tsv_to_dict(filename):
"""Creates a reader of a specified .tsv file."""
with open(filename, 'r') as f:
reader = csv.reader(f, delimiter='\t') # '\t' implies tab
email_list = []
# Checks each list in the reader list and removes empty elements
for lst in reader:
email_list.append([elem for elem in lst if elem != '']) # List comprehension
# Stores the list of lists as a dict
email_dict = dict(email_list)
return email_dict
def count_keys(dictionary):
"""Counts the number of entries in a dictionary."""
return len(dictionary.keys())
def clean_dict(dictionary):
"""Removes all whitespace in keys from specified dictionary."""
return { k.strip():v for k,v in dictionary.items() } # Dictionary comprehension
def split_emails(dictionary):
"""Splits out all email addresses from dictionary into output csv files by country code."""
# Creating a list of unique country codes
cc_list = []
for v in dictionary.values():
if not v in cc_list:
cc_list.append(v)
# Writing the email addresses to a csv based on the cc (value) in dictionary
for key, value in dictionary.items():
for c in cc_list:
if c == value:
with open('output-' +str(c) +'.csv', 'w') as f_out:
writer = csv.writer(f_out, lineterminator='\r\n')
writer.writerow([key])
You can simplify this a lot by using a defaultdict:
import csv
from collections import defaultdict
emails = defaultdict(list)
with open('email.tsv','r') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
if row:
if '#' in row[0]:
emails[row[1].strip()].append(row[0].strip()+'\n')
for key,values in emails.items():
with open('output-{}.csv'.format(key), 'w') as f:
f.writelines(values)
As your separated files are not comma separated, but single columns - you don't need the csv module and can simply write the rows.
The emails dictionary contains a key for each country code, and a list for all the matching email addresses. To make sure the email addresses are printed correctly, we remove any whitespace and add the a line break (this is so we can use writelines later).
Once the dictionary is populated, its simply a matter of stepping through the keys to create the files and then writing out the resulting list.
The problem with your code is that it keeps opening the same country output file each time it writes an entry into it, thereby overwriting whatever might have already been there.
A simple way to avoid that is to open all the output files at once for writing and store them in a dictionary keyed by the country code. Likewise, you can have another that associates each country code to acsv.writerobject for that country's output file.
Update: While I agree that Burhan's approach is probably superior, I feel that you have the idea that my earlier answer was excessively long due to all the comments it had -- so here's another version of essentially the same logic but with minimal comments to allow you better discern its reasonably-short true length (even with the contextmanager).
import csv
from contextlib import contextmanager
#contextmanager # to manage simultaneous opening and closing of output files
def open_country_csv_files(countries):
csv_files = {country: open('output-'+country+'.csv', 'w')
for country in countries}
yield csv_files
for f in csv_files.values(): f.close()
with open('email.tsv', 'r') as f:
email_dict = {row[0]: row[1] for row in csv.reader(f, delimiter='\t') if row}
countries = set(email_dict.values())
with open_country_csv_files(countries) as csv_files:
csv_writers = {country: csv.writer(csv_files[country], lineterminator='\r\n')
for country in countries}
for email_addr,country in email_dict.items():
csv_writers[country].writerow([email_addr])
Not a Python answer, but maybe you can use this Bash solution.
$ while read email country
do
echo $email >> output-$country.csv
done < in.csv
This reads the lines from in.csv, splits them into two parts email and country, and appends (>>) the email to the file called output-$country.csv.

Writing multiple values in single cell in csv

For each user I have the list of events in which he participated.
e.g. bob : [event1,event2,...]
I want to write it in csv file. I created a dictionary (key - user & value - list of events)
I wrote it in csv. The following is the sample output
username, frnds
"abc" ['event1','event2']
where username is first col and frnds 2nd col
This is code
writer = csv.writer(open('eventlist.csv', 'ab'))
for key, value in evnt_list.items():
writer.writerow([key, value])
when I am reading the csv I am not getting the list directly. But I am getting it in following way
['e','v','e','n','t','1','','...]
I also tried to write the list directly in csv but while reading am getting the same output.
What I want is multiple values in a single cell so that when I read a column for a row I get list of all events.
e.g
colA colB
user1,event1,event2,...
I think it's not difficult but somehow I am not getting it.
###Reading
I am reading it with the help of following
codereader = csv.reader(open("eventlist.csv"))
reader.next()
for row in reader:
tmp=row[1]
print tmp # it is printing the whole list but
print tmp[0] #the output is [
print tmp[1] #output is 'e' it should have been 'event1'
print tmp[2] #output is 'v' it should have been 'event2'
you have to format your values into a single string:
with open('eventlist.csv', 'ab') as f:
writer = csv.writer(f, delimiter=' ')
for key, value in evnt_list.items():
writer.writerow([key, ','.join(value)])
exports as
key1 val11,val12,val13
key2 val21,val22,val23
READING: Here you have to keep in mind, that you converted your Python list into a formatted string. Therefore you cannot use standard csv tools to read it:
with open("eventlist.csv") as f:
csvr = csv.reader(f, delimiter=' ')
csvr.next()
for rec in csvr:
key, values_txt = rec
values = values_txt.split(',')
print key, values
works as awaited.
You seem to be saying that your evnt_list is a dictionary whose keys are strings and whose values are lists of strings. If so, then the CSV-writing code you've given in your question will write a string representation of a Python list into the second column. When you read anything in from CSV, it will just be a string, so once again you'll have a string representation of your list. For example, if you have a cell that contains "['event1', 'event2']" you will be reading in a string whose first character (at position 0) is [, second character is ', third character is e, etc. (I don't think your tmp[1] is right; I think it is really ', not e.)
It sounds like you want to reconstruct the Python object, in this case a list of strings. To do that, use ast.literal_eval:
import ast
cell_string_value = "['event1', 'event2']"
cell_object = ast.literal_eval(cell_string_value)
Incidentally, the reason to use ast.literal_eval instead of just eval is safety. eval allows arbitrary Python expressions and is thus a security risk.
Also, what is the purpose of the CSV, if you want to get the list back as a list? Will people be reading it (in Excel or something)? If not, then you may want to simply save the evnt_list object using pickle or json, and not bother with the CSV at all.
Edit: I should have read more carefully; the data from evnt_list is being appended to the CSV, and neither pickle nor json is easily appendable. So I suppose CSV is a reasonable and lightweight way to accumulate the data. A full-blown database might be better, but that would not be as lightweight.

Categories