Converting to csv from? - python
I have got a file with the following lines
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f508f-e7c8-32b8-e044-0003ba298018","municipalityCode":"0766","municipalityName":"Hedensted","streetCode":"0072","streetName":"Værnegården","streetBuildingIdentifier":"13","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"8000","districtName":"Århus","presentationString":"Værnegården 13, 8000 Århus","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(553564 6179299)","x":553564,"y":6179299}]}
I want to transform every line into a csv readable file with headers. Like the following
status,message,data,addressAccessId,municipalityCode,municipalityName,streetCode,streetName,streetBuildingIdentifier,mailDeliverySublocationIdentifier,districtSubDivisionIdentifier,postCodeIdentifier,districtName,presentationString,addressSpecificCount,validCoordinates,geometryWkt,x,y
OK,OK,data:type,addressAccessType,0a3f508f-e7c8-32b8-e044-0003ba298018,0766,Hedensted,0072,Værnegården,13,,,8000,Århus,Værnegården 13, 8000 Århus,1,true,POINT553564 6179299,553564,6179299
How do I accomplish that? Code and explanation are very welcome. So far this is what I have come up with the following from this example:(How can I convert JSON to CSV?)
x = json.loads(x)
f = csv.writer(open('test.csv', 'wb+'))
# Write CSV Header, If you dont need that, remove this line
f.writerow(['status', 'message', 'type', 'addressAccessId', 'municipalityCode','municipalityName','streetCode','streetName','streetBuildingIdentifier','mailDeliverySublocationIdentifier','districtSubDivisionIdentifier','postCodeIdentifier','districtName','presentationString','addressSpecificCount','validCoordinates','geometryWkt','x','y'])
for x in x:
f.writerow([x['status'],
x['message'],
x['data']['type'],
x['data']['addressAccessId'],
x['data']['municipalityCode'],
x['data']['municipalityName'],
x['data']['streetCode'],
x['data']['streetName'],
x['data']['streetBuildingIdentifier'],
x['data']['mailDeliverySublocationIdentifier'],
x['data']['districtSubDivisionIdentifier'],
x['data']['postCodeIdentifier'],
x['data']['districtName'],
x['data']['presentationString'],
x['data']['addressSpecificCount'],
x['data']['validCoordinates'],
x['data']['geometryWkt'],
x['data']['x'],
x['data']['y']])
I have looked through and tried a lot of other solutions, including DictWriter, replace() and translate() to remove characthers but have not yet been able to transform the line to my need. The purpose being able to select the fields that are output into a new file, and transforming x and y to a new coordinate system. But for now Im just trying to parse the above line to a csv file. Can anyone offer code and explanation of their code? Thank you very much for your time.
Below are the first few lines of my addresses.txt
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5081-e039-32b8-e044-0003ba298018","municipalityCode":"0265","municipalityName":"Roskilde","streetCode":"0831","streetName":"Brønsager","streetBuildingIdentifier":"69","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Svogerslev","postCodeIdentifier":"4000","districtName":"Roskilde","presentationString":"Brønsager 69, 4000 Roskilde","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(690026 6169309)","x":690026,"y":6169309}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5089-ecab-32b8-e044-0003ba298018","municipalityCode":"0461","municipalityName":"Odense","streetCode":"9505","streetName":"Vægtens Kvarter","streetBuildingIdentifier":"271","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Holluf Pile","postCodeIdentifier":"5220","districtName":"Odense SØ","presentationString":"Vægtens Kvarter 271, 5220 Odense SØ","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(592191 6135829)","x":592191,"y":6135829}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f507c-adc3-32b8-e044-0003ba298018","municipalityCode":"0165","municipalityName":"Albertslund","streetCode":"0445","streetName":"Skyttehusene","streetBuildingIdentifier":"33","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"2620","districtName":"Albertslund","presentationString":"Skyttehusene 33, 2620 Albertslund","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(711079 6174741)","x":711079,"y":6174741}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f509c-7f57-32b8-e044-0003ba298018","municipalityCode":"0851","municipalityName":"Aalborg","streetCode":"5205","streetName":"Løvstikkevej","streetBuildingIdentifier":"36","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"","postCodeIdentifier":"9000","districtName":"Aalborg","presentationString":"Løvstikkevej 36, 9000 Aalborg","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(552407 6322490)","x":552407,"y":6322490}]}
{"status":"OK","message":"OK","data":[{"type":"addressAccessType","addressAccessId":"0a3f5098-32a6-32b8-e044-0003ba298018","municipalityCode":"0779","municipalityName":"Skive","streetCode":"0462","streetName":"Landevejen","streetBuildingIdentifier":"52","mailDeliverySublocationIdentifier":"","districtSubDivisionIdentifier":"Håsum","postCodeIdentifier":"7860","districtName":"Spøttrup","presentationString":"Landevejen 52, 7860 Spøttrup","addressSpecificCount":1,"validCoordinates":true,"geometryWkt":"POINT(491515 6269739)","x":491515,"y":6269739}]}
Note that the data key holds a list of dictionaries. x['data']['type'] wouldn't work, but x['data'][0]['type'] would. There might be more than one such dictionary in that list, however. I'll assume you want a CSV row per x['data'] dictionary.
Next, it appears you have a UTF-8 BOM on every line; whatever wrote this was not using UTF-8 encoding correctly. We need to strip this marker, the first 3 characters.
Last, JSON strings are always Unicode data, and you have non-ASCII characters in your data, so you'll have to encode to bytestrings again before passing the data to the CSV writer object.
I'd use csv.DictWriter here, with a pre-defined list of field names:
import codecs
import csv
import json
fields = [
'status', 'message', 'type', 'addressAccessId', 'municipalityCode',
'municipalityName', 'streetCode', 'streetName', 'streetBuildingIdentifier',
'mailDeliverySublocationIdentifier', 'districtSubDivisionIdentifier',
'postCodeIdentifier', 'districtName', 'presentationString', 'addressSpecificCount',
'validCoordinates', 'geometryWkt', 'x', 'y']
with open('test.csv', 'wb') as csvfile, open('jsonfile', 'r') as jsonfile:
writer = csv.DictWriter(csvfile, fields)
writer.writeheader()
for line in jsonfile:
if line.startswith(codecs.BOM_UTF8):
line = line[3:]
entry = json.loads(line)
for item in entry['data']:
row = dict(item, status=entry['status'], message=entry['message'])
row = {k.encode('utf8'): unicode(v).encode('utf8') for k, v in row.iteritems()}
writer.writerow(row)
The row dictionary is basically a copy of each of the dictionaries in the entry['data'] list, with the status and message keys copied over separately. This makes row a flat dictionary instead.
I also read your input file line by line, as you say that each line contains a separate JSON entry.
Open the output file with cvs.DictWriter() and define the output header fields as you specified. Use extrasaction='ignore' and restval='' as options.
Look at Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6 for help with processing large files as I had a similar question Also look at the question that I link to.
I build a similar type of system from a JSON using appropriate loops.
for example,
def parse_row(currdata):
outx = {}
# currdata is defined earlier to point to the x['data'] dictionary
for eachx in currdata:
outx[eachx] = currdata[eachx]
return outx
where this is in a function with currdata as an argument and called with x['data'][row] as the input argument.
rows = len(x['data'])
for row in range(rows):
outx = parse_row(x['data'][row])
# process the row and create output
This should allow you to set up the parsing properly. I cannot copy the actual code into this answer but this should point you to a solution.
Related
Parse pipe delimited CSV Python [duplicate]
I have a text file (.txt) which could be in tab separated format or pipe separated format, and I need to convert it into CSV file format. I am using python 2.6. Can any one suggest me how to identify the delimiter in a text file, read the data and then convert that into comma separated file. Thanks in advance
I fear that you can't identify the delimiter without knowing what it is. The problem with CSV is, that, quoting ESR: the Microsoft version of CSV is a textbook example of how not to design a textual file format. The delimiter needs to be escaped in some way if it can appear in fields. Without knowing, how the escaping is done, automatically identifying it is difficult. Escaping could be done the UNIX way, using a backslash '\', or the Microsoft way, using quotes which then must be escaped, too. This is not a trivial task. So my suggestion is to get full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in the other answers or some variant. Edit: Python provides csv.Sniffer that can help you deduce the format of your DSV. If your input looks like this (note the quoted delimiter in the first field of the second row): a|b|c "a|b"|c|d foo|"bar|baz"|qux You can do this: import csv csvfile = open("csvfile.csv") dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.DictReader(csvfile, dialect=dialect) for row in reader: print row, # => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'} # write records using other dialect
Your strategy could be the following: parse the file with BOTH a tab-separated csv reader and a pipe-separated csv reader calculate some statistics on resulting rows to decide which resultset is the one you want to write. An idea could be counting the total number of fields in the two recordset (expecting that tab and pipe are not so common). Another one (if your data is strongly structured and you expect the same number of fields in each line) could be measuring the standard deviation of number of fields per line and take the record set with the smallest standard deviation. In the following example you find the simpler statistic (total number of fields) import csv piperows= [] tabrows = [] #parsing | delimiter f = open("file", "rb") readerpipe = csv.reader(f, delimiter = "|") for row in readerpipe: piperows.append(row) f.close() #parsing TAB delimiter f = open("file", "rb") readertab = csv.reader(f, delimiter = "\t") for row in readerpipe: tabrows.append(row) f.close() #in this example, we use the total number of fields as indicator (but it's not guaranteed to work! it depends by the nature of your data) #count total fields totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows]) totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows]) if totfieldspipe > totfieldstab: yourrows = piperows else: yourrows = tabrows #the var yourrows contains the rows, now just write them in any format you like
Like this from __future__ import with_statement import csv import re with open( input, "r" ) as source: with open( output, "wb" ) as destination: writer= csv.writer( destination ) for line in input: writer.writerow( re.split( '[\t|]', line ) )
I would suggest taking some of the example code from the existing answers, or perhaps better use the csv module from python and change it to first assume tab separated, then pipe separated, and produce two output files which are comma separated. Then you visually examine both files to determine which one you want and pick that. If you actually have lots of files, then you need to try to find a way to detect which file is which. One of the examples has this: if "|" in line: This may be enough: if the first line of a file contains a pipe, then maybe the whole file is pipe separated, else assume a tab separated file. Alternatively fix the file to contain a key field in the first line which is easily identified - or maybe the first line contains column headers which can be detected.
for line in open("file"): line=line.strip() if "|" in line: print ','.join(line.split("|")) else: print ','.join(line.split("\t"))
Trying to copy column1 from a csv file to another empty file using python
I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great! So if this is test.csv A 32 D 21 C 2 B 20 I want this output A D C B I've tried the following commands in python but the output file is empty f= open("test.csv",'r') import csv reader = csv.reader(f,delimiter="\t") names="" for each_line in reader: names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed with open('test.csv') as inpfile, open('out.csv', 'w') as outfile: next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines… for line in inpfile: each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block) fields = line.split() by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print(). outfile.write(fields[0]+'\n') That's all, but if you omit my comments it's 4 lines of code with open('test.csv') as inpfile, open('out.csv', 'w') as outfile: for line in inpfile: fields = line.split() outfile.write(fields[0]+'\n') When you are done with learning (some) Python, ask for an explanation of this... with open('test.csv') as ifl, open('out.csv', 'w') as ofl: ofl.write('\n'.join(line.split()[0] for line in ifl)) Addendum The csv module in such a simple case adds the additional conveniences of auto-splitting each line into a list of strings taking care of the details of output (newlines, etc) and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion… The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns. ss is a slice - the 0th column, taking all rows as denoted by : The last line writes the slice to a new filename. import pandas as pd df = pd.read_csv('test.csv', sep=' ', header=None) ss = df.ix[:, 0] ss.to_csv('new_path.csv', sep=' ', index=False)
import csv reader = csv.reader(open("test.csv","rb"), delimiter='\t') writer = csv.writer(open("output.csv","wb")) for e in reader: writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example: import csv def writetocsv(l): #convert the set to the list b = list(l) print (b) with open("newfile.csv",'w',newline='',) as f: w = csv.writer(f, delimiter=',') for value in b: w.writerow([value]) adcb_list = [] f= open("test.csv",'r') reader = csv.reader(f,delimiter="\t") for each_line in reader: adcb_list.append(each_line) writetocsv(adcb_list) hope this works for you :-)
Python merge csv files with matching Index
I want to merge two CSV files based on a field The 1st one looks like this: ID, field1, field2 1,a,green 2,b,white 2,b,red 2,b,blue 3,c,black The second one looks like: ID, field3 1,value1 2,value2 What I want to have is: ID, field1, field2,field3 1,a,green,value1 2,b,white,value2 2,b,red,value2 2,b,blue,value2 3,c,black,'' I'm using pydev on eclipse import csv endings0=[] endings1=[] with open("salaries.csv") as book0: for line in book0: endings0.append(line.split(',')[-1]) endings1.append(line.split(',')[0]) linecounter=0 res = open("result.csv","w") with open('total.csv') as book2: for line in book2: # if not header line: l=line.split(',')[0] for linecounter in range(0,endings1.__len__()): if( l == endings1[linecounter]): res.writelines(line.replace("\n","") +','+str(endings0[linecounter])) print("done")
There are a bunch of things wrong with what you're doing You should really really be using the classes in the csv module to read and write csv files. Importing the module isn't enough. You actually need to call its functions You should never find yourself typing endings1.__len__(). Use len(endings1) instead You should never find yourself typing for linecounter in range(0,len(endings1)). Use either for linecounter, _ in enumerate(endings1), or better yet for end1, end2 in zip(endings1, endings2) A dictionary is a much better data structure for lookup than a pair of parallel lists. To quote pike: If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Here's how I'd do it: import csv with open('second.csv') as f: # look, a builtin to read csv file lines as dictionaries! reader = csv.DictReader(f) # build a mapping of id to field3 id_to_field3 = {row['ID']: row['field3'] for row in reader} # you can put more than one open inside a with statement with open('first.csv') as f, open('result.csv', 'o') as fo: # csv even has a class to write files! reader = csv.DictReader(f) res = csv.DictWriter(fo, fieldnames=reader.fieldnames + ['field3']) res.writeheader() for row in reader: # .get returns its second argument if there was no match row['field3'] = id_to_field3.get(row['ID'], '') res.writerow(row)
I have a high-level solution for you. Deserialize your first CSV into dict1 mapping ID to a list containing a list containing field1 and field2. Deserialize your second CSV into dict2 mapping ID to field3. for each (id, list) in dict1, do list.append(dict2.setdefault(id, '')). Now serialize it back into CSV using whatever serializer you were using before. I used dictionary's setdefault because I noticed that ID 3 is in the first CSV file but not the second.
Writing multiple values in single cell in csv
For each user I have the list of events in which he participated. e.g. bob : [event1,event2,...] I want to write it in csv file. I created a dictionary (key - user & value - list of events) I wrote it in csv. The following is the sample output username, frnds "abc" ['event1','event2'] where username is first col and frnds 2nd col This is code writer = csv.writer(open('eventlist.csv', 'ab')) for key, value in evnt_list.items(): writer.writerow([key, value]) when I am reading the csv I am not getting the list directly. But I am getting it in following way ['e','v','e','n','t','1','','...] I also tried to write the list directly in csv but while reading am getting the same output. What I want is multiple values in a single cell so that when I read a column for a row I get list of all events. e.g colA colB user1,event1,event2,... I think it's not difficult but somehow I am not getting it. ###Reading I am reading it with the help of following codereader = csv.reader(open("eventlist.csv")) reader.next() for row in reader: tmp=row[1] print tmp # it is printing the whole list but print tmp[0] #the output is [ print tmp[1] #output is 'e' it should have been 'event1' print tmp[2] #output is 'v' it should have been 'event2'
you have to format your values into a single string: with open('eventlist.csv', 'ab') as f: writer = csv.writer(f, delimiter=' ') for key, value in evnt_list.items(): writer.writerow([key, ','.join(value)]) exports as key1 val11,val12,val13 key2 val21,val22,val23 READING: Here you have to keep in mind, that you converted your Python list into a formatted string. Therefore you cannot use standard csv tools to read it: with open("eventlist.csv") as f: csvr = csv.reader(f, delimiter=' ') csvr.next() for rec in csvr: key, values_txt = rec values = values_txt.split(',') print key, values works as awaited.
You seem to be saying that your evnt_list is a dictionary whose keys are strings and whose values are lists of strings. If so, then the CSV-writing code you've given in your question will write a string representation of a Python list into the second column. When you read anything in from CSV, it will just be a string, so once again you'll have a string representation of your list. For example, if you have a cell that contains "['event1', 'event2']" you will be reading in a string whose first character (at position 0) is [, second character is ', third character is e, etc. (I don't think your tmp[1] is right; I think it is really ', not e.) It sounds like you want to reconstruct the Python object, in this case a list of strings. To do that, use ast.literal_eval: import ast cell_string_value = "['event1', 'event2']" cell_object = ast.literal_eval(cell_string_value) Incidentally, the reason to use ast.literal_eval instead of just eval is safety. eval allows arbitrary Python expressions and is thus a security risk. Also, what is the purpose of the CSV, if you want to get the list back as a list? Will people be reading it (in Excel or something)? If not, then you may want to simply save the evnt_list object using pickle or json, and not bother with the CSV at all. Edit: I should have read more carefully; the data from evnt_list is being appended to the CSV, and neither pickle nor json is easily appendable. So I suppose CSV is a reasonable and lightweight way to accumulate the data. A full-blown database might be better, but that would not be as lightweight.
Sorting a list from a file, outputting in another file
I am trying to find the min and max out of a csv file, and have it output into a text file, currently my code outputs all data into the output file, and I am unsure of how to grab the data out of the multiple columns and have them sorted accordingly. Any guidance would be appreciated, as I don't have a good lead on how to figure this out read_file = open("riskfactors.csv", 'r') def create_file(): read_file = open("riskfactors.csv", 'r') write_file = open("best_and_worst.txt", "w") for line_str in read_file: read_file.readline() print (line_str,file=write_file) write_file.close() read_file.close()
Assuming your file is a standard .csv file containing only numbers separated by semicolons: 1;5;7;6; 3;8;1;1; Then it's easiest to use the str.split() command, followed by a type conversion to int. You could store all values in a list (or quicker: set) and then get the maximum: valuelist=[] for line_str in read_file: for cell in line_str.split(";"): valuelist.append(int(cell)) print(max(valuelist)) print(min(valuelist)) Warning: If your file contains non-number entries you'd have to filter them out. .csv-files can also have different delimiters.
import sys, csv def cmp_risks(x, y): # This assumes risk factors are prioritised by key columns 1, 3 # and that column 1 is numeric while column 3 is textual return cmp(int(x[0]), int(y[0])) or cmp(x[2], y[2]) l = sorted(csv.reader(sys.stdin), cmp_risks)) # Write out the first and last rows csv.writer(sys.stdout).writerows([l[0], l[len(l)-1]]) Now, I took a shortcut and said the input and output files were sys.stdin and sys.stdout. You'd probably replace these with the file objects you created in your original question. (e.g. read_file and write_file) However, in my case, I'd probably just run it (if I were using linux) with: $ ./foo.py <riskfactors.csv >best_and_worst.txt