Converting a modified JSON to CSV using python - python

I know this question has been asked before, but never with the following caveats:
I'm a complete python n00b. Also a JSON noob.
The JSON file / string is not the same as those seen in json2csv examples.
The CSV file output is supposed to have standard columns.
Due to point number 1, I'm not aware of most terminologies and technologies used for this. So please bear with me.
Point number 2: Here's a single line of the supposed JSON file:
"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^
Weird, I know - it lacks braces and brackets and stuff. Which is why I'm convinced posted solutions won't work.
I'm not sure what the 0^ at the end of the line is, but I see it at the end of every line. I'm assuming the 0 is the value for "were_here_count" while the ^ is a... line terminator? EDIT: Apparently, I can just disregard it.
Of note is that the value of "parking" appears to be yet another array - I'm fine with just displaying it as is (minus the double quotes).
Point number 3: Here's the columns of the supposed CSV file output. This is the complete column set - the JSON file won't always have them all.
ID STRING,
ABOUT STRING,
ATTIRE STRING,
BAND_MEMBERS STRING,
BEST_PAGE STRING,
BIRTHDAY STRING,
BOOKING_AGENT STRING,
CAN_POST STRING,
CATEGORY STRING,
CATEGORY_LIST STRING,
CHECKINS STRING,
COMPANY_OVERVIEW STRING,
COVER STRING,
CONTEXT STRING,
CURRENT_LOCATION STRING,
DESCRIPTION STRING,
DIRECTED_BY STRING,
FOUNDED STRING,
GENERAL_INFO STRING,
GENERAL_MANAGER STRING,
GLOBAL_BRAND_PARENT_PAGE STRING,
HOMETOWN STRING,
HOURS STRING,
IS_PERMANENTLY_CLOSED STRING,
IS_PUBLISHED STRING,
IS_UNCLAIMED STRING,
LIKES STRING,
LINK STRING,
LOCATION STRING,
MISSION STRING,
NAME STRING,
PARKING STRING,
PHONE STRING,
PRESS_CONTACT STRING,
PRICE_RANGE STRING,
PRODUCTS STRING,
RESTAURANT_SERVICES STRING,
RESTAURANT_SPECIALTIES STRING,
TALKING_ABOUT_COUNT STRING,
USERNAME STRING,
WEBSITE STRING,
WERE_HERE_COUNT STRING
Here's my code so far:
import os
num = '1'
inPath = "./fb-data_input/"
outPath = "./fb-data_output/"
#Get list of Files, put them in filenameList array
fileNameList = os.listdir(path)
#Process per file in
for item in fileNameList:
print("Processing: " + item)
fb_inputFile = open(inPath + item, "rb").read().split("\n")
fb_outputFile = open(outPath + "fbdata-IAB-output" + num, "wb")
num++
jsonString = fb_inputFile.split("\",\"")
jsonField = jsonString[0]
jsonValue = jsonString[1]
jsonHash[?] = [?,?]
#Do Code stuff here
Up until the for loop, it just loads the json file names into an array, and then processes it one by one.
Here's my logic for the rest of the code:
Split the json string by something. Perhaps the "," so that other commas won't get split.
Store it into a hashmap / 2D array (dynamic?)
Trim away the JSON fields and the first and/or last double quotes.
Add the resulting output to another hashmap, with those set columns, putting in null in a column that the JSON file does not have.
And then I output the result to a CSV.
It sounds logical in my head, but I'm pretty sure there's something I missed. And of course, I have a hard time putting it in code.
Can I have some help on this? Thanks.
P.S.
Additional information:
OS: Mac OSX
Target platform OS: Ubuntu of some sort

Here is a full solution, based on your original code:
import os
import json
from csv import DictWriter
import codecs
def get_columns():
columns = []
with open("columns.txt") as f:
columns = [line.split()[0] for line in f if line.strip()]
return columns
if __name__ == "__main__":
in_path = "./fb-data_input/"
out_path = "./fb-data_output/"
columns = get_columns()
bad_keys = ("has_added_app", "is_community_page")
for filename in os.listdir(in_path):
json_filename = os.path.join(in_path, filename)
csv_filename = os.path.join(out_path, "%s.csv" % (os.path.basename(filename)))
with open(json_filename) as f, open(csv_filename, "wb") as csv_file:
csv_file.write(codecs.BOM_UTF8)
csv = DictWriter(csv_file, columns)
csv.writeheader()
for line_number, line in enumerate(f, start=1):
try:
data = json.loads("{%s}" % (line.strip().strip('^')))
# fix parking column
if "parking" in data:
data['parking'] = ", ".join("%s: %s" % (k, str(v)) for k, v in data['parking'].items())
data = {k.upper(): unicode(v).encode('utf8') for k, v in data.items() if k not in bad_keys}
except Exception, e:
import traceback
traceback.print_exc()
data = {columns[0]: "Error on line %s of %s: %s" % (line_number, json_filename, e)}
csv.writerow(data)
Edited: Full unicode support plus extended error information.

So, first off, your string is valid json if you just add curly braces around it. You can then deserialize with Python's json library. Setup your csv columns as a dictionary with each of them pointing to whatever you want as a default value (None? ""? you're choice). Once you've deserialized the json to a dict, just loop through each key there and fill in the csv_columns dict as appropriate. Then just use Python's csv module to write it out:
import json
import csv
string = '"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^'
string = '{%s}' % string[:-1]
json_dict = json.loads(string)
#make 'parking' a string. I'm assuming that's your only hash.
json_dict['parking'] = json.dumps(json_dict['parking'])
csv_cols_list = ['a','b','c'] #put your actual csv columns here
csv_cols = {col: '' for col in csv_cols_list}
for k, v in json_dict.iterkeys():
if k in csv_cols:
csv_cols[k] = v
#now just write to csv using Python's csv library
Note: this is a general answer that assumes that your "json" will be valid key/value pairs. Your "parking" key is a special case you'll need to deal with somehow. I left it as is because I don't know what you want with it. I'm also assuming the '^' at the end of your string was a typo.
[EDIT] Changed to account for parking and the '^' at the end. [/EDIT]
Either way, the general idea here is what you want.

The first thing is your input is not JSON. Its just a string that is delimited, where the column and value is quoted.
Here is a solution that would work:
import csv
columns = ['ID', 'ABOUT', ... ]
with open('input_file.txt', 'r') as f, open('output_file.txt', 'w') as o:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(o, delimiter=',')
writer.writerow(columns)
for row in reader:
data = {k.upper():v for k,v in row.split(':', 1)}
row = [data.get(v, '') for v in columns]
writer.writerow(row)
In this loop, for each line we read from the input file, a dictionary is created. The key is the first value from the 'foo:bar' pair, and we convert it to upper case.
Next, for each column, we try to fetch a value from this dictionary in the order that the columns are written out. If a value for the column doesn't exist, a blank '' is returned. These values are collected in a list row. This makes sure no matter how many columns are missing, we write an equal number of columns to the output.

Related

How to extract a subset from a text file and store it in a separate file?

I am currently trying to extract information from a text file using Python. I want to extract a subset from the file and store it in a separate file from everywhere it occurs in the text file. To give you an idea of what my file looks like, here is a sample:
C","datatype":"double","value":25.71,"measurement":"Temperature","timestamp":1573039331258250},
{"unit":"%RH","datatype":"double","value":66.09,"measurement":"Humidity","timestamp":1573039331258250}]
Here, I want to extract "value" and the corresponding number beside it. I have tried various techniques but have been unsuccessful. I tried to iterate through the file and stop at where I have "value" but that did not work.
Here is a sample of the code:
with open("DOTemp.txt") as openfile:
for line in openfile:
for part in line.split():
if "value" in part:
print(part)
A simple solution to return the value marked by the "value" key:
with open("DOTemp.txt") as openfile:
for line in openfile:
line = line.replace('"', '')
for part in line.split(','):
if "value" in part:
print(part.split(':')[1])
Note that by default str.split() splits on whitespace. In the last line, if we printed element zero of the list it would just be "value". If you wish to use this as an int or float, simply cast it as such and return it.
First split using , (comma) as the delimiter, then split the corresponding strings using : as the delimiter.
if required trim leading and trailing "" then compare with value
Below code will work for you:
file1 = open("untitled.txt","r")
data = file1.readlines()
#Convert to a single string
val = ""
for d in data:
val = val + d
#split string at comma
comma_splitted = val.split(',')
#find the required float
for element in comma_splitted:
if 'value' in element:
out = element.split('"value":')[1]
print(float(out))
I assume your input file is a json string(list of dictionaries) (looking at the file sample). If that's the case, perhaps you can try this.
import json
#Assuming each record is a dictionary
with open("DOTemp.txt") as openfile:
lines = openfile.readlines()
records = json.loads(lines)
out_lines = list(map(lambda d: d.get('value'), records))
with open('DOTemp_out.txt', 'w') as outfile:
outfile.write("\n".join(out_lines))

Remove characters before inserting key value pairs into dictionary

I have a csv file which I'm parsing to read the contents into a dictionary. However, the code below is giving me brackets surrounding every value for all key: value pairs:
import csv
f = open(input('Which csv file: '))
cdata = csv.reader(f, delimiter = ';', quoting = csv.QUOTE_NONNUMERIC)
cdict = {}
for row in cdata:
cdict[row[0]] = row[1:]
print(cdict)
f.close()
I have tried:
for row in cdata:
row = "".join(row)
cdict[row[0]] = row[1:]
But receive the error:
TypeError: sequence item 0: expected str instance, float found
The reason I read the contents of the csv as floats instead of strings was to remove extraneous characters in the first place. I need the final output into the dictionary to contain nothing but the actual numbers from the csv, i.e. no quotes or brackets.
Below does return what I'm looking for, but there must be a more pythonic way to do this:
for row in cdata:
cdict[row[0]] = row.pop()
Example csv data:
Number;Value
0;13.168159
1;13.1681598889
2;13.0313661591
Since there are only two columns I think this is more pythonic
cdata = csv.reader(f, delimiter = ';', quoting = csv.QUOTE_NONNUMERIC)
print dict(cdata)

Writing multiple values in single cell in csv

For each user I have the list of events in which he participated.
e.g. bob : [event1,event2,...]
I want to write it in csv file. I created a dictionary (key - user & value - list of events)
I wrote it in csv. The following is the sample output
username, frnds
"abc" ['event1','event2']
where username is first col and frnds 2nd col
This is code
writer = csv.writer(open('eventlist.csv', 'ab'))
for key, value in evnt_list.items():
writer.writerow([key, value])
when I am reading the csv I am not getting the list directly. But I am getting it in following way
['e','v','e','n','t','1','','...]
I also tried to write the list directly in csv but while reading am getting the same output.
What I want is multiple values in a single cell so that when I read a column for a row I get list of all events.
e.g
colA colB
user1,event1,event2,...
I think it's not difficult but somehow I am not getting it.
###Reading
I am reading it with the help of following
codereader = csv.reader(open("eventlist.csv"))
reader.next()
for row in reader:
tmp=row[1]
print tmp # it is printing the whole list but
print tmp[0] #the output is [
print tmp[1] #output is 'e' it should have been 'event1'
print tmp[2] #output is 'v' it should have been 'event2'
you have to format your values into a single string:
with open('eventlist.csv', 'ab') as f:
writer = csv.writer(f, delimiter=' ')
for key, value in evnt_list.items():
writer.writerow([key, ','.join(value)])
exports as
key1 val11,val12,val13
key2 val21,val22,val23
READING: Here you have to keep in mind, that you converted your Python list into a formatted string. Therefore you cannot use standard csv tools to read it:
with open("eventlist.csv") as f:
csvr = csv.reader(f, delimiter=' ')
csvr.next()
for rec in csvr:
key, values_txt = rec
values = values_txt.split(',')
print key, values
works as awaited.
You seem to be saying that your evnt_list is a dictionary whose keys are strings and whose values are lists of strings. If so, then the CSV-writing code you've given in your question will write a string representation of a Python list into the second column. When you read anything in from CSV, it will just be a string, so once again you'll have a string representation of your list. For example, if you have a cell that contains "['event1', 'event2']" you will be reading in a string whose first character (at position 0) is [, second character is ', third character is e, etc. (I don't think your tmp[1] is right; I think it is really ', not e.)
It sounds like you want to reconstruct the Python object, in this case a list of strings. To do that, use ast.literal_eval:
import ast
cell_string_value = "['event1', 'event2']"
cell_object = ast.literal_eval(cell_string_value)
Incidentally, the reason to use ast.literal_eval instead of just eval is safety. eval allows arbitrary Python expressions and is thus a security risk.
Also, what is the purpose of the CSV, if you want to get the list back as a list? Will people be reading it (in Excel or something)? If not, then you may want to simply save the evnt_list object using pickle or json, and not bother with the CSV at all.
Edit: I should have read more carefully; the data from evnt_list is being appended to the CSV, and neither pickle nor json is easily appendable. So I suppose CSV is a reasonable and lightweight way to accumulate the data. A full-blown database might be better, but that would not be as lightweight.

Effective way to get part of string until token

I'm parsing a very big csv (big = tens of gigabytes) file in python and I need only the value of the first column of every line. I wrote this code, wondering if there is a better way to do it:
delimiter = ','
f = open('big.csv','r')
for line in f:
pos = line.find(delimiter)
id = int(line[0:pos])
Is there a more effective way to get the part of the string before the first delimiter?
Edit: I do know about the CSV module (and I have used it occasionally), but I do not need to load in memory every line of this file - I need the first column. So lets focus on string parsing.
>>> a = '123456'
>>> print a.split('2', 1)[0]
1
>>> print a.split('4', 1)[0]
123
>>>
But, if you're dealing with a CSV file, then:
import csv
with open('some.csv') as fin:
for row in csv.reader(fin):
print int(row[0])
And the csv module will handle quoted columns containing quotes etc...
If the first field can't have an escaped delimiter in it such as in your case where the first field is an integer and there are no embed newlines in any field i.e., each row corresponds to exactly one physical line in the file then csv module is an overkill and you could use your code from the question or line.split(',', 1) as suggested by #Jon Clements.
To handle occasional lines that have no delimiter in them you could use str.partition:
with open('big.csv', 'rb') as file:
for line in file:
first, sep, rest = line.partition(b',')
if sep: # the line has ',' in it
process_id(int(first)) # or `yield int(first)`
Note: s.split(',', 1)[0] silently returns a wrong result (the whole string) if there is no delimiter in the string.
'rb' file mode is used to avoid unnecessary end of line manipulation (and implicit decoding to Unicode on Python 3). It is safe to use if the csv file has '\n' at the end of each raw i.e., newline is either '\n' or '\r\n'
Personnally , I would do with generators:
from itertools import imap
import csv
def int_of_0(x):
return(int(x[0]))
def obtain(filepath, treat):
with open(filepath,'rb') as f:
for i in imap(treat,csv.reader(f)):
yield i
for x in obtain('essai.txt', int_of_0):
# instructions

Reading CSV files in numpy where delimiter is ","

I've got a CSV file with a format that looks like this:
"FieldName1", "FieldName2", "FieldName3", "FieldName4"
"04/13/2010 14:45:07.008", "7.59484916392", "10", "6.552373"
"04/13/2010 14:45:22.010", "6.55478493312", "9", "3.5378543"
...
Note that there are double quote characters at the start and end of each line in the CSV file, and the "," string is used to delimit fields within each line. The number of fields in the CSV file can vary from file to file.
When I try to read this into numpy via:
import numpy as np
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True)
all the data gets read in as string values, surrounded by double-quote characters. Not unreasonable, but not much use to me as I then have to go back and convert every column to its correct type
When I use delimiter='","' instead, everything works as I'd like, except for the 1st and last fields. As the start of line and end of line characters are a single double-quote character, this isn't seen as a valid delimiter for the 1st and last fields, so they get read in as e.g. "04/13/2010 14:45:07.008 and 6.552373" - note the leading and trailing double-quote characters respectively. Because of these redundant characters, numpy assumes the 1st and last fields are both String types; I don't want that to be the case
Is there a way of instructing numpy to read in files formatted in this fashion as I'd like, without having to go back and "fix" the structure of the numpy array after the initial read?
The basic problem is that NumPy doesn't understand the concept of stripping quotes (whereas the csv module does). When you say delimiter='","', you're telling NumPy that the column delimiter is literally a quoted comma, i.e. the quotes are around the comma, not the value, so the extra quotes you get on he first and last columns are expected.
Looking at the function docs, I think you'll need to set the converters parameter to strip quotes for you (the default does not):
import re
import numpy as np
fieldFilter = re.compile(r'^"?([^"]*)"?$')
def filterTheField(s):
m = fieldFilter.match(s.strip())
if m:
return float(m.group(1))
else:
return 0.0 # or whatever default
#...
# Yes, sorry, you have to know the number of columns, since the NumPy docs
# don't say you can specify a default converter for all columns.
convs = dict((col, filterTheField) for col in range(numColumns))
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True,
converters=convs)
Or abandon np.genfromtxt() and let csv.csvreader give you the file's contents a row at a time, as lists of strings, then you just iterate through the elements and build the matrix:
reader = csv.csvreader(csvfile)
result = np.array([[float(col) for col in row] for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT: Okay, so it looks like your file isn't all floats. In that case, you can set convs as needed in the genfromtxt case, or create a vector of conversion functions in the csv.csvreader case:
reader = csv.csvreader(csvfile)
converters = [datetime, float, int, float]
result = np.array([[conv(col) for col, conv in zip(row, converters)]
for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT 2: Okay, variable column count... Your data source just wants to make life difficult. Luckily, we can just use magic...
reader = csv.csvreader(csvfile)
result = np.array([[magic(col) for col in row] for row in reader])
... where magic() is just a name I got off the top of my head for a function. (Psyche!)
At worst, it could be something like:
def magic(s):
if '/' in s:
return datetime(s)
elif '.' in s:
return float(s)
else:
return int(s)
Maybe NumPy has a function that takes a string and returns a single element with the right type. numpy.fromstring() looks close, but it might interpret the space in your timestamps as a column separator.
P.S. One downside with csvreader I see is that it doesn't discard comments; real csv files don't have comments.

Categories