I am trying to fetch data from JSON file and store it into csv file. This JSON file has nesting and multiple records for same key value i.e.value.
Source File:
I tried below code which flattens the json file, but i am unable to get to required csv format.
import pandas as pd
import json
def flatten_json(nested_json):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
response = json.loads(data)
df=pd.Series(flatten_json(response)).to_frame()
print(df)
Below is the output which i get after executing above code:
0
data_Value_strt_0_col1 John
data_Value_strt_0_col2 David
data_Value_strt_0_col3 Lisa
data_Value_strt_0_col4 None
data_Value_strt_0_col5 None
data_Value_strt_0_data_byValue_0_col3 dev
data_Value_strt_0_data_byValue_0_col6 None
data_Value_strt_0_data_byValue_0_col1 None
data_Value_strt_0_data_byValue_0_data_value_201... 02.22
data_Value_strt_0_data_byValue_0_data_value_2020-1 12.32
data_Value_strt_1_col1 Ram
data_Value_strt_1_col2 Shyam
data_Value_strt_1_col3 Kishore
data_Value_strt_1_col4 None
data_Value_strt_1_col5 None
data_Value_strt_1_data_byValue_0_col3 prd
data_Value_strt_1_data_byValue_0_col6 None
data_Value_strt_1_data_byValue_0_col1 None
data_Value_strt_1_data_byValue_0_data_value_2020-3 12.87
data_Value_strt_1_data_byValue_1_col3 dev-prd
data_Value_strt_1_data_byValue_1_col6 None
data_Value_strt_1_data_byValue_1_col1 None
data_Value_strt_1_data_byValue_1_data_value_201... 3.39
data_Value_strt_1_data_byValue_1_data_value_201... 9.24
I am unable to get to above format using above code since there is nesting and multiple values for Key 'value'
The following works for the data you've provided. It's possible that this may not work if there is more data you haven't shown, and the format changes:
import json
import csv
data = ...
info = json.loads(data)["data"]["Value"]["strt"]
fieldnames = ["Name1", "Name2", "Name3", "Col_4", "Col_5", "Val_Col3", "Val_Col6", "Val_Col1", "Val_Year", "Val_Month", "Value"]
with open("output.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(fieldnames)
for d1 in info:
for d2 in d1["data"]["byValue"]:
for key, value in d2["data"]["value"].items():
year, month = key.split("-")
row = [d1["col1"], d1["col2"], d1["col3"], d1["col4"], d1["col5"], d2["col3"], d2["col6"], d2["col1"], year, month, value]
writer.writerow(row)
This will write to a CSV file in the format you specified. None values are written to the file as empty strings by the csv.writer object. If you want to introduce whitespace into the CSV file, so that the delimiters line up, you may have to make some changes.
Related
I am trying to make a program that can save the results of a filtered JSON file as a CSV. Right now my function only saves the keys of the JSON to the CSV file.
Ideally I want the function to take two arguments: column (key) it is searching in; and the item (value) it is searching for.
This is my current function:
def save_csv(key, value):
with open('db.json') as json_file:
info = json.load(json_file)
test = info['data']
csv_file = open('test.csv', 'w')
csv_writer = csv.writer(csv_file)
count = 0
for e in test:
if count == 0:
header_csv = e.keys()
csv_writer.writerow(header_csv)
count += 1
for e in key:
if e == value:
csv_writer.writerow(e.values())
csv_file.close()
How could I change this function to make it save the filtered results in a CSV?
No matter what changes I try to make, it will only save the keys to the header of the CSV. None of the results I am filtering for will save to the CSV.
def save_csv(key, value):
with open('db.json') as json_file:
info = json.load(json_file)
test = info['data']
with open('test.csv', 'w', newline='') as csv_file:
csv_writer = csv.writer(csv_file)
for n,v in enumerate(test):
if not n:
header_csv = e.keys()
csv_writer.writerow(header_csv)
if key in v and v.get(key)==value:
csv_writer.writerow(e.values())
I am trying to take a CSV and create a list of dictionaries in python with the CSV coming from S3. Code is as follows:
import os
import boto3
import csv
import json
from io import StringIO
import logging
import time
s3 = boto3.resource('s3')
s3Client = boto3.client('s3','us-east-1')
bucket = 'some-bucket'
key = 'some-key'
obj = s3Client.get_object(Bucket = bucket, Key = key)
lines = obj['Body'].read().decode('utf-8').splitlines(True)
newl = []
for line in csv.reader(lines, quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL,skipinitialspace=True, escapechar="\\"):
newl.append(line)
fieldnames = newl[0]
newl1 = newl[1:]
reader = csv.DictReader(newl1,fieldnames)
out = json.dumps([row for row in reader])
jlist1 = json.loads(out)
but this gives me the error:
iterator should return strings, not list (did you open the file in text mode?)
if I alter the for loop to this:
for line in csv.reader(lines, quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL,skipinitialspace=True, escapechar="\\"):
newl.append(','.join(line))
then it works, however there are some fields that have commas in them so this completely screws up the schema and shifts the data. For example:
|address1 |address2 |state|
------------------------------
|123 Main st|APT 3, Fl1|TX |
becomes:
|address1 |address2 |state|null|
-----------------------------------
|123 Main st|APT 3 |Fl1 |TX |
Where am I going wrong?
The problem is that you are building a list of lists here :
newl.append(line)
and as the error says : iterator should return strings, not list
so try to cast line as a string:
newl.append(str(line))
Hope this helps :)
I ended up changing the code to this:
obj = s3Client.get_object(Bucket = bucket, Key = key)
lines1 = obj['Body'].read().decode('utf-8').split('\n')
fieldnames = lines1[0].replace('"','').split(',')
testls = [row for row in csv.DictReader(lines1[1:], fieldnames)]
out = json.dumps([row for row in testls])
jlist1 = json.loads(out)
And got the desired result
I am trying to create a generic filter to split file on the condition from the Yaml file.
My code is running Pandas but as the environment is not having Pandas module I am trying to achieve it through CSV library.
When I am hard coding the value at q its working but when I am trying to pass it from the config file its not working. Also I want pass multiple checks on the same column like('','Balance). So Asset goest to one file and ('','Balance) in another.
import sys
import yaml
import csv
def dynamicQuery(config_file, data_file, outputPath):
"""Loading Configuration file into dataframe"""
try:
with open(config_file) as file:
doc = yaml.full_load(file)
except Exception as err:
print("Error Configuration data file: ", err)
try:
for k, v in doc.items():
if k != 'column':
filename = k
k = doc[k]
q = ' , '.join(f'{v} ' for q, v in k.items())
q = '"' + str(strip(q)) + '"'
print(q) #-- "Asset"
df = csv.reader(open(data_file), delimiter=',')
df = filter(lambda x: (x[2] == q), df) # Not working here
#df = filter(lambda x: x[2] == "Asset", df) --> this is working
csv.writer(open(filename + ".txt", 'w', newline=' '), delimiter=',').writerows(df)
print("File is created for " + filename)
except Exception as err:
print("Error executing queries and saving output data file: ", err)
def main():
if len(sys.argv) == 3:
"""File will be passed as parameter """
config_file = sys.argv[1]
data_file = sys.argv[2]
dynamicQuery(config_file, data_file)
else:
usage()
def usage():
print("Usage: python splitGenric.py config_file data_file ")
main()
Sample file
1233,ACV,Asset,sample
1235,ACV,Asset,sample
1232,ACV,Asset,sample
1234,ACV,Asset,sample
1237,ACV,,sample
1238,ACV,,sample
1234,ACV,Balance,sample
1254,ACV,Balance,sample
1244,ACV,Balance,sample
1264,ACV,Balance,sample
Config.yaml
Asset :
filter1: '"Asset"'
Balance:
filter1: '"Balance"'
filter2: '""'
The YAML configuration file format is not particularly convenient for this, and yaml is not a standard Python module. I would probably go for something like regular expressions instead of a YAML file. But just to sort out the immediate problems, the problem here is that you are mixing up Python syntax and literal quoting characters. You are assembling a string containing literal double quotes around Asset for example, where your CSV file does not contain double quotes around this value; and so you are effectively comparing if 'Asset' == '"Asset"' which of course is False.
The following might not do exactly what you want, but should at least demonstrate a rough first cut of what I think you are trying to do here.
with open(config_file) as file:
config = yaml.full_load(file)
filters = dict()
for k, v in config.items():
handle = open(k + '.txt', 'w', newline='')
writer = csv.writer(handle, delimiter=',')
filt = {'handle': handle, 'writer': writer, 'conditions': []}
for _, expr in v.items():
filt['conditions'].append(expr.strip('"'))
filters[k] = filt
with open(data_file) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
for handle, conf in filters.items():
for i in range(len(conf['conditions'])):
if row[2] == conf['conditions'][i]:
conf['writer'].writerow(row)
break
for handle, conf in filters.items():
conf['handle'].close()
I'm guessing you used pyyaml which seems to be the dominant YAML module for Python.
I tried to use the config.yaml, but I've got this error
File "C:\Users\XXXXXX\AppData\Local\Programs\Python\Python36-32\lib\site-packages\yaml\parser.py", line 439, in parse_block_mapping_key
"expected <block end>, but found %r" % token.id, token.start_mark)
yaml.parser.ParserError: while parsing a block mapping
in "config.yml", line 5, column 5
expected <block end>, but found ','
in "config.yml", line 5, column 17
But I will pretend it worked and the content was loaded in a dictionary, as it appears to be the intention.
The dictionary is as:
doc = {'Asset':'Asset','Balance':[' ','Balance']}
#load directly to dataframe
df = pd.read_csv('sample.txt',header=None)
handler = ''
for k,v in doc.items():
kList = {k:[]} #making empty lists with k values
if isinstance(v,str): #Asset is string
fil = v
else:
for i in range(len(v)): #Balance is list of values
if v[i]:
fil = v[i]
else:
handler = k #replace the null
for types in df.values:
if fil in types:
kList[k].append(types) #append types to corresponding list
csv.writer(open(k+".txt", 'a', newline='\n'), delimiter=',').writerows(kList[k])
if handler: #there is null values
nulls = df[df.isnull().any(axis=1)].values.tolist()
csv.writer(open(handler+".txt", 'a', newline='\n'), delimiter=',').writerows(nulls)
The result are two files, with the following contents:
Asset.txt:
1233,ACV,Asset,sample
1235,ACV,Asset,sample
1232,ACV,Asset,sample
1234,ACV,Asset,sample
Balance.txt:
1234,ACV,Balance,sample
1254,ACV,Balance,sample
1244,ACV,Balance,sample
1264,ACV,Balance,sample
1237,ACV,nan,sample
1238,ACV,nan,sample
I'm using the Json2csv code to convert the Yelp dataset to csv files (Available here: https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py)
This code was originally used with Python 2 but I'm using Python 3; I've made some changes so it's now working with Python 3 except that I'm getting b' preceding the strings (which indicates that it is a byte sequence).
I've added encoding='utf-8' to convert it to string but my csv file still shows the b''
Example: business_id
b'7KPBkxAOEtb3QeIL9PEErg'
What do I need to change to make it write strings instead of bytes?
Thanks
# -*- coding: utf-8 -*-
"""Convert the Yelp Dataset Challenge dataset from json format to csv.
For more information on the Yelp Dataset Challenge please visit http://yelp.com/dataset_challenge
"""
import argparse
import collections
import csv
import json
def read_and_write_file(json_file_path, csv_file_path, column_names):
"""Read in the json dataset file and write it out to a csv file, given the column names."""
with open(csv_file_path, 'w', newline='', encoding='utf-8') as fout:
csv_file = csv.writer(fout)
csv_file.writerow(list(column_names))
with open(json_file_path,encoding='utf-8') as fin:
for line in fin:
line_contents = json.loads(line)
csv_file.writerow(get_row(line_contents, column_names))
def get_superset_of_column_names_from_file(json_file_path):
"""Read in the json dataset file and return the superset of column names."""
column_names = set()
with open(json_file_path, encoding='utf-8') as fin:
for line in fin:
line_contents = json.loads(line)
column_names.update(
set(get_column_names(line_contents).keys())
)
return column_names
def get_column_names(line_contents, parent_key=''):
"""Return a list of flattened key names given a dict.
Example:
line_contents = {
'a': {
'b': 2,
'c': 3,
},
}
will return: ['a.b', 'a.c']
These will be the column names for the eventual csv file.
"""
column_names = []
for k, v in line_contents.items():
column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
if isinstance(v, collections.MutableMapping):
column_names.extend(
get_column_names(v, column_name).items()
)
else:
column_names.append((column_name, v))
return dict(column_names)
def get_nested_value(d, key):
"""Return a dictionary item given a dictionary `d` and a flattened key from `get_column_names`.
Example:
d = {
'a': {
'b': 2,
'c': 3,
},
}
key = 'a.b'
will return: 2
"""
if '.' not in key:
if key not in d:
return None
return d[key]
base_key, sub_key = key.split('.', 1)
if base_key not in d:
return None
sub_dict = d[base_key]
return get_nested_value(sub_dict, sub_key)
def get_row(line_contents, column_names):
"""Return a csv compatible row given column names and a dict."""
row = []
for column_name in column_names:
line_value = get_nested_value(
line_contents,
column_name,
)
if isinstance(line_value, str):
row.append('{0}'.format(line_value.encode('utf-8')))
elif line_value is not None:
row.append('{0}'.format(line_value))
else:
row.append('')
return row
if __name__ == '__main__':
"""Convert a yelp dataset file from json to csv."""
parser = argparse.ArgumentParser(
description='Convert Yelp Dataset Challenge data from JSON format to CSV.',
)
parser.add_argument(
'json_file',
type=str,
help='The json file to convert.',
)
args = parser.parse_args()
json_file = args.json_file
csv_file = '{0}.csv'.format(json_file.split('.json')[0])
column_names = get_superset_of_column_names_from_file(json_file)
read_and_write_file(json_file, csv_file, column_names)
Just a guess:
if isinstance(line_value, str):
row.append('{0}'.format(line_value.encode('utf-8')))
If the value is str you don't need to encode it in Python 3 - all strings in Python 3 are unicode. You probably should check if the value is an instance of bytes instead.
if isinstance(line_value, bytes):
row.append('{0}'.format(line_value.decode('utf-8')))
[update]
No, that line is checking if it is string versus number... so str is correct – Luluperam
Are you sure? Lets say line_value is the string "foo":
line_value = 'foo'
Now try this:
>>> row = []
>>> if isinstance(line_value, str):
... row.append('{0}'.format(line_value.encode('utf-8')))
>>> print(row)
["b'foo'"]
That is the source of your bytes literal in the CSV file. Now lets try the version I so kindly suggested before dismissing it:
>>> line_value = b'foo'
>>> row = []
>>> if isinstance(line_value, bytes):
... row.append('{0}'.format(line_value.decode('utf-8')))
>>> print(row)
['foo']
I have a csv file where each record is a LinkedIn contact. I have to recreate another csv file where each contact it was reached only after a specific date (ex all the contact that are connected to me after 1/04/2017).
So this is my implementation:
def import_from_csv(file):
key_order = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
linkedin_contacts = []
with open(file, encoding="utf8") as csvfile:
reader=csv.DictReader(csvfile, delimiter=',')
for row in reader:
single_person = {"FirstName": row["FirstName"], "LastName": row["LastName"],
"EmailAddress": row["EmailAddress"], "Company": row["Company"],
"ConnectedOn": parser.parse(row["ConnectedOn"])}
od = OrderedDict((k, single_person[k]) for k in key_order)
linkedin_contacts.append(od)
return linkedin_contacts
the first script give to me a list of ordered dict, i dont know if the way i used to achive the correct order is good, also seeing some example (like here) i'm not using the od.update method, but i dont think i need it, is it correct?
Now i wrote a second function to filter the list:
def filter_by_date(connections):
filtered_list = []
target_date = parser.parse("01/04/2017")
for row in connections:
if row["ConnectedOn"] > target_date:
filtered_list.append(row)
return filtered_list
Am I doing this correctly?
Is there a way to optimize the code? Thanks
First point: you don't need the OrderedDict at all, just use a csv.DictWriter to write the filtered csv.
fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
with open("/apth/to/final.csv", "wb") as f:
writer = csv.DictWriter(f, fieldnames)
writer.writeheader()
writer.writerows(filtered_contacts)
Second point: you don't need to create a new dict from the one yielded by the csv reader, just update the ConnectedOn key in place :
def import_from_csv(file):
linkedin_contacts = []
with open(file, encoding="utf8") as csvfile:
reader=csv.DictReader(csvfile, delimiter=',')
for row in reader:
row["ConnectedOn"] = parser.parse(row["ConnectedOn"])
linkedin_contacts.append(row)
return linkedin_contacts
And finally, if all you have to do is take the source csv, filter out records on ConnectedOn and write the result, you don't need to load the whole source in memory, create a filtered list (in memory again) and write the filtered list, you can stream the whole operation:
def filter_csv(source_path, dest_path, date):
fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
target = parser.parse(date)
with open(source_path, "rb") as source, open(dest_path, "wb") as dest:
reader = csv.DictReader(source)
writer = csv.DictWriter(dest, fieldnames)
# if you want a header line with the fieldnames - else comment it out
writer.writeheaders()
for row in reader:
row_date = parser.parse(row["ConnectedOn"])
if row_date > target:
writer.writerow(row)
And here you are, plain and simple.
NB : I don't know what "parser.parse()" is but as others answers mention, you'd probably be better using the datetime module instead.
For filtering you could use filter() function:
def filter_by_date(connections):
target_date = datetime.strptime("01/04/2017", '%Y/%m/%d').date()
return list(filter(lambda x: x["ConnectedOn"] > target_date, connections))
And instead of creating simple dict and then fill its values into OrderedDict you could write values directly to the OrderedDict:
for row in reader:
od = OrderedDict()
od["FirstName"] = row["FirstName"]
od["LastName"] = row["LastName"]
od["EmailAddress"] = row["EmailAddress"]
od["Company"] = row["Company"]
od["ConnectedOn"] = datetime.strptime(row["ConnectedOn"], '%Y/%m/%d').date()
linkedin_contacts.append(od)
If you know date format you don't need python_dateutil, you could use built-in datetime.datetime.strptime() with needed format.
Because you don't precise the format string.
Use :
from datetime import datetime
format = '%d/%m/%Y'
date_text = '01/04/2017'
# inverse by datetime.strftime(format)
datetime.strptime(date_text, format)
#....
# with format as global
for row in reader:
od = OrderedDict()
od["FirstName"] = row["FirstName"]
od["LastName"] = row["LastName"]
od["EmailAddress"] = row["EmailAddress"]
od["Company"] = row["Company"]
od["ConnectedOn"] = strptime(row["ConnectedOn"], format)
linkedin_contacts.append(od)
Do:
def filter_by_date(connections, date_text):
target_date = datetime.strptime(date_text, format)
return [x for x in connections if x["ConnectedOn"] > target_dat]