How to convert nested json in csv with pandas - python

I have a nested json file (100k rows), which looks like this:
{"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
I am trying to create a csv, so that it can easily be loaded in a rdbms. I am trying to use json_normalize() in pandas but even before I get there I am getting below error.
with open('transactions.json') as data_file:
data = json.load(data_file)
JSONDecodeError: Extra data: line 2 column 1 (char 466)

If your problem originates in reading the json file itself, then i would just use:
json.loads()
and then use
pd.read_csv()
If your problem originates in the conversion from your json dict to dataframe you can use this:
test = {"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
import json
import pandas
# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)
# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)
# remove dict
df.drop('Segment', axis=1, inplace=True)
That would be my approach but there might be more convenient approaches.

Step one: loop over a file of records
Since your file has one JSON record per line, you need to loop over all the records in your file, which you can do like this:
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
# or
df = pd.read_json(line, convert_axes=True)
# do something with data or df
Step two: write the CSV file
Now, you can combine this with a csv.writer to convert the file into a CSV file.
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
#Loop for each record, somehow:
#row = build list with row contents
writer.writerow(row)
Putting it all together
I'll read the first record once to get the keys to display and output them as a CSV header, and then I'll read the whole file and convert it one record at a time:
import copy
import csv
import json
import pandas as pd
# Read the first JSON record to get the keys that we'll use as headers for the CSV file
with open('transactions.json', encoding="utf8") as data_file:
keys = list(json.loads(next(data_file)).keys())
# Our CSV headers are going to be the keys from the first row, except for
# segments, which we'll replace (arbitrarily) by three numbered segment column
# headings.
keys.pop()
base_keys = copy.copy(keys)
keys.extend(["Segment1", "Segment2", "Segment3"])
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(keys) # Write the CSV headers
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
row = [data[k] for k in base_keys] + data["Segment"]
writer.writerow(row)
The resulting CSV file will still have a JSON record in each Segmenti column. If you want to format each segment differently, you could define a format_segment(segment) function and replace data["Segment"] by this list comprehension: [format_segment(segment) for segment in data["Segment"]]

Related

How to convert csv file into json in python so that the header of csv are keys of every json value

I have this use case
please create a function called “myfunccsvtojson” that takes in a filename path to a csv file (please refer to attached csv file) and generates a file that contains streamable line delimited JSON.
• Expected filename will be based on the csv filename, i.e. Myfilename.csv will produce Myfilename.json or File2.csv will produce File2.json. Please show this in your code and should not be hardcoded.
• csv file has 10000 lines including the header
• output JSON file should contain 9999 lines
• Sample JSON lines from the csv file below:
CSV:
nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0043044,tt0050419,tt0053137" nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0071877,tt0038355,tt0117057,tt0037382" nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0057345,tt0059956,tt0049189,tt0054452"
JSON lines:
{"nconst":"nm0000001","primaryName":"Fred Astaire","birthYear":1899,"deathYear":1987,"primaryProfession":"soundtrack,actor,miscellaneous","knownForTitles":"tt0072308,tt0043044,tt0050419,tt0053137"}
{"nconst":"nm0000002","primaryName":"Lauren Bacall","birthYear":1924,"deathYear":2014,"primaryProfession":"actress,soundtrack","knownForTitles":"tt0071877,tt0038355,tt0117057,tt0037382"}
{"nconst":"nm0000003","primaryName":"Brigitte Bardot","birthYear":1934,"deathYear":null,"primaryProfession":"actress,soundtrack,producer","knownForTitles":"tt0057345,tt0059956,tt0049189,tt0054452"}
I am not able to understand is how the header can be inputted as a key to every value of jason.
Has anyone come access this scenario and help me out of it?
What i was trying i know loop is not correct but figuring it out
with open(file_name, encoding = 'utf-8') as file:
csv_data = csv.DictReader(file)
csvreader = csv.reader(file)
# print(csv_data)
keys = next(csvreader)
print (keys)
for i,Value in range(len(keys)), csv_data:
data[keys[i]] = Value
print (data)
You can convert your csv to pandas data frame and output as json:
df = pd.read_csv('data.csv')
df.to_json(orient='records')
import csv
import json
def csv_to_json(csv_file_path, json_file_path):
data_dict = []
with open(csv_file_path, encoding = 'utf-8') as csv_file_handler:
csv_reader = csv.DictReader(csv_file_handler)
for rows in csv_reader:
data_dict.append(rows)
with open(json_file_path, 'w', encoding = 'utf-8') as json_file_handler:
json_file_handler.write(json.dumps(data_dict, indent = 4))
csv_to_json("/home/devendra/Videos/stackoverflow/Names.csv", "/home/devendra/Videos/stackoverflow/Names.json")

Sorting CSV file and saving result as a CSV

I'd like to take a csv file, sort it and then save it as a csv. This is what I have so far and can't figure out how to write it to a csv file
import csv
with open('test.csv','r') as f:
sample = csv.reader(f)
sort = sorted(sample)
for eachline in sort:
print (eachline)
You don't need pandas for something simple like this:
# Read the input file and sort it
with open('input.csv') as f:
data = sorted(csv.reader(f))
# write to the output file
with open('output.csv', 'w', newline='\n') as f:
csv.writer(f).writerows(data)
Tuples in python sort lexicographically, meaning they sort by the first value, and if those are equal by the second. You can supply a key function to sorted to sort by a specific value.
I think something like this should do the trick:
import pandas as pd
path = "C:/Your/file/path/file.csv"
df = pd.read_csv(path)
df = df.sort_values("variablename_by_which_to_sort", axis=0, ascending=True/False)
df.to_csv(path)

How to convert a csv file into a json formatted file?

Most of the samples here show hard-coded columns and not an iteration. I have 73 columns I want iterated and expressed properly in the JSON.
import csv
import json
CSV_yearly = r'C:\path\yearly.csv'
JSON_yearly = r'C:\path\json_yearly.json'
with open(CSV_yearly, 'r') as csv_file:
reader = csv.DictReader(csv_file)
with open(JSON_yearly, 'w') as json_file:
for row in reader:
json_file.write(json.dumps(row) + ',' + '\n')
print "done"
Though this creates a json file it does one improperly. I saw examples where an argument inside reader requested a list, but i don't want to type out 73 columns from the csv. My guess is the line of code goes between the start of with and reader.
Your code creates each line in the file as a separate JSON object (sometimes called JsonL or json-lines format). Collect the rows in a list and then serialise as JSON:
with open(CSV_yearly, 'r') as csv_file:
reader = csv.DictReader(csv_file)
with open(JSON_yearly, 'w') as json_file:
rows = list(reader)
json.dump(rows, json_file)
Note that some consumers of JSON expect an object rather than a list as an outer container, in which case your data would have to be
rows = {'data': list(reader)}
Update: - questions from comments
Do you know why the result did not order my columns accordingly?
csv.DictReader uses a standard Python dictionary to create rows, so the order of keys is arbitrary in Python versions before 3.7. If key order must be preserved, try using an OrderedDict:
from collections import OrderedDict
out = []
with open('mycsv.csv', 'rb') as f:
reader = csv.reader(f)
headings = next(reader) # Assumes first row is headings, otherwise supply your own list
for row in reader:
od = OrderedDict(zip(headings, row))
out.append(od)
# dump out to file using json module
Be aware that while this may output json with the required key order, consumers of the json are not required to respect it.
Do you also know why my values in the json were converted into string and not remain as a number or without parenthesis.
All values from a csv are read as strings. If you want different types then you need to perform the necessary conversions after reading from the csv file.

Save tuple in a list into a new file [duplicate]

I have data written in a csv file in the format below:
[(789,255,25,33.0),(855,275,25,33.0)............]
I want it to be converted into a format like:
1. 789,255,25,33.0
2. 855,275,25,33.0
..............
So all i want is convert the tuples in the list into a new csv file with each tuple in a new line. The values in the list are in string and i want to convert it into float as well how do i accomplish it?
Using the csv module and enumerate.
Ex:
import csv
data = [(789,255,25,33.0),(855,275,25,33.0)]
with open(filename, "w") as outfile:
writer = csv.writer(outfile)
for i, line in enumerate(data, 1):
writer.writerow([i]+ list(line))
Using Pandas
import pandas as pd
data = [(789,255,25,33.0),(855,275,25,33.0)]
df = pd.DataFrame(data)
df.to_csv(filename, header=None)

Reading column names alone in a csv file

I have a csv file with the following columns:
id,name,age,sex
Followed by a lot of values for the above columns.
I am trying to read the column names alone and put them inside a list.
I am using Dictreader and this gives out the correct details:
with open('details.csv') as csvfile:
i=["name","age","sex"]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
But what I want to do is, I need the list of columns, ("i" in the above case)to be automatically parsed with the input csv than hardcoding them inside a list.
with open('details.csv') as csvfile:
rows=iter(csv.reader(csvfile)).next()
header=rows[1:]
re=csv.DictReader(csvfile)
for row in re:
print row
for x in header:
print row[x]
This gives out an error
Keyerrror:'name'
in the line print row[x]. Where am I going wrong? Is it possible to fetch the column names using Dictreader?
Though you already have an accepted answer, I figured I'd add this for anyone else interested in a different solution-
Python's DictReader object in the CSV module (as of Python 2.6 and above) has a public attribute called fieldnames.
https://docs.python.org/3.4/library/csv.html#csv.csvreader.fieldnames
An implementation could be as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
d_reader = csv.DictReader(f)
#get fieldnames from DictReader object and store in list
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
In the above, d_reader.fieldnames returns a list of your headers (assuming the headers are in the top row).
Which allows...
>>> print(headers)
['MyCol1', 'MyCol2', 'MyCol3']
If your headers are in, say the 2nd row (with the very top row being row 1), you could do as follows:
import csv
with open('C:/mypath/to/csvfile.csv', 'r') as f:
#you can eat the first line before creating DictReader.
#if no "fieldnames" param is passed into
#DictReader object upon creation, DictReader
#will read the upper-most line as the headers
f.readline()
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
for line in d_reader:
#print value in MyCol1 for each row
print(line['MyCol1'])
You can read the header by using the next() function which return the next row of the reader’s iterable object as a list. then you can add the content of the file to a list.
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
rest = list(reader)
Now i has the column's names as a list.
print i
>>>['id', 'name', 'age', 'sex']
Also note that reader.next() does not work in python 3. Instead use the the inbuilt next() to get the first line of the csv immediately after reading like so:
import csv
with open("C:/path/to/.filecsv", "rb") as f:
reader = csv.reader(f)
i = next(reader)
print(i)
>>>['id', 'name', 'age', 'sex']
The csv.DictReader object exposes an attribute called fieldnames, and that is what you'd use. Here's example code, followed by input and corresponding output:
import csv
file = "/path/to/file.csv"
with open(file, mode='r', encoding='utf-8') as f:
reader = csv.DictReader(f, delimiter=',')
for row in reader:
print([col + '=' + row[col] for col in reader.fieldnames])
Input file contents:
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
00,01,02,03,04,05,06,07,08,09
10,11,12,13,14,15,16,17,18,19
20,21,22,23,24,25,26,27,28,29
30,31,32,33,34,35,36,37,38,39
40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59
60,61,62,63,64,65,66,67,68,69
70,71,72,73,74,75,76,77,78,79
80,81,82,83,84,85,86,87,88,89
90,91,92,93,94,95,96,97,98,99
Output of print statements:
['col0=00', 'col1=01', 'col2=02', 'col3=03', 'col4=04', 'col5=05', 'col6=06', 'col7=07', 'col8=08', 'col9=09']
['col0=10', 'col1=11', 'col2=12', 'col3=13', 'col4=14', 'col5=15', 'col6=16', 'col7=17', 'col8=18', 'col9=19']
['col0=20', 'col1=21', 'col2=22', 'col3=23', 'col4=24', 'col5=25', 'col6=26', 'col7=27', 'col8=28', 'col9=29']
['col0=30', 'col1=31', 'col2=32', 'col3=33', 'col4=34', 'col5=35', 'col6=36', 'col7=37', 'col8=38', 'col9=39']
['col0=40', 'col1=41', 'col2=42', 'col3=43', 'col4=44', 'col5=45', 'col6=46', 'col7=47', 'col8=48', 'col9=49']
['col0=50', 'col1=51', 'col2=52', 'col3=53', 'col4=54', 'col5=55', 'col6=56', 'col7=57', 'col8=58', 'col9=59']
['col0=60', 'col1=61', 'col2=62', 'col3=63', 'col4=64', 'col5=65', 'col6=66', 'col7=67', 'col8=68', 'col9=69']
['col0=70', 'col1=71', 'col2=72', 'col3=73', 'col4=74', 'col5=75', 'col6=76', 'col7=77', 'col8=78', 'col9=79']
['col0=80', 'col1=81', 'col2=82', 'col3=83', 'col4=84', 'col5=85', 'col6=86', 'col7=87', 'col8=88', 'col9=89']
['col0=90', 'col1=91', 'col2=92', 'col3=93', 'col4=94', 'col5=95', 'col6=96', 'col7=97', 'col8=98', 'col9=99']
How about
with open(csv_input_path + file, 'r') as ft:
header = ft.readline() # read only first line; returns string
header_list = header.split(',') # returns list
I am assuming your input file is CSV format.
If using pandas, it takes more time if the file is big size because it loads the entire data as the dataset.
I am just mentioning how to get all the column names from a csv file.
I am using pandas library.
First we read the file.
import pandas as pd
file = pd.read_csv('details.csv')
Then, in order to just get all the column names as a list from input file use:-
columns = list(file.head(0))
Thanking Daniel Jimenez for his perfect solution to fetch column names alone from my csv, I extend his solution to use DictReader so we can iterate over the rows using column names as indexes. Thanks Jimenez.
with open('myfile.csv') as csvfile:
rest = []
with open("myfile.csv", "rb") as f:
reader = csv.reader(f)
i = reader.next()
i=i[1:]
re=csv.DictReader(csvfile)
for row in re:
for x in i:
print row[x]
here is the code to print only the headers or columns of the csv file.
import csv
HEADERS = next(csv.reader(open('filepath.csv')))
print (HEADERS)
Another method with pandas
import pandas as pd
HEADERS = list(pd.read_csv('filepath.csv').head(0))
print (HEADERS)
import pandas as pd
data = pd.read_csv("data.csv")
cols = data.columns
I literally just wanted the first row of my data which are the headers I need and didn't want to iterate over all my data to get them, so I just did this:
with open(data, 'r', newline='') as csvfile:
t = 0
for i in csv.reader(csvfile, delimiter=',', quotechar='|'):
if t > 0:
break
else:
dbh = i
t += 1
Using pandas is also an option.
But instead of loading the full file in memory, you can retrieve only the first chunk of it to get the field names by using iterator.
import pandas as pd
file = pd.read_csv('details.csv'), iterator=True)
column_names_full=file.get_chunk(1)
column_names=[column for column in column_names_full]
print column_names

Categories