Trying to convert a big tsv file to json - python

I've a tsv file, which I need to convert it into a json file. I'm using this python script which is exporting a empty json file.
import json
data={}
with open('data.json', 'w') as outfile,open("data.tsv","r") as f:
for line in f:
sp=line.split()
data.setdefault("data",[])
json.dump(data, outfile)

This can be done by pandas , but am not sure about performance
df.to_json
df = pd.read_csv('data.tsv',sep='\t') # read your tsv file
df.to_json('data.json') #save it as json . refer orient='values' or 'columns' as per your requirements

You never use the sp in your code.
To properly convert the tsv, you should read the first line separately, to get the "column names", then read the following lines and populate a list of dictionaries.
Here's what your code should look like:
import json
data=[{}]
with open('data.json', 'w') as outfile, open("data.tsv","r") as f:
firstline = f.readline()
columns = firstline.split()
lines = f.readlines()[1:]
for line in lines:
values = line.split()
entry = dict(zip(columns, values))
data.append(entry)
json.dump(data, outfile)
This will output a file containing a list of tsv rows as objects.

Related

How to convert csv file into json in python so that the header of csv are keys of every json value

I have this use case
please create a function called “myfunccsvtojson” that takes in a filename path to a csv file (please refer to attached csv file) and generates a file that contains streamable line delimited JSON.
• Expected filename will be based on the csv filename, i.e. Myfilename.csv will produce Myfilename.json or File2.csv will produce File2.json. Please show this in your code and should not be hardcoded.
• csv file has 10000 lines including the header
• output JSON file should contain 9999 lines
• Sample JSON lines from the csv file below:
CSV:
nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0043044,tt0050419,tt0053137" nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0071877,tt0038355,tt0117057,tt0037382" nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,producer","tt0057345,tt0059956,tt0049189,tt0054452"
JSON lines:
{"nconst":"nm0000001","primaryName":"Fred Astaire","birthYear":1899,"deathYear":1987,"primaryProfession":"soundtrack,actor,miscellaneous","knownForTitles":"tt0072308,tt0043044,tt0050419,tt0053137"}
{"nconst":"nm0000002","primaryName":"Lauren Bacall","birthYear":1924,"deathYear":2014,"primaryProfession":"actress,soundtrack","knownForTitles":"tt0071877,tt0038355,tt0117057,tt0037382"}
{"nconst":"nm0000003","primaryName":"Brigitte Bardot","birthYear":1934,"deathYear":null,"primaryProfession":"actress,soundtrack,producer","knownForTitles":"tt0057345,tt0059956,tt0049189,tt0054452"}
I am not able to understand is how the header can be inputted as a key to every value of jason.
Has anyone come access this scenario and help me out of it?
What i was trying i know loop is not correct but figuring it out
with open(file_name, encoding = 'utf-8') as file:
csv_data = csv.DictReader(file)
csvreader = csv.reader(file)
# print(csv_data)
keys = next(csvreader)
print (keys)
for i,Value in range(len(keys)), csv_data:
data[keys[i]] = Value
print (data)
You can convert your csv to pandas data frame and output as json:
df = pd.read_csv('data.csv')
df.to_json(orient='records')
import csv
import json
def csv_to_json(csv_file_path, json_file_path):
data_dict = []
with open(csv_file_path, encoding = 'utf-8') as csv_file_handler:
csv_reader = csv.DictReader(csv_file_handler)
for rows in csv_reader:
data_dict.append(rows)
with open(json_file_path, 'w', encoding = 'utf-8') as json_file_handler:
json_file_handler.write(json.dumps(data_dict, indent = 4))
csv_to_json("/home/devendra/Videos/stackoverflow/Names.csv", "/home/devendra/Videos/stackoverflow/Names.json")

How to convert nested json in csv with pandas

I have a nested json file (100k rows), which looks like this:
{"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
I am trying to create a csv, so that it can easily be loaded in a rdbms. I am trying to use json_normalize() in pandas but even before I get there I am getting below error.
with open('transactions.json') as data_file:
data = json.load(data_file)
JSONDecodeError: Extra data: line 2 column 1 (char 466)
If your problem originates in reading the json file itself, then i would just use:
json.loads()
and then use
pd.read_csv()
If your problem originates in the conversion from your json dict to dataframe you can use this:
test = {"UniqueId":"4224f3c9-323c-e911-a820-a7f2c9e35195","TransactionDateUTC":"2019-03-01 15:00:52.627 UTC","Itinerary":"MUC-CPH-ARN-MUC","OriginAirportCode":"MUC","DestinationAirportCode":"CPH","OneWayOrReturn":"Return","Segment":[{"DepartureAirportCode":"MUC","ArrivalAirportCode":"CPH","SegmentNumber":"1","LegNumber":"1","NumberOfPassengers":"1"},{"DepartureAirportCode":"ARN","ArrivalAirportCode":"MUC","SegmentNumber":"2","LegNumber":"1","NumberOfPassengers":"1"}]}
import json
import pandas
# convert json to string and read
df = pd.read_json(json.dumps(test), convert_axes=True)
# 'unpack' the dict as series and merge them with original df
df = pd.concat([df, df.Segment.apply(pd.Series)], axis=1)
# remove dict
df.drop('Segment', axis=1, inplace=True)
That would be my approach but there might be more convenient approaches.
Step one: loop over a file of records
Since your file has one JSON record per line, you need to loop over all the records in your file, which you can do like this:
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
# or
df = pd.read_json(line, convert_axes=True)
# do something with data or df
Step two: write the CSV file
Now, you can combine this with a csv.writer to convert the file into a CSV file.
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
#Loop for each record, somehow:
#row = build list with row contents
writer.writerow(row)
Putting it all together
I'll read the first record once to get the keys to display and output them as a CSV header, and then I'll read the whole file and convert it one record at a time:
import copy
import csv
import json
import pandas as pd
# Read the first JSON record to get the keys that we'll use as headers for the CSV file
with open('transactions.json', encoding="utf8") as data_file:
keys = list(json.loads(next(data_file)).keys())
# Our CSV headers are going to be the keys from the first row, except for
# segments, which we'll replace (arbitrarily) by three numbered segment column
# headings.
keys.pop()
base_keys = copy.copy(keys)
keys.extend(["Segment1", "Segment2", "Segment3"])
with open('transactions.csv', "w", encoding="utf8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(keys) # Write the CSV headers
with open('transactions.json', encoding="utf8") as data_file:
for line in data_file:
data = json.loads(line)
row = [data[k] for k in base_keys] + data["Segment"]
writer.writerow(row)
The resulting CSV file will still have a JSON record in each Segmenti column. If you want to format each segment differently, you could define a format_segment(segment) function and replace data["Segment"] by this list comprehension: [format_segment(segment) for segment in data["Segment"]]

Read Data from CSV and change it to tuple using Python 3.7

I read data from CSV file and convert it to tuple using code below
with open('data.csv') as f:
data=[tuple(line) for line in csv.reader(f)]
output data is like below:
data=[('A1231',),('B1256',),('A4152',),('D1254',)]
I need to have data like :
data=['A1231','B1256','A4152','D1254']
Don't use tuple, use slicing instead:
with open('data.csv') as f:
data = [line[0] for line in csv.reader(f)]

Parsing multiple json objects from a text file using Python

I have a .json file where each line is an object. For example, first two lines are:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
I have tried processing using ijson lib as follows:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
However, i get error:
JSONError: Additional data
Its seems due to multiple objects I'm receiving such error.
Whats the recommended way for analyzing such Json file in Jupyter?
Thank You in advance
The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
Here is some code how to clean your file:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.
You can aggregate these objects in one list, and from there do whatever you like with your data :
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
You could do it as a list comprehension to have it a little more pythonic :
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
You have multiple lines in your file, so that's why it's throwing errors
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
That should catch both lines and load them in properly

CSV file to JSON file in Python

I have read quite a lot of posts here and elsewhere, but I can't seem to find the solution. And I do not want to convert it online.
I would like to convert a CSV file to a JSON file (no nesting, even though I might need it in the future) with this code I found here:
import csv
import json
f = open( 'sample.csv', 'r' )
reader = csv.DictReader( f, fieldnames = ( "id","name","lat","lng" ) )
out = json.dumps( [ row for row in reader ] )
print out
Awesome, simple, and it works. But I do not get a .csv file, but a text output that if I copy and paste, is one long line.
I would need a .json that is readable and ideally saved to a .json file.
Is this possible?
To get more readable JSON, try the indent argument in dumps():
print json.dumps(..., indent=4)
However - to look more like the original CSV file, what you probably want is to encode each line separately, and then join them all up using the JSON array syntax:
out = "[\n\t" + ",\n\t".join([json.dumps(row) for row in reader]) + "\n]"
That should give you something like:
[
{"id": 1, "name": "foo", ...},
{"id": 2, "name": "bar", ...},
...
]
If you need help writing the result to a file, try this tutorial.
If you want a more readable format of the JSON file, use it like this:
json.dump(output_value, open('filename','w'), indent=4, sort_keys=False)
Here's a full script. This script uses the comma-separated values of the first line as the keys for the JSON output. The output JSON file will be automatically created or overwritten using the same file name as the input CSV file name just with the .csv file extension replaced with .json.
Example CSV file:
id,longitude,latitude
1,32.774,-124.401
2,32.748,-124.424
4,32.800,-124.427
5,32.771,-124.433
Python script:
csvfile = open('sample.csv', 'r')
jsonfile = open('sample.csv'.replace('.csv', '.json'), 'w')
jsonfile.write('{"' + 'sample.csv'.replace('.csv', '') + '": [\n') # Write JSON parent of data list
fieldnames = csvfile.readline().replace('\n','').split(',') # Get fieldnames from first line of csv
num_lines = sum(1 for line in open('sample.csv')) - 1 # Count total lines in csv minus header row
reader = csv.DictReader(csvfile, fieldnames)
i = 0
for row in reader:
i += 1
json.dump(row, jsonfile)
if i < num_lines:
jsonfile.write(',')
jsonfile.write('\n')
jsonfile.write(']}')

Categories