Reading data from JSON file and passing to data frame - python

I have a JSON file data.json which contains large tweeter records. I am trying to load this JSON file to Jupyter and then transfer to a pandas data frame for further analysis. So far I have written the following code:
sample of tweets are '{"_id":{"$oid":"5ec248611c9b498cdbf095a1"},"created_at":"Mon Dec 31 23:19:39 +0000 2018","id":{"$numberLong":"1079879790738325504"},"id_str":"1079879790738325504","text":"NPAF's Artist in Residence, Composer Glenn McClure is at the Park at work on his unusual sonification compostions
import json
import csv
json_file = "\\Users\\data.json"
header = ["id_str", "created_at", "lang", "text"]
tweets_processed = 0
with open(json_file, 'r') as infile:
print("json file: ", json_file)
for line in infile:
tweet = json.loads(line)
#row = [tweet['id_str'], tweet['created_at'], tweet['lang'], tweet['text']]
#csvwriter.writerow(row)
tweets_processed += 1
#print("tweet processed: ", tweet_processed)
This is the code so far I have written basically to read my json file and pass it to pandas dataframe. Any help on how to get my json data into pandas dataframe? Thanks in advance.

You never imported the name csvwriter name into the namespace.
In this instance, you should likely be using csv.writer.writerow(). Alternatively, if you are trying to use the csvwriter package (which I doubt you are trying to do, then you need to add import csvwriter to the top of the file.
The takeaway is to read the docs of the package you are trying to use and importing everything into the proper namspace.

The easiest way to do this is use pandas own function read_json().
import pandas as pd
dataset = pd.read_json('file.json')
File must contain format like this

Related

convert excel to json file in python

I am new here , need some help with writing to json file:
I have a dataframe with below values, which is created by reading a excel file
need to write this to json file with object as column dtls
Output :
A similar task is considered in the question:
Converting Excel into JSON using Python
Different approaches are possible to solve this problem.
I hope, it works for your solution.
import pandas as pd
import json
df = pd.read_excel('./TfidfVectorizer_sklearn.xlsx')
df.to_json('new_file1.json', orient='records') # excel to json
# read json and then append details to it
with open('./new_file1.json', 'r') as json_file:
a = {}
data = json.load(json_file)
a['details'] = data
# write new json with details in it
with open("./new_file1.json", "w") as jsonFile:
json.dump(a, jsonFile)
JSON Output:

trying do analysis to my data in json format. my code below so far my questions is how can I join all my data , pls help I am new to python

this is what I am done so far
import os, json
import pandas as pd
path_to_json = 'C:\\Users\\Mohammed Al kinoon\\Desktop\\Research Data\\VCDB-master\\VCDB-master\\data\\json\\validated'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files)
output
enter image description here
You will need to put it in a format the complier can read it;
import json
#open the file
with open('filepathhere') as f:
data = json.load(f)
#reading file
print(data)
For pandas, I recommend the following:
data_frames = [pd.read_json(file) for file in json_files]
combined_df = pd.concat(data_frames).reset_index(drop=True)
This assumes all of the files follow the same format (columns/keys) and can fit in memory. If they follow different formats, you should separate them out into groups that do follow the same format.
If they are too large to fit in memory, I recommend using Spark/pyspark.

Python + Pandas JSON Url Call via api : Cannot save resuls as CSV index problem scalar values

Trying to do a simple script to save as csv. My python Version is 3.8.3 and I am using window 10.
I am trying to use the tool pandas : https://pandas.pydata.org
I am trying to get data results from the URL https://barcode.monster/api/3061990141101. I installed Pandas to convert the JSON file to csv. There is an "index" problem. None of the answers I found worked.
Value error if using all scalar values, you must pass an index.
I looked all over Google and forums, and tried adding "index_col=0" , also "index=false".
Below is my python script :
with urllib.request.urlopen("https://barcode.monster/api/3061990141101") as url:
data = json.loads(url.read().decode())
print(data)
with open('data.json', encoding='utf-8-sig') as f_input:
df = pd.read_json(f_input, index_col=0)
df.to_csv('test.csv', encoding='utf-8', index=False)
I'm sure this is obvious for any Python dev, I'm just learning from scratch. Many thanks.
There are many ways of solving this issue. If you're going to use open a text file, then you'll need to convert the string to json. Try using json.load(f). Doing so, you can call DataFrame. You will then need to either set the index to the first item or wrap the json data in a json object.
For example:
with open('data.json', "r") as f_input:
text = f_input.read()
jsonData = json.loads(text)
df = pd.DataFrame(jsonData, index=[0])
df.to_csv('test.csv', encoding='utf-8', index=False)
Or:
with open('data.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame({'data': data})
df.to_csv('test.csv', encoding='utf-8', index=False)

How to open .ndjson file in Python?

I have .ndjson file that has 20GB that I want to open with Python. File is to big so I found a way to split it into 50 peaces with one online tool. This is the tool: https://pinetools.com/split-files
Now I get one file, that has extension .ndjson.000 (and I do not know what is that)
I'm trying to open it as json or as a csv file, to read it in pandas but it does not work.
Do you have any idea how to solve this?
import json
import pandas as pd
First approach:
df = pd.read_json('dump.ndjson.000', lines=True)
Error: ValueError: Unmatched ''"' when when decoding 'string'
Second approach:
with open('dump.ndjson.000', 'r') as f:
my_data = f.read()
print(my_data)
Error: json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 104925061 (char 104925060)
I think the problem is that I have some emojis in my file, so I do not know how to encode them?
ndjson is now supported out of the box with argument lines=True
import pandas as pd
df = pd.read_json('/path/to/records.ndjson', lines=True)
df.to_json('/path/to/export.ndjson', lines=True)
I think the pandas.read_json cannot handle ndjson correctly.
According to this issue you can do sth. like this to read it.
import ujson as json
import pandas as pd
records = map(json.loads, open('/path/to/records.ndjson'))
df = pd.DataFrame.from_records(records)
P.S: All credits for this code go to KristianHolsheimer from the Github Issue
The ndjson (newline delimited) json is a json-lines format, that is, each line is a json. It is ideal for a dataset lacking rigid structure ('non-sql') where the file size is large enough to warrant multiple files.
You can use pandas:
import pandas as pd
data = pd.read_json('dump.ndjson.000', lines=True)
In case your json strings do not contain newlines, you can alternatively use:
import json
with open("dump.ndjson.000") as f:
data = [json.loads(l) for l in f.readlines()]

Read data of a CSV file to create a new CSV file

I have some data on a CSV file. As you can see in the code, I can read the file and print the info I need. The problem is when I try to create a new CSV file with some info of Original CSV file. I would like to save my analyzed info in a new CSV. I don't know how to use the original info to make a new file.
Data.csv
enter image description here
import csv
with open('Data.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
analyzed = (row[0],row[3],row[3]<0.25)
print(analyzed)
You probably want to use pandas when it comes to CSV files or table-like data:
import pandas as pd
df_data = pd.DataFrame.from_csv('Data.csv')
# Analyze
for index, row in df_data.iterrows():
pass
df_data.to_csv('new_Data.csv')
For reading you have several options like
pandas.DataFrame.from_csv
pandas.read_csv
pandas.read_table
and, as you see, use
pandas.DataFrame.to_csv
to save your transformed or newly created DataFrame.
For installation run
pip install pandas

Categories