How to read a json data into a dataframe using pandas - python

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?

you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})

Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

Related

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

Write Python3/Pandas dataframe to JSON with orient=records but without the array when there is only one record

I'm writing a very small Pandas dataframe to a JSON file. In fact, the Dataframe has only one row with two columns.
To build the dataframe:
import pandas as pd
df = pd.DataFrame.from_dict(dict({'date': '2020-10-05', 'ppm': 411.1}), orient='index').T
print(df)
prints
date ppm
0 2020-10-05 411.1
The desired json output is as follows:
{
"date": "2020-10-05",
"ppm": 411.1
}
but when writing the json with pandas, I can only print it as an array with one element, like so:
[
{
"date":"2020-10-05",
"ppm":411.1
}
]
I've currently hacked my code to convert the Dataframe to a dict, and then use the json module to write the file.
import json
data = df.to_dict(orient='records')
data = data[0] # keep the only element
with open('data.json', 'w') as fp:
json.dump(data, fp, indent=2)
Is there a native way with pandas' .to_json() to keep the only dictionary item if there is only one?
I am currently using .to_json() like this, which incorrectly prints the array with one dictionary item.
df.to_json('data.json', orient='index', indent = 2)
Python 3.8.6
Pandas 1.1.3
If you want to export only one row, use iloc:
print (df.iloc[0].to_dict())
#{'date': '2020-10-05', 'ppm': 411.1}

How to return a specific data structure with inner dictionary of lists

I have a csv file (image attached) and to take the CSV file and create a dictionary of lists with the format "{method},{number},{orbital_period},{mass},{distance},{year}" .
So far I have code :
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
print(inputm)
but my output is coming out like ['Radial Velocity', '1', '269.3', '7.1', '77.4', '2006']
when I want it to look like :
"Radial Velocity" : {"number":[1,1,1], "orbital_period":[269.3, 874.774, 763.0], "mass":[7.1, 2.21, 2.6], "distance":[77.4, 56.95, 19.84], "year":[2006.0, 2008.0, 2011.0] } , "Transit" : {"number":[1,1,1], "orbital_period":[1.5089557, 1.7429935, 4.2568], "mass":[], "distance":[200.0, 680.0], "year":[2008.0, 2008.0, 2008.0] }
Any ideas on how I can alter my code?
Hey SKR01 welcome to Stackoverflow!
I would suggest working with the pandas library. It is meant for table like contents that you have there. What you are then looking for is a groupby on your #method column.
import pandas as pd
def remove_index(row):
d = row._asdict()
del d["Index"]
return d
df = pd.read_csv("https://docs.google.com/uc?export=download&id=1PnQzoefx-IiB3D5BKVOrcawoVFLIPVXQ")
{row.Index : remove_index(row) for row in df.groupby('#method').aggregate(list).itertuples()}
The only thing that remains is removing the nan values from the resulting dict.
If you don't want to use Pandas, maybe something like this is what you're looking for:
import csv
with open('exoplanets.csv') as inputfile :
reader = csv.reader(inputfile)
inputm = list(reader)
header = inputm.pop(0)
del header[0] # probably you don't want "#method"
# create and populate the final dictionary
data = {}
for row in inputm:
if row[0] not in data:
data[row[0]] = {h:[] for h in header}
for i, h in enumerate(header):
data[row[0]][h].append(row[i+1])
print(data)
This is a bit complex, and I'm questioning why you want the data this way, but this should get you the output format you want without requiring any external libraries like Pandas.
import csv
with open('exoplanets.csv') as input_file:
rows = list(csv.DictReader(input_file))
# Create the data structure
methods = {d["#method"]: {} for d in rows}
# Get a list of fields, trimming off the method column
fields = list(rows[1])[1:]
# Fill in the data structure
for method in methods:
methods[method] = {
# Null-trimmed version of listcomp
# f: [r[f] for r in rows if r["#method"] == method and r[f]]
f: [r[f] for r in rows if r["#method"] == method]
for f
in fields
}
Note: This could be one multi-tiered list/dict comprehension, but I've broken it apart for clarity.

Generating a dataframe using a collection of JSON objects from a file

I have a log file where every line is a log record such as:
{"log":{"identifier": "x", "message": {"key" : "value"}}}
What I'd like to do is convert this JSON collection to a single DataFrame for analysis.
Example
identifier | key
------------|-------------
x | value
Up till now, I have done the following
with open("../data/cleaned_logs_xs.json", 'r') as logfile:
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
df = pd.io.json.json_normalize(log)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
Read this file line by line, convert every single record to a DataFrame and append the DataFrame to a parent DataFrame. At the end of this loop, it builds the final DataFrame I need.
Now I know this is extremely hack-y and inefficient. What would be the best way to go about this?
I dont know exactly if this is what you want but there is something like this:
import json
from pandas.io.json import json_normalize
my_json = '{"log": {"identifier": "x", "message": {"key": "value"}}}'
data = json.loads(my_json)
data = json_normalize(data)
print(data)
Output:
log.identifier log.message.key
0 x value
In your case just read the json file.
At this moment, I've removed the constant appending to the parent dataframe.
I append the JSON encoded log message into an array through a loop and at the end convert the array of JSON records to a dataframe using the following:
log_messages = list()
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
log_messages.append(log)
log_df = pd.DataFrame.from_records(log_messages)
Can this be optimized further?

Converting a string representation of dicts to an actual dict

I have a CSV file with 100K+ lines of data in this format:
"{'foo':'bar' , 'foo1':'bar1', 'foo3':'bar3'}"
"{'foo':'bar' , 'foo1':'bar1', 'foo4':'bar4'}"
The quotes are there before the curly braces because my data came in a CSV file.
I want to extract the key value pairs in all the lines to create a dataframe like so:
Column Headers: foo, foo1, foo3, foo...
Rows: bar, bar1, bar3, bar...
I've tried implementing something similar to what's explained here ( Python: error parsing strings from text file with Ast module).
I've gotten the ast.literal_eval function to work on my file to convert the contents into a dict but now how do I get the DataFrame function to work? I am very much a beginner so any help would be appreciated.
import pandas as pd
import ast
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
print(parsed)
pd.DataFrame(???)
You can turn a dictionary into a pandas dataframe using pd.DataFrame.from_dict, but it will expect each value in the dictionary to be in a list.
for key, value in parsed.items():
parsed[key] = [value]
df = pd.DataFrame.from_dict(parsed)
You can do this iteratively by appending to your dataframe.
df = pd.DataFrame()
for string in f:
parsed = ast.literal_eval(string.rstrip())
for key, value in parsed.items():
parsed[key] = [value]
df.append(pd.DataFrame.from_dict(parsed))
parsed is a dictionary, you make a dataframe from it, then join all the frames together:
df = []
with open('file_name.csv') as f:
for string in f:
parsed = ast.literal_eval(string.rstrip())
if type(parsed) != dict:
continue
subDF = pd.DataFrame(parsed, index=[0])
df.append(subDF)
df = pd.concat(df, ignore_index=True, sort=False)
Calling pd.concat on a list of dataframes is faster than calling DataFrame.append repeatedly. sort=False means that pd.concat will not sort the column names when it encounters a few one, like foo4 on the second row.

Categories