I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.
Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])
I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)
I'm trying to read my Twitter data saved in json format using the following code:
import json
with open(file, 'r') as f:
line = f.readline()
tweet = json.loads(line)
df1 = pd.DataFrame(tweet)
This code reads only one tweet and it works, but when I'm trying to read all file by:
with open(file, 'r') as f:
for line in f:
tweet = json.loads(line)
I receive an error:
JSONDecodeError: Expecting value: line 2 column 1 (char 1)
What should I change to read this file properly?
My main task is to find creation dates for those tweets and I found it using following filters (I just used one tweet which worked at the beginning):
df2 = df[["user"]]
df3 = df2.loc[['created_at']]
df3
Is there a better way than DataFrames to handle this data?
A more succint way to read in (all) your JSON file for me looks like
import pandas as pd
df = pd.read_json("python.json", orient = 'records', lines = True)
You can then apply transformations to df so to get data from the columns that you are interested in.
You can do something like this:
import pandas as pd
#results is the JSON tweet data.
#Define the columns you want to extract
resultFrame = pd.DataFrame(columns=["username","created_at","tweet"])
print len(results)
for i in range(len(results)):
resultFrame.loc[i,"username"] = results[i].user.screen_name
resultFrame.loc[i, "created_at"] = results[i].created_at
resultFrame.loc[i, "tweet"] = results[i].text
print resultFrame.head()
I have an existing json file in a format of list of dicts.
$cat output.json
[{'a':1, 'b':2}, {'a':2, 'b':3}]
And I have a DataFrame
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
I want to save "df" with to_json to append it to file output.json:
df.to_json('output.json', orient='records') # mode='a' not available for to_json
* There is append mode='a' for to_csv, but not for to_json really.
The expected generated output.json file will be:
[{'a':1, 'b':2}, {'a':2, 'b':3}, {'a':1, 'b':3}, {'a':2, 'b':4}]
The existing file output.json can be huge (say Tetabytes), is it possible to append the new dataframe result without loading the file?
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_json.html
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_csv.html
You could do this. It will write each record/row as json in new line.
f = open(outfile_path, mode="a")
for chunk_df in data:
f.write(chunk_df.to_json(orient="records", lines=True))
f.close()
No, you can't append to a json file without re-writing the whole file using pandas or the json module. You might be able to modify the file "manually" by opening the file in a mode and seeking to the correct position and inserting your data. I wouldn't recommend this though. Better to just use a file format other than json if your file is going to be larger than your RAM.
This answer also might help. It doesn't create valid json files (instead each line is a json string), but its goal is very similar to yours.
May be you need to think in terms of orient='records':
def to_json_append(df,file):
'''
Load the file with
pd.read_json(file,orient='records',lines=True)
'''
df.to_json('tmp.json',orient='records',lines=True)
#append
f=open('tmp.json','r')
k=f.read()
f.close()
f=open(file,'a')
f.write('\n') #Prepare next data entry
f.write(k)
f.close()
df=pd.read_json('output.json')
#Save again as lines
df.to_json('output.json',orient='records',lines=True)
#new data
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
#append:
to_json_append(df,'output.json')
To load full data
pd.read_json('output.json',orient='records',lines=True)
I've solved it just by using in built pandas.DataFrame methods. You need to remember about the performance in case of huge dataframes (there are ways to deal with it).
Code:
if os.path.isfile(dir_to_json_file):
# if exist open read it
df_read = pd.read_json(dir_to_json_file, orient='index')
# add data that you want to save
df_read = pd.concat([df_read, df_to_append], ignore_index=True)
# in case of adding to much unnecessery data (if you need)
df_read.drop_duplicates(inplace=True)
# save it to json file in AppData.bin
df_read.to_json(dir_to_json_file, orient='index')
else:
df_to_append.to_json(dir_to_json_file, orient='index')
Usecase, write big amount of data to JSON file with small memory:
Let's say we have 1,000 dataframe, each dataframe is like 1000,000 line of json. Each dataframe needs 100MB, the total file size would be 1000 * 100MB = 100GB.
Solution:
use buffer to store content of each dataframe
use pandas to dump it to text
use append mode to write text to the end of file
import io
import pandas as pd
from pathlib_mate import Path
n_lines_per_df = 10
n_df = 3
columns = ["id", "value"]
value = "alice#example.com"
f = Path(__file__).change(new_basename="big-json-file.json")
if not f.exists():
for nth_df in range(n_df):
data = list()
for nth_line in range(nth_df * n_lines_per_df, (nth_df + 1) * n_lines_per_df):
data.append((nth_line, value))
df = pd.DataFrame(data, columns=columns)
buffer = io.StringIO()
df.to_json(
buffer,
orient="records",
lines=True,
)
with open(f.abspath, "a") as file:
file.write(buffer.getvalue())