Generating a dataframe using a collection of JSON objects from a file - python

I have a log file where every line is a log record such as:
{"log":{"identifier": "x", "message": {"key" : "value"}}}
What I'd like to do is convert this JSON collection to a single DataFrame for analysis.
Example
identifier | key
------------|-------------
x | value
Up till now, I have done the following
with open("../data/cleaned_logs_xs.json", 'r') as logfile:
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
df = pd.io.json.json_normalize(log)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
Read this file line by line, convert every single record to a DataFrame and append the DataFrame to a parent DataFrame. At the end of this loop, it builds the final DataFrame I need.
Now I know this is extremely hack-y and inefficient. What would be the best way to go about this?

I dont know exactly if this is what you want but there is something like this:
import json
from pandas.io.json import json_normalize
my_json = '{"log": {"identifier": "x", "message": {"key": "value"}}}'
data = json.loads(my_json)
data = json_normalize(data)
print(data)
Output:
log.identifier log.message.key
0 x value
In your case just read the json file.

At this moment, I've removed the constant appending to the parent dataframe.
I append the JSON encoded log message into an array through a loop and at the end convert the array of JSON records to a dataframe using the following:
log_messages = list()
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
log_messages.append(log)
log_df = pd.DataFrame.from_records(log_messages)
Can this be optimized further?

Related

How to read a json data into a dataframe using pandas

I have json data which is in the structure below:
{"Text1": 4, "Text2": 1, "TextN": 123}
I want to read the json file and make a dataframe such as
Each key value pairs will be a row in the dataframe and I need to need headers "Sentence" and "Label". I tried with using lines = True but it returns all the key-value pairs in one row.
data_df = pd.read_json(PATH_TO_DATA, lines = True)
What is the correct way to load such json data?
you can use:
with open('json_example.json') as json_data:
data = json.load(json_data)
df=pd.DataFrame.from_dict(data,orient='index').reset_index().rename(columns={'index':'Sentence',0:'Label'})
Easy way that I remember
import pandas as pd
import json
with open("./data.json", "r") as f:
data = json.load(f)
df = pd.DataFrame({"Sentence": data.keys(), "Label": data.values()})
With read_json
To read straight from the file using read_json, you can use something like:
pd.read_json("./data.json", lines=True)\
.T\
.reset_index()\
.rename(columns={"index": "Sentence", 0: "Labels"})
Explanation
A little dirty but as you probably noticed, lines=True isn't completely sufficient so the above transposes the result so that you have
(index)
0
Text1
4
Text2
1
TextN
123
So then resetting the index moves the index over to be a column named "index" and then renaming the columns.

In Pandas, how can I extract certain value using the key off of a dataframe imported from a csv file?

Using Pandas, I'm trying to extract value using the key but I keep failing to do so. Could you help me with this?
There's a csv file like below:
value
"{""id"":""1234"",""currency"":""USD""}"
"{""id"":""5678"",""currency"":""EUR""}"
I imported this file in Pandas and made a DataFrame out of it:
dataframe from a csv file
However, when I tried to extract the value using a key (e.g. df["id"]), I'm facing an error message.
I'd like to see a value 1234 or 5678 using df["id"]. Which step should I take to get it done? This may be a very basic question but I need your help. Thanks.
The csv file isn't being read in correctly.
You haven't set a delimiter; pandas can automatically detect a delimiter but hasn't done so in your case. See the read_csv documentation for more on this. Because the , the pandas dataframe has a single column, value, which has entire lines from your file as individual cells - the first entry is "{""id"":""1234"",""currency"":""USD""}". So, the file doesn't have a column id, and you can't select data by id.
The data aren't formatted as a pandas df, with row titles and columns of data. One option is to read in this data is to manually process each row, though there may be slicker options.
file = 'test.dat'
f = open(file,'r')
id_vals = []
currency = []
for line in f.readlines()[1:]:
## remove obfuscating characters
for c in '"{}\n':
line = line.replace(c,'')
line = line.split(',')
## extract values to two lists
id_vals.append(line[0][3:])
currency.append(line[1][9:])
You just need to clean up the CSV file a little and you are good. Here is every step:
# open your csv and read as a text string
with open('My_CSV.csv', 'r') as f:
my_csv_text = f.read()
# remove problematic strings
find_str = ['{', '}', '"', 'id:', 'currency:','value']
replace_str = ''
for i in find_str:
my_csv_text = re.sub(i, replace_str, my_csv_text)
# Create new csv file and save cleaned text
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(my_csv_text)
# Create pandas dataframe
df = pd.read_csv('my_new_csv.csv', sep=',', names=['ID', 'Currency'])
print(df)
Output df:
ID Currency
0 1234 USD
1 5678 EUR
You need to extract each row of your dataframe using json.loads() or eval()
something like this:
import json
for row in df.iteritems():
print(json.loads(row.value)["id"])
# OR
print(eval(row.value)["id"])

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

How can I load into a pandas DataFrame tweets from a json file?

I'm trying to read my Twitter data saved in json format using the following code:
import json
with open(file, 'r') as f:
line = f.readline()
tweet = json.loads(line)
df1 = pd.DataFrame(tweet)
This code reads only one tweet and it works, but when I'm trying to read all file by:
with open(file, 'r') as f:
for line in f:
tweet = json.loads(line)
I receive an error:
JSONDecodeError: Expecting value: line 2 column 1 (char 1)
What should I change to read this file properly?
My main task is to find creation dates for those tweets and I found it using following filters (I just used one tweet which worked at the beginning):
df2 = df[["user"]]
df3 = df2.loc[['created_at']]
df3
Is there a better way than DataFrames to handle this data?
A more succint way to read in (all) your JSON file for me looks like
import pandas as pd
df = pd.read_json("python.json", orient = 'records', lines = True)
You can then apply transformations to df so to get data from the columns that you are interested in.
You can do something like this:
import pandas as pd
#results is the JSON tweet data.
#Define the columns you want to extract
resultFrame = pd.DataFrame(columns=["username","created_at","tweet"])
print len(results)
for i in range(len(results)):
resultFrame.loc[i,"username"] = results[i].user.screen_name
resultFrame.loc[i, "created_at"] = results[i].created_at
resultFrame.loc[i, "tweet"] = results[i].text
print resultFrame.head()

How to export DataFrame to_json in append mode - Python Pandas?

I have an existing json file in a format of list of dicts.
$cat output.json
[{'a':1, 'b':2}, {'a':2, 'b':3}]
And I have a DataFrame
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
I want to save "df" with to_json to append it to file output.json:
df.to_json('output.json', orient='records') # mode='a' not available for to_json
* There is append mode='a' for to_csv, but not for to_json really.
The expected generated output.json file will be:
[{'a':1, 'b':2}, {'a':2, 'b':3}, {'a':1, 'b':3}, {'a':2, 'b':4}]
The existing file output.json can be huge (say Tetabytes), is it possible to append the new dataframe result without loading the file?
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_json.html
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_csv.html
You could do this. It will write each record/row as json in new line.
f = open(outfile_path, mode="a")
for chunk_df in data:
f.write(chunk_df.to_json(orient="records", lines=True))
f.close()
No, you can't append to a json file without re-writing the whole file using pandas or the json module. You might be able to modify the file "manually" by opening the file in a mode and seeking to the correct position and inserting your data. I wouldn't recommend this though. Better to just use a file format other than json if your file is going to be larger than your RAM.
This answer also might help. It doesn't create valid json files (instead each line is a json string), but its goal is very similar to yours.
May be you need to think in terms of orient='records':
def to_json_append(df,file):
'''
Load the file with
pd.read_json(file,orient='records',lines=True)
'''
df.to_json('tmp.json',orient='records',lines=True)
#append
f=open('tmp.json','r')
k=f.read()
f.close()
f=open(file,'a')
f.write('\n') #Prepare next data entry
f.write(k)
f.close()
df=pd.read_json('output.json')
#Save again as lines
df.to_json('output.json',orient='records',lines=True)
#new data
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
#append:
to_json_append(df,'output.json')
To load full data
pd.read_json('output.json',orient='records',lines=True)
I've solved it just by using in built pandas.DataFrame methods. You need to remember about the performance in case of huge dataframes (there are ways to deal with it).
Code:
if os.path.isfile(dir_to_json_file):
# if exist open read it
df_read = pd.read_json(dir_to_json_file, orient='index')
# add data that you want to save
df_read = pd.concat([df_read, df_to_append], ignore_index=True)
# in case of adding to much unnecessery data (if you need)
df_read.drop_duplicates(inplace=True)
# save it to json file in AppData.bin
df_read.to_json(dir_to_json_file, orient='index')
else:
df_to_append.to_json(dir_to_json_file, orient='index')
Usecase, write big amount of data to JSON file with small memory:
Let's say we have 1,000 dataframe, each dataframe is like 1000,000 line of json. Each dataframe needs 100MB, the total file size would be 1000 * 100MB = 100GB.
Solution:
use buffer to store content of each dataframe
use pandas to dump it to text
use append mode to write text to the end of file
import io
import pandas as pd
from pathlib_mate import Path
n_lines_per_df = 10
n_df = 3
columns = ["id", "value"]
value = "alice#example.com"
f = Path(__file__).change(new_basename="big-json-file.json")
if not f.exists():
for nth_df in range(n_df):
data = list()
for nth_line in range(nth_df * n_lines_per_df, (nth_df + 1) * n_lines_per_df):
data.append((nth_line, value))
df = pd.DataFrame(data, columns=columns)
buffer = io.StringIO()
df.to_json(
buffer,
orient="records",
lines=True,
)
with open(f.abspath, "a") as file:
file.write(buffer.getvalue())

Categories