How to export DataFrame to_json in append mode - Python Pandas? - python

I have an existing json file in a format of list of dicts.
$cat output.json
[{'a':1, 'b':2}, {'a':2, 'b':3}]
And I have a DataFrame
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
I want to save "df" with to_json to append it to file output.json:
df.to_json('output.json', orient='records') # mode='a' not available for to_json
* There is append mode='a' for to_csv, but not for to_json really.
The expected generated output.json file will be:
[{'a':1, 'b':2}, {'a':2, 'b':3}, {'a':1, 'b':3}, {'a':2, 'b':4}]
The existing file output.json can be huge (say Tetabytes), is it possible to append the new dataframe result without loading the file?
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_json.html
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.to_csv.html

You could do this. It will write each record/row as json in new line.
f = open(outfile_path, mode="a")
for chunk_df in data:
f.write(chunk_df.to_json(orient="records", lines=True))
f.close()

No, you can't append to a json file without re-writing the whole file using pandas or the json module. You might be able to modify the file "manually" by opening the file in a mode and seeking to the correct position and inserting your data. I wouldn't recommend this though. Better to just use a file format other than json if your file is going to be larger than your RAM.
This answer also might help. It doesn't create valid json files (instead each line is a json string), but its goal is very similar to yours.

May be you need to think in terms of orient='records':
def to_json_append(df,file):
'''
Load the file with
pd.read_json(file,orient='records',lines=True)
'''
df.to_json('tmp.json',orient='records',lines=True)
#append
f=open('tmp.json','r')
k=f.read()
f.close()
f=open(file,'a')
f.write('\n') #Prepare next data entry
f.write(k)
f.close()
df=pd.read_json('output.json')
#Save again as lines
df.to_json('output.json',orient='records',lines=True)
#new data
df = pd.DataFrame({'a':pd.Series([1,2], index=list('CD')), \
"b":pd.Series([3,4], index=list('CD')})
#append:
to_json_append(df,'output.json')
To load full data
pd.read_json('output.json',orient='records',lines=True)

I've solved it just by using in built pandas.DataFrame methods. You need to remember about the performance in case of huge dataframes (there are ways to deal with it).
Code:
if os.path.isfile(dir_to_json_file):
# if exist open read it
df_read = pd.read_json(dir_to_json_file, orient='index')
# add data that you want to save
df_read = pd.concat([df_read, df_to_append], ignore_index=True)
# in case of adding to much unnecessery data (if you need)
df_read.drop_duplicates(inplace=True)
# save it to json file in AppData.bin
df_read.to_json(dir_to_json_file, orient='index')
else:
df_to_append.to_json(dir_to_json_file, orient='index')

Usecase, write big amount of data to JSON file with small memory:
Let's say we have 1,000 dataframe, each dataframe is like 1000,000 line of json. Each dataframe needs 100MB, the total file size would be 1000 * 100MB = 100GB.
Solution:
use buffer to store content of each dataframe
use pandas to dump it to text
use append mode to write text to the end of file
import io
import pandas as pd
from pathlib_mate import Path
n_lines_per_df = 10
n_df = 3
columns = ["id", "value"]
value = "alice#example.com"
f = Path(__file__).change(new_basename="big-json-file.json")
if not f.exists():
for nth_df in range(n_df):
data = list()
for nth_line in range(nth_df * n_lines_per_df, (nth_df + 1) * n_lines_per_df):
data.append((nth_line, value))
df = pd.DataFrame(data, columns=columns)
buffer = io.StringIO()
df.to_json(
buffer,
orient="records",
lines=True,
)
with open(f.abspath, "a") as file:
file.write(buffer.getvalue())

Related

Read BLOB object into pandas as CSV

I have a mariaDB database that contains csvs in the form of BLOB objects. I wanted to read these into pandas, but it appears that the csv is stored as a text file in it's own cell, like this:
Name
Data
csv1
col1, col2, ...
csv2
col1, col2, ...
How can I specifically read the cells in the data column as their own csvs into a pandas dataframe.
This is what I have tried:
raw = pd.read_sql_query(query, engine)
cell_as_string = raw.to_string(index=False)
converted_string = StringIO(cell_as_string)
rawdf = pd.read_csv(converted_string, sep = ',')
rawdf
However, rawdf is just the string with spaces, not a dataframe.
Here is a screenshot of what the query returns:
How can I ... read the cells ... into a pandas dataframe
Why is this even interesting?
It appears you already have the answer.
You are able to SELECT each item,
open a file for write, transfer the data,
and then ask .read_csv for a DataFrame.
But perhaps the requirement was to avoid spurious disk I/O.
Ok. The read_csv function accepts a file-like input,
and several libraries offer such data objects.
If the original question
was reproducible it would include
code that started like this:
from io import BytesIO, StringIO
default = "n,square\n2,4\n3,9"
blob = do_query() or default.encode("utf-8")
assert isinstance(blob, bytes)
Then with a binary BLOB in hand it's just a matter of:
f = StringIO(blob.decode("utf-8"))
df = pd.read_csv(f)
print(df.set_index("n"))
Sticking with bytes we might prefer the equivalent:
f = BytesIO(blob)
df = pd.read_csv(f)

how to read data (using pandas?) so that it is correctly formatted?

I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)

Generating a dataframe using a collection of JSON objects from a file

I have a log file where every line is a log record such as:
{"log":{"identifier": "x", "message": {"key" : "value"}}}
What I'd like to do is convert this JSON collection to a single DataFrame for analysis.
Example
identifier | key
------------|-------------
x | value
Up till now, I have done the following
with open("../data/cleaned_logs_xs.json", 'r') as logfile:
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
df = pd.io.json.json_normalize(log)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
Read this file line by line, convert every single record to a DataFrame and append the DataFrame to a parent DataFrame. At the end of this loop, it builds the final DataFrame I need.
Now I know this is extremely hack-y and inefficient. What would be the best way to go about this?
I dont know exactly if this is what you want but there is something like this:
import json
from pandas.io.json import json_normalize
my_json = '{"log": {"identifier": "x", "message": {"key": "value"}}}'
data = json.loads(my_json)
data = json_normalize(data)
print(data)
Output:
log.identifier log.message.key
0 x value
In your case just read the json file.
At this moment, I've removed the constant appending to the parent dataframe.
I append the JSON encoded log message into an array through a loop and at the end convert the array of JSON records to a dataframe using the following:
log_messages = list()
for line in logfile:
jsonified = json.loads(line)
log = jsonified["log"]
log_messages.append(log)
log_df = pd.DataFrame.from_records(log_messages)
Can this be optimized further?

Read only specific fields from large JSON and import into a Pandas Dataframe

I have a folder with more or less 10 json files that size between 500 and 1000 Mb.
Each file contains about 1.000.000 of lines like the loffowling:
{
"dateTime": '2019-01-10 01:01:000.0000'
"cat": 2
"description": 'This description'
"mail": 'mail#mail.com'
"decision":[{"first":"01", "second":"02", "third":"03"},{"first":"04", "second":"05", "third":"06"}]
"Field001": 'data001'
"Field002": 'data002'
"Field003": 'data003'
...
"Field999": 'data999'
}
My target is to analyze it with pandas so I would like to save the data coming from all the files into a Dataframe.
If I loop all the files Python crash because I don't have free resources to manage the data.
As for my purpose I only need a Dataframe with two columns cat and dateTime from all the files, which I suppose is lighter that a whole Dataframe with all the columns I have tryed to read only these two columns with the following snippet:
Note: at the moment I am working with only one file, when I get a fast reader code I will loop to all other files (A.json, B.json, ...)
import pandas as pd
import json
import os.path
from glob import glob
cols = ['cat', 'dateTime']
df = pd.DataFrame(columns=cols)
file_name='this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
data=json.loads(line)
lst_dict=({'cat':data['cat'], 'dateTime':data['dateTime']})
df = df.append(lst_dict, ignore_index=True)
The code works, but it is very very slow so it takes more than one hour for one, file while reading all the file and storing into a Dataframe usually takes me 8-10 minutes.
Is there a way to read only two specific columns and append to a Dataframe in a faster way?
I have tryed to read all the JSON file and store into a Dataframe, then drop all the columns but 'cat' and 'dateTime' but it seems to be too heavy for my MacBook.
I had the same problem. I found out that appending a dict to a DataFrame is very very slow. Extract the values as a list instead. In my case it took 14 s instead of 2 h.
cols = ['cat', 'dateTime']
data = []
file_name = 'this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['cat'], doc['dateTime']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)
Will this help?
Step 1.
Read your json file from pandas
"pandas.read_json() "
Step 2.
Then filter out your 2 columns from the dataframe.
Let me know if you still face any issue.
Thanks

Operations on a very large csv with pandas

I have been using pandas on csv files to get some values out of them. My data looks like this:
"A",23.495,41.995,"this is a sentence with some words"
"B",52.243,0.118,"More text but contains WORD1"
"A",119.142,-58.289,"Also contains WORD1"
"B",423.2535,292.3958,"Doesn't contain anything of interest"
"C",12.413,18.494,"This string contains WORD2"
I have a simple script to read the csv and create the frequencies of WORD by group so the output is like:
group freqW1 freqW2
A 1 0
B 1 0
C 0 1
Then do some other operations on the values. The problem is now I have to deal with very large csv files (20+ GB) that can't be held in memory. I tried the chunksize=x option in pd.read_csv, but because 'TextFileReader' object is not subscriptable, I can't do the necessary operations on the chunks.
I suspect there is some easy way to iterate through the csv and do what I want.
My code is like this:
df = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"])
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
outfile = open("csv_out.txt","w", encoding='utf-8')
df1.to_csv(outfile, sep=",")
outfile.close()
You can specify a chunksize option in the read_csv call. See here for details
Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose.
Okay I misunderstood the chunk parameter. I solved it by doing this:
frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks:
freq=Counter(df['group'])
word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
frame = frame.add(df1,fill_value=0)
outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close()

Categories