Pandas Dataframe to JSON File with Separate Records - python

I'm attempting to dump data from a Pandas Dataframe into a JSON file to import into MongoDB. The format I require in a file has JSON records on each line of the form:
{<column 1>:<value>,<column 2>:<value>,...,<column N>:<value>}
df.to_json(,orient='records') gets close to the result but all the records are dumped within a single JSON array.
Any thoughts on an efficient way to get this result from a dataframe?
UPDATE: The best solution I've come up with is the following:
dlist = df.to_dict('records')
dlist = [json.dumps(record)+"\n" for record in dlist]
open('data.json','w').writelines(dlist)

docs here, there are several orient options you can pass, you need at least pandas 0.12
In [2]: df = DataFrame(np.random.randn(10,2),columns=list('AB'))
In [3]: df
Out[3]:
A B
0 -0.350949 -0.428705
1 -1.732226 1.895324
2 0.314642 -1.494372
3 -0.492676 0.180832
4 -0.985848 0.070543
5 -0.689386 -0.213252
6 0.673370 0.045452
7 -1.403494 -1.591106
8 -1.836650 -0.494737
9 -0.105253 0.243730
In [4]: df.to_json()
Out[4]: '{"A":{"0":-0.3509492646,"1":-1.7322255701,"2":0.3146421374,"3":-0.4926764426,"4":-0.9858476787,"5":-0.6893856618,"6":0.673369954,"7":-1.4034942394,"8":-1.8366498622,"9":-0.1052531862},"B":{"0":-0.4287054732,"1":1.8953235554,"2":-1.4943721459,"3":0.1808322313,"4":0.0705432211,"5":-0.213252257,"6":0.045451995,"7":-1.5911060576,"8":-0.4947369551,"9":0.2437304866}}'

format your data in a python dictionary to your liking and use simplejson:
json.dumps(your_dictionary)

Related

Python Pandas Dataframe from API JSON Response >>

I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`

Extracting value from JSON column very slow

I've got a CSV with a bunch of data. One of the columns, ExtraParams contains a JSON object. I want to extract a value using a specific key, but it's taking quite a while to get through the 60.000something rows in the CSV. Can it be sped up?
counter = 0 #just to see where I'm at
order_data['NewColumn'] = ''
for row in range(len(total_data)):
s = total_data['ExtraParams'][row]
try:
data = json.loads(s)
new_data = data['NewColumn']
counter += 1
print(counter)
order_data['NewColumn'][row] = new_data
except:
print('NewColumn not in row')
I use a try-except because a few of the rows have what I assume is messed up JSON, as they crash the program with a "expecting delimiter ','" error.
When I say "slow" I mean ~30mins for 60.000rows.
EDIT: It might be worth nothing each JSON contains about 35 key/value pairs.
You could use something like pandas and make use of the apply method. For some simple sample data in test.csv
Col1,Col2,ExtraParams
1,"a",{"dog":10}
2,"b",{"dog":5}
3,"c",{"dog":6}
You could use something like
In [1]: import pandas as pd
In [2]: import json
In [3]: df = pd.read_csv("test.csv")
In [4]: df.ExtraParams.apply(json.loads)
Out[4]:
0 {'dog': 10}
1 {'dog': 5}
2 {'dog': 6}
Name: ExtraParams, dtype: object
If you need to extract a field from the json, assuming the field is present in each row you can write a lambda function like
In [5]: df.ExtraParams.apply(lambda x: json.loads(x)['dog'])
Out[5]:
0 10
1 5
2 6
Name: ExtraParams, dtype: int64

How to read bytes as bytes from csv?

I have a csv with ~10 columns.. One of the columns has information in bytes i.e., b'gAAAA234'. But when I read this from pandas via .read_csv("file.csv"), I get it all in a dataframe and this particular column is in string rather than bytes i.e., b'gAAAA234'.
How do I simply read it as bytes without having to read it as string and then reconverting?
Currently, I'm working with this:
b = df['column_with_data_in_bytes'][i]
bb = bytes(b[2:len(b)-1],'utf-8')
#further processing of bytes
This works but I was hoping to find a more elegant/pythonic or more reliable way to do this?
You might consider parsing with ast.literal_eval:
import ast
df['column_with_data_in_bytes'] = df['column_with_data_in_bytes'].apply(ast.literal_eval)
Demo:
In [322]: df = pd.DataFrame({'Col' : ["b'asdfghj'", "b'ssdgdfgfv'", "b'asdsfg'"]})
In [325]: df
Out[325]:
Col
0 b'asdfghj'
1 b'ssdgdfgfv'
2 b'asdsfg'
In [326]: df.Col.apply(ast.literal_eval)
Out[326]:
0 asdfghj
1 ssdgdfgfv
2 asdsfg
Name: Col, dtype: object

Splitting Regex response column on python

I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG
It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS

Extract nested JSON embedded as string in Pandas dataframe

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.
Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

Categories