Extracting value from JSON column very slow

Extracting value from JSON column very slow - python

I've got a CSV with a bunch of data. One of the columns, ExtraParams contains a JSON object. I want to extract a value using a specific key, but it's taking quite a while to get through the 60.000something rows in the CSV. Can it be sped up?
counter = 0 #just to see where I'm at
order_data['NewColumn'] = ''
for row in range(len(total_data)):
s = total_data['ExtraParams'][row]
try:
data = json.loads(s)
new_data = data['NewColumn']
counter += 1
print(counter)
order_data['NewColumn'][row] = new_data
except:
print('NewColumn not in row')
I use a try-except because a few of the rows have what I assume is messed up JSON, as they crash the program with a "expecting delimiter ','" error.
When I say "slow" I mean ~30mins for 60.000rows.
EDIT: It might be worth nothing each JSON contains about 35 key/value pairs.

You could use something like pandas and make use of the apply method. For some simple sample data in test.csv
Col1,Col2,ExtraParams
1,"a",{"dog":10}
2,"b",{"dog":5}
3,"c",{"dog":6}
You could use something like
In [1]: import pandas as pd
In [2]: import json
In [3]: df = pd.read_csv("test.csv")
In [4]: df.ExtraParams.apply(json.loads)
Out[4]:
0 {'dog': 10}
1 {'dog': 5}
2 {'dog': 6}
Name: ExtraParams, dtype: object
If you need to extract a field from the json, assuming the field is present in each row you can write a lambda function like
In [5]: df.ExtraParams.apply(lambda x: json.loads(x)['dog'])
Out[5]:
0 10
1 5
2 6
Name: ExtraParams, dtype: int64

Related

Parse JSON response to populate data in the form of table

I have JSON response in a <class 'dict'>. I want to iterate over the JSON response and form a table view. Below is the sample JSON response.
{'ResultSet': {'Rows': [{'Data': [{'VarCharValue': 'cnt'}, {'VarCharValue': 'id'}, {'VarCharValue': 'val'}]}, {'Data': [{'VarCharValue': '2000'}, {'VarCharValue': '1234'}, {'VarCharValue': 'ABC'}]},{'Data': [{'VarCharValue': '3000'}, {'VarCharValue': '5678'}, {'VarCharValue': 'DEF'}]}]}}
Expected Output format:
cnt id val
2000 1234 ABC
3000 5678 DEF
There can only one row of data or there can be multiple rows of data for the column set (For provided sample data two rows are there).

I assume you want to use Pandas. Since pd.DataFrame accepts a list of dictionaries directly, you can restructure your input dictionary D as a list of dictionaries:
cols = [next(iter(i.values())) for i in D['ResultSet']['Rows'][0]['Data']]
d = [{col: j['VarCharValue'] for col, j in zip(cols, i['Data'])}
for i in D['ResultSet']['Rows'][1:]]
df = pd.DataFrame(d)
print(df)
cnt id val
0 2000 1234 ABC
1 3000 5678 DEF
You will probably want to convert at least the cnt series to numeric:
df['cnt'] = pd.to_numeric(df['cnt'])

I am not sure if you are using pandas but you can easily parse your response dict into a pandas.DataFrame with the following code
import pandas as pd
pd.DataFrame([[entr['VarCharValue'] for entr in r['Data']] for r in response['ResultSet']['Rows'][1:]],
columns = [r['VarCharValue'] for r in response['ResultSet']['Rows'][0]['Data']])

Python - Pandas - How to drop null values from to_json after dataframe merge

i'm building a process to "outer join" two csv files and export the result as a json object.
# read the source csv files
firstcsv = pandas.read_csv('file1.csv', names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv', names = ['main_index','attr_three','attr_four'])
# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')
jsonresult = output.to_json(orient='records')
print(jsonresult)
Now, the two csv files are like this:
file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo
file2.csv:
1, midlane, mage
2, jungler, melee
And I would like the resulting json to be outputted like:
[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]
instead i'm getting on the line with main_index = 3
{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]
so nulls are added automatically in the output.
I would like to remove them - i looked around but i couldn't find a proper way to do it.
Hope someone can help me around!

Since we're using a DataFrame, pandas will 'fill in' values with NaN, i.e.
>>> print(output)
main_index attr_one attr_two attr_three attr_four
0 1 aurelion sol midlane mage
1 2 lee sin jungler melee
2 3 cute teemo NaN NaN
I can't see any options in the pandas.to_json documentation to skip null values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
So the way I came up with involves re-building the JSON string. This probably isn't very performant for large datasets of millions of rows (but there's less than 200 champs in league so shouldn't be a huge issue!)
from collections import OrderedDict
import json
jsonresult = output.to_json(orient='records')
# read the json string to get a list of dictionaries
rows = json.loads(jsonresult)
# new_rows = [
# # rebuild the dictionary for each row, only including non-null values
# {key: val for key, val in row.items() if pandas.notnull(val)}
# for row in rows
# ]
# to maintain order use Ordered Dict
new_rows = [
OrderedDict([
(key, row[key]) for key in output.columns
if (key in row) and pandas.notnull(row[key])
])
for row in rows
]
new_json_output = json.dumps(new_rows)
And you will find that new_json_output has dropped all keys that have NaN values, and kept the order:
>>> print(new_json_output)
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
{"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
{"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]

I was trying to achieve the same thing and found the following solution, that I think should be pretty fast (although I haven't tested that). A bit too late to answer the original question, but maybe useful to some.
# Data
df = pd.DataFrame([
{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}
])
gives a DataFrame with missing values.
>>> print(df)
attr_four attr_one attr_three attr_two main_index
0 mage aurelion midlane sol 1
1 melee lee jungler sin 2
2 NaN cute NaN teemo 3
To convert it to a json, you can apply to_json() to each row of the transposed DataFrame, after filtering out empty values. Then join the jsons, separated by commas, and wrap in brackets.
# To json
json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
json_wrapped = "[%s]" % ",".join(json_df)
Then
>>> print(json_wrapped)
[{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]

How to read bytes as bytes from csv?

I have a csv with ~10 columns.. One of the columns has information in bytes i.e., b'gAAAA234'. But when I read this from pandas via .read_csv("file.csv"), I get it all in a dataframe and this particular column is in string rather than bytes i.e., b'gAAAA234'.
How do I simply read it as bytes without having to read it as string and then reconverting?
Currently, I'm working with this:
b = df['column_with_data_in_bytes'][i]
bb = bytes(b[2:len(b)-1],'utf-8')
#further processing of bytes
This works but I was hoping to find a more elegant/pythonic or more reliable way to do this?

You might consider parsing with ast.literal_eval:
import ast
df['column_with_data_in_bytes'] = df['column_with_data_in_bytes'].apply(ast.literal_eval)
Demo:
In [322]: df = pd.DataFrame({'Col' : ["b'asdfghj'", "b'ssdgdfgfv'", "b'asdsfg'"]})
In [325]: df
Out[325]:
Col
0 b'asdfghj'
1 b'ssdgdfgfv'
2 b'asdsfg'
In [326]: df.Col.apply(ast.literal_eval)
Out[326]:
0 asdfghj
1 ssdgdfgfv
2 asdsfg
Name: Col, dtype: object

Extract nested JSON embedded as string in Pandas dataframe

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.

Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

Pandas Dataframe to JSON File with Separate Records

I'm attempting to dump data from a Pandas Dataframe into a JSON file to import into MongoDB. The format I require in a file has JSON records on each line of the form:
{<column 1>:<value>,<column 2>:<value>,...,<column N>:<value>}
df.to_json(,orient='records') gets close to the result but all the records are dumped within a single JSON array.
Any thoughts on an efficient way to get this result from a dataframe?
UPDATE: The best solution I've come up with is the following:
dlist = df.to_dict('records')
dlist = [json.dumps(record)+"\n" for record in dlist]
open('data.json','w').writelines(dlist)

docs here, there are several orient options you can pass, you need at least pandas 0.12
In [2]: df = DataFrame(np.random.randn(10,2),columns=list('AB'))
In [3]: df
Out[3]:
A B
0 -0.350949 -0.428705
1 -1.732226 1.895324
2 0.314642 -1.494372
3 -0.492676 0.180832
4 -0.985848 0.070543
5 -0.689386 -0.213252
6 0.673370 0.045452
7 -1.403494 -1.591106
8 -1.836650 -0.494737
9 -0.105253 0.243730
In [4]: df.to_json()
Out[4]: '{"A":{"0":-0.3509492646,"1":-1.7322255701,"2":0.3146421374,"3":-0.4926764426,"4":-0.9858476787,"5":-0.6893856618,"6":0.673369954,"7":-1.4034942394,"8":-1.8366498622,"9":-0.1052531862},"B":{"0":-0.4287054732,"1":1.8953235554,"2":-1.4943721459,"3":0.1808322313,"4":0.0705432211,"5":-0.213252257,"6":0.045451995,"7":-1.5911060576,"8":-0.4947369551,"9":0.2437304866}}'

format your data in a python dictionary to your liking and use simplejson:
json.dumps(your_dictionary)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting value from JSON column very slow - python

Related

Parse JSON response to populate data in the form of table

Python - Pandas - How to drop null values from to_json after dataframe merge

How to read bytes as bytes from csv?

Extract nested JSON embedded as string in Pandas dataframe

Pandas Dataframe to JSON File with Separate Records

Categories

Resources