Extract nested JSON embedded as string in Pandas dataframe - python

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.

Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

Related

Dask read_csv: skip periodically ocurring lines

I want to use Dask to read in a large file of atom coordinates at multiple time steps. The format is called XYZ file, and it looks like this:
3
timestep 1
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
3
timestep 2
C 9.5464696279 5.2523477968 4.4976072664
C 10.6455075132 6.0351186102 4.0196547961
C 10.2970471574 7.3880736108 3.6390228968
The first line contains the atom number, the second line is just a comment.
After that, the atoms are listed with their names and positions.
After all atoms are listed, the same is repeated for the next time step.
I would now like to load such a trajectory via dask.dataframe.read_csv.
However, I could not figure out how to skip the periodically ocurring lines containing the atom number and the comment. Is this actually possible?
Edit:
Reading this format into a Pandas Dataframe is possible via:
atom_nr = 3
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
pd.read_csv(xyz_filename, skiprows=skip, delim_whitespace=True,
header=None)
But it looks like the Dask dataframe does not support to pass a function to skiprows.
Edit 2:
MRocklin's answer works! Just for completeness, I write down the full code I used.
from io import BytesIO
import pandas as pd
import dask.bytes
import dask.dataframe
import dask.delayed
atom_nr = ...
filename = ...
def skip(line_nr):
return line_nr % (atom_nr + 2) < 2
def pandaread(data_in_bytes):
pseudo_file = BytesIO(data_in_bytes[0])
return pd.read_csv(pseudo_file, skiprows=skip, delim_whitespace=True,
header=None)
bts = dask.bytes.read_bytes(filename, delimiter=f"{atom_nr}\ntimestep".encode())
dfs = dask.delayed(pandaread)(bts)
sol = dask.dataframe.from_delayed(dfs)
sol.compute()
The only remaining question is: How do I tell dask to only compute the first n frames? At the moment it seems the full trajectory is read.
Short answer
No, neither pandas.read_csv nor dask.dataframe.read_csv offer this kind of functionality (to my knowledge)
Long Answer
If you can write code to convert some of this data into a pandas dataframe, then you can probably do this on your own with moderate effort using
dask.bytes.read_bytes
dask.dataframe.from_delayed
In general this might look something like the following:
values = read_bytes('filenames.*.txt', delimiter='...', blocksize=2**27)
dfs = [dask.delayed(load_pandas_from_bytes)(v) for v in values]
df = dd.from_delayed(dfs)
Each of the dfs correspond to roughly blocksize bytes of your data (and then up until the next delimiter). You can control how fine you want your partitions to be using this blocksize. If you want you can also select only a few of these dfs objects to get a smaller portion of your data
dfs = dfs[:5] # only the first five blocks of `blocksize` data

Python - Pandas - How to drop null values from to_json after dataframe merge

i'm building a process to "outer join" two csv files and export the result as a json object.
# read the source csv files
firstcsv = pandas.read_csv('file1.csv', names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv', names = ['main_index','attr_three','attr_four'])
# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')
jsonresult = output.to_json(orient='records')
print(jsonresult)
Now, the two csv files are like this:
file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo
file2.csv:
1, midlane, mage
2, jungler, melee
And I would like the resulting json to be outputted like:
[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]
instead i'm getting on the line with main_index = 3
{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]
so nulls are added automatically in the output.
I would like to remove them - i looked around but i couldn't find a proper way to do it.
Hope someone can help me around!
Since we're using a DataFrame, pandas will 'fill in' values with NaN, i.e.
>>> print(output)
main_index attr_one attr_two attr_three attr_four
0 1 aurelion sol midlane mage
1 2 lee sin jungler melee
2 3 cute teemo NaN NaN
I can't see any options in the pandas.to_json documentation to skip null values: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html
So the way I came up with involves re-building the JSON string. This probably isn't very performant for large datasets of millions of rows (but there's less than 200 champs in league so shouldn't be a huge issue!)
from collections import OrderedDict
import json
jsonresult = output.to_json(orient='records')
# read the json string to get a list of dictionaries
rows = json.loads(jsonresult)
# new_rows = [
# # rebuild the dictionary for each row, only including non-null values
# {key: val for key, val in row.items() if pandas.notnull(val)}
# for row in rows
# ]
# to maintain order use Ordered Dict
new_rows = [
OrderedDict([
(key, row[key]) for key in output.columns
if (key in row) and pandas.notnull(row[key])
])
for row in rows
]
new_json_output = json.dumps(new_rows)
And you will find that new_json_output has dropped all keys that have NaN values, and kept the order:
>>> print(new_json_output)
[{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
{"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
{"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]
I was trying to achieve the same thing and found the following solution, that I think should be pretty fast (although I haven't tested that). A bit too late to answer the original question, but maybe useful to some.
# Data
df = pd.DataFrame([
{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}
])
gives a DataFrame with missing values.
>>> print(df)
attr_four attr_one attr_three attr_two main_index
0 mage aurelion midlane sol 1
1 melee lee jungler sin 2
2 NaN cute NaN teemo 3
To convert it to a json, you can apply to_json() to each row of the transposed DataFrame, after filtering out empty values. Then join the jsons, separated by commas, and wrap in brackets.
# To json
json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
json_wrapped = "[%s]" % ",".join(json_df)
Then
>>> print(json_wrapped)
[{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]

Parse tsv with very specific format into python

I have a tsv file containing a network. Here's a snippet. Column 0 contains unique IDs, column 1 contains an alternative ID (not necessarily unique). Each pair of columns after that contains an 'interactor' and a score of interaction.
11746909_a_at A1CF SHPRH 0.11081568 TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185 CCDC90B 0.14495682
11724734_at ABCB8 HYKK 0.09577321 LDB3 0.09845833
11723976_at ABCC8 FAM161B 0.15087105 ID1 0.14801268
11718612_a_at ABCD4 HOXC6 0.23559235 LCMT2 0.12867001
11758217_s_at ABHD17C FZD7 0.46334574 HIVEP3 0.24272481
So for example, A1CF connects to SHPRH and TRIM10 with scores of 0.11081568 and 0.11914056 respectively. I'm trying to convert this data into a 'flat' format using pandas which would look like this:
11746909_a_at A1CF SHPRH 0.11081568
TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185
CCDC90B 0.14495682
...... and so on........ ........ ....
Note that each row can have an arbitrary number of (interactor, score) pairs.
I've tried setting columns 0 and 1 to indexes then giving the columns names df.colnames = ['Interactor', Weight']*int(df.shape[1]/2) then using pandas.groupby but so far my attempts have not been successful. Can anybody suggest a way to do this?
Producing an output dataframe like you specified above shouldn't be too hard
from collections import OrderedDict
import pandas as pd
def open_network_tsv(filepath):
"""
Read the tsv file, returning every line split by tabs
"""
with open(filepath) as network_file:
for line in network_file.readlines():
line_columns = line.strip().split('\t')
yield line_columns
def get_connections(potential_conns):
"""
Get the connections of a particular line, grouped
in interactor:score pairs
"""
for idx, val in enumerate(potential_conns):
if not idx % 2:
if len(potential_conns) >= idx + 2:
yield val, potential_conns[idx+1]
def create_connections_df(filepath):
"""
Build the desired dataframe
"""
connections = OrderedDict({
'uniq_id': [],
'alias': [],
'interactor': [],
'score': []
})
for line in open_network_tsv(filepath):
uniq_id, alias, *potential_conns = line
for connection in get_connections(potential_conns):
connections['uniq_id'].append(uniq_id)
connections['alias'].append(alias)
connections['interactor'].append(connection[0])
connections['score'].append(connection[1])
return pd.DataFrame(connections)
Maybe you can do a dataframe.set_index(['uniq_id', 'alias']) or dataframe.groupby(['uniq_id', 'alias']) on the output afterward

Looping at specified indexes

I have 2 large lists, each with about 100 000 elements each and one being larger than the other, that I want to iterate through. My loop looks like this:
for i in list1:
for j in list2:
function()
This current looping takes too long. However, list1 is a list that needs to be checked from list2 but, from a certain index, there are no more instances beyond in list2. This means that looping from indexes might be faster but the problem is I do not know how to do so.
In my project, list2 is a list of dicts that have three keys: value, name, and timestamp. list1 is a list of the timestamps in order. The function is one that takes the value based off the timestamp and puts it into a csv file in the appropriate name column.
This is an example of entries from list1:
[1364310855.004000, 1364310855.005000, 1364310855.008000]
This is what list2 looks like:
{"name":"vehicle_speed","value":2,"timestamp":1364310855.004000}
{"name":"accelerator_pedal_position","value":4,"timestamp":1364310855.004000}
{"name":"engine_speed","value":5,"timestamp":1364310855.005000}
{"name":"torque_at_transmission","value":-3,"timestamp":1364310855.008000}
{"name":"vehicle_speed","value":1,"timestamp":1364310855.008000}
In my final csv file, I should have something like this:
http://s000.tinyupload.com/?file_id=03563948671103920273
If you want this to be fast, you should restructure the data that you have in list2 in order to speedup your lookups:
# The following code converts list2 into a multivalue dictionary
from collections import defaultdict
list2_dict = defaultdict(list)
for item in list2:
list2_dict[item['timestamp']].append((item['name'], item['value']))
This gives you a much faster way to look up your timestamps:
print(list2_dict)
defaultdict(<type 'list'>, {
1364310855.008: [('torque_at_transmission', -3), ('vehicle_speed', 0)],
1364310855.005: [('engine_speed', 0)],
1364310855.004: [('vehicle_speed', 0), ('accelerator_pedal_position', 0)]})
Lookups will be much more efficient when using list2_dict:
for i in list1:
for j in list2_dict[i]:
# here j is a tuple in the form (name, value)
function()
You appear to only want to use the elements in list2 that correspond to i*2 and i*2+1, That is elements 0, 1 and 2, 3, ...
You only need one loop.
for i in range(len(list1)):
j = list[i*2]
k = list2[j+1]
# Process function using j and k
You will only process to the end of list one.
i think pandas module is a perfect match for your goals...
import ujson # 'ujson' (Ultra fast JSON) is faster than the standard 'json'
import pandas as pd
filter_list = [1364310855.004000, 1364310855.005000, 1364310855.008000]
def file2list(fn):
with open(fn) as f:
return [ujson.loads(line) for line in f]
# Use pd.read_json('data.json') instead of pd.DataFrame(load_data('data.json'))
# if you have a proper JSON file
#
# df = pd.read_json('data.json')
df = pd.DataFrame(file2list('data.json'))
# filter DataFrame with 'filter_list'
df = df[df['timestamp'].isin(filter_list)]
# convert UNIX timestamps to readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
# pivot data frame
# fill NaN's with zeroes
df = df.pivot(index='timestamp', columns='name', values='value').fillna(0)
# save data frame to CSV file
df.to_csv('output.csv', sep=',')
#pd.set_option('display.expand_frame_repr', False)
#print(df)
output.csv
timestamp,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0.0,0.0,-3.0,1.0
PS i don't know where did you get [Latitude,Longitude] columns from, but it's pretty easy to add those columns to your result DataFrame - just add the following lines before calling df.to_csv()
df.insert(0, 'latitude', 0)
df.insert(1, 'longitude', 0)
which would result in:
timestamp,latitude,longitude,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,0,0,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0,0,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0,0,0.0,0.0,-3.0,1.0

Pandas dataframe to list of tuples

I parsed a .xlsx file to a pandas dataframe and desire converting to a list of tuples. The pandas dataframe has two columns.
The list of tuples requires the product_id grouped with the transaction_id. I saw a post on creating a pandas dataframe to list of tuples, but the code result grouped with transaction_id grouped with `product_id.
How can I get the list of tuples in the desired format on the bottom of the page?
import pandas as pd
import xlrd
#Import data
trans = pd.ExcelFile('/Users/Transactions.xlsx')
#parse xlsx file into dataframe
transdata = trans.parse('Orders')
#view dataframe
#print transdata
transaction_id product_id
0 20001 48165
1 20001 48162
2 20001 48166
3 20004 48815
4 20005 48165
transdata = trans.parse('Orders')
#Create tuple
trans_set = [tuple(x) for x in subset.values]
print trans_set
[(20001, (48165), (20001, 48162), (20001, 48166), (20004, 48815), (20005, 48165)]
Desired Result:
[(20001, [48165, 48162, 48166]), (20004, 48815), (20005, 48165)]
trans_set = [(key,list(grp)) for key, grp in
transdata.groupby(['transaction_id'])['product_id']]
In [268]: trans_set
Out[268]: [(20001, [48165, 48162, 48166]), (20004, [48815]), (20005, [48165])]
This is a little different than your desired result -- note the (20004, [48815]), for example -- but I think it is more consistent. The second item in each tuple is a list of all the product_ids which are associate with the transaction_id. It might consist of only one element, but it is always a list.
To write trans_set to a CSV, you could use the csv module:
import csv
with open('/tmp/data.csv', 'wb') as f:
writer = csv.writer(f)
for key, grp in trans_set:
writer.writerow([key]+grp)
yields a file, /tmp/data.csv, with content:
20001,48165,48162,48166
20004,48815
20005,48165

Categories