Parse tsv with very specific format into python - python

I have a tsv file containing a network. Here's a snippet. Column 0 contains unique IDs, column 1 contains an alternative ID (not necessarily unique). Each pair of columns after that contains an 'interactor' and a score of interaction.
11746909_a_at A1CF SHPRH 0.11081568 TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185 CCDC90B 0.14495682
11724734_at ABCB8 HYKK 0.09577321 LDB3 0.09845833
11723976_at ABCC8 FAM161B 0.15087105 ID1 0.14801268
11718612_a_at ABCD4 HOXC6 0.23559235 LCMT2 0.12867001
11758217_s_at ABHD17C FZD7 0.46334574 HIVEP3 0.24272481
So for example, A1CF connects to SHPRH and TRIM10 with scores of 0.11081568 and 0.11914056 respectively. I'm trying to convert this data into a 'flat' format using pandas which would look like this:
11746909_a_at A1CF SHPRH 0.11081568
TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185
CCDC90B 0.14495682
...... and so on........ ........ ....
Note that each row can have an arbitrary number of (interactor, score) pairs.
I've tried setting columns 0 and 1 to indexes then giving the columns names df.colnames = ['Interactor', Weight']*int(df.shape[1]/2) then using pandas.groupby but so far my attempts have not been successful. Can anybody suggest a way to do this?

Producing an output dataframe like you specified above shouldn't be too hard
from collections import OrderedDict
import pandas as pd
def open_network_tsv(filepath):
"""
Read the tsv file, returning every line split by tabs
"""
with open(filepath) as network_file:
for line in network_file.readlines():
line_columns = line.strip().split('\t')
yield line_columns
def get_connections(potential_conns):
"""
Get the connections of a particular line, grouped
in interactor:score pairs
"""
for idx, val in enumerate(potential_conns):
if not idx % 2:
if len(potential_conns) >= idx + 2:
yield val, potential_conns[idx+1]
def create_connections_df(filepath):
"""
Build the desired dataframe
"""
connections = OrderedDict({
'uniq_id': [],
'alias': [],
'interactor': [],
'score': []
})
for line in open_network_tsv(filepath):
uniq_id, alias, *potential_conns = line
for connection in get_connections(potential_conns):
connections['uniq_id'].append(uniq_id)
connections['alias'].append(alias)
connections['interactor'].append(connection[0])
connections['score'].append(connection[1])
return pd.DataFrame(connections)
Maybe you can do a dataframe.set_index(['uniq_id', 'alias']) or dataframe.groupby(['uniq_id', 'alias']) on the output afterward

Related

Parsing a zipcode boundary geojson file based on a condition, then appending to a a new json file if condition is met

I have a geojson file of zipcode boundaries.
with open('zip_geo.json') as f:
gj = geojson.load(f)
gj['features'][0]['properties']
Prints out;
{'STATEFP10': '36',
'ZCTA5CE10': '12205',
'GEOID10': '3612205',
'CLASSFP10': 'B5',
'MTFCC10': 'G6350',
'FUNCSTAT10': 'S',
'ALAND10': 40906445,
'AWATER10': 243508,
'INTPTLAT10': '+42.7187855',
'INTPTLON10': '-073.8292399',
'PARTFLG10': 'N'}
I also have a pandas dataframe with one of the fields being the zipcode.
I want to create a new geojson file only if the 'ZCTA5CEO' value of the specific element is present in the zipcode column of my dataframe.
How would I go about doing this?
I was thinking of something like; (This is pseudo code)
new_dict = {}
for index,item in enumerate(gj):
if item['features'][index]['properties']['ZCTACE10'] in df['zipcode']:
new_dict += item
The syntax of the code above is obviously wrong, but I am not sure how to parse though multiple nested dictionaries and append the results to a new dictionary.
Link to the geojson file : https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/ny_new_york_zip_codes_geo.min.json
In short I want to remove all the elements relating to the zipcodes that are not there in the zipcode column in my dataframe.
try this. just update your ziplist. then you can save the new json to a local file
ziplist = ['12205', '14719', '12193', '12721'] #list of zips in your dataframe
url='https://github.com/OpenDataDE/State-zip-code-GeoJSON/raw/master/ny_new_york_zip_codes_geo.min.json'
gj = requests.get(url).json()
inziplist = []
for ft in gj['features']:
if ft['properties']['ZCTA5CE10'] in ziplist:
print(ft['properties']['ZCTA5CE10'])
inziplist.append(ft)
print(len(inziplist))
new_zip_json = {}
new_zip_json['type'] = 'FeatureCollection'
new_zip_json['features'] = inziplist
new_zip_json = json.dumps(new_zip_json)

Looking for a better way do accomplish dataframe to dictionary by series

Here's a portion of what the Excel file looks like. Meant to include this the first time. Thanks for the help so far.
Name Phone Number Carrier
FirstName LastName1 3410142531 Alltel
FirstName LastName2 2437201754 AT&T
FirstName LastName3 9247224091 Boost Mobile
FirstName LastName4 6548310018 Cricket Wireless
FirstName LastName5 8811620411 Project Fi
I am converting a list of names, phone numbers, and carriers to a dictionary for easy reference by other code. The idea is separate code will be able to call a name and access that person's phone number and carrier.
I got the output I need, but I'm wondering if there were an easier way I could have accomplished this task and get the same output. Though it's fairly concise, I'm interested in any module or built in of which I'm not aware. My python skills are beginner at best. I wrote this in Thorny with Python 3.6.4. Thanks!
#Imports
import pandas as pd
import math
# Assign spreadsheet filename to `file`
file = 'Phone_Numbers.xlsx'
# Load spreadsheets
xl = pd.ExcelFile(file)
# Load a sheet into a DataFrame by name: df1
df1 = xl.parse('Sheet1', header=0)
# Put the dataframe into a dictionary to start
phone_numbers = df1.to_dict(orient='records')
# Converts PhoneNumbers.xlsx to a dictionary
x=0
temp_dict = {}
for item in phone_numbers:
temp_list = []
for key in phone_numbers[x]:
tempholder = phone_numbers[x][key]
if (isinstance(tempholder, float) or isinstance(tempholder, int)) and math.isnan(tempholder) == False: # Checks to see if there is a blank and if the phone number comes up as a float
# Converts any floats to string for use in later code
tempholder = str(int(tempholder))
else:
pass
temp_list.append(tempholder)
temp_dict[temp_list[0]] = temp_list[1:] # Makes the first item in the list the key and add the rest as values
x += 1
print(temp_dict)
Here's the desired output:
{'FirstName LastName1': ['3410142531', 'Alltel'], 'FirstName LastName2': [2437201754, 'AT&T'], 'FirstName LastName3': [9247224091, 'Boost Mobile'], 'FirstName LastName4': [6548310018, 'Cricket Wireless'], 'FirstName LastName5': [8811620411, 'Project Fi']
One way to do it would be to iterate through the dataframe and use a dictionary comprehension:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']] for _, row in df.iterrows()}
where df is your original dataframe (the result of xl.parse('Sheet1', header=0)). This basically iterates through all rows in your dataframe, creating a dictionary key for each Name, with Phone number and carrier as it's values (in a list), as you indicated in your output.
To make sure that your phone number is not null (as you did in your loop), you could add an if clause to your dict comprehension, such as this:
temp_dict = {row['Name']:[row['Phone Number'], row['Carrier']]
for _, row in df.iterrows()
if not math.isnan(row['Phone Number'])}
df.set_index('Name').T.to_dict('list')
should do the job ,Here df is your dataframe

Read data in chunks and keep one row for each ID in Python

Imagine we have a big file with rows as follows
ID value string
1 105 abc
1 98 edg
1 100 aoafsk
2 160 oemd
2 150 adsf
...
Say the file is named file.txt and is separated by tab.
I want to keep the largest value for each ID. The expected output is
ID value string
1 105 abc
2 160 oemd
...
How can I read it by chunks and process the data? If I read the data in chunks, how can I make sure at the end of each chunk the records are complete for each ID?
Keep track of the data in a dictionary of this format:
data = {
ID: [value, 'string'],
}
As you read each line from the file, see if that ID is already in the dict. If not, add it; if it is, and the current ID is bigger, replace it in the dict.
At the end, your dict should have every biggest ID.
# init to empty dict
data = {}
# open the input file
with open('file.txt', 'r') as fp:
# read each line
for line in fp:
# grab ID, value, string
item_id, item_value, item_string = line.split()
# convert ID and value to integers
item_id = int(item_id)
item_value = int(item_value)
# if ID is not in the dict at all, or if the value we just read
# is bigger, use the current values
if item_id not in data or item_value > data[item_id][0]:
data[item_id] = [item_value, item_string]
for item_id in data:
print item_id, data[item_id][0], data[item_id][1]
Dictionaries don't enforce any specific ordering of their contents, so at the end of your program when you get the data back out of the dict, it might not be in the same order as the original file (i.e. you might see ID 2 first, followed by ID 1).
If this matters to you, you can use an OrderedDict, which retains the original insertion order of the elements.
(Did you have something specific in mind when you said "read by chunks"? If you meant a specific number of bytes, then you might run into issues if a chunk boundary happens to fall in the middle of a word...)
Code
import csv
import itertools as it
import collections as ct
with open("test.csv") as f:
reader = csv.DictReader(f, delimiter=" ") # 1
for k, g in it.groupby(reader, lambda d: d["ID"]): # 2
print(max(g, key=lambda d: float(d["value"]))) # 3
# {'value': '105', 'string': 'abc', 'ID': '1'}
# {'value': '160', 'string': 'oemd', 'ID': '2'}
Details
The with block ensures safe opening and closing of file f. The file is iterable allowing you to loop over it or ideally apply itertools.
For each line of f, csv.DictReader splits the data and maintains header-row information as key-value pairs of a dictionary ,e.g. [{'value': '105', 'string': 'abc', 'ID': '1'}, ...
This data is iterable and passed to groupby that chunks all of the data by ID. See this post from more details on how groupby works.
The the max() builtin combined with a special key function returns the dicts with the largest "value". See this tutorial for more details on the max() function.

Pandas + HDF5 Panel data storage for large data

As part of my research, I am searching a good storing design for my panel data. I am using pandas for all in-memory operations. I've had a look at the following two questions/contributions, Large Data Work flows using Pandas and Query HDF5 Pandas as they come closest to my set-up. However, I have a couple of questions left. First, let me define my data and some requirements:
Size: I have around 800 dates, 9000 IDs and up to 200 variables. Hence, flattening the panel (along dates and IDs) corresponds to 7.2mio rows and 200 columns. This might all fit in memory or not, let's assume it does not. Disk-space is not an issue.
Variables are typically calculated once, but updates/changes probably happen from time to time. Once updates occur, old versions don't matter anymore.
New variables are added from time to time, mostly one at a time only.
New rows are not added.
Querying takes place. For example, often I need to select only a certain date range like date>start_date & date<end_date. But some queries need to consider rank conditions on dates. For example, get all data (i.e. columns) where rank(var1)>500 & rank(var1)<1000, where rank is as of date.
The objective is to achieve fast reading/querying of data. Data writing is not so critical.
I thought of the following HDF5 design:
Follow the groups_map approach (of 1) to store variables in different tables. Limit the number of columns for each group to 10 (to avoid large memory loads when updating single variables, see point 3).
Each group represents one table, where I use the multi-index based on dates & ids for each table stored.
Create an update function, to update variables. The functions loads the table with all (10) columns to memory as a df, deletes the table on the disk, replaces the updated variable in df and saves the table from memory back to disk.
Create an add function, add var1 to a group with less than 10 columns, or create new group if required. Saving similar as in 3. load current group to memory, delete table on disk, add new column and save it back on disk.
Calculate ranks as of date for relevant variables and add them to disk-storage as rank_var1, which should reduce the query as of to simply rank_var1 > 500 & rank_var1<1000.
I have the following questions:
Updating HDFTable, I suppose I have to delete the entire table in order to update a single column?
When to use 'data_columns', or should I simply assign True in HDFStore.append()?
If I want to query based on condition of rank_var1 > 500 & rank_var1<1000, but I need columns from other groups. Can I enter the index received from the rank_var1 condition into the query to get other columns based on this index (the index is a multi-index with date and ID)? Or would I need to loop this index by date and then chunk the IDs similar as proposed in 2 and repeat the procedure for each group where I need. Alternatively, (a) I could add to each groups table rank columns, but it seems extremely inefficient in terms of disk-storage. Note, the number of variables where rank filtering is relevant is limited (say 5). Or (b) I could simply use the df_rank received from the rank_var1 query and use in-memory operations via df_rank.merge(df_tmp, left_index=True, right_index=True, how='left') and loop through groups (df_tmp) where I select the desired columns.
Say I have some data in different frequencies. Having different group_maps (or different storages) for different freq is the way to go I suppose?
Copies of the storage might be used on win/ux systems. I assume it is perfectly compatible, anything to consider here?
I plan to use pd.HDFStore(str(self.path), mode='a', complevel=9, complib='blosc'). Any concerns regarding complevel or complib?
I've started to write up some code, once I have something to show I'll edit and add it if desired. Please, let me know if you need any more information.
EDIT I here a first version of my storage class, please adjust path at bottom accordingly. Sorry for the length of the code, comments welcome
import pandas as pd
import numpy as np
import string
class LargeDFStorage():
# TODO add index features to ensure correct indexes
# index_names = ('date', 'id')
def __init__(self, h5_path, groups_map):
"""
Parameters
----------
h5_path: str
hdf5 storage path
groups_map: dict
where keys are group_names and values are dict, with at least key
'columns' where the value is list of column names.
A special group_name is reserved for group_name/key "query", which
can be used as queering and conditioning table when getting data,
see :meth:`.get`.
"""
self.path = str(h5_path)
self.groups_map = groups_map
self.column_map = self._get_column_map()
# if desired make part of arguments
self.complib = 'blosc'
self.complevel = 9
def _get_column_map(self):
""" Calc the inverse of the groups_map/ensures uniqueness of cols
Returns
-------
dict: with cols as keys and group_names as values
"""
column_map = dict()
for g, value in self.groups_map.items():
if len(set(column_map.keys()) & set(value['columns'])) > 0:
raise ValueError('Columns have to be unique')
for col in value['columns']:
column_map[col] = g
return column_map
#staticmethod
def group_col_names(store, group_name):
""" Returns all column names of specific group
Parameters
----------
store: pd.HDFStore
group_name: str
Returns
-------
list:
of all column names in the group
"""
if group_name not in store:
return []
# hack to get column names, straightforward way!?
return store.select(group_name, start=0, stop=0).columns.tolist()
#staticmethod
def stored_cols(store):
""" Collects all columns stored in HDF5 store
Parameters
----------
store: pd.HDFStore
Returns
-------
list:
a list of all columns currently in the store
"""
stored_cols = list()
for x in store.items():
group_name = x[0][1:]
stored_cols += LargeDFStorage.group_col_names(store, group_name)
return stored_cols
def _find_groups(self, columns):
""" Searches all groups required for covering columns
Parameters
----------
columns: list
list of valid columns
Returns
-------
list:
of unique groups
"""
groups = list()
for column in columns:
groups.append(self.column_map[column])
return list(set(groups))
def add_columns(self, df):
""" Adds columns to storage for the first time. If columns should
be updated use(use :meth:`.update` instead)
Parameters
----------
df: pandas.DataFrame
with new columns (not yet stored in any of the tables)
Returns
-------
"""
store = pd.HDFStore(self.path, mode='a' , complevel=self.complevel,
complib=self.complib)
# check if any column has been stored already
if df.columns.isin(self.stored_cols(store)).any():
store.close()
raise ValueError('Some cols are already in the store')
# find all groups needed to store the data
groups = self._find_groups(df.columns)
for group in groups:
v = self.groups_map[group]
# select columns of current group in df
select_cols = df.columns[df.columns.isin(v['columns'])].tolist()
tmp = df.reindex(columns=select_cols, copy=False)
# set data column to False only in case of query data
dc = None
if group=='query':
dc = True
stored_cols = self.group_col_names(store,group)
# no columns in group (group does not exists yet)
if len(stored_cols)==0:
store.append(group, tmp, data_columns=dc)
else:
# load current disk data to memory
df_grp = store.get(group)
# remove data from disk
store.remove(group)
# add new column(s) to df_disk
df_grp = df_grp.merge(tmp, left_index=True, right_index=True,
how='left')
# save old data with new, additional columns
store.append(group, df_grp, data_columns=dc)
store.close()
def _query_table(self, store, columns, where):
""" Selects data from table 'query' and uses where expression
Parameters
----------
store: pd.HDFStore
columns: list
desired data columns
where: str
a valid select expression
Returns
-------
"""
query_cols = self.group_col_names(store, 'query')
if len(query_cols) == 0:
store.close()
raise ValueError('No data to query table')
get_cols = list(set(query_cols) & set(columns))
if len(get_cols) == 0:
# load only one column to minimize memory usage
df_query = store.select('query', columns=query_cols[0],
where=where)
add_query = False
else:
# load columns which are anyways needed already
df_query = store.select('query', columns=get_cols, where=where)
add_query = True
return df_query, add_query
def get(self, columns, where=None):
""" Retrieve data from storage
Parameters
----------
columns: list/str
list of columns to use, or use 'all' if all columns should be
retrieved
where: str
a valid select statement
Returns
-------
pandas.DataFrame
with all requested columns and considering where
"""
store = pd.HDFStore(str(self.path), mode='r')
# get all columns in stored in HDFStorage
stored_cols = self.stored_cols(store)
if columns == 'all':
columns = stored_cols
# check if all desired columns can be found in storage
if len(set(columns) - set(stored_cols)) > 0:
store.close()
raise ValueError('Column(s): {}. not in storage'.format(
set(columns)- set(stored_cols)))
# get all relevant groups (where columns are taken from)
groups = self._find_groups(columns)
# if where query is defined retrieve data from storage, eventually
# only index of df_query might be used
if where is not None:
df_query, add_df_query = self._query_table(store, columns, where)
else:
df_query, add_df_query = None, False
# dd collector
df = list()
for group in groups:
# skip in case where was used and columns used from
if where is not None and group=='query':
continue
# all columns which are in group but also requested
get_cols = list(
set(self.group_col_names(store, group)) & set(columns))
tmp_df = store.select(group, columns=get_cols)
if df_query is None:
df.append(tmp_df)
else:
# align query index with df index from storage
df_query, tmp_df = df_query.align(tmp_df, join='left', axis=0)
df.append(tmp_df)
store.close()
# if any data of query should be added
if add_df_query:
df.append(df_query)
# combine all columns
df = pd.concat(df, axis=1)
return df
def update(self, df):
""" Updates data in storage, all columns have to be stored already in
order to be accepted for updating (use :meth:`.add_columns` instead)
Parameters
----------
df: pd.DataFrame
with index as in storage, and column as desired
Returns
-------
"""
store = pd.HDFStore(self.path, mode='a' , complevel=self.complevel,
complib=self.complib)
# check if all column have been stored already
if df.columns.isin(self.stored_cols(store)).all() is False:
store.close()
raise ValueError('Some cols have not been stored yet')
# find all groups needed to store the data
groups = self._find_groups(df.columns)
for group in groups:
dc = None
if group=='query':
dc = True
# load current disk data to memory
group_df = store.get(group)
# remove data from disk
store.remove(group)
# update with new data
group_df.update(df)
# save updated df back to disk
store.append(group, group_df, data_columns=dc)
store.close()
class DataGenerator():
np.random.seed(1282)
#staticmethod
def get_df(rows=100, cols=10, freq='M'):
""" Simulate data frame
"""
if cols < 26:
col_name = list(string.ascii_lowercase[:cols])
else:
col_name = range(cols)
if rows > 2000:
freq = 'Min'
index = pd.date_range('19870825', periods=rows, freq=freq)
df = pd.DataFrame(np.random.standard_normal((rows, cols)),
columns=col_name, index=index)
df.index.name = 'date'
df.columns.name = 'ID'
return df
#staticmethod
def get_panel(rows=1000, cols=500, items=10):
""" simulate panel data
"""
if items < 26:
item_names = list(string.ascii_lowercase[:cols])
else:
item_names = range(cols)
panel_ = dict()
for item in item_names:
panel_[item] = DataGenerator.get_df(rows=rows, cols=cols)
return pd.Panel(panel_)
def main():
# Example of with DataFrame
path = 'D:\\fc_storage.h5'
groups_map = dict(
a=dict(columns=['a', 'b', 'c', 'd', 'k']),
query=dict(columns=['e', 'f', 'g', 'rank_a']),
)
storage = LargeDFStorage(path, groups_map=groups_map)
df = DataGenerator.get_df(rows=200000, cols=15)
storage.add_columns(df[['a', 'b', 'c', 'e', 'f']])
storage.update(df[['a']]*3)
storage.add_columns(df[['d', 'g']])
print(storage.get(columns=['a','b', 'f'], where='f<0 & e<0'))
# Example with panel and rank condition
path2 = 'D:\\panel_storage.h5'
storage_pnl = LargeDFStorage(path2, groups_map=groups_map)
panel = DataGenerator.get_panel(rows=800, cols=2000, items=24)
df = panel.to_frame()
df['rank_a'] = df[['a']].groupby(level='date').rank()
storage_pnl.add_columns(df[['a', 'b', 'c', 'e', 'f']])
storage_pnl.update(df[['a']]*3)
storage_pnl.add_columns(df[['d', 'g', 'rank_a']])
print(storage_pnl.get(columns=['a','b','e', 'f', 'rank_a'],
where='f>0 & e>0 & rank_a <100'))
if __name__ == '__main__':
main()
It's bit difficult to answer those questions without particular examples...
Updating HDFTable, I suppose I have to delete the entire table in
order to update a single column?
AFAIK yes unless you are storing single columns separately, but it will be done automatically, you just have to write your DF/Panel back to HDF Store.
When to use 'data_columns', or should I simply assign True in
HDFStore.append()?
data_columns=True - will index all your columns - IMO it's waste of resources unless you are going to use all columns in the where parameter (i.e. if all columns should be indexed).
I would specify there only those columns that will be used often for searching in where= clause. Consider those columns as indexed columns in a database table.
If I want to query based on condition of rank_var1 > 500 &
rank_var1<1000, but I need columns from other groups. Can I enter the
index received from the rank_var1 condition into the query to get
other columns based on this index (the index is a multi-index with
date and ID)?
I think we would need some reproducible sample data and examples of your queries in order to give a reasonable answer...
Copies of the storage might be used on win/ux systems. I assume it is
perferctly compatible, anything to consider here?
Yes, it should be fully compatible
I plan to use pd.HDFStore(str(self.path), mode='a', complevel=9,
complib='blosc'). Any concerns regarding complevel or complib?
Test it with your data - results might depend on dtypes, number of unique values, etc. You may also want to consider lzo complib - it might be faster in some use-cases. Check this. Sometimes a high complevel doesn't give you better copression ratio, but will be slower (see results of my old comparison)

Extract nested JSON embedded as string in Pandas dataframe

I have a CSV where one of the fields is a nested JSON object, stored as a string. I would like to load the CSV into a dataframe and parse the JSON into a set of fields appended to the original dataframe; in other words, extract the contents of the JSON and make them part of the dataframe.
My CSV:
id|dist|json_request
1|67|{"loc":{"lat":45.7, "lon":38.9},"arrival": "Monday", "characteristics":{"body":{"color":"red", "make":"sedan"}, "manuf_year":2014}}
2|34|{"loc":{"lat":46.89, "lon":36.7},"arrival": "Tuesday", "characteristics":{"body":{"color":"blue", "make":"sedan"}, "manuf_year":2014}}
3|98|{"loc":{"lat":45.70, "lon":31.0}, "characteristics":{"body":{"color":"yellow"}, "manuf_year":2010}}
Note that not all keys are the same for all the rows.
I'd like it to produce a data frame equivalent to this:
data = {'id' : [1, 2, 3],
'dist' : [67, 34, 98],
'loc_lat': [45.7, 46.89, 45.70],
'loc_lon': [38.9, 36.7, 31.0],
'arrival': ["Monday", "Tuesday", "NA"],
'characteristics_body_color':["red", "blue", "yellow"],
'characteristics_body_make':["sedan", "sedan", "NA"],
'characteristics_manuf_year':[2014, 2014, 2010]}
df = pd.DataFrame(data)
(I'm really sorry, I can't get the table itself to look sensible in SO! Please don't be mad at me, I'm a rookie :( )
What I've tried
After a lot of futzing around, I came up with the following solution:
#Import data
df_raw = pd.read_csv("sample.csv", delimiter="|")
#Parsing function
def parse_request(s):
sj = json.loads(s)
norm = json_normalize(sj)
return norm
#Create an empty dataframe to store results
parsed = pd.DataFrame(columns=['id'])
#Loop through and parse JSON in each row
for i in df_raw.json_request:
parsed = parsed.append(parse_request(i))
#Merge results back onto original dataframe
df_parsed = df_raw.join(parsed)
This is obviously inelegant and really inefficient (would take multiple hours on the 300K rows that I have to parse). Is there a better way?
Where I've looked
I've gone through the following related questions:
Reading a CSV into pandas where one column is a json string
(which seems to only work for simple, non-nested JSON)
JSON to pandas DataFrame
(I borrowed parts of my solutions from this, but I can't figure out how to apply this solution across the dataframe without looping through rows)
I'm using Python 3.3 and Pandas 0.17.
Here's an approach that speeds things up by a factor of 10 to 100, and should allow you to read your big file in under a minute, as opposed to over an hour. The idea is to only construct a dataframe once all of the data has been read, thereby reducing the number of times memory needs to be allocated, and to only call json_normalize once on the entire chunk of data, rather than on each row:
import csv
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('sample.csv') as fh:
rows = csv.reader(fh, delimiter='|')
header = next(rows)
# "transpose" the data. `data` is now a tuple of strings
# containing JSON, one for each row
idents, dists, data = zip(*rows)
data = [json.loads(row) for row in data]
df = json_normalize(data)
df['ids'] = idents
df['dists'] = dists
So that:
>>> print(df)
arrival characteristics.body.color characteristics.body.make \
0 Monday red sedan
1 Tuesday blue sedan
2 NaN yellow NaN
characteristics.manuf_year loc.lat loc.lon ids
0 2014 45.70 38.9 1
1 2014 46.89 36.7 2
2 2010 45.70 31.0 3
Furthermore, I looked into what pandas's json_normalize is doing, and it's performing some deep copies that shouldn't be necessary if you're just creating a dataframe from a CSV. We can implement our own flatten function which takes a dictionary and "flattens" the keys, similar to what json_normalize does. Then we can make a generator which spits out one row of the dataframe at a time as a record. This approach is even faster:
def flatten(dct, separator='_'):
"""A fast way to flatten a dictionary,"""
res = {}
queue = [('', dct)]
while queue:
prefix, d = queue.pop()
for k, v in d.items():
key = prefix + k
if not isinstance(v, dict):
res[key] = v
else:
queue.append((key + separator, v))
return res
def records_from_json(fh):
"""Yields the records from a file object."""
rows = csv.reader(fh, delimiter='|')
header = next(rows)
for ident, dist, data in rows:
rec = flatten(json.loads(data))
rec['id'] = ident
rec['dist'] = dist
yield rec
def from_records(path):
with open(path) as fh:
return pd.DataFrame.from_records(records_from_json(fh))
And here are the results of a timing experiment where I artificially increased the size of your sample data by repeating rows. The number of lines is denoted by n_rows:
method 1 (s) method 2 (s) original time (s)
n_rows
96 0.008217 0.002971 0.362257
192 0.014484 0.004720 0.678590
384 0.027308 0.008720 1.373918
768 0.055644 0.016175 2.791400
1536 0.105730 0.030914 5.727828
3072 0.209049 0.060105 11.877403
Extrapolating linearly, the first method should read 300k lines in about 20 seconds, while the second method should take around 6 seconds.

Categories