Parse JSON response to populate data in the form of table - python

I have JSON response in a <class 'dict'>. I want to iterate over the JSON response and form a table view. Below is the sample JSON response.
{'ResultSet': {'Rows': [{'Data': [{'VarCharValue': 'cnt'}, {'VarCharValue': 'id'}, {'VarCharValue': 'val'}]}, {'Data': [{'VarCharValue': '2000'}, {'VarCharValue': '1234'}, {'VarCharValue': 'ABC'}]},{'Data': [{'VarCharValue': '3000'}, {'VarCharValue': '5678'}, {'VarCharValue': 'DEF'}]}]}}
Expected Output format:
cnt id val
2000 1234 ABC
3000 5678 DEF
There can only one row of data or there can be multiple rows of data for the column set (For provided sample data two rows are there).

I assume you want to use Pandas. Since pd.DataFrame accepts a list of dictionaries directly, you can restructure your input dictionary D as a list of dictionaries:
cols = [next(iter(i.values())) for i in D['ResultSet']['Rows'][0]['Data']]
d = [{col: j['VarCharValue'] for col, j in zip(cols, i['Data'])}
for i in D['ResultSet']['Rows'][1:]]
df = pd.DataFrame(d)
print(df)
cnt id val
0 2000 1234 ABC
1 3000 5678 DEF
You will probably want to convert at least the cnt series to numeric:
df['cnt'] = pd.to_numeric(df['cnt'])

I am not sure if you are using pandas but you can easily parse your response dict into a pandas.DataFrame with the following code
import pandas as pd
pd.DataFrame([[entr['VarCharValue'] for entr in r['Data']] for r in response['ResultSet']['Rows'][1:]],
columns = [r['VarCharValue'] for r in response['ResultSet']['Rows'][0]['Data']])

Related

Filtering a Pandas DataFrame through a list dictionary

Movie Dataframe
I have a DataFrame that contains movie information and I'm trying to filter the rows so that if the list of dictionaries contains 'name' == 'specified genre' it will display movies containing that genre.
I have tried using a list comprehension
filter = ['Action']
expectedResult = [d for d in df if d['name'] in filter]
however I end up with an error:
TypeError: string indices must be integers
d is a column name in your code. That's why you are getting this error.
See the following example:
import pandas as pd
df = pd.DataFrame({"abc": [1,2,3], "def": [4,5,6]})
for d in df:
print(d)
Gives:
abc
def
I think what you are trying to do could be achieved by:
df = pd.DataFrame({"genre": ["something", "soemthing else"], "abc": ["movie1", "movie2"]})
movies = df.to_dict("records")
[m["abc"] for m in movies if m["genre"] == "something"]
Which gives:
['movie1']
your loop,for d in df, will give the headings for your values.
your d will have generes as a value.
try to run:-
for d in df:
print(d)
you will understand

Parallel Processing using Multiprocessing in Python

I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help
multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
If the column is strings, first use ast.literal_eval to convert them to list type
df.countries = df.countries.apply(ast.literal_eval)
If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1

Extracting value from JSON column very slow

I've got a CSV with a bunch of data. One of the columns, ExtraParams contains a JSON object. I want to extract a value using a specific key, but it's taking quite a while to get through the 60.000something rows in the CSV. Can it be sped up?
counter = 0 #just to see where I'm at
order_data['NewColumn'] = ''
for row in range(len(total_data)):
s = total_data['ExtraParams'][row]
try:
data = json.loads(s)
new_data = data['NewColumn']
counter += 1
print(counter)
order_data['NewColumn'][row] = new_data
except:
print('NewColumn not in row')
I use a try-except because a few of the rows have what I assume is messed up JSON, as they crash the program with a "expecting delimiter ','" error.
When I say "slow" I mean ~30mins for 60.000rows.
EDIT: It might be worth nothing each JSON contains about 35 key/value pairs.
You could use something like pandas and make use of the apply method. For some simple sample data in test.csv
Col1,Col2,ExtraParams
1,"a",{"dog":10}
2,"b",{"dog":5}
3,"c",{"dog":6}
You could use something like
In [1]: import pandas as pd
In [2]: import json
In [3]: df = pd.read_csv("test.csv")
In [4]: df.ExtraParams.apply(json.loads)
Out[4]:
0 {'dog': 10}
1 {'dog': 5}
2 {'dog': 6}
Name: ExtraParams, dtype: object
If you need to extract a field from the json, assuming the field is present in each row you can write a lambda function like
In [5]: df.ExtraParams.apply(lambda x: json.loads(x)['dog'])
Out[5]:
0 10
1 5
2 6
Name: ExtraParams, dtype: int64

Create and instantiate python 2d dictionary

I have two python dictionaries:
ccyAr = {'AUDCAD','AUDCHF','AUDJPY','AUDNZD','AUDUSD','CADCHF','CADJPY','CHFJPY','EURAUD','EURCAD','EURCHF','EURGBP','EURJPY','EURNZD','EURUSD','GBPAUD','GBPCAD','GBPCHF','GBPJPY','GBPNZD','GBPUSD','NZDCAD','NZDCHF','NZDJPY','NZDUSD','USDCAD','USDCHF','USDJPY'}
data = {'BTrades', 'BPips', 'BProfit', 'STrades', 'SPips', 'SProfit', 'Trades', 'Pips', 'Profit', 'Won', 'WonPC', 'Lost', 'LostPC'}
I've been trying to get my head round how to most elegantly create a construct in which each of 'data' exists in each of 'ccyAr'. The following are the two I feel are the closest, but the first results (now I realise) in arrays and the latter i more like pseudocode:
1.
table={ { data:[] for d in data } for ccy in ccyAr }
2.
for ccy in ccyAr:
for d in data:
table['ccy']['d'] = 0
I also want to set each of the entries to int 0 and I'd like to do it in one go. I'm struggling with the comprehension method as I end up creating each value of each inside directory member as a list instead of a value 0.
I've seen the autovivification piece but I don't want to mimic perl, I want to do it the pythonic way. Any help = cheers.
for ccy in ccyAr:
for d in data:
table['ccy']['d'] = 0
Is close.
table = {}
for ccy in ccyAr:
table[ccy] = {}
for d in data:
table[ccy][d] = 0
Also, ccyAr and data in your question are sets, not dictionaries.
What you are searching for is a pandas DataFrame of shape data x ccyAr. I give a minimal example here:
import pandas as pd
data = {'1', '2'}
ccyAr = {'a','b','c'}
df = pd.DataFrame(np.zeros((len(data), len(ccyAr))))
Then the most important step is to set both the columns and the index. If your two so-called dictionaries are in fact sets (as it seems in your code), use:
df.columns = ccyAr
df.index = data
If they are indeed dictionaries, you instead have to call their keys method:
df.columns = ccyAr.keys()
df.index = data.keys()
You can print df to see that this is actually what you wanted:
| a | c | b
-------------
1 | 0 0 0
2 | 0 0 0
And now if you try to access via df['a'][1] it returns you 0. It is the best solution to your problem.
How to do this using a dictionary comprehension:
table = {ccy:{d:0 for d in data} for ccy in ccyAr}

Pandas dataframe to list of tuples

I parsed a .xlsx file to a pandas dataframe and desire converting to a list of tuples. The pandas dataframe has two columns.
The list of tuples requires the product_id grouped with the transaction_id. I saw a post on creating a pandas dataframe to list of tuples, but the code result grouped with transaction_id grouped with `product_id.
How can I get the list of tuples in the desired format on the bottom of the page?
import pandas as pd
import xlrd
#Import data
trans = pd.ExcelFile('/Users/Transactions.xlsx')
#parse xlsx file into dataframe
transdata = trans.parse('Orders')
#view dataframe
#print transdata
transaction_id product_id
0 20001 48165
1 20001 48162
2 20001 48166
3 20004 48815
4 20005 48165
transdata = trans.parse('Orders')
#Create tuple
trans_set = [tuple(x) for x in subset.values]
print trans_set
[(20001, (48165), (20001, 48162), (20001, 48166), (20004, 48815), (20005, 48165)]
Desired Result:
[(20001, [48165, 48162, 48166]), (20004, 48815), (20005, 48165)]
trans_set = [(key,list(grp)) for key, grp in
transdata.groupby(['transaction_id'])['product_id']]
In [268]: trans_set
Out[268]: [(20001, [48165, 48162, 48166]), (20004, [48815]), (20005, [48165])]
This is a little different than your desired result -- note the (20004, [48815]), for example -- but I think it is more consistent. The second item in each tuple is a list of all the product_ids which are associate with the transaction_id. It might consist of only one element, but it is always a list.
To write trans_set to a CSV, you could use the csv module:
import csv
with open('/tmp/data.csv', 'wb') as f:
writer = csv.writer(f)
for key, grp in trans_set:
writer.writerow([key]+grp)
yields a file, /tmp/data.csv, with content:
20001,48165,48162,48166
20004,48815
20005,48165

Categories