pandas summing up columns by group [duplicate] - python

I often use pandas groupby to generate stacked tables. But then I often want to output the resulting nested relations to json. Is there any way to extract a nested json filed from the stacked table it produces?
Let's say I have a df like:
year office candidate amount
2010 mayor joe smith 100.00
2010 mayor jay gould 12.00
2010 govnr pati mara 500.00
2010 govnr jess rapp 50.00
2010 govnr jess rapp 30.00
I can do:
grouped = df.groupby('year', 'office', 'candidate').sum()
print grouped
amount
year office candidate
2010 mayor joe smith 100
jay gould 12
govnr pati mara 500
jess rapp 80
Beautiful! Of course, what I'd real like to do is get nested json via a command along the lines of grouped.to_json. But that feature isn't available. Any workarounds?
So, what I really want is something like:
{"2010": {"mayor": [
{"joe smith": 100},
{"jay gould": 12}
]
},
{"govnr": [
{"pati mara":500},
{"jess rapp": 80}
]
}
}
Don

I don't think think there is anything built-in to pandas to create a nested dictionary of the data. Below is some code that should work in general for a series with a MultiIndex, using a defaultdict
The nesting code iterates through each level of the MultIndex, adding layers to the dictionary until the deepest layer is assigned to the Series value.
In [99]: from collections import defaultdict
In [100]: results = defaultdict(lambda: defaultdict(dict))
In [101]: for index, value in grouped.itertuples():
...: for i, key in enumerate(index):
...: if i == 0:
...: nested = results[key]
...: elif i == len(index) - 1:
...: nested[key] = value
...: else:
...: nested = nested[key]
In [102]: results
Out[102]: defaultdict(<function <lambda> at 0x7ff17c76d1b8>, {2010: defaultdict(<type 'dict'>, {'govnr': {'pati mara': 500.0, 'jess rapp': 80.0}, 'mayor': {'joe smith': 100.0, 'jay gould': 12.0}})})
In [106]: print json.dumps(results, indent=4)
{
"2010": {
"govnr": {
"pati mara": 500.0,
"jess rapp": 80.0
},
"mayor": {
"joe smith": 100.0,
"jay gould": 12.0
}
}
}

I had a look at the solution above and figured out that it only works for 3 levels of nesting. This solution will work for any number of levels.
import json
levels = len(grouped.index.levels)
dicts = [{} for i in range(levels)]
last_index = None
for index,value in grouped.itertuples():
if not last_index:
last_index = index
for (ii,(i,j)) in enumerate(zip(index, last_index)):
if not i == j:
ii = levels - ii -1
dicts[:ii] = [{} for _ in dicts[:ii]]
break
for i, key in enumerate(reversed(index)):
dicts[i][key] = value
value = dicts[i]
last_index = index
result = json.dumps(dicts[-1])

Here is a generic recursive solution for this problem:
def df_to_dict(df):
if df.ndim == 1:
return df.to_dict()
ret = {}
for key in df.index.get_level_values(0):
sub_df = df.xs(key)
ret[key] = df_to_dict(sub_df)
return ret

I'm aware this is an old question, but I came across the same issue recently. Here's my solution. I borrowed a lot of stuff from chrisb's example (Thank you!).
This has the advantage that you can pass a lambda to get the final value from whatever enumerable you want, as well as for each group.
from collections import defaultdict
def dict_from_enumerable(enumerable, final_value, *groups):
d = defaultdict(lambda: defaultdict(dict))
group_count = len(groups)
for item in enumerable:
nested = d
item_result = final_value(item) if callable(final_value) else item.get(final_value)
for i, group in enumerate(groups, start=1):
group_val = str(group(item) if callable(group) else item.get(group))
if i == group_count:
nested[group_val] = item_result
else:
nested = nested[group_val]
return d
In the question, you'd call this function like:
dict_from_enumerable(grouped.itertuples(), 'amount', 'year', 'office', 'candidate')
The first argument can be an array of data as well, not even requiring pandas.

Related

Transform string to Pandas df

I have the string like that:
'key=IAfpK, age=58, key=WNVdi, age=64, key=jp9zt, age=47'
How can I transform it to Pandas DataFrame?
key
age
0
1
Thank you
Use:
In [919]: s = 'key=IAfpK, age=58, key=WNVdi, age=64, key=jp9zt, age=47'
In [922]: d = {}
In [927]: for i in s.split(', '):
...: ele, val = i.split('=')
...: if ele in d:
...: d[ele].append(val)
...: else:
...: d[ele] = [val]
...:
In [930]: df = pd.DataFrame(d)
In [931]: df
Out[931]:
key age
0 IAfpK 58
1 WNVdi 64
2 jp9zt 47
A quick and somewhat manual way to do it would be to first create a list of dict values appending each string. Then convert that list to a dataframe. (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):
import pandas as pd
keylist = []
keylist.append({"key": 'IAfpK', "age": '58'})
keylist.append({"key": 'WNVdi', "age": '64'})
keylist.append({"key": 'jp9zt', "age": '47'})
#convert the list of dictionaries into a df
key_df = pd.DataFrame(keylist, columns = ['key', 'age'])
However, this is only efficient for that specific string you mentioned, if you need to work on a longer string/more data then a for loop would be more efficient.
Although I think this answers your question, there are probably more optimal ways to go about it :)
Try:
s = "key=IAfpK, age=58, key=WNVdi, age=64, key=jp9zt, age=47"
x = (
pd.Series(s)
.str.extractall(r"key=(?P<key>.*?),\s*age=(?P<age>.*?)(?=,|\Z)")
.reset_index(drop=True)
)
print(x)
Prints:
key age
0 IAfpK 58
1 WNVdi 64
2 jp9zt 47

How to split one row into multiple rows in python

I have a pandas dataframe that has one long row as a result of a flattened json list.
I want to go from the example:
{'0_id': 1, '0_name': a, '0_address': USA, '1_id': 2, '1_name': b, '1_address': UK, '1_hobby': ski}
to a table like the following:
id
name
address
hobby
1
a
USA
2
b
UK
ski
Any help is greatly appreciated :)
There you go:
import json
json_data = '{"0_id": 1, "0_name": "a", "0_address": "USA", "1_id": 2, "1_name": "b", "1_address": "UK", "1_hobby": "ski"}'
arr = json.loads(json_data)
result = {}
for k in arr:
kk = k.split("_")
if int(kk[0]) not in result:
result[int(kk[0])] = {"id":"", "name":"", "hobby":""}
result[int(kk[0])][kk[1]] = arr[k]
for key in result:
print("%s %s %s" % (key, result[key]["name"], result[key]["address"]))
if you want to have field more dynamic, you have two choices - either go through all array and gather all possible names and then build template associated empty array, or just check if key exist in result when you returning results :)
This way only works if every column follows this pattern, but should otherwise be pretty robust.
data = {'0_id': '1', '0_name': 'a', '0_address': 'USA', '1_id': '2', '1_name': 'b', '1_address': 'UK', '1_hobby': 'ski'}
df = pd.DataFrame(data, index=[0])
indexes = set(x.split('_')[0] for x in df.columns)
to_concat = []
for i in indexes:
target_columns = [col for col in df.columns if col.startswith(i)]
df_slice = df[target_columns]
df_slice.columns = [x.split('_')[1] for x in df_slice.columns]
to_concat.append(df_slice)
new_df = pd.concat(to_concat)

Comparing the elements in two JSON dicts and getting the difference as a ratio or percentage

I have a JSON object and I am working on some data manipulation. I want to get the difference as a ratio so I can more accurately rank the elements in my dict.
[{condition: functional, location:Sydney }, {condition:functional, location: Adelaide}, {condition:broken, location:Sydney}]
I can get the number of points where the location is not functional like so:
filter(lambda x: x['condition']!='functional', json_obj)
But I would like to return this as a percentage ratio.
You can try Counter and defaultdict as below-
from collections import Counter,defaultdict
d = [{'condition': 'functional', 'location':'Sydney' }, {'condition':'functional', 'location': 'Adelaide'}, {'condition':'broken', 'location':'Sydney'}]
cities = [j['location'] for j in d]
#initialize data
data = defaultdict(float)
for city in cities:
data[city]=0
#Count occurrances of a single city as a counter dictionary
counters = Counter((i['location'] for i in d))
#Do the calculation
for i in d:
if i['condition']== 'functional':
inc = (counters[i['location']]*100)/len(d)
data[i['location']]+= float(inc)
elif i['condition']== 'broken':
dec = (counters[i['location']]*100)/len(d)
data[i['location']]-=float(dec)
else:
raise Exception("Error")
print {k:"{0}%".format(v) for k,v in data.items()}
Output-
{'Sydney': '0.0%', 'Adelaide': '33.0%'}
It's easy:
a = [{'condition': 'functional', 'location':'Sydney' }, {'condition':'functional', 'location': 'Adelaide'}, {'condition':'broken', 'location':'Sydney'}]
b = filter(lambda x: x['condition']!='functional', a)
all_locations = [item['location'] for item in b]
result = {}
for location in all_locations:
if location not in result.keys():
result[location] = all_locations.count(location)*100/float(len(all_locations))
print result
It's will return percent for every location
Is this what you want? This compares the elements in two JSON dicts and getting the difference as a ratio, as you ask for in the title. But reading the question body, it not really clear what it is you want to do.
This assumes that both dictionaries have the same keys.
def dictionary_similarity(d1, d2):
return sum(d1[key] == d2[key] for key in d1) / float(len(d1))
dictionary_similarity(
{'condition': 'functional', 'location': 'Sydney' },
{'condition': 'functional', 'location': 'Adelaide'},)
0.5

Python 3.4 list to do different tasks

I have this list:
list1 = [
'aapl':'apple',
'tgt':'target',
'nke':'nike',
'mcd':'Mc Donald',
'googl':'google',
'yhoo':'yahoo',
'rl':'Polo Ralph lauren'
]
I wish to execute this
q = 0
while q < len(list1):
code_to_be_executed
q = q+1
But only for the 1st part (aapl,tgt,nke,mdc,rl,yhoo,etc) and the the 2nd part (the company name as google, yahoo , polo ralph lauren, etc ) to be printed in something like this : nke = nike to the user
The problem is it will perform the q task to everything even for the company name which is not what I want) I know I could seperate the abreviations and the company name in two different list but how could I print it like the nke = Nike comp. ? thank you very much
What I believe you are trying to do is a basic printing of a key/value data structure (called a dict in Python):
Example:
data = {
'aapl': 'apple', 'tgt': 'target', 'nke': 'nike', 'mcd': 'Mc Donald',
'googl': 'google', 'yhoo': 'yahoo', 'rl': 'Polo Ralph lauren'
}
for k, v in data.items():
print("{0:s}={1:s}".format(k, v))
Which outputs:
$ python app.py
nke=nike
tgt=target
aapl=apple
mcd=Mc Donald
rl=Polo Ralph lauren
yhoo=yahoo
googl=google
Update: If you still want to do this with a while loop with a q variant then your data structure will have to be a "list of tuples" -- Where each "tuple" is a key/value pair. e.g: [(1, 2), (3, 4)]
Example: (based on your code more or less)
data = [
('aapl', 'apple'),
('tgt', 'target'),
('nke', 'nike'),
('mcd', 'Mc Donald'),
('googl', 'google'),
('yhoo', 'yahoo'),
('rl', 'Polo Ralph lauren)')
]
q = 0 # counter
while q < len(data):
k, v = data[q] # assign and unpack each tuple (key, value)
print("{0:s}={1:s}".format(k, v))
q += 1
NB: This data structure is mostly the same except that you loose the benefits of being able to do O(1) lookups. Dictionaries/Mappings are more suited to this kind of data structure especially i you intend to perform lookups base don keys.

Converting a dictionary with lists for values into a dataframe

I spent a while looking through SO and seems I have a unique problem.
I have a dictionary that looks like the following:
dict={
123: [2,4],
234: [6,8],
...
}
I want to convert this dictionary that has lists for values into a 3 column data frame like the following:
time, value1, value2
123, 2, 4
234, 6, 8
...
I can run:
pandas.DataFrame(dict)
but this generates the following:
123, 234, ...
2, 6, ...
4, 8, ...
Probably a simple fix but I'm still picking up pandas
You can either preprocess the data as levi suggests, or you can transpose the data frame after creating it.
testdict={
123: [2,4],
234: [6,8],
456: [10, 12]
}
df = pd.DataFrame(testdict)
df = df.transpose()
print(df)
# 0 1
# 123 2 4
# 234 6 8
It may be of interest to some that Roger Fan's pandas.DataFrame(dict) method is actually pretty slow if you have a ton of indices. The faster way is to just preprocess the data into separate lists and then create a DataFrame out of these lists.
(Perhaps this was explained in levi's answer, but it is gone now.)
For example, consider this dictionary, dict1, where each value is a list. Specifically, dict1[i] = [ i*10, i*100] (for ease of checking the final dataframe).
keys = range(1000)
values = zip(np.arange(1000)*10, np.arange(1000)*100)
dict1 = dict(zip(keys, values))
It takes roughly 30 times as long with the pandas method. E.g.
t = time.time()
test1 = pd.DataFrame(dict1).transpose()
print time.time() - t
0.118762016296
versus:
t = time.time()
keys = []
list1 = []
list2 = []
for k in dict1:
keys.append(k)
list1.append(dict1[k][0])
list2.append(dict1[k][1])
test2 = pd.DataFrame({'element1': list1, 'element2': list2}, index=keys)
print time.time() - t
0.00310587882996

Categories