I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4
You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)
I have two dataframes. Input_df1 is a concatenation of files with a 'filename' column, 'unitid', and many columns of data which I simplify as 'data' for this example. Input_df2 has 'filename' column in addition to 'group'. I'm trying to replicate the data in input_df1 per filename for each instance of filename in input_df2 and add a group column in the output dataframe showing which group it belongs to.
import pandas as pd
input_df1 = pd.DataFrame(data={
'filename' : ['A', 'A', 'B', 'C'],
'unitid' : [ 1, 2, 3, 4 ],
'data' : [11, 12, 13, 14 ]
})
input_df2 = pd.DataFrame(data={
'filename' : ['A', 'B', 'C', 'C', 'A' ],
'group' : ['g1', 'g2', 'g3', 'g4', 'g5']
})
output_df = pd.DataFrame(data={
'filename' : [ 'A', 'A', 'A', 'A', 'B', 'C', 'C', 'A', 'A', 'A', 'A'],
'unitid' : [ 1, 2, 1, 2, 3, 4, 4, 1, 2, 1, 2],
'data' : [ 11, 12, 11, 12, 13, 14, 14, 11, 12, 11, 12],
'group' : ['g1', 'g1', 'g1', 'g1', 'g2', 'g3', 'g4', 'g5', 'g5', 'g5', 'g5']
})
Output_df is what I'm trying to create: replicated rows of the input_df1 per instance of filename in input_df2 with the 'group' value added.
Another question I have is, if I need to filter the rows of each replicated dataframe based on the group type, is it better to do that before joining or after? I was planning on filtering after since I have a better idea of how to do it, but I figure computing on unneeded rows is inefficient when they could be dropped during the replication process. Also I'm dealing with about 30k rows in input_df1 and 800 rows in input_df2 so potential for 24m rows total.
Any direction on which functions I should research to achieve this would be very appreciated.
I have a list of dictionaries, all with the same 10 keywords.
Looking for a neat way to convert it to 10 1D numpy arrays. Efficiency not important.
20 lines of code at the moment.
names = [x['name'] for x in fields]
names = np.asarray(names)
etc.
You could use a nested list comprehension :
[np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']]
or a dict comprehension:
{attribute:np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']}
As an example :
>>> fields = [{'name': 'A', 'age': 25, 'address' : 'NYC'}, {'name': 'B', 'age': 32, 'address' : 'LA'}]
>>> [np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']]
[array(['A', 'B'],
dtype='|S1'), array([25, 32]), array(['NYC', 'LA'],
dtype='|S3')]
>>> {attribute:np.asarray([x[attribute] for x in fields]) for attribute in ['name', 'age', 'address']}
{'age': array([25, 32]), 'name': array(['A', 'B'],
dtype='|S1'), 'address': array(['NYC', 'LA'],
dtype='|S3')}
To get the attributes in an automatic way, you could use:
>>> fields[0].keys()
['age', 'name', 'address']
Finally, a pandas DataFrame probably is the most suitable type for your data:
>>> pd.DataFrame(fields)
address age name
0 NYC 25 A
1 LA 32 B
It will be plenty fast and should allow you to do any operation you'd like to do on a list of arrays.
Let A be the input list of dictionaries.
For a 2D array output with each row holding data per keyword -
np.column_stack([i.values() for i in A])
Sample run -
In [217]: A # input list of 2 dictionaries, each with same 3 keywords
Out[217]:
[{'a': array([6, 8, 2]), 'b': array([7, 7, 3]), 'c': array([6, 6, 4])},
{'a': array([4, 4, 3]), 'b': array([7, 1, 6]), 'c': array([6, 1, 5])}]
In [244]: np.column_stack([i.values() for i in A])
Out[244]:
array([[6, 8, 2, 4, 4, 3], # key : a
[6, 6, 4, 6, 1, 5], # key : c
[7, 7, 3, 7, 1, 6]]) # key : b
# Get those keywords per row with `keys()` :
In [263]: A[0].keys()
Out[263]: ['a', 'c', 'b']
One more sample run-
In [245]: fields # sample from #Eric's solution
Out[245]:
[{'address': 'NYC', 'age': 25, 'name': 'A'},
{'address': 'LA', 'age': 32, 'name': 'B'}]
In [246]: np.column_stack([i.values() for i in fields])
Out[246]:
array([['25', '32'],
['A', 'B'],
['NYC', 'LA']],
dtype='|S21')
In [267]: fields[0].keys()
Out[267]: ['age', 'name', 'address']
This question already has answers here:
Sort a dictionary by list of lists in increasing order
(3 answers)
Closed 9 years ago.
I am trying to sort a dictionary by list of lists. The items in the list of lists are keys in the dictionary. I asked it before but the answers didn't solve the issue.
My input list is:
mylist= [
['why', 'was', 'cinderella', 'late', 'for', 'the', 'ball', 'she', 'forgot', 'to', 'swing', 'the', 'bat'],
['why', 'is', 'the', 'little', 'duck', 'always', 'so', 'sad', 'because', 'he', 'always', 'sees', 'a', 'bill', 'in', 'front', 'of', 'his', 'face'],
['what', 'has', 'four', 'legs', 'and', 'goes', 'booo', 'a', 'cow', 'with', 'a', 'cold'],
['what', 'is', 'a', 'caterpillar', 'afraid', 'of', 'a', 'dogerpillar'],
['what', 'did', 'the', 'crop', 'say', 'to', 'the', 'farmer', 'why', 'are', 'you', 'always', 'picking', 'on', 'me']
]
My dictionary somewhat looks like this:
myDict = {'to': [7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56],
'jam': [20], 'black': [5], 'farmer': [11],
'woodchuck': [54], 'has': [14, 16, 51], 'who': [16]
}
My code is:
def sort_by_increasing_order(mylist, myDict):
#temp = sorted(myDict, key=myDict.get)
temp = sorted(myDict, key=lambda tempKey: for tempKey in mylist, reverse=True )
return temp
Something like:
sort_by_increasing_order(['d', 'e', 'f'], {'d': [0, 1], 'e': [1, 2, 3], 'f': [4]})
result: ['f', 'e', 'd']
So for my sample input it would look like:
sort_by_increasing_order(mylist, myDict)
>> ['to','woodchuck','has','jam','who','farmer']
The commented line just sorts by the dictionary keys when i try to sort by the list. My approach is not correct. The result should a list with increasing order of the length of indices as mentioned above. Any suggestion.
With reference to #doukremt answer
assuming you are aware of decorators.
mydict = {'d': [0, 1], 'e': [1, 2, 3], 'f': [4]}
mylist = [['d', 'e', 'f', 'c'], ['c', 'v', 'd', 'n']]
def convert_to_set(mylist, result_set=None):
if result_set is None:
result_set = []
for item in mylist:
if isinstance(item, str):
result_set.append(item)
if isinstance(item, list):
convert_to_set(item, result_set)
return set(result_set)
def list_to_set(f):
def wrapper(mylist, mydict):
myset = convert_to_set(mylist)
result = f(myset, mydict)
return result
return wrapper
#list_to_set
def findit(mylist, mydict):
gen = ((k, mydict[k]) for k in mylist if k in mydict)
return [k for k, v in sorted(gen, key=lambda p: len(p[1]))]
print findit(mylist, mydict)
def findit(mylist, mydict):
gen = ((k, mydict[k]) for k in mylist if k in mydict)
return [k for k, v in sorted(gen, key=lambda p: len(p[1]))]
>>> findit(['d', 'e', 'f'], {'d': [0, 1], 'e': [1, 2, 3], 'f': [4]})
['f', 'd', 'e']
>>> D= {'to': [7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56],
... 'jam': [20], 'black': [5], 'farmer': [11],
... 'woodchuck': [54], 'has': [14, 16, 51], 'who': [16]
... }
>>>
>>> sorted(D, key=lambda k:len(D[k]), reverse=True)
['to', 'has', 'who', 'jam', 'black', 'farmer', 'woodchuck']
For the values
>>> sorted(D.values(), key=len, reverse=True)
[[7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56], [14, 16, 51], [16], [20], [5], [11], [54]]
For (keys, values)
>>> sorted(D.items(), key=lambda i:len(i[1]), reverse=True)
[('to', [7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56]), ('has', [14, 16, 51]), ('who', [16]), ('jam', [20]), ('black', [5]), ('farmer', [11]), ('woodchuck', [54])]
Edit: Still not really clear what you are asking for. Your example doesn't seem to care about the length at all, otherwise "has" should come before "woodchuck"? Changing len to max may be what you want
>>> D = {'to': [7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56],
... 'jam': [20], 'black': [5], 'farmer': [11],
... 'woodchuck': [54], 'has': [14, 16, 51], 'who': [16]
... }
>>>
>>> sorted(D, key=lambda k:max(D[k]), reverse=True)
['to', 'woodchuck', 'has', 'jam', 'who', 'farmer', 'black']
>>> sorted(D.values(), key=max, reverse=True)
[[7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56], [54], [14, 16, 51], [20], [16], [11], [5]]
>>> sorted(D.items(), key=lambda i:max(i[1]), reverse=True)
[('to', [7, 11, 17, 23, 24, 25, 26, 33, 34, 37, 39, 41, 47, 48, 53, 56]), ('woodchuck', [54]), ('has', [14, 16, 51]), ('jam', [20]), ('who', [16]), ('farmer', [11]), ('black', [5])]