I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?
One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1
Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
Related
I am stuck with an issue on a massive pandas table. I would like to get a boolean to check the cross of 2 series.
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8]})
I would like to add one column in my array to get a result like this one
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': [10, 1, 2, 8],
'C': [0, -1, 0, 1]
})
So basically to get
0 when there is no cross between series B and A
-1 when table B crosses down table A
1 when table B crosses up table A
I need to do vector calculation because my real table is like more than one million rows.
Thank you
You can compute the relative position of the 2 columns with lt, then convert to integer and compute the diff:
m = df['A'].lt(df['B'])
df['C'] = m.astype(int).diff().fillna(0, downcast='infer')
output:
A B C
0 1 10 0
1 2 1 -1
2 3 2 0
3 4 8 1
visual of A/B:
I have a dataframe where I want to group rows based on a column. Some of the columns in the rows I want to sum up and the others I want to aggregate as a list.
#creating sample data
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['id'] = [1,2,1,4]
df['group'] = [[0,1,2,3] , [0,2,3,4], [1,1,1,1], 1]
df
Out[5]:
a b c d id group
0 0.850058 0.160497 0.742296 0.354296 1 [0, 1, 2, 3]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.160764 0.671702 0.414800 0.429992 1 [1, 1, 1, 1]
3 0.011089 0.581518 0.718829 0.610140 4 1
Here I want to combine row 0 and row 2 as they have the same id. When doing this, I want to sum up the values in columns a, b, c and d but for column group, I want the lists to be appended. How can I do this?
My expected output is:
a b c d id group
0 1.155671 1.670582 0.392744 0.681494 1 [0, 1, 2, 3, 1, 1, 1, 1]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.011089 0.581518 0.718829 0.610140 4 1
(When I use only the sum or df.groupby(['id'])['group'].apply(list), the other columns are dropped. )
Use groupby.aggregate
df.groupby('id').agg({k: sum for k in ['a', 'b', 'c', 'd', 'group']})
A one-liner alternative would be using numeric_only flag. But be careful with the columns you are feeding in.
df.groupby('id').sum(numeric_only=False)
Output
a b c d group
id
1 1.488778 0.802794 0.949768 0.952676 [0, 1, 2, 3, 1, 1, 1, 1]
2 0.488390 0.512301 0.064922 0.233875 [0, 2, 3, 4]
4 0.649945 0.267125 0.229313 0.156696 1
First Solution:
We can arrive at the task in 2 steps, the 1st step using GroupBy.sum to get the grouped sum of the first 4 columns. The 2nd step acting on the column group only and concat the lists also by GroupBy.sum
df.groupby('id').sum().join(df.groupby('id')['group'].sum()).reset_index()
Input (Different values owing to the different random numbers generated)
a b c d id group
0 0.758148 0.781987 0.310849 0.600912 1 [0, 1, 2, 3]
1 0.694848 0.755622 0.947359 0.708422 2 [0, 2, 3, 4]
2 0.515446 0.454484 0.169883 0.697287 1 [1, 1, 1, 1]
3 0.361939 0.325718 0.143510 0.077142 4 1
Output:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Second Solution
We can also use GroupBy.agg with named aggegation, as follows:
df.groupby('id', as_index=False).agg(a=('a', 'sum'), b=('b', 'sum'), c=('c', 'sum'), d=('d', 'sum'), group=('group', 'sum'))
Result:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Does this work:
pd.merge(df.groupby('id', as_index = False).sum(), df.groupby('id')['group'].apply(sum).reset_index(), on = 'id')
id a b c d group
0 1 1.241602 0.839409 0.779673 0.639509 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.967984 0.838906 0.313017 0.498611 [0, 2, 3, 4]
2 4 0.042871 0.367209 0.676656 0.178939 1
Specify any keyword in list or dict format as follows
Is it possible to sort columns in a data frame?
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2]})
# before
col_cc_7, col_aa_7, col_bb_7
0, 1, 2
0, 1, 2
0, 1, 2
# sort
custom_sort_key = ["aa", "bb", "cc"]
# ... sort codes ...
# after
col_aa_7, col_bb_7, col_cc_7
1, 2, 0
1, 2, 0
1, 2, 0
For me, your question is a little confusing.
If you only want to sort your columns values, a simple google search would do the trick, if not, I could not understand the question.
df= df.sort_values(by=['col','col2', "col3"],ascending=[True,True,False])
The by= sets the order of the sorting, and the ascending is self explanatory.
We can split by the middle value and create a dictionary of your columns, then apply a sort before we assign this back. I've added some extra columns not in your sort to show what will happen to them.
df = pd.DataFrame ({
"col_cc_7": [0, 0, 0],
"col_aa_7": [1, 1, 1],
"col_bb_7": [2, 2, 2],
"col_ee_7": [2, 2, 2],
"col_dd_7": [2, 2, 2]})
custom_sort_key = ["bb", "cc", "aa"]
col_dict = dict(zip(df.columns,[x.split('_')[1] for x in df.columns.tolist()]))
#{'col_cc_7': 'cc',
# 'col_aa_7': 'aa',
# 'col_bb_7': 'bb',
# 'col_ee_7': 'ee',
# 'col_dd_7': 'dd'}
d = {v:k for k,v in enumerate(custom_sort_key)}
# this will only work on python 3.6 +
new_cols = dict(sorted(col_dict.items(), key=lambda x: d.get(x[1], float('inf'))))
df[new_cols.keys()]
col_bb_7 col_cc_7 col_aa_7 col_ee_7 col_dd_7
0 2 0 1 2 2
1 2 0 1 2 2
2 2 0 1 2 2
I do as below:
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
frames = [data1, data2]
data = pd.concat(frames)
data
a b
0 2 1
1 2 1
2 2 1
0 2 1
1 2 1
2 2 1
The data column order is in alphabet order. Why is it so?
and how to keep the original order?
You are creating DataFrames out of dictionaries. Dictionaries are a unordered which means the keys do not have a specific order. So
d1 = {'key_a': 'val_a', 'key_b': 'val_b'}
and
d2 = {'key_b': 'val_b', 'key_a': 'val_a'}
are (probably) the same.
In addition to that I assume that pandas sorts the dictionary's keys descending by default (unfortunately I did not find any hint in the docs in order to prove that assumption) leading to the behavior you encountered.
So the basic motivation would be to resort / reorder the columns in your DataFrame. You can do this as follows:
import pandas as pd
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
frames = [data1, data2]
data = pd.concat(frames)
print(data)
cols = ['b' , 'a']
data = data[cols]
print(data)
Starting from version 0.23.0, you can prevent the concat() method to sort the returned DataFrame. For example:
df1 = pd.DataFrame({ 'a' : [1, 1, 1], 'b' : [2, 2, 2]})
df2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
df = pd.concat([df1, df2], sort=False)
A future version of pandas will change to not sort by default.
def concat_ordered_columns(frames):
columns_ordered = []
for frame in frames:
columns_ordered.extend(x for x in frame.columns if x not in columns_ordered)
final_df = pd.concat(frames)
return final_df[columns_ordered]
# Usage
dfs = [df_a,df_b,df_c]
full_df = concat_ordered_columns(dfs)
This should work.
You can create the original DataFrames with OrderedDicts
from collections import OrderedDict
odict = OrderedDict()
odict['b'] = [1, 1, 1]
odict['a'] = [2, 2, 2]
data1 = pd.DataFrame(odict)
data2 = pd.DataFrame(odict)
frames = [data1, data2]
data = pd.concat(frames)
data
b a
0 1 2
1 1 2
2 1 2
0 1 2
1 1 2
2 1 2
you can also specify the order like this :
import pandas as pd
data1 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
data2 = pd.DataFrame({ 'b' : [1, 1, 1], 'a' : [2, 2, 2]})
listdf = [data1, data2]
data = pd.concat(listdf)
sequence = ['b','a']
data = data.reindex(columns=sequence)
Simplest way is firstly make the columns same order then concat:
df2=df2[df1.columns]
df=pd.concat((df1,df2),axis=0)
What is the best way to get a random sample of the elements of a groupby? As I understand it, a groupby is just an iterable over groups.
The standard way I would do this for an iterable, if I wanted to select N = 200 elements is:
rand = random.sample(data, N)
If you attempt the above where data is a 'grouped' the elements of the resultant list are tuples for some reason.
I found the below example for randomly selecting the elements of a single key groupby, however this does not work with a multi-key groupby. From, How to access pandas groupby dataframe by key
create groupby object
grouped = df.groupby('some_key')
pick N dataframes and grab their indices
sampled_df_i = random.sample(grouped.indices, N)
grab the groups using the groupby object 'get_group' method
df_list = map(lambda df_i: grouped.get_group(df_i),sampled_df_i)
optionally - turn it all back into a single dataframe object
sampled_df = pd.concat(df_list, axis=0, join='outer')
You can take a randoms sample of the unique values of df.some_key.unique(), use that to slice the df and finally groupby on the resultant:
In [337]:
df = pd.DataFrame({'some_key': [0,1,2,3,0,1,2,3,0,1,2,3],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [338]:
print df[df.some_key.isin(random.sample(df.some_key.unique(),2))].groupby('some_key').mean()
val
some_key
0 1.000000
2 3.666667
If there are more than one groupby keys:
In [358]:
df = pd.DataFrame({'some_key1':[0,1,2,3,0,1,2,3,0,1,2,3],
'some_key2':[0,0,0,0,1,1,1,1,2,2,2,2],
'val': [1,2,3,4,1,5,1,5,1,6,7,8]})
In [359]:
gby = df.groupby(['some_key1', 'some_key2'])
In [360]:
print gby.mean().ix[random.sample(gby.indices.keys(),2)]
val
some_key1 some_key2
1 1 5
3 2 8
But if you are just going to get the values of each group, you don't even need to groubpy, MultiIndex will do:
In [372]:
idx = random.sample(set(pd.MultiIndex.from_product((df.some_key1, df.some_key2)).tolist()),
2)
print df.set_index(['some_key1', 'some_key2']).ix[idx]
val
some_key1 some_key2
2 0 3
3 1 5
I feel like lower-level numpy operations are cleaner:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"some_key": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8],
}
)
ids = df["some_key"].unique()
ids = np.random.choice(ids, size=2, replace=False)
ids
# > array([3, 2])
df.loc[df["some_key"].isin(ids)]
# > some_key val
# 2 2 3
# 3 3 4
# 6 2 1
# 7 3 5
# 10 2 7
# 11 3 8
Although this question was asked and answered long ago, I think the following is cleaner:
import pandas as pd
df = pd.DataFrame(
{
"some_key1": [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3],
"some_key2": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
"val": [1, 2, 3, 4, 1, 5, 1, 5, 1, 6, 7, 8]
}
)
# Set the number of samples by group
n_samples_by_group = 1
samples_by_group = df \
.groupby(by=["some_key1", "some_key2"]) \
.sample(n_samples_by_group)