I have dataframe like below
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1
first I would like to binning with age
age
[0~4]
age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
Then sum up and count days by grouping with type
sum count
a 6 2
b 9 3
c 0 0
d 0 0
e 0 0
f 0 0
Then I would like to apply this method to another binns.
[5~9]
[11~14]
My desired result is below
[0~4] [5~9] [10~14]
sum count sum count sum count
a 6 2 0 0 1 1
b 9 3 0 0 0 0
c 0 0 1 1 0 0
d 0 0 4 1 0 0
e 0 0 0 0 2 1
f 0 0 0 1 0 0
How can this be done?
It is very complicated for me..
Consider a pivot_table with pd.cut if you do not care too much about column ordering as count and sum are not paired together under the bin. With manipulation you can change such ordering.
df['bin'] = pd.cut(df.age, [0,4,9,14])
pvtdf = df.pivot_table(index='type', columns=['bin'], values='days',
aggfunc=('count', 'sum')).fillna(0)
# count sum
# bin (0, 4] (4, 9] (9, 14] (0, 4] (4, 9] (9, 14]
# type
# a 2.0 0.0 1.0 6.0 0.0 1.0
# b 3.0 0.0 0.0 9.0 0.0 0.0
# c 0.0 1.0 0.0 0.0 1.0 0.0
# d 0.0 1.0 0.0 0.0 4.0 0.0
# e 0.0 0.0 1.0 0.0 0.0 2.0
# f 0.0 1.0 0.0 0.0 0.0 0.0
We'll use some stacking and groupby operations to get us to the desired output.
string_ = io.StringIO('''age type days
1 a 1
2 b 3
2 b 4
3 a 5
4 b 2
6 c 1
7 f 0
7 d 4
10 e 2
14 a 1''')
df = pd.read_csv(string_, sep='\s+')
df['age_bins'] = pd.cut(df['age'], [0,4,9,14])
df_stacked = df.groupby(['age_bins', 'type']).agg({'days': np.sum,
'type': 'count'}).transpose().stack().fillna(0)
df_stacked.rename(index={'days': 'sum', 'type': 'count'}, inplace=True)
>>> df_stacked
age_bins (0, 4] (4, 9] (9, 14]
type
sum a 6.0 0.0 1.0
b 9.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 4.0 0.0
e 0.0 0.0 2.0
f 0.0 0.0 0.0
count a 2.0 0.0 1.0
b 3.0 0.0 0.0
c 0.0 1.0 0.0
d 0.0 1.0 0.0
e 0.0 0.0 1.0
f 0.0 1.0 0.0
This doesn't produce the exact output you listed, but it's similar, and I think it will be easier to index and retrieve data from. Alternatively, you could do use the following to get something like the desired output.
>>> df_stacked.unstack(level=0)
age_bins (0, 4] (4, 9] (9, 14]
count sum count sum count sum
type
a 2.0 6.0 0.0 0.0 1.0 1.0
b 3.0 9.0 0.0 0.0 0.0 0.0
c 0.0 0.0 1.0 1.0 0.0 0.0
d 0.0 0.0 1.0 4.0 0.0 0.0
e 0.0 0.0 0.0 0.0 1.0 2.0
f 0.0 0.0 1.0 0.0 0.0 0.0
Related
How do I print the number of rows dropped while executing the following code in python:
df.dropna(inplace = True)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')
len(new_data)
Use:
np.random.seed(2022)
df = pd.DataFrame(np.random.choice([0,np.nan, 1], size=(10, 3)))
print (df)
0 1 2
0 NaN 0.0 NaN
1 0.0 NaN NaN
2 0.0 0.0 1.0
3 0.0 0.0 NaN
4 NaN NaN 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
7 NaN 0.0 1.0
8 1.0 1.0 NaN
9 1.0 0.0 NaN
You can count missing values before by DataFrame.isna with DataFrame.any and sum:
count = df.isna().any(axis=1).sum()
df.dropna(inplace = True)
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Or get difference of size Dataframe before and after dropna:
orig = df.shape[0]
df.dropna(inplace = True)
count = orig - df.shape[0]
print (df)
0 1 2
2 0.0 0.0 1.0
5 1.0 0.0 0.0
6 1.0 0.0 1.0
print (count)
7
Having this pandas.core.frame.DataFrame:
Gorilla A T C C A G C T
Dog G G G C A A C T
Humano A T G G A T C T
Drosophila A A G C A A C C
Elefante T T G G A A C T
Mono A T G C C A T T
Unicornio A T G G C A C T
I would like to get a data frame like that:
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Basically, what I want is to count the frequent column by column and create the second df as I show.
I want to do this because finally, I would like to get a Consensus string. Should be something like that A T G C A A C T
Could anyone help me or give me some advice?
Try:
result = df.apply(pd.value_counts).fillna(0)
col1 col2 col3 col4 col5 col6 col7 col8
A 5.0 1.0 0.0 0.0 5.0 5.0 0.0 0.0
C 0.0 0.0 1.0 4.0 2.0 0.0 6.0 1.0
G 1.0 1.0 6.0 3.0 0.0 1.0 0.0 0.0
T 1.0 5.0 0.0 0.0 0.0 1.0 1.0 6.0
You could use Series.value_counts by column:
print(df.iloc[:, 1:].apply(pd.Series.value_counts).fillna(0))
Output
1 2 3 4 5 6 7 8
A 5.0 1.0 0.0 0.0 5.0 5.0 0.0 0.0
C 0.0 0.0 1.0 4.0 2.0 0.0 6.0 1.0
G 1.0 1.0 6.0 3.0 0.0 1.0 0.0 0.0
T 1.0 5.0 0.0 0.0 0.0 1.0 1.0 6.0
I have the following dataframe
0 0 0
1 0 0
1 1 0
1 1 1
1 1 1
0 0 0
0 1 0
0 1 0
0 0 0
how do you get a dataframe which looks like this
0 0 0
4 0 0
4 3 0
4 3 2
4 3 2
0 0 0
0 2 0
0 2 0
0 0 0
Thank you for your help.
You may need using for loop here , with tranform, and using cumsum create the key and assign the position back to your original df
for x in df.columns:
df.loc[df[x]!=0,x]=df[x].groupby(df[x].eq(0).cumsum()[df[x]!=0]).transform('count')
df
Out[229]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
Or without for loop
s=df.stack().sort_index(level=1)
s2=s.groupby([s.index.get_level_values(1),s.eq(0).cumsum()]).transform('count').sub(1).unstack()
df=df.mask(df!=0).combine_first(s2)
df
Out[255]:
1 2 3
0 0.0 0.0 0.0
1 4.0 0.0 0.0
2 4.0 3.0 0.0
3 4.0 3.0 2.0
4 4.0 3.0 2.0
5 0.0 0.0 0.0
6 0.0 2.0 0.0
7 0.0 2.0 0.0
8 0.0 0.0 0.0
I have two dataframes:
dayData
power_comparison final_average_delta_power calculated_power
1 0.0 0.0 0
2 0.0 0.0 0
3 0.0 0.0 0
4 0.0 0.0 0
5 0.0 0.0 0
7 0.0 0.0 0
and
historicPower
power
0 0.0
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
I'm trying to reindex the historicPower dataframe to have the same shape as the dayData dataframe (so in this example it would looks like):
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
The dataframes in reality will be alot larger with different shapes.
I think you can use reindex if index has no duplicates:
historicPower = historicPower.reindex(dayData.index)
print (historicPower)
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
I have a dataframe like :
user_id category view collect
1 1 a 2 3
2 1 b 5 9
3 2 a 8 6
4 3 a 7 3
5 3 b 4 2
6 3 c 3 0
7 4 e 1 4
how to change it to a new dataframe ,each user_id can appear once,then the category with the view and collect appears to the columns ,if there is no data ,fill it with 0, like this :
user_id a_view a_collect b_view b_collect c_view c_collect d_view d_collect e_view e_collect
1 2 3 5 6 0 0 0 0 0 0
2 8 6 0 0 0 0 0 0 0 0
3 7 3 4 2 3 0 0 0 0 0
4 0 0 0 0 0 0 0 0 1 4
The desired result can be obtained by pivoting df, with values from user_id becoming the index and values from category becoming a column level:
import numpy as np
import pandas as pd
df = pd.DataFrame({'category': ['a', 'b', 'a', 'a', 'b', 'c', 'e'],
'collect': [3, 9, 6, 3, 2, 0, 4],
'user_id': [1, 1, 2, 3, 3, 3, 4],
'view': [2, 5, 8, 7, 4, 3, 1]})
result = (df.pivot(index='user_id', columns='category')
.swaplevel(axis=1).sortlevel(axis=1).fillna(0))
yields
category a b c e
view collect view collect view collect view collect
user_id
1 2.0 3.0 5.0 9.0 0.0 0.0 0.0 0.0
2 8.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0
3 7.0 3.0 4.0 2.0 3.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 1.0 4.0
Above, result has a MultiIndex. In general I think this should be preferred over a flattened single index, since it retains more of the structure of the data.
However, the MultiIndex can be flattened into a single index:
result.columns = ['{}_{}'.format(cat,col) for cat, col in result.columns]
print(result)
yields
a_view a_collect b_view b_collect c_view c_collect e_view \
user_id
1 2.0 3.0 5.0 9.0 0.0 0.0 0.0
2 8.0 6.0 0.0 0.0 0.0 0.0 0.0
3 7.0 3.0 4.0 2.0 3.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 1.0
e_collect
user_id
1 0.0
2 0.0
3 0.0
4 4.0