Say I have a dataset like this:
is_a is_b is_c population infected
1 0 1 50 20
1 1 0 100 10
0 1 1 20 10
...
How do I reshape it to look like this?
feature 0 1
a 10/20 30/150
b 20/50 20/120
c 10/100 30/70
...
In the original dataset, I have features a, b, and c as their own separate columns. In the transformed dataset, these same variables are listed under column feature, and two new columns 0 and 1 are produced, corresponding to the values that these features can take on.
In the original dataset where is_a is 0, add infected values and divide them by population values. Where is_a is 1, do the same, add infected values and divide them by population values. Rinse and repeat for is_b and is_c. The new dataset will have these fractions (or decimals) as shown. Thank you!
I've tried pd.pivot_table and pd.melt but nothing comes close to what I need.
After doing the wide_to_long , your question is more clear
df=pd.wide_to_long(df,['is'],['population','infected'],j='feature',sep='_',suffix='\w+').reset_index()
df
population infected feature is
0 50 20 a 1
1 50 20 b 0
2 50 20 c 1
3 100 10 a 1
4 100 10 b 1
5 100 10 c 0
6 20 10 a 0
7 20 10 b 1
8 20 10 c 1
df.groupby(['feature','is']).apply(lambda x : sum(x['infected'])/sum(x['population'])).unstack()
is 0 1
feature
a 0.5 0.200000
b 0.4 0.166667
c 0.1 0.428571
I tried this on your small dataframe, but I am not sure it will work on a larger dataset.
dic_df = {}
for letter in ['a', 'b', 'c']:
dic_da = {}
dic_da[0] = df[df['is_'+str(letter)] == 0].infected.sum()/df[df['is_'+str(letter)] == 0].population.sum()
dic_da[1] = df[df['is_'+str(letter)] == 1].infected.sum()/df[df['is_'+str(letter)] == 1].population.sum()
dic_df[letter] = dic_da
dic_df
dic_df_ = pd.DataFrame(data = dic_df).T.reset_index().rename(columns= {'index':'feature'})
feature 0 1
0 a 0.5 0.200000
1 b 0.4 0.166667
2 c 0.1 0.428571
Here, DF would be your original DataFrame
Aux_NewDF = [{'feature': feature,
0 : '{}/{}'.format(DF['infected'][DF['is_{}'.format(feature.lower())]==0].sum(), DF['population'][DF['is_{}'.format(feature.lower())]==0].sum()),
1 : '{}/{}'.format(DF['infected'][DF['is_{}'.format(feature.lower())]==1].sum(), DF['population'][DF['is_{}'.format(feature.lower())]==1].sum())} for feature in ['a','b','c']]
NewDF = pd.DataFrame(Aux_NewDF)
Related
Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1
How do I count the number of multicolumn (thing, cond=1) event occurrences prior to every (thing, cond=any) event?
(These could be winning games of poker by player, episodes of depression by patient, or so on.) For example, row index == 3, below, contains the pair (thing, cond) = (c,2), and shows the number of prior (c,1) occurrences, which is correctly (but manually) shown in the priors column as 0. I'm interested in producing a synthetic column with the count of prior (thing, 1) events for every (thing, event) pair in my data. My data are monotonically increasing in time. The natural index in the silly DataFrame can be taken as logical ticks, if it helps. (<Narrator>: It really doesn't.)
For convenience, below is the code for my test DataFrame and the manually generated priors column, which I cannot get pandas to usefully generate, no matter which combinations of groupby, cumsum, shift, where, & etc. I try. I have googled and wracked my brain for days. No SO answers seem to fit the bill. The key to reading the priors column is that its entries say things like, "Before this (a,1) or (a,2) event, there have been 2 (a,1) events."
[In]:
import pandas as pd
silly = pd.DataFrame({'thing': ['a','b','a','c','b','c','c','a','a','b','c','a'], "cond": [1,2,1,2,1,2,1,2,1,2,1,2]})
silly['priors'] = pd.Series([0,0,1,0,0,0,0,2,2,1,1,3])
silly
[Out]:
silly
thing cond priors
0 a 1 0
1 b 2 0
2 a 1 1
3 c 2 0
4 b 1 0
5 c 2 0
6 c 1 0
7 a 2 2
8 a 1 2
9 b 2 1
10 c 1 1
11 a 2 3
The closest I've come is:
silly
[In]:
silly['priors_inc'] = silly[['thing', 'cond']].where(silly['cond'] == 1).groupby('thing').cumsum() - 1
[Out]:
silly
thing cond priors priors_inc
0 a 1 0 0.0
1 b 2 0 NaN
2 a 1 1 1.0
3 c 2 0 NaN
4 b 1 0 0.0
5 c 2 0 NaN
6 c 1 0 0.0
7 a 2 2 NaN
8 a 1 2 2.0
9 b 2 1 NaN
10 c 1 1 1.0
11 a 2 3 NaN
Note that the values that are present in the incomplete priors column are correct, but not all of the desired data are there.
Please, if at all possible, withhold any "Pythonic" answers. While my real data are small compared to most ML problems, I want to learn pandas the right way, not the toy data way with Python loops or itertools chicanery that I've seen too much of already. Thank you in advance! (And I apologize for the wall of text!)
You need to
Cumulatively count where each "cond" is 1
Do this for each "thing"
Make sure the counts are shifted by 1.
You can do this using groupby, cumsum and shift:
(df.cond.eq(1)
.groupby(df.thing)
.apply(lambda x: x.cumsum().shift())
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
Another option to avoid the apply is to chain two groupby calls—one does the shifting, the other performs the cumsum.
(df.cond.eq(1)
.groupby(df.thing)
.cumsum()
.groupby(df.thing)
.shift()
.fillna(0, downcast='infer'))
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 2
8 2
9 1
10 1
11 3
Name: cond, dtype: int64
I'm having trouble understanding how a function works:
""" the apply() method lets you apply an arbitrary function to the group
result. The function take a DataFrame and returns a Pandas object (a df or
series) or a scalar.
For example: normalize the first column by the sum of the second"""
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
print (df); print (df.groupby('key').apply(norm_by_data2))
(Excerpt from: "Python Data Science Handbook", Jake VanderPlas pp. 167)
Returns this:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
key data1 data2
0 A 0.000000 5
1 B 0.142857 0
2 C 0.166667 3
3 A 0.375000 3
4 B 0.571429 7
5 C 0.416667 9
For me, the best way to understand how this works is by manually calculating the values.
Can someone explain how to manually arrive to the second value of the column 'data1': 0.142857
It's 1/7? but where do this values come from?
Thanks!
I got it!!
The sum of column B for each group is:
A: 5 + 3 = 8
B: 0 + 7 = 7
C: 3 + 9 = 12
For example, to arrive to 0.142857, divide 1 in the sum of group B (it's 7) : 1/7 = 0.142857
I know that we can get normalized values from value_counts() of a pandas series but when we do a group by on a dataframe, the only way to get counts is through size(). Is there any way to get normalized values with size()?
Example:
df = pd.DataFrame({'subset_product':['A','A','A','B','B','C','C'],
'subset_close':[1,1,0,1,1,1,0]})
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
df.subset_product.value_counts()
A 3
B 2
C 2
df2
Looking to get:
subset_product subset_close prod_count norm
A 0 1 1/3
A 1 2 2/3
B 1 2 2/2
C 1 1 1/2
C 0 1 1/2
subset_product
Besides manually calculating the normalized values as prod_count/total, is there any way to get normalized values?
I think it is not possible only one groupby + size because groupby by 2 columns subset_product and subset_close and need size by subset_product only for normalize.
Possible solutions are map or transform for Series with same size as df2 with div:
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
s = df.subset_product.value_counts()
df2['prod_count'] = df2['prod_count'].div(df2['subset_product'].map(s))
Or:
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
a = df2.groupby('subset_product')['prod_count'].transform('sum')
df2['prod_count'] = df2['prod_count'].div(a)
print (df2)
subset_product subset_close prod_count
0 A 0 0.333333
1 A 1 0.666667
2 B 1 1.000000
3 C 0 0.500000
4 C 1 0.500000
First time asking a question here so hopefully I will make my issue clear. I am trying to understand how to better apply a list of scenarios (via for loop) to the same dataset and summarize results. *Note that once a scenario is applied, and I pull the relevant statistical data from dataframe and put into the summary table, I do not need to retain the information. Iterrows is painfully slow as I have tens of thousands of scenarios I want to run. Thank you for taking the time to review.
I have two Pandas dataframes: df_analysts and df_results:
1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)
TB JK SF PWR
0 50 50 50 50
1 50 50 50 100
2 50 50 50 150
3 50 50 50 200
4 50 50 50 250
2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W)
0 11182017 1 1 2 1 2 1 2
1 11182017 1 2 3 2 3 2 1
2 11182017 1 3 1 3 1 3 3
3 11182017 2 1 1 2 2 1 1
4 11182017 2 2 2 1 1 2 1
3) I am using iterrows to
loop through each scenario in the df_analysts dataframe
apply weight scenario to each factor rank (if rank = 1, then 1.0*weight, rank = 2, then 0.68*weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
Sum the W1-W4 columns.
Rank the SUM(W) column.
Result sample below for a single scenario (e.g. 50,50,50,50)
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W) Rank
0 11182017 1 1 2 1 2 1 1 34 50 34 50 168 1
1 11182017 1 2 3 2 3 2 3 16 34 16 34 100 3
2 11182017 1 3 1 3 1 3 2 50 16 50 16 132 2
3 11182017 2 1 2 2 2 1 1 34 34 34 50 152 2
4 11182017 2 2 1 1 1 2 1 50 50 50 34 184 1
4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).
Factors Weights Top Top2
0 (TB,JK,SF,PWR) (50,50,50,50) 1 1
1 (TB,JK,SF,PWR) (50,50,50,100) 1 0
2 (TB,JK,SF,PWR) (50,50,50,150) 1 1
3 (TB,JK,SF,PWR) (50,50,50,200) 1 0
4 (TB,JK,SF,PWR) (50,50,50,250) 1 1
You could merge your analyst and results dataframe and then perform the calculations.
def factor_rank(x,y):
if (x==1): return y
elif (x==2): return y*0.68
elif (x==3): return y*0.32
df_analysts.index.name='SCENARIO'
df_analysts.reset_index(inplace=True)
df_analysts['key'] = 1
df_results['key'] = 1
df = pd.merge(df_analysts, df_results, on='key')
df.drop(['key'],axis=1,inplace=True)
df['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)
df['SUM(W)'] = df.W1 + df.W1 + df.W3 + df.W4
df["rank"] = df.groupby(['GR','SCENARIO'])['SUM(W)'].rank(ascending=False)
You may also want to check out this question which deals with improving processing times on row based calculations:
How to apply a function to mulitple columns of a pandas DataFrame in parallel