New to python. I'm sure there is a very simple solution to this but I'm struggling to find it.
I have a series of positive and negative numbers. I want to know what percentage of the numbers that are positive. I've accomplished that for the whole dataset but I would like the calculation to occur on every row.
The dataset I'm working with is quite large but here is an example:
import pandas as pd
data = {'numbers': [100, 300, 150, -150, -75, -100]}
df = pd.DataFrame(data)
df['count'] = df['numbers'].count()
df['pct_positive'] = df.numbers[df.numbers > 0].count() / df['count']
print(df)
Here is the actual result:
numbers count pct_positive
0 100 6 0.5
1 300 6 0.5
2 150 6 0.5
3 -150 6 0.5
4 -75 6 0.5
5 -100 6 0.5
Here is my desired result:
numbers count pct_positive
0 100 1 1.0
1 300 2 1.0
2 150 3 1.0
3 -150 4 0.75
4 -75 5 0.66
5 -100 6 0.5
note how 'count' and 'pct_positive' are calculated on each row in the desired result and are simply totals in the actual result.
In this case 'Count' is redundant with your index, so you can create that column based on the index (or just stick with the index). .cumsum a boolean Series checking > 0 to get the percent positive after dividing by 'Count'.
df['Count'] = df.index+1
df['pct_pos'] = df.numbers.gt(0).cumsum()/df.Count
numbers Count pct_pos
0 100 1 1.00
1 300 2 1.00
2 150 3 1.00
3 -150 4 0.75
4 -75 5 0.60
5 -100 6 0.50
Also, avoid naming a column 'count' as it is a method.
Try:
df.numbers.gt(0).cumsum().div(df.numbers.notnull().cumsum())
Output:
0 1.00
1 1.00
2 1.00
3 0.75
4 0.60
5 0.50
Name: numbers, dtype: float64
Details:
Get sign of df.number check to see if greater than 0 for positive
then cumsum that column.
Count the numbers using notnull to change to boolean and cumsum.
Divide postive by total count.
Related
I have three columns on a pandas df: id, hazard, probability
I want to make sure the sum of probabilities for each id, hazard combo is 1.
So I wanted to find the sum of probabilities for each id, hazard.
And also find the index of the max probability for each id, hazard, and add to that value the 1-sum.
I found in stack overflow how to do these two separately, but can't find a way to combine them.
Find index of max value per group:
i = df.groupby(['id','haz'])['prob'].transform('idxmax').values
Find sum of probabilities per group:
sums= df.groupby(['id','haz'])['prob'].sum()
How can I combine these two to make sure that the sum of probabilities for each group is exactly 1?
My code so far and example df below
import pandas as pd
import numpy as np
File = 'testprob1.csv'
VF = pd.read_csv(f'{File}', sep=',', header=0, index_col=False, dtype='str')
VF = VF.astype({'id': 'str', 'haz': 'int16', 'prob': 'float64'})
i = VF.groupby(['id','haz'])['prob'].transform('idxmax').values
sums= VF.groupby(['id','haz'])['prob'].sum()
Edit:
Example df
Try this -
new_proba calculates the new values of probability for each group that they need to replace their max values.
Then, you can use idxmax to find the row indexes and df.loc to find those rows and update them with the new_proba
new_proba = df.groupby(['id','haz'])['prob'].apply(lambda x: max(x)+1-(sum(x))).values
df.loc[df.groupby(['id','haz'])['prob'].agg('idxmax').values, 'prob'] = new_proba
print(df)
id haz prob
0 1 20 0.05
1 1 20 0.05
2 1 20 0.90
3 1 30 0.98
4 1 30 0.02
5 2 30 1.00
6 2 40 0.12
7 2 40 0.78
8 2 40 0.05
9 2 40 0.05
Alternate method
For your custom rescaling function, you can write your own function and apply it to each group. Then return the new probability as a list, which once you pass into a pd.Series gets distributed as it would when you use .transform.
idd = [1,1,1,1,1,2,2,2,2,2]
haz = [20,20,20,30,30,30,40,40,40,40]
prob = [0.05,0.05,0.42,0.3,0.02,0.05,0.12,0.44,0.05,0.05]
df = pd.DataFrame({'id':idd, 'haz':haz, 'prob':prob})
def f(l):
return [i+(1-sum(l)) if i==max(l) else i for i in l]
df['new_proba'] = df.groupby(['id','haz'])['prob'].apply(lambda x: pd.Series(f(x))).values
print(df)
id haz prob new_proba
0 1 20 0.05 0.05
1 1 20 0.05 0.05
2 1 20 0.42 0.90
3 1 30 0.30 0.98
4 1 30 0.02 0.02
5 2 30 0.05 1.00
6 2 40 0.12 0.12
7 2 40 0.44 0.78
8 2 40 0.05 0.05
9 2 40 0.05 0.05
Just to confirm that sum for each group is 1 -
df.groupby(['id','haz'])['new_proba'].sum()
id haz
1 20 1.0
30 1.0
2 30 1.0
40 1.0
Name: new_proba, dtype: float64
Hi I have the following df in which I want the new column to be the result of B/A unless B == 0 in which case take the average of C&D and divide by A so ((C+D)/2)/A.
I know how to do df["New Column"] = df["B"]/df["A"] But I am not sure how you would do it how I want. DO I need to iterate through each row of the df and use conditional if statements?
A B C D New Column Desired Column
5 3 2 4 0.6 0.6
6 2 2 3 0.333 0.333333333
8 4 3 4 0.5 0.5
9 0 3 4 0 0.388888889
14 3 3 4 0.214 0.214285714
5 0 2 4 0 0.6
Here you go:
import numpy as np
df["new Column"] = np.where(df["B"] != 0, df["B"]/df["A"], (df["C"]+df["D"])/2/df["A"])
I have two dataframes in pandas of the following form:
df1 df2
column factor
0 2 0 0.0
1 4 1 0.25
2 12 2 0.50
3 5 3 0.99
4 4 4 1.00
5 15
6 32
The work is to sumproduct every 5 row in df1 with df2 and put the new results in df3 (my actual data has about 500 rows in df1). The results should be like this:
df3
results **description (no need to add this column)**
0 15.95 df1.iloc[:4,0].dot(df2)
1 24.46 df1.iloc[1:5,0].dot(df2)
3 50.10 df1.iloc[2:6,0].dot(df2
Try:
import numpy as np
mx=df2.to_numpy()
df1.rolling(5).apply(lambda x: np.dot(x, mx), raw=True).iloc[4:]
Outputs:
column
4 15.95
5 24.46
6 50.10
Say I have a dataset like this:
is_a is_b is_c population infected
1 0 1 50 20
1 1 0 100 10
0 1 1 20 10
...
How do I reshape it to look like this?
feature 0 1
a 10/20 30/150
b 20/50 20/120
c 10/100 30/70
...
In the original dataset, I have features a, b, and c as their own separate columns. In the transformed dataset, these same variables are listed under column feature, and two new columns 0 and 1 are produced, corresponding to the values that these features can take on.
In the original dataset where is_a is 0, add infected values and divide them by population values. Where is_a is 1, do the same, add infected values and divide them by population values. Rinse and repeat for is_b and is_c. The new dataset will have these fractions (or decimals) as shown. Thank you!
I've tried pd.pivot_table and pd.melt but nothing comes close to what I need.
After doing the wide_to_long , your question is more clear
df=pd.wide_to_long(df,['is'],['population','infected'],j='feature',sep='_',suffix='\w+').reset_index()
df
population infected feature is
0 50 20 a 1
1 50 20 b 0
2 50 20 c 1
3 100 10 a 1
4 100 10 b 1
5 100 10 c 0
6 20 10 a 0
7 20 10 b 1
8 20 10 c 1
df.groupby(['feature','is']).apply(lambda x : sum(x['infected'])/sum(x['population'])).unstack()
is 0 1
feature
a 0.5 0.200000
b 0.4 0.166667
c 0.1 0.428571
I tried this on your small dataframe, but I am not sure it will work on a larger dataset.
dic_df = {}
for letter in ['a', 'b', 'c']:
dic_da = {}
dic_da[0] = df[df['is_'+str(letter)] == 0].infected.sum()/df[df['is_'+str(letter)] == 0].population.sum()
dic_da[1] = df[df['is_'+str(letter)] == 1].infected.sum()/df[df['is_'+str(letter)] == 1].population.sum()
dic_df[letter] = dic_da
dic_df
dic_df_ = pd.DataFrame(data = dic_df).T.reset_index().rename(columns= {'index':'feature'})
feature 0 1
0 a 0.5 0.200000
1 b 0.4 0.166667
2 c 0.1 0.428571
Here, DF would be your original DataFrame
Aux_NewDF = [{'feature': feature,
0 : '{}/{}'.format(DF['infected'][DF['is_{}'.format(feature.lower())]==0].sum(), DF['population'][DF['is_{}'.format(feature.lower())]==0].sum()),
1 : '{}/{}'.format(DF['infected'][DF['is_{}'.format(feature.lower())]==1].sum(), DF['population'][DF['is_{}'.format(feature.lower())]==1].sum())} for feature in ['a','b','c']]
NewDF = pd.DataFrame(Aux_NewDF)
I have a dataframe of length-interval data (from boreholes) which looks something like this:
df
Out[46]:
from to min intensity
0 0 10 py 2
1 5 15 cpy 3.5
2 14 27 spy 0.7
I need to pivot this data, but also break it on the least common length interval; resulting in the 'min' column as column headers, and the values being the 'rank'. The output would look like this:
df.somefunc(index=['from','to'], columns='min', values='intensity', fill_value=0)
Out[47]:
from to py cpy spy
0 0 5 2 0 0
1 5 10 2 3.5 0
2 10 14 0 3.5 0
3 14 15 0 3.5 0.7
4 15 27 0 0 0.7
so basically the "From" and "To" describe non-overlapping intervals down a borehole, where the intervals have been split by the least common denominator - as you can see the "py" interval from the original table has been split, the first (0-5m) into py:2, cpy:0 and the second (5-10m) into py:2, cpy:3.5.
The result from just a basic pivot_table function is this:
pd.pivot_table(df, values='intensity', index=['from', 'to'], columns="min", aggfunc="first", fill_value=0)
Out[48]:
min cpy py spy
from to
0 10 0 2 0
5 15 3.5 0 0
14 27 0 0 0.75
which just treats the from and to columns combined as an index. An important point is that my output cannot have overlapping from and to values (IE the subsequent 'from' value cannot be less than the previous 'to' value).
Is there an elegant way to accomplish this using Pandas? Thanks for the help!
I don't know natural interval arithmetic in Pandas, so you need do do it.
Here a way to do that, If I correctly understand bound conditions.
This can be a O(n^3) problem, it will create huge table for big entries.
# make the new bounds
bounds=np.unique(np.hstack((df["from"],df["to"])))
df2=pd.DataFrame({"from":bounds[:-1],"to":bounds[1:]})
#find inclusions
isin=df.apply(lambda x :
df2['from'].between(x[0],x[1]-1)
| df2['to'].between(x[0]+1,x[1])
,axis=1).T
#data
data=np.where(isin,df.intensity,0)
#result
df3=pd.DataFrame(data,
pd.MultiIndex.from_arrays(df2.values.T),df["min"])
For :
In [26]: df3
Out[26]:
min py cpy spy
0 5 2.0 0.0 0.0
5 10 2.0 3.5 0.0
10 14 0.0 3.5 0.0
14 15 0.0 3.5 0.7
15 27 0.0 0.0 0.7