I have a dataframe with two columns, score and order_amount. I want to find the score Y that represents the Xth percentile of order_amount. I.e. if I sum up all of the values of order_amount where score <= Y I will get X% of the total order_amount.
I have a solution below that works, but it seems like there should be a more elegant way with pandas.
import pandas as pd
test_data = {'score': [0.3,0.1,0.2,0.4,0.8],
'value': [10,100,15,200,150]
}
df = pd.DataFrame(test_data)
df
score value
0 0.3 10
1 0.1 100
2 0.2 15
3 0.4 200
4 0.8 150
# Now we can order by `score` and use `cumsum` to calculate what we want
df_order = df.sort_values('score')
df_order['percentile_value'] = 100*df_order['value'].cumsum()/df_order['value'].sum()
df_order
score value percentile_value
1 0.1 100 21.052632
2 0.2 15 24.210526
0 0.3 10 26.315789
3 0.4 200 68.421053
4 0.8 150 100.000000
# Now can find the first value of score with percentile bigger than 50% (for example)
df_order[df_order['percentile_value']>50]['score'].iloc[0]
Use Series.searchsorted:
idx = df_order['percentile_value'].searchsorted(50)
print (df_order.iloc[idx, df.columns.get_loc('score')])
0.4
Or get first value of filtered Series with next and iter, if no match returned some default value:
s = df_order.loc[df_order['percentile_value'] > 50, 'score']
print (next(iter(s), 'no match'))
0.4
One line solution:
out = next(iter((df.sort_values('score')
.assign(percentile_value = lambda x: 100*x['value'].cumsum()/x['value'].sum())
.query('percentile_value > 50')['score'])),'no matc')
print (out)
0.4
here is another way starting from the oriinal dataframe using np.percentile:
df = df.sort_values('score')
df.loc[np.searchsorted(df['value'],np.percentile(df['value'].cumsum(),50)),'score']
Or series.quantile
df.loc[np.searchsorted(df['value'],df['value'].cumsum().quantile(0.5)),'score']
Or similarly with iloc, if index is not default:
df.iloc[np.searchsorted(df['value']
,np.percentile(df['value'].cumsum(),50)),df.columns.get_loc('score')]
0.4
Related
I need to calculate a percent change column with respect to the MultiIndex:
import pandas as pd
import numpy as np
row_x1 = ['1','0']
row_x2 = ['1.5','.5']
row_x3 = ['3','1']
row_x4 = ['2','0']
row_x5 = ['3','.5']
index_arrays = [
np.array(['first', 'first', 'first', 'second', 'second']),
np.array(['one','two','three','one','two'])
]
df1 = pd.DataFrame(
[row_x1,row_x2,row_x3,row_x4,row_x5],
columns=['A'],
index=index_arrays,
)
print(df1)
Starting with the following data frame:
A
first one 1
two 1.5
three 3
second one 2
two 3
The final "percentage change" column should be calculated as shown below:
A %
first one 1 0
two 1.5 .5
three 3 1
second one 2 0
two 3 .5
I have a large data set, and I need to do this programmatically.
Let's do groupby and calculate percent change
df1['A'] = df1['A'].astype(float)
df1['%'] = df1.groupby(level=0)['A'].pct_change().fillna(0)
A %
first one 1.0 0.0
two 1.5 0.5
three 3.0 1.0
second one 2.0 0.0
two 3.0 0.5
I am trying to make a article similarity checker by comparing 6 articles with list of articles that I obtained from an API. I have used cosine similarity to compare each article one by one with the 6 articles that I am using as baseline.
My dataframe now looks like this:
id
Article
cosinesin1
cosinesin2
cosinesin3
cosinesin4
cosinesin5
cosinesin6
Similar
id1
[Article1]
0.2
0.5
0.6
0.8
0.7
0.8
True
id2
[Article2]
0.1
0.2
0.03
0.8
0.2
0.45
False
So I want to add Similar column in my dataframe that could check values for each Cosinesin (1-6) and return True if at least 3 out of 6 has value more than 0.5 otherwise return False.
Is there any way to do this in python?
Thanks
In Python, you can treat True and False as integers, 1 and 0, respectively.
So if you compare all the similarity metrics to 0.5, you can sum over the resulting Boolean DataFrame along the columns, to get the number of comparisons that resulted in True for each row. Comparing those numbers to 3 yields the column you want:
cos_cols = [f"cosinesin{i}" for i in range(1, 7)]
df['Similar'] = (df[cos_cols] > 0.5).sum(axis=1) >= 3
I have a Pandas dataframe like this:
import pandas as pd
df = pd.DataFrame(
{'gender':['F','F','F','F','F','M','M','M','M','M'],
'mature':[0,1,0,0,0,1,1,1,0,1],
'cta' :[1,1,0,1,0,0,0,1,0,1]}
)
df['gender'] = df['gender'].astype('category')
df['mature'] = df['mature'].astype('category')
df['cta'] = pd.to_numeric(df['cta'])
df
I calculated the sum (How many times people clicked) and total (the number of sent messages). I want to figure out how to calculate the percentage defined as clicks/total and how to get a dataframe as output.
temp_groupby = df.groupby('gender').agg({'cta': [('clicks','sum'),
('total','count')]})
temp_groupby
I think it means you need average, add new tuple to list like:
temp_groupby = df.groupby('gender').agg({'cta': [('clicks','sum'),
('total','count'),
('perc', 'mean')]})
print (temp_groupby)
cta
clicks total perc
gender
F 3 5 0.6
M 2 5 0.4
For avoid MultiIndex in columns specify column after groupby:
temp_groupby = df.groupby('gender')['cta'].agg([('clicks','sum'),
('total','count'),
('perc', 'mean')]).reset_index()
print (temp_groupby)
gender clicks total perc
0 F 3 5 0.6
1 M 2 5 0.4
Or use named aggregation:
temp_groupby = df.groupby('gender', as_index=False).agg(clicks= ('cta','sum'),
total= ('cta','count'),
perc= ('cta','mean'))
print (temp_groupby)
gender clicks total perc
0 F 3 5 0.6
1 M 2 5 0.4
I have a matrix as follows
0 1 2 3 ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2
The data is in a dataframe as shown
Genes string
Gene1 ATGC
Gene2 GCTA
Gene3 ATCG
I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))
The output should be as follows:
Genes Score
Gene1 1.1
Gene2 1.5
Gene3 0.7
I tried using biopython but could not get it right. Can anyone please help!
Let df and genes be your DataFrames. First, let's convert df into a "tall" form:
tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!
Create a new tuple-based index for the trall DF:
tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)
This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:
def gene2score(gene):
return tall.loc[list(enumerate(gene))]['score'].sum()
genes['string'].apply(gene2score)
#Genes
#Gene1 1.1
#Gene2 1.5
#Gene3 0.7
I have a pandas dataframe with position,k, y. For example
pos k y
123 0.7 0.5
124 0.4 0.1
125 0.3 0.2
126 0.4 0.1
128 0.3 0.6
130 0.4 0.9
131 0.3 0.2
i would like to sum the information at k and y like
123 1.1 0.6
125 0.7 0.3
128 0.3 0.6
130 0.7 1.1
so the output has only the first positions and the sum of value the first and its immediate consecutive number which follows it.
I tried grouping by pandas
for k,g in df.groupby(df['pos'] - np.arange(df.shape[0])):
u=g.ix[0:,2:].sum()
but its groups all the consecutive numbers which I dont want
ALSO I NEED SOMETHING FAST AS I HAVE 2611774 ROW IN MY DATAFILE
Hope this will solve your problem
import pandas as pd
df = pd.DataFrame( columns=['pos','k','y'])
cf = pd.DataFrame( columns=['pos','k','y'])
df['pos']=123, 124,125,126,128,130,131
df['k']=.7,.4,.3,.4,.3,.4,.3
df['y']=.5,.1,.2,.1,.6,.9,.2
row=0
while 1:
if row+1<len(df):
if(df.loc[row]['pos']+1==df.loc[row+1]['pos']):
cf.loc[row]= df.loc[row]+df.loc[row+1]
cf.loc[row]['pos']=df.loc[row]['pos']
row=row+2
else:
cf.loc[row]= df.loc[row]
row=row+1
else:
break
print cf
Maybe this is faster than a loop, but it won't sum positions 123 and 124 and then 130 and 131 as I think you expect, because it sums odd positions with its consecutive like 129 and 130, 131 and 132...
df = df.set_index('pos')
df_odd = df.loc[df.index.values % 2 == 1]
df_even = df.loc[df.index.values % 2 == 0]
df_even = df_even.set_index(df_even.index.values - 1)
df_odd.add(df_even, fill_value = 0)
Result:
pos k y
123 1.1 0.6
125 0.7 0.3
127 0.3 0.6
129 0.4 0.9
131 0.3 0.2
I have not used pandas before, but if you get the chance to use the data as a list then this should work.
def SumNext(L):
N = xrange(len(L)-1)
Output = [L[i]+L[i+1] for i in N]
return Output
This function will give you a summation of consecutive elements if you input a list.
A=[1,1,2,3,5,8,13]
SumNext(A) => [2,3,5,8,13]
then you just have to read out the values to wherever you like, it is much faster to do things in lists (as opposed to while loops) when you get lots of elements.
then you will just need to figure out the implementation of passing the output back to your data frame.