I have a DataFrame that looks like this:
score num_participants
0 20
1 15
2 5
3 10
4 12
5 15
I need to find the number of participants with score that is greater than or equal to the score in the current row:
score num_participants num_participants_with_score_greater_or_equal
0 20 77
1 15 57
2 5 42
3 10 37
4 12 27
5 15 15
So, I am trying to sum current row and all rows below it. The data has around 5000 rows, so I can't manually set it by indexing. cumsum doesn't do the trick and I am not sure if there is a simple way to do this. I have spent quite some time trying to solve this, so any help would be appreciated.
This is a reverse cumsum. Reverse the list, cumsum, then reverse back.
df.iloc[::-1].cumsum().iloc[::-1]
score num_participants
0 15 77
1 15 57
2 14 42
3 12 37
4 9 27
5 5 15
Unless score is already sorted, how about
df['num_participants_with_score_greater_or_equal'] = df.sort_values('score', ascending=False).num_participants.cumsum()
to make score is in the right order. You can restore the original order by .sort_index() after.
Related
I have a list of n elements lets say:
[5,30,60,180,240]
And a dataframe with the following characteristics
id1 id2 feat1
1 1 40
1 2 40
1 3 40
1 4 40
2 6 87
2 7 87
2 8 87
The combination of id1 + id2 is unique but all of the records with common id1 share the value of feat1. I would like to write a function to run it via groupby + apply (or whatever is faster) that creates a column called 'closest_number'. The 'closest_number' will be the closest element between the feat1 column for a given id1+id2 (or id1 as the records share feat1) and each of the elements of the list.
Desired output:
id1 id2 feat1 closest_number
1 1 40 30
1 2 40 30
1 3 40 30
1 4 40 30
2 6 87 60
2 7 87 60
2 8 87 60
If this will be a standard 2 array lookup problem I could do:
def get_closest(array, values):
# make sure array is a numpy array
array = np.array(array)
# get insert positions
idxs = np.searchsorted(array, values, side="left")
# find indexes where previous index is closer
prev_idx_is_less = ((idxs == len(array))|(np.fabs(values - array[np.maximum(idxs-1, 0)]) < np.fabs(values - array[np.minimum(idxs, len(array)-1)])))
idxs[prev_idx_is_less] -= 1
return array[idxs]
An if I apply this do the columns there I will get as output:
array([30, 60])
However I will not get any information about which indexes they have the correspondence with 30 and 60.
What will be the optimum way of doing this? As my list of elements is very small I have created distance columns in my dataset and then I have selected the one that gets me the min distances.
But I assume there should be a more elegant way of doing this.
BR
E
Use get_closest as follows:
# obtain the series with index id1 and values feat1
vals = df.groupby("id1")["feat1"].first().rename("closest_number")
# find the closest values and assign them back
vals[:] = get_closest(s, vals)
# merge the series into the original DataFrame
res = df.merge(vals, right_index=True, left_on="id1", how="left")
print(res)
Output
id1 id2 feat1 closest_number
0 1 1 40 30
1 1 2 40 30
2 1 3 40 30
3 1 4 40 30
4 2 6 87 60
5 2 7 87 60
6 2 8 87 60
For example if I wanted
df = pd.DataFrame(index=range(5000)
df[‘A’]= 0
df[‘A’][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = df['A'][i-1] * 3
Is there a way to do this without a loop?
your code sample has missing close brackets and quotes are not valid. Fixed these
if I understand what you are trying to achieve, multiply previous value by 3 where zeroth number is 1
initialize the series to 3, then set zeroth item to 1
then simple use of cumprod
shortened your series as this calculation will rapidly result in number overflow
df = pd.DataFrame(index=range(20))
df["A"]= 3
df.loc[0,"A"] = 1
df["A"] = df["A"].cumprod()
A
0
1
1
3
2
9
3
27
4
81
5
243
6
729
7
2187
8
6561
9
19683
10
59049
11
177147
12
531441
13
1.59432e+06
14
4.78297e+06
15
1.43489e+07
16
4.30467e+07
17
1.2914e+08
18
3.8742e+08
19
1.16226e+09
This question already has answers here:
Is it possible to insert a row at an arbitrary position in a dataframe using pandas?
(4 answers)
Closed 2 years ago.
import pandas as pd
data = {'term':[2, 7,10,11,13],'pay':[22,30,50,60,70]}
df = pd.DataFrame(data)
pay term
0 22 2
1 30 7
2 50 10
3 60 11
4 70 13
df.loc[2] = [49,9]
print(df)
pay term
0 22 2
1 30 7
2 49 9
3 60 11
4 70 13
Expected output :
pay term
0 22 2
1 30 7
2 49 9
3 50 10
4 60 11
5 70 13
If we run above code, it is replacing the values at 2 index. I want to add new row with desired value as above to my existing dataframe without replacing the existing values. Please suggest.
You could not be able to insert a new row directly by assigning values to df.loc[2] as it will overwrite the existing values. But you can slice the dataframe in two parts and then concat the two parts along with third row to insert.
Try this:
new_df = pd.DataFrame({"pay": 49, "term": 9}, index=[2])
df = pd.concat([df.loc[:1], new_df, df.loc[2:]]).reset_index(drop=True)
print(df)
Output:
term pay
0 2 22
1 7 30
2 9 49
3 10 50
4 11 60
5 13 70
A possible way is to prepare an empty slot in the index, add the row and sort according to the index:
df.index = list(range(2)) + list(range(3, len(df) +1))
df.loc[2] = [49,9]
It gives:
term pay
0 2 22
1 7 30
3 10 50
4 11 60
5 13 70
2 49 9
Time to sort it:
df = df.sort_index()
term pay
0 2 22
1 7 30
2 49 9
3 10 50
4 11 60
5 13 70
That is because loc and iloc methods bring the already existing row from the dataframe, what you would normally do is to insert by appending a value in the last row.
To address this situation first you need to split the dataframe, append the value you want, concatenate with the second split and finally reset the index (in case you want to keep using integers)
#location you want to update
i = 2
#data to insert
data_to_insert = pd.DataFrame({'term':49, 'pay':9}, index = [i])
#split, append data to insert, append the rest of the original
df = df.loc[:i].append(data_to_insert).append(df.loc[i:]).reset_index(drop=True)
Keep in mind that the slice operator will work because the index is integers.
I have dataframe that has different length each group.
For example,
gid val1 val2
1 3 5
1 11 15
1 12 5
1 18 6
1 8 8
1 18 7
1 18 8
2 29 21
2 27 23
....
Then, I want to perform
def func(x):
d = {}
d['first2'] = x['val1'].first(5).mean()
return pd.Series(...)
grouped = df.groupby(['gid']).apply(func)
that way I can get the mean() of first five columns of x['val1'] which is in the same group.
Is there any way to perform above operations?
Is there any way to perform above operations but with latter half columns rather than first five columns?
Thank you in advance.
I am new to using pandas but want to learn it better. I am currently facing a problem. I have a DataFrame looking like this:
0 1 2
0 chr2L 1 4
1 chr2L 9 12
2 chr2L 17 20
3 chr2L 23 23
4 chr2L 26 27
5 chr2L 30 40
6 chr2L 45 47
7 chr2L 52 53
8 chr2L 56 56
9 chr2L 61 62
10 chr2L 66 80
I want to get something like this:
0 1 2 3
0 chr2L 0 1 0
1 chr2L 1 2 1
2 chr2L 2 3 1
3 chr2L 3 4 1
4 chr2L 4 5 0
5 chr2L 5 6 0
6 chr2L 6 7 0
7 chr2L 7 8 0
8 chr2L 8 9 0
9 chr2L 9 10 1
10 chr2L 10 11 1
11 chr2L 11 12 1
12 chr2L 12 13 0
And so on...
So, fill in the missing intervals with zeros, and save the present intervals as ones (if there is an easy way to save "boundary" positions (the borders of the intervals in the initial data) as 0.5 at the same time it might also be helpful) while splitting all data into 1-length intervals.
In the data there are multiple string values in the column 0, and this should be done for each of them separately. They require different length of the final data (the last value that should get a 0 or a 1 is different). Would appreciate your help with dealing with this in pandas.
This works for most of your first paragraph and some of the second. Left as an exercise: finish inserting insideness=0 rows (see end):
import pandas as pd
# dummied-up version of your data, but with column headers for readability:
df = pd.DataFrame({'n':['a']*4 + ['b']*2, 'a':[1,6,8,5,1,5],'b':[4,7,10,5,3,7]})
# splitting up a range, translated into df row terms:
def onebyone(dfrow):
a = dfrow[1].a; b = dfrow[1].b; n = dfrow[1].n
count = b - a
if count >= 2:
interior = [0.5]+[1]*(count-2)+[0.5]
elif count == 1:
interior = [0.5]
elif count == 0:
interior = []
return {'n':[n]*count, 'a':range(a, a + count),
'b':range(a + 1, a + count + 1),
'insideness':interior}
Edited to use pd.concat(), new in pandas 0.15, to combine the intermediate results:
# Into a new dataframe:
intermediate = []
for label in set(df.n):
for row in df[df.n == label].iterrows():
intermediate.append(pd.DataFrame(onebyone(row)))
df_onebyone = pd.concat(intermediate)
df_onebyone.index = range(len(df_onebyone))
And finally a sketch of identifying the missing rows, which you can edit to match the above for-loop in adding rows to a final dataframe:
# for times in the overall range describing 'a'
for i in range(int(newd[newd.n=='a'].a.min()),int(newd[newd.n=='a'].a.max())):
# if a time isn't in an existing 0.5-1-0.5 range:
if i not in newd[newd.n=='a'].a.values:
# these are the values to fill in a 0-row
print '%d, %d, 0'%(i, i+1)
Or, if you know the a column will be sorted for each n, you could keep track of the last end-value handled by onebyone() and insert some extra rows to catch up to the next start value you're going to pass to onebyone().