Cumsum until value exceeds certain number:
Say that we have two Data frames A,B that look like this:
A = pd.DataFrame({"type":['a','b','c'], "value":[100, 50, 30]})
B = pd.DataFrame({"type": ['a','a','a','a','b','b','b','c','c','c','c','c'], "value": [10,50,45,10,45,10,5,6,6,8,12,10]})
The two data frames would look like this.
>>> A
type value
0 a 100
1 b 50
2 c 30
>>> B
type value
0 a 10
1 a 50
2 a 45
3 a 10
4 b 45
5 b 10
6 b 5
7 c 6
8 c 6
9 c 8
10 c 12
11 c 10
For each group in "type" in data frame A, i would like to add the column value in B up to the number specified in the column value in A. I would also like to count the number of rows in B that were added. I've been trying to use a cumsum() but I don't know exactly to to stop the sum when the value is reached,
The output should be:
type value
0 a 3
1 b 2
2 c 4
Thank you,
Merging the two data frame before hand should help:
import pandas as pd
df = pd.merge(B, A, on = 'type')
df['cumsum'] = df.groupby('type')['value_x'].cumsum()
B[(df.groupby('type')['cumsum'].shift().fillna(0) < df['value_y'])].groupby('type').count()
# type value
# a 3
# b 2
# c 4
Assuming B['type'] to be sorted as with the sample case, here's a NumPy based solution -
IDs = np.searchsorted(A['type'],B['type'])
count_cumsum = np.bincount(IDs,B['value']).cumsum()
upper_bound = A['value'] + np.append(0,count_cumsum[:-1])
Bv_cumsum = np.cumsum(B['value'])
grp_start = np.unique(IDs,return_index=True)[1]
A['output'] = np.searchsorted(Bv_cumsum,upper_bound) - grp_start + 1
Related
I am trying to create a machine learning model and teaching myself as I go. I will be working with a large dataset, but before I get to that, I am practicing with a smaller dataset to make sure everything is working as expected. I will need to swap half of the rows of two columns in my dataset, and I am not sure how to accomplish this.
Say I have a dataframe like the below:
index
number
letter
0
1
A
1
2
B
2
3
C
3
4
D
4
5
E
5
6
F
I want to randomly swap half of the rows of the number and letter columns, so one output could look like this:
index
number
letter
0
1
A
1
B
2
2
3
C
3
D
4
4
5
E
5
F
6
Is there a way to do this in python?
edit: thank you for all of your answers, I greatly appreciate it! :)
Here's one way to implement this.
import pandas as pd
from random import sample
df = pd.DataFrame({'index':range(6),'number':range(1,7),'letter':[*'ABCDEF']}).set_index('index')
n = len(df)
idx = sample(range(n),k=n//2) # randomly select which rows to switch
df = df.iloc[idx,:] = df.iloc[idx,::-1].values # switch those rows
An example result is
number letter
index
0 1 A
1 2 B
2 C 3
3 4 D
4 E 5
5 F 6
Update
To select randomly rows, use np.random.choice:
import numpy as np
idx = np.random.choice(df.index, len(df) // 2, replace=False)
df.loc[idx, ['letter', 'number']] = df.loc[idx, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 2 B
2 3 C
3 D 4
4 E 5
5 F 6
Old answer
You can try:
df.loc[df.index % 2 == 1, ['letter', 'number']] = \
df.loc[df.index % 2 == 1, ['number', 'letter']].to_numpy()
print(df)
# Output
number letter
0 1 A
1 B 2
2 3 C
3 D 4
4 5 E
5 F 6
For more readability, use an intermediate variable as a boolean mask:
mask = df.index % 2 == 1
df.loc[mask, ['letter', 'number']] = df.loc[mask, ['number', 'letter']].to_numpy()
You can create a copy of your original data, sample it, and then update it inplace- converting to a numpy ndarray to prevent index-alignment from occuring.
swapped_df = df.copy()
sample = swapped_df.sample(frac=0.5, random_state=0)
swapped_df.loc[sample.index, ['number', 'letter']] = sample[['letter', 'number']].to_numpy()
print(swapped_df)
number letter
index
0 1 A
1 B 2
2 C 3
3 4 D
4 E 5
5 6 F
>>>
Similar to previous answers but slightly more readable (in my opinion) if you are trying to build your sense for basic pandas operations:
rows_to_change = df.sample(frac=0.5)
rows_to_change = rows_to_change.rename(columns={'number':'letter', 'letter':'number'})
df.loc[rows_to_change.index] = rows_to_change
I have a following dummy data frame:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A B C
0 1 50 60
1 5 70 80
2 2 120 30
3 3 125 450
4 5 80 90
5 4 100 200
6 2 1000 2000
7 1 10 20
I am for loop in python at this moment and I would like to know if there is any possibility that for loop in python to generate multiple results. I would like to break the above data frame using for loop where for each variable in column A I would like to have new df and sort them based on column B and have column C multiplied by 2:
df1 =
A B C
1 10 40
1 20 120
df2 =
A B C
2 120 60
2 1000 4000
df3 =
A B C
3 125 900
df4 =
A B C
4 100 200
df5 =
A B C
5 70 80
5 80 90
I am not sure if this can be done in Python. Normally I use matlab and for this I tried the following in my python script:
def f(df):
for i in np.unique(df['A'].values):
df = df.sort_values(['A','B'])
df = df['C'].assign(C = lambda x: x.C*2)
print df
Of course this is wrong since it will not generate multiple result as df1,df2...df5 (this variables are important to be ended by 1,2,...5 such that it can be traced or followed column A of the dataframe). Could anyone help me please? I understand that this can be easily done without for loop (vectorization), but I have many unique values in column A and I would like to run a for loop on them and I would also like to learn more about loop in Python. Many thanks.
Use DataFrame.groupby is faster than Series.unique.
Optionally you can save the dataframes in a dictionary.
The advantage of using a dictionary with respect to the list is that it can match the password with the value in A
df2=df.copy()
df2['C']=df2['C']*2
df2=df2.sort_values('B')
dfs={i:group for i,group in df2.groupby('A')}
access the dictionary based on the value in A:
for key in dfs:
print(f'dfs[{key}]')
print(dfs[key])
print('_'*20)
dfs[1]
A B C
7 1 10 80
0 1 50 240
____________________
dfs[2]
A B C
2 2 120 120
6 2 1000 8000
____________________
dfs[3]
A B C
3 3 125 1800
____________________
dfs[4]
A B C
5 4 100 800
____________________
dfs[5]
A B C
1 5 70 320
4 5 80 360
Sort and multiply before chunking into pieces:
df['C'] = 2* df['C']
[group for name, group in df.sort_values(by=['A','B']).groupby('A')]
Or if you want a dict:
{name: group for name, group in df.sort_values(by=['A','B']).groupby('A')}
I have similar answer like Ansev:
df = pd.DataFrame([[1,50,60],[5,70,80],[2,120,30],[3,125,450],[5,80,90],[4,100,200],[2,1000,2000],[1,10,20]],columns = ['A','B','C'])
A = np.unique(data['A'].values)
df_result = []
for a in A:
df1 = df.loc[df['A'] == a]
df1 = df1.sort_values('B')
df1 = df1.assign(C = lambda x: x.C*2)
df_result+=[df1]
I am still unable to automate this for having the result as df_result1, df_result2...df_result5. What I can do is only to call the result from each loop as df_result[0], df_result[1],...df_result[4].
What you want to do is group by the column A and then store the resulting dataframe into a dict indexed by the value of A. A code to do that would be
df_dict = {}
for ix, gp in df.groupby('A'):
new_df = gp.sort_values('B')
new_df['C'] = 2*new_df['C']
df_dict[ix] = new_df
Then the variable df_list contains all the resulting dataframes sorted by column B and column C multiplied by 2. For example
print(df_dict[1])
A B C
1 10 40
1 50 120
With Panda dataframes, how can I change a generic text column ['A'] into an integer value column ['A'], so that some word becomes some value. I'm not asking to calculate a value, I'm asking to replace some text by some number.
table before table after
A A
0 w 0 2
1 q 1 11
2 st 2 1
3 R 3 7
4 Prt 4 6
Replace it so R becomes 7, st becomes 1
Pseudo code:
df['A'] = df.convert('w'=2, 'q'=11, 'st'=1 )
You can use replace with a dictionary indicating how the replace should be done.
import pandas as pd
df_before = pd.DataFrame({'A':['w','q','st','R','Prt']})
d = {'w':2, 'q':11, 'st':1, 'R': 7, 'Prt':6}
df_after = df_before.replace(d)
print(df_after)
Output:
A
0 2
1 11
2 1
3 7
4 6
I am working on Pandas data frame and have following dataframe:
data =pd.DataFrame()
data['HomeTeam'] = ['A','B','C','D','E']
data['AwayTeam'] = ['E','D','A','B','C']
data['HomePoint'] = [1,3,0,1,3]
data['AwayPoint'] = [1,0,3,1,0]
data ['Match'] = data['HomeTeam'].astype(str)+' Vs '+data['AwayTeam'].astype(str)
# I want to duplicate the match
Nsims = 2
data_Dub =pd.DataFrame((pd.np.tile(data,(Nsims,1))))
data_Dub.columns = data.columns
# Then I will assign the stage of the match
data_Dub['SimStage'] = data_Dub.groupby('Match').cumcount()
What i wanted to do is to sum homepoint and awaypoint obtained by each team and save it to new data frame.
my new dataframe will look like as follow:
It means that Homepoint and awaypoint will be added for same team as I have 5 teams in dataframe.
Can anyone advise how to do it.
I used following code and it does not work.
Point = data_Dub.groupby(['SimStage','HomeTeam','AwayTeam])['HomePoint','AwayPoint'].sum()
Thanks.
You can aggregate sum separately for HomeTeam and AwayTeam, then use add, last sort_index, reset_index for columns from MultiIndex, change column name and if necessary order of columns:
a = data_Dub.groupby(['AwayTeam', 'SimStage'])['AwayPoint'].sum()
b = data_Dub.groupby(['HomeTeam', 'SimStage'])['HomePoint'].sum()
s = a.add(b).rename('Point')
df = s.sort_index(level=[1, 0]).reset_index().rename(columns={'AwayTeam':'Team'})
df = df[['Team','Point','SimStage']]
print (df)
Team Point SimStage
0 A 4 0
1 B 4 0
2 C 0 0
3 D 1 0
4 E 4 0
5 A 4 1
6 B 4 1
7 C 0 1
8 D 1 1
9 E 4 1
I have a table with records which looks like this:
buyer=[name,value]
It is possible for anyid to have several records in the table:
Name Value
E 10
A 2
D 4
E 10
A 5
B 3
B 10
D 10
C 4
I am trying to filter this table based on the following logic: Select all records for those names for which the maximum value is not larger than 5. Based on the above example, I would select all records for names A and C, because their maxima are 5 and 3 respectively:
Name Value
A 2
A 5
C 4
B, D and E will be excluded, because their maxima are 10 (for each of them).
How can I do this ?
Simply with pandas module:
Let's say your data is in test.csv file.
import pandas as pd
df = pd.read_csv("test.csv", delim_whitespace=True)
idx = df.groupby('Name')['Value'].transform(max) <= 5
print(df[idx])
The output:
Name Value
1 A 2
4 A 5
8 C 4
pandas.DataFrame.groupby