I have been looking for a way to do this but haven't been successful in finding something that will work in python/pandas.
I am looking to loop through rows until a 1 is found again and concatenate the previous rows until the one is found and place that in a third column.
For Example:
df
'A' 'B' 'C'
1 4 4
2 3 43
3 1 431
4 2 4312
1 5 5
2 4 54
1 2 2
2 2 22
3 4 224
df
if df['A'] == 1
df['C'] = df.concat['B']
else
df['C'] = df.concat['B'] + df.concat['B'+1]
If you can't tell, this is my first time trying to write a loop.
Any help generating column C from columns A and B using python code would be well appreciated.
Thank you,
James
This can achieve what you need , create a new key by using cumsum, then we groupby the key we created , using cumsum again
df.groupby(df.A.eq(1).cumsum()).B.apply(lambda x : x.astype(str).cumsum())
Out[838]:
0 4
1 43
2 431
3 4312
4 5
5 54
6 2
7 22
8 224
Name: B, dtype: object
Related
Is there an efficient way to change the value of a previous row whenever a conditional is met in a subsequent entry? Specifically I am wondering if there is anyway to adapt pandas.where to modify the entry in a row prior or subsequent to the conditional test. Suppose
Data={'Energy':[12,13,14,12,15,16],'Time':[2,3,4,2,5,6]}
DF = pd.DataFrame(Data)
DF
Out[123]:
Energy Time
0 12 2
1 13 3
2 14 4
3 12 2
4 15 5
5 16 6
If I wanted to change the value of Energy to 'X' whenever Time <= 2 I could just do something like.
DF['ENERGY']=DF['ENERGY'].where(DF['TIME'] >2,'X')
or
DF.loc[DF['Time']<=2,'Energy']='X'
Which would output
Energy Time
0 X 2
1 13 3
2 14 4
3 X 2
4 15 5
5 16 6
But what if I want to change the value of 'Energy' in the row after Time <=2 so that the output would actually be.
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Is there an easy modification for a vectorized approach to this?
Shift the values one row down using Series.shift and then compare:
df.loc[df['Time'].shift() <= 2, 'Energy'] = 'X'
df
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Side note: I assume 'X' is actually something else here, but FYI, mixing strings and numeric data leads to object type columns which is a known pandas anti-pattern.
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows
i'm using the following dask.dataframe AID:
AID FID ANumOfF
0 1 X 1
1 1 Y 5
2 2 Z 6
3 2 A 1
4 2 X 11
5 2 B 18
I know in a pandas dataframe I could use:
AID.groupby('AID')['ANumOfF'].transform('sum')
to get:
0 6
1 6
2 36
3 36
4 36
5 36
I want to use the same with dask.dataframes which usually uses same functions as a pandas dataframe, but in this instance gives me the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'transform'
It could either be one of two things, either that dask doesn't support it, or it's because I'm using python 3?
I tried the following code:
AID.groupby('AID')['ANumOfF'].sum()
but that just gives me the sum of each group like this:
AID
1 6
2 36
I need it to be as the above where a sum is repeated in each row. My question is, if transform isn't supported, is there another way I could achieve the same result?
I think you can use join:
s = AID.groupby('AID')['ANumOfF'].sum()
AID = AID.set_index('AID').drop('ANumOfF', axis=1).join(s).reset_index()
print (AID)
AID FID ANumOfF
0 1 X 6
1 1 Y 6
2 2 Z 36
3 2 A 36
4 2 X 36
5 2 B 36
Or faster solution with map by aggregate Series or dict:
s = AID.groupby('AID')['ANumOfF'].sum()
#a bit faster
#s = AID.groupby('AID')['ANumOfF'].sum().to_dict()
AID['ANumOfF'] = AID['AID'].map(s)
print (AID)
AID FID ANumOfF
0 1 X 6
1 1 Y 6
2 2 Z 36
3 2 A 36
4 2 X 36
5 2 B 36
Currently Dask supports transform , howerver there may be an issues with indexes (depending on original dataframe). see this PR #5327
So your code should work
AID.groupby('AID')['ANumOfF'].transform('sum')
I realize this question is similar to join or merge with overwrite in pandas, but the accepted answer does not work for me since I want to use the on='keys' from df.join().
I have a DataFrame df which looks like this:
keys values
0 0 0.088344
1 0 0.088344
2 0 0.088344
3 0 0.088344
4 0 0.088344
5 1 0.560857
6 1 0.560857
7 1 0.560857
8 2 0.978736
9 2 0.978736
10 2 0.978736
11 2 0.978736
12 2 0.978736
13 2 0.978736
14 2 0.978736
Then I have a Series s (which is a result from some df.groupy.apply()) with the same keys:
keys
0 0.183328
1 0.239322
2 0.574962
Name: new_values, dtype: float64
Basically I want to replace the 'values' in the df with the values in the Series, by keys so every keys block gets the same new value. Currently, I do it as follows:
df = df.join(s, on='keys')
df['values'] = df['new_values']
df = df.drop('new_values', axis=1)
The obtained (and desired) result is then:
keys values
0 0 0.183328
1 0 0.183328
2 0 0.183328
3 0 0.183328
4 0 0.183328
5 1 0.239322
6 1 0.239322
7 1 0.239322
8 2 0.574962
9 2 0.574962
10 2 0.574962
11 2 0.574962
12 2 0.574962
13 2 0.574962
14 2 0.574962
That is, I add it as a new column and by using on='keys' it gets the corrects shape. Then I assign values to be new_values and remove the new_values column. This of course works perfectly, the only problem being that I find it extremely ugly.
Is there a better way to do this?
You could try something like:
df = df[df.columns[df.columns!='values']].join(s, on='keys')
Make sure s is named 'values' instead of 'new_values'.
To my knowledge, pandas doesn't have the ability to join with "force overwrite" or "overwrite with warning".
I am trying to add several columns of data to an existing dataframe. The dataframe itself was built from a number of other dataframes, which I successfully joined on indices, which were identical. For that, I used code like this:
data = p_data.join(r_data)
I actually joined these on a multi-index, so the dataframe looks something like the following, where Name1 and Name 2 are indices:
Name1 Name2 present r behavior
a 1 1 0 0
2 1 .5 2
4 3 .125 1
b 2 1 0 0
4 5 .25 4
8 1 0 1
So the Name1 index does not repeat data, but the Name2 index does (I'm using this to keep track of dyads, so that Name1 & Name2 together are only represented once). What I now want to add are 4 columns of data that correspond to Name2 data (information on the second member of the dyad). Unlike the "present" "r" and "behavior" data, these data are per individual, not per dyad. So I don't need to consider Name1 data when merging.
The problem is that while Name2 data are repeated to exhaust the dyad combos, the "Name2" column in the data I would now like to add only has one piece of data per Name2 individual:
Name2 Data1 Data2 Data3
1 80 6 1
2 61 8 3
4 45 7 2
8 30 3 6
What I would like the output to look like:
Name1 Name2 present r behavior Data1 Data2 Data3
a 1 1 0 0 80 6 1
2 1 .5 2 61 8 3
4 3 .125 1 45 7 2
b 2 1 0 0 61 8 3
4 5 .25 4 45 7 2
8 1 0 1 30 3 6
Despite reading the documentation, I am not clear on whether I can use join() or merge() for the desired outcome. If I try a join to the existing dataframe like the simple one I've used previously, I end up with the new columns but they are full of NaN values. I've also tried various combinations using Name1 and Name2 as either columns or as indices, with either join or merge (not as random as it sounds, but I'm clearly not interpreting the documentation correctly!). Your help would be very much appreciated, as I am presently very much lost.
I'm not sure if this is the best way, but you could use reset_index to temporarily make your original DataFrame indexed by Name2 only. Then you could perform the join as usual. Then use set_index to again make Name1 part of the MultiIndex:
import pandas as pd
df = pd.DataFrame({'Name1':['a','a','a','b','b','b'],
'Name2':[1,2,4,2,4,8],
'present':[1,1,3,1,5,1]})
df.set_index(['Name1','Name2'], inplace=True)
df2 = pd.DataFrame({'Data1':[80,61,45,30],
'Data2':[6,8,7,3]},
index=pd.Series([1,2,4,8], name='Name2'))
result = df.reset_index(level=0).join(df2).set_index('Name1', append=True)
print(result)
# present Data1 Data2
# Name2 Name1
# 1 a 1 80 6
# 2 a 1 61 8
# b 1 61 8
# 4 a 3 45 7
# b 5 45 7
# 8 b 1 30 3
To make the result look even more like your desired DataFrame, you could reorder and sort the index:
print(result.reorder_levels([1,0],axis=0).sort(axis=0))
# present Data1 Data2
# Name1 Name2
# a 1 1 80 6
# 2 1 61 8
# 4 3 45 7
# b 2 1 61 8
# 4 5 45 7
# 8 1 30 3