Pandas: How to add number of the row within grouped rows - python

so I have DataFrame:
>>> df2
text
0 0 a
0 1 b
0 2 c
0 3 d
1 4 e
1 5 f
1 6 g
2 7 h
2 8 1
How do I create another column, which contains counter for each row within an level=0 index?
I have tried the following code (i need to get df['counter'] column):
current_index = ''
for index, row in df.iterrows():
if index[0] != current_index:
current_index = index[0]
df[(df.index == current_index)]['counter'] = np.arange(len(df[(df.index == current_index)].index))
and following code as well:
df2 = pd.DataFrame()
for group, df in df1.groupby('level_0_column'):
df0 = df0.sort_values(by=['level_1_column'])
df['counter'] = list(df.reset_index().index.values + 1)
df2 = df2.append(df0)
I have around 650K rows in DataFrame... goes to infinite loop. Please advice

I believe you're looking for groupby along the 0th column index + cumcount:
df['counter'] = df.groupby(level=0).cumcount() + 1
df
text counter
0 0 a 1
1 b 2
2 c 3
3 d 4
1 4 e 1
5 f 2
6 g 3
2 7 h 1
8 1 2

Related

Comparing the value of a column with the previous value of a new column using Apply in Python (Pandas)

I have a dataframe with these values in column A:
df = pd.DataFrame(A,columns =['A'])
A
0 0
1 5
2 1
3 7
4 0
5 2
6 1
7 3
8 0
I need to create a new column (called B) and populate it using next conditions:
Condition 1: If the value of A is equal to 0 then, the value of B must be 0.
Condition 2: If the value of A is not 0 then I compare its value to the previous value of B. If A is higher than the previous value of B then I take A, otherwise I take B.
The result should be this:
A B
0 0 0
1 5 5
2 1 5
3 7 7
4 0 0
5 2 2
6 1 2
7 3 3
The dataset is huge and using loops would be too slow. I would need to solve this without using loops and the pandas “Loc” function. Anyone could help me to solve this using the Apply function? I have tried different things without success.
Thanks a lot.
One way to do this I guess could be the following
def do_your_stuff(row):
global value
# fancy stuff here
value = row["b"]
[...]
value = df.iloc[0]['B']
df["C"] = df.apply(lambda row: do_your_stuff(row), axis=1)
Try this:
df['B'] = df['A'].shift()
df['B'] = df.apply(lambda x:0 if x.A == 0 else x.A if x.A > x.B else x.B, axis=1)
Use .shift() to shift your one cell down and check if the previous value is smaller and it is not 0. Then use .mask() to replace the values with the previous if the condition stands.
from io import StringIO
import pandas as pd
wt = StringIO("""A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
""")
df = pd.read_csv(wt, sep='\s\s+')
df
A
0 0
1 2
2 3
3 1
4 2
5 7
6 0
def func(df, col):
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
if col == 'B':
while ((df[col].shift(1) > df[col]) & (df[col] != 0)).any():
df['B'] = df[col].mask(cond=((df[col].shift(1) > df[col]) & (df[col] != 0)), other=df[col].shift(1))
return df
(df.pipe(func, 'A').pipe(func, 'B'))
Output:
A B
0 0 0
1 2 2
2 3 3
3 1 3
4 2 3
5 7 7
6 0 0
Using the solution of Achille I solved it this way:
import pandas as pd
A = [0,2,3,0,2,7,2,3,2,20,1,0,2,5,4,3,1]
df = pd.DataFrame(A,columns =['A'])
df['B'] = 0
def function(row):
global value
global prev
if row['A'] ==0:
value = 0
elif row['A'] > value:
value = row['A']
else:
value = prev
prev = value
return value
value = df.iloc[0]['B']
prev = value
df["B"] = df.apply(lambda row: function(row), axis=1)
df
output:
A B
0 0 0
1 2 2
2 3 3
3 0 0
4 2 2
5 7 7
6 2 7
7 3 7
8 2 7
9 20 20
10 1 20
11 0 0
12 2 2
13 5 5
14 4 5
15 3 5
16 1 5

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

pandas matching database with string keeping index of database

I have a database with strings and the index as below.
df0
idx name_id_code string_line_0
0 0.01 A
1 0.5 B
2 77.6 C
3 29.8 D
4 56.2 E
5 88.1000005 F
6 66.4000008 G
7 2.1 H
8 99 I
9 550.9999999 J
df1
idx string_line_1
0 A
1 F
2 J
3 G
4 D
Now, I want to match the df1 with df0, taking the values where df1 = df 0 but, keeping the index of df0 true as below
df_result name_id_code string_line_0
0 0.01 A
5 88.1000005 F
9 550.9999999 J
6 66.4000008 G
3 29.8 D
I tried with my code but it didnt work for string and only matching index
c = df0['name_id_code'] + ' (' + df0['string_line_0'].astype(str) + ')'
out = df1[df2['string_line_1'].isin(s)]
I also tried to keep simple just last column match like
c = df0['string_line_0'].astype(str) + ')'
out = df1[df1['string_line_1'].isin(s)]
but blank output.
Because is filtered df0 DataFrame then is index values not changed if use Series.isin by df1['string_line_1', only order of columns is like in original df0:
out = df0[df0['string_line_0'].isin(df1['string_line_1'])]
print (out)
name_id_code string_line_0
idx
0 0.010000 A
3 29.800000 D
5 88.100001 F
6 66.400001 G
9 551.000000 J
Or if use DataFrame.merge then for avoid lost df0.index is necessary add DataFrame.reset_index:
out = (df1.rename(columns={'string_line_1':'string_line_0'})
.merge(df0.reset_index(), on='string_line_0'))
print (out)
string_line_0 idx name_id_code
0 A 0 0.010000
1 F 5 88.100001
2 J 9 551.000000
3 G 6 66.400001
4 D 3 29.800000
Similar solution, only same values in string_line_0 and string_line_1 columns:
out = (df1.merge(df0.reset_index(), left_on='string_line_1', right_on='string_line_0'))
print (out)
string_line_1 idx name_id_code string_line_0
0 A 0 0.010000 A
1 F 5 88.100001 F
2 J 9 551.000000 J
3 G 6 66.400001 G
4 D 3 29.800000 D
You can do:
out = df0.loc[(df0["string_line_0"].isin(df1["string_line_1"]))].copy()
out["string_line_0"] = pd.Categorical(out["string_line_0"], categories=df1["string_line_1"].unique())
out.sort_values(by=["string_line_0"], inplace=True)
The first line filters df0 to just the rows where string_line_0 is in the string_line_1 column of df1.
The second line converts string_line_0 in the output df to a Categorical feature, which is then custom sorted by the order of the values in df1

Find begin and end index of consecutive ones in pandas dataframe

I have the following dataframe:
A B C
0 1 1 1
1 0 1 0
2 1 1 1
3 1 0 1
4 1 1 0
5 1 1 0
6 0 1 1
7 0 1 0
of which I want to know the start and end index when the values are 1 for 3 or more consecutive values per column. Desired outcome:
Column From To
A 2 5
B 1 3
B 4 7
first I filter out the ones that are not consecutive for 3 or more values
filtered_df = df.copy().apply(filter, threshold=3)
where
def filter(col, threshold=3):
mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
mask &= col.eq(1)
col.update(col.loc[mask].replace(1,0))
return col
filtered_df now look as:
A B C
0 0 1 0
1 0 1 0
2 1 1 0
3 1 0 0
4 1 1 0
5 1 1 0
6 0 1 0
7 0 1 0
If the dataframe would have only one column with zeros and ones the result could be achieved as in How to use pandas to find consecutive same data in time series. However, I am struggeling to do something similar for multiple columns at once.
Use DataFrame.pipe for apply function for all DataFrame.
In first solution get first and last value of consecutive 1 per each columns, add output to lists and last concat:
def f(df, threshold=3):
out = []
for col in df.columns:
m = df[col].eq(1)
g = (df[col] != df[col].shift()).cumsum()[m]
mask = g.groupby(g).transform('count').ge(threshold)
filt = g[mask].reset_index()
output = filt.groupby(col)['index'].agg(['first','last'])
output.insert(0, 'col', col)
out.append(output)
return pd.concat(out, ignore_index=True)
Or first reshape by unstack and then apply solution:
def f(df, threshold=3):
df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
m = df1['val'].eq(1)
g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
mask = g.groupby(g).transform('count').ge(threshold) & m
return (df1[mask].groupby([df1['col'], g])['idx']
.agg(['first','last'])
.reset_index(level=1, drop=True)
.reset_index())
filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
col first last
0 A 2 5
1 B 0 2
2 B 4 7
filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
col first last
0 A 2 5
1 B 0 2
2 B 4 7
3 C 2 3
You can use rolling to create a window over the data frame. Then you can apply all your conditions and shift the window back to its start location:
length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)
which prints:
A B C
0 False True False
1 False False False
2 True False False
3 True False False
4 False True False
5 False True False
6 NaN NaN NaN
7 NaN NaN NaN

sum for every row values through columns pandas

This is my dataframe and I want to sum for every row values through columns A,B,C,D and append column 'Summ'
A B C D Summ
0 1 1 0 0 1+1+0+0
1 0 0 1 1 0+0+1+1
2 0 0 1 0 0+0+1+0
3 1 1 1 1 1+1+1+1
4 1 0 1 0 1+0+1+0
df['Summ'] = df.sum(axis=1)
or better:
df.loc[:, 'Summ'] = df.sum(axis=1)
or for a subset of columns
cols = ['A','B']
df.loc[:, 'Summ'] = df[cols].sum(axis=1)

Categories