Add a column results in difference of rows - python

Let's say I have a data frame:
A B
0 a b
1 c d
2 e f
and what I am aiming for is to find the difference between the following rows from column A
Like this:
A B Ic
0 a b (a-a)
1 c d (a-c)
2 e f (a-e)
This is what I tried:
df['dA'] = df['A'] - df['A']
But it doesn't give me the result I needed. Any help would be greatly appreciated.

Select first value by loc by index and column name or iat by column name and position and subtract:
df['Ic'] = df.loc[0,'A'] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
df['Ic'] = df['A'].iat[0] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
Detail:
print (df.loc[0,'A'])
4
print (df['A'].iat[0])
4

Related

Create a dataframe of all combinations of columns names per row based on mutual presence of columns pairs

I'm trying to create a dataframe based on other dataframe and a specific condition.
Given the pandas dataframe above, I'd like to have a two column dataframe, which each row would be the combinations of pairs of words that are different from 0 (coexist in a specific row), beginning with the first row.
For example, for this part of image above, the new dataframe that I want is like de following:
and so on...
Does anyone have some tip of how I can do it? I'm struggling... Thanks!
As you didn't provide a text example, here is a dummy one:
>>> df
A B C D E
0 0 1 1 0 1
1 1 1 1 1 1
2 1 0 0 1 0
3 0 0 0 0 1
4 0 1 1 0 0
you could use a combination of masking, explode and itertools.combinations:
from itertools import combinations
mask = df.gt(0)
series = (mask*df.columns).apply(lambda x: list(combinations(set(x).difference(['']), r=2)), axis=1)
pd.DataFrame(series.explode().dropna().to_list(), columns=['X', 'Y'])
output:
X Y
0 C E
1 C B
2 E B
3 E D
4 E C
5 E B
6 E A
7 D C
8 D B
9 D A
10 C B
11 C A
12 B A
13 A D
14 C B

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

How to get top 5 items for each group in grouped dataframe?

df = pd.DataFrame({'Weekday':list('MMMMMMMMMMTTTTTTTTTT'),
'Items': list("AAABBCDEFGBBBCCADEFG")
})
grouped = df.groupby(['Weekday','Items'],sort=True).agg({'Items': 'count'})
Then, I get the result of grouped:
Weekday Items
M A 3
B 2
C 1
D 1
E 1
F 1
G 1
T A 1
B 3
C 2
D 1
E 1
F 1
G 1
So how to output the top 5 items for each "weekdays" (5 for 'M' and 'T'), like:
Weekday Items
M A 3
B 2
C 1
D 1
E 1
T
B 3
C 2
A 1
D 1
E 1
Anyone can help this?
df = pd.DataFrame({'Weekday':list('MMMMMMMMMMTTTTTTTTTT'),
'Item': list("AAABBCDEFGBBBCCADEFG")
})
grouped = df.groupby(['Weekday','Item'],sort=True).agg(count=('Item', 'count'))
grouped.sort_values(['Weekday','count'],ascending=False).groupby('Weekday').head(5)
count
Weekday Item
T B 3
C 2
A 1
D 1
E 1
M A 3
B 2
C 1
D 1
E 1
grouped = (df.groupby(['Weekday','Items'])
.Items.agg(counter='count')
.groupby(['Weekday'],
as_index=False))
pd.concat([group.nlargest(5,'counter') for name,group in grouped])
counter
Weekday Items
M A 3
B 2
C 1
D 1
E 1
T B 3
C 2
A 1
D 1
E 1
groupby twice, first to get the counter variable. the second groupby allows an iteration through the groups to get the top 5, using nlargest. last step is to combine the dataframes in the list into one.
vb_rise's solution should be faster as it avoids the iteration process.

Assigning incremental values based on an unique value of a column

I am trying to add an incremental value to a column based on specific values of another column in a dataframe. So that...
col A col B
A 0
B 1
C 2
A 3
A 4
B 5
Would become something like this:
col A col B
A 1
B 2
C 3
A 1
A 1
B 2
C 3
Have tried using groupby function but cant really get my head around setting incremental values on column B.
Any thoughts?
Thanks
I think need factorize:
df['col B'] = pd.factorize(df['col A'])[0] + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2
Another solution:
df['col B'] = pd.Categorical(df['col A']).codes + 1
print (df)
col A col B
0 A 1
1 B 2
2 C 3
3 A 1
4 A 1
5 B 2

pandas finds indices of rows in each group which meets certain condition and assign values to these rows

I have a df,
name_id name
1 a
2 b
2 b
3 c
3 c
3 c
now I want to groupby name_id and assign -1 to rows in the group(s), whose length is 1 or < 2;
one_occurrence_indices = df.groupby('name_id').filter(lambda x: len(x) == 1).index.tolist()
for index in one_occurrence_indices:
df.loc[index, 'name_id'] = -1
I am wondering what is the best way to do it. so the result df,
name_id name
-1 a
2 b
2 b
3 c
3 c
3 c
Use transform with loc:
df.loc[df.groupby('name_id')['name_id'].transform('size') == 1, 'name_id'] = -1
Alternative is numpy.where:
df['name_id'] = np.where(df.groupby('name_id')['name_id'].transform('size') == 1,
-1, df['name_id'])
print (df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Also if want test duplicates use duplicated:
df['name_id'] = np.where(df.duplicated('name_id', keep=False), df['name_id'], -1)
Use:
df.name_id*=(df.groupby('name_id').name.transform(len)==1).map({True:-1,False:1})
df
Out[50]:
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Using pd.DataFrame.mask:
lens = df.groupby('name_id')['name'].transform(len)
df['name_id'].mask(lens < 2, -1, inplace=True)
print(df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c

Categories