How to create conditional pandas series/column? - python

Here is a sample df:
A B C D E (New Column)
0 1 2 a n ?
1 3 3 p d ?
2 5 9 f z ?
If Column A == Column B PICK Column C's value apply to Column E;
Otherwise PICK Column D's value apply to Column E.
I have tried many ways but failed, I am new please teach me how to do it, THANK YOU!
Note:
It needs to PICK the value from Col.C or Col.D in this case. So there are not specify values are provided to fill in the Col.E(this is the most different to other similar questions)

use numpy.where
df['E'] = np.where(df['A'] == df['B'],
df['C'],
df['D'])
df
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z

Try pandas where
df['E'] = df['C'].where(df['A'].eq(df['B']), df['D'])
df
Out[570]:
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z

Related

Create a dataframe of all combinations of columns names per row based on mutual presence of columns pairs

I'm trying to create a dataframe based on other dataframe and a specific condition.
Given the pandas dataframe above, I'd like to have a two column dataframe, which each row would be the combinations of pairs of words that are different from 0 (coexist in a specific row), beginning with the first row.
For example, for this part of image above, the new dataframe that I want is like de following:
and so on...
Does anyone have some tip of how I can do it? I'm struggling... Thanks!
As you didn't provide a text example, here is a dummy one:
>>> df
A B C D E
0 0 1 1 0 1
1 1 1 1 1 1
2 1 0 0 1 0
3 0 0 0 0 1
4 0 1 1 0 0
you could use a combination of masking, explode and itertools.combinations:
from itertools import combinations
mask = df.gt(0)
series = (mask*df.columns).apply(lambda x: list(combinations(set(x).difference(['']), r=2)), axis=1)
pd.DataFrame(series.explode().dropna().to_list(), columns=['X', 'Y'])
output:
X Y
0 C E
1 C B
2 E B
3 E D
4 E C
5 E B
6 E A
7 D C
8 D B
9 D A
10 C B
11 C A
12 B A
13 A D
14 C B

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

Pandas: optimise iterating with a condition on both row and column with large file

I have the following data, and what I would like is to fill in col E with values from another row (let’s call it the target row) in col D only when the following conditions are met:
col E has no value
the string in col A of the target row is the same as that in col A
the value in col B for the target row is the same as the value in col C
A
B
C
D
E
1
XXZ
a
d
1
2
YXXZ
b
a
2
3
YXXZ
c
b
3
2
4
YXXZ
d
c
4
5
5
XXZ
e
a
4
What I would get would be something like this:
A
B
C
D
E
XXZ
a
d
1
1
YXXZ
b
a
2
2
YXXZ
c
b
3
2
YXXZ
d
c
4
5
XXZ
e
a
4
NaN
The answer from #ralubrusto below works, but is clearly not efficient for large files. Is there any suggestion on how to make it work faster?
missing = df.E.isna()
for id in df[missing].index:
original = df.loc[id]
# Second condition
equal_A = df[df['A'] == original['A']]
# Third condition
the_one = equal_A[equal_A['C'] == original['B']]
# Assigning
if len(the_one) > 0:
df.at[id, 'E'] = the_one.iloc[0]['D']
Since you have multiple and different conditions, you might wanna do something like:
# Find missing E values
missing = df.E.isna()
for id in df[missing].index:
original = df.loc[id]
# Second condition
equal_A = df[df['A'] == original['A']]
# Third condition
the_one = equal_A[equal_A['C'] == original['B']]
# Assigning
if len(the_one) > 0:
df.at[id, 'E'] = the_one.iloc[0]['D']
The answer for your example data would be:
A B C D E
0 XXZ a d 1 4.0
1 YXXZ b a 2 3.0
2 YXXZ c b 3 2.0
3 YXXZ d c 4 5.0
4 XXZ e a 4 NaN
Edit: Thanks for your patience. I've tried a few different approaches to accomplish this task, and most of them are pretty inefficient, as you can see in the perfplot below (it's not a perfect plot, but you can get the general idea).
I've tried some approaches using groupby, apply, for loops (the previous answer) and finally a merge one, which is by far the fastest one.
Here it is its code:
_df = (df.reset_index()
.merge(df, left_on=['A', 'B'],
right_on=['A', 'C'],
how='inner',
suffixes=['_ori', '_target']))
_df.loc[_df.E_ori.isna(), 'E_ori'] = _df.loc[_df.E_ori.isna(), 'D_target']
_df.set_index('index', inplace=True)
df.loc[_df.index, 'E'] = _df['E_ori']
It's really more efficient than the previous solution, so please try it out using your dataset and tell us if you have any further issues.

Add a column results in difference of rows

Let's say I have a data frame:
A B
0 a b
1 c d
2 e f
and what I am aiming for is to find the difference between the following rows from column A
Like this:
A B Ic
0 a b (a-a)
1 c d (a-c)
2 e f (a-e)
This is what I tried:
df['dA'] = df['A'] - df['A']
But it doesn't give me the result I needed. Any help would be greatly appreciated.
Select first value by loc by index and column name or iat by column name and position and subtract:
df['Ic'] = df.loc[0,'A'] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
df['Ic'] = df['A'].iat[0] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
Detail:
print (df.loc[0,'A'])
4
print (df['A'].iat[0])
4

Categories