I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
Related
Here is a sample df:
A B C D E (New Column)
0 1 2 a n ?
1 3 3 p d ?
2 5 9 f z ?
If Column A == Column B PICK Column C's value apply to Column E;
Otherwise PICK Column D's value apply to Column E.
I have tried many ways but failed, I am new please teach me how to do it, THANK YOU!
Note:
It needs to PICK the value from Col.C or Col.D in this case. So there are not specify values are provided to fill in the Col.E(this is the most different to other similar questions)
use numpy.where
df['E'] = np.where(df['A'] == df['B'],
df['C'],
df['D'])
df
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z
Try pandas where
df['E'] = df['C'].where(df['A'].eq(df['B']), df['D'])
df
Out[570]:
A B C D E
0 1 2 a n n
1 3 3 p d p
2 5 9 f z z
I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])
I have a dataframe with a column like this
Col1
1 A, 2 B, 3 C
2 B, 4 C
1 B, 2 C, 4 D
I have used the .str.split(',', expand=True), the result is like this
0 | 1 | 2
1 A | 2 B | 3 C
2 B | 4 C | None
1 B | 2 C | 4 D
what I am trying to achieve is to get this one:
Col A| Col B| Col C| Col D
1 A | 2 B | 3 C | None
None | 2 B | 4 C | None
None | 1 B | 2 C | 4 D
I am stuck, how to get new columns formatted as such ?
Let's try:
# split and explode
s = df['Col1'].str.split(', ').explode()
# create new multi-level index
s.index = pd.MultiIndex.from_arrays([s.index, s.str.split().str[-1].tolist()])
# unstack to reshape
out = s.unstack().add_prefix('Col ')
Details:
# split and explode
0 1 A
0 2 B
0 3 C
1 2 B
1 4 C
2 1 B
2 2 C
2 4 D
Name: Col1, dtype: object
# create new multi-level index
0 A 1 A
B 2 B
C 3 C
1 B 2 B
C 4 C
2 B 1 B
C 2 C
D 4 D
Name: Col1, dtype: object
# unstack to reshape
Col A Col B Col C Col D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
Most probably there are more general approaches you can use but this worked for me. Please note that this is based on a lot of assumptions and constraints of your particular example.
test_dict = {'col_1': ['1 A, 2 B, 3 C', '2 B, 4 C', '1 B, 2 C, 4 D']}
df = pd.DataFrame(test_dict)
First, we split the df into initial columns:
df2 = df.col_1.str.split(pat=',', expand=True)
Result:
0 1 2
0 1 A 2 B 3 C
1 2 B 4 C None
2 1 B 2 C 4 D
Next, (first assumption) we need to ensure that we can later use ' ' as delimiter to extract the columns. In order to do that we need to remove all the starting and trailing spaces from each string
func = lambda x: pd.Series([i.strip() for i in x])
df2 = df2.astype(str).apply(func, axis=1)
Next, We would need to get a list of unique columns. To do that we first extract column names from each cell:
func = lambda x: pd.Series([i.split(' ')[1] for i in x if i != 'None'])
df3 = df2.astype(str).apply(func, axis=1)
Result:
0 1 2
0 A B C
1 B C NaN
2 B C D
Then create a list of unique columns ['A', 'B', 'C', 'D'] that are present in your DataFrame:
columns_list = pd.unique(df3[df3.columns].values.ravel('K'))
columns_list = [x for x in columns_list if not pd.isna(x)]
And create an empty base dataframe with those columns which will be used to assign the corresponding values:
result_df = pd.DataFrame(columns=columns_list)
Once the preparations are done we can assign column values for each of the rows and use pd.concat to merge them back in to one DataFrame:
result_list = []
result_list.append(result_df) # Adding the empty base table to ensure the columns are present
for row in df2.iterrows():
result_object = {} # dict that will be used to represent each row in source DataFrame
for column in columns_list:
for value in row[1]: # row is returned in the format of tuple where first value is row_index that we don't need
if value != 'None':
if value.split(' ')[1] == column: # Checking for a correct column to assign
result_object[column] = [value]
result_list.append(pd.DataFrame(result_object)) # Adding dicts per row
Once the list of DataFrames is generated we can use pd.concat to put it together:
final_df = pd.concat(result_list, ignore_index=True) # ignore_index will rebuild the index for the final_df
And the result will be:
A B C D
0 1 A 2 B 3 C NaN
1 NaN 2 B 4 C NaN
2 NaN 1 B 2 C 4 D
I don't think this is the most elegant and efficient way to do it but it will produce the results you need
I have a pandas data frame with 2 columns - user1 and user2
something like this
Now, I want to do a transitive relation such that if A is related to B and B is to C and C is to D, then I want the output as a list like "A-B-C-D" in one group and "E-F-G" in another group.
Thanks
If you have just 2 groups, you can do in this way. But it works only for 2 groups, and you cannot generalize:
x = []
y = []
x.append(df['user1'][0])
x.append(df['user2'][0])
for index, i in enumerate(df['user1']):
if df['user1'][index] in x:
x.append(df['user2'][index])
else:
y.append(df['user1'][index])
y.append(df['user2'][index])
x = set(x)
y = set(y)
If you want to find all the transitive relationships then most likely you need to perform a recursion. Perhaps this following piece of code may help:
import pandas as pd
data={'user1':['A','A','B', 'C', 'E', 'F'],
'user2':['B', 'C','C','D','F','G']}
df=pd.DataFrame(data)
print(df)
# this method is similar to the commnon table expression (CTE) in SQL
def cte(df_anchor,df_ref,level):
if (level==0):
df_anchor.insert(0, 'user_root',df_anchor['user1'])
df_anchor['level']=0
df_anchor['relationship']=df_anchor['user1']+'-'+df_anchor['user2']
_df_anchor=df_anchor
if (level>0):
_df_anchor=df_anchor[df_anchor.level==level]
_df=pd.merge(_df_anchor, df_ref , left_on='user2', right_on='user1', how='inner', suffixes=('', '_x'))
if not(_df.empty):
_df['relationship']=_df['relationship']+'-'+_df['user2_x']
_df['level']=_df['level']+1
_df=_df[['user_root','user1_x', 'user2_x', 'level','relationship']].rename(columns={'user1_x': 'user1', 'user2_x': 'user2'})
df_anchor_new=pd.concat([df_anchor, _df])
return cte(df_anchor_new, df_ref, level+1)
else:
return df_anchor
df_rel=cte(df, df, 0)
print("\nall relationship=\n",df_rel)
print("\nall relationship related to A=\n", df_rel[df_rel.user_root=='A'])
user1 user2
0 A B
1 A C
2 B C
3 C D
4 E F
5 F G
all relationship=
user_root user1 user2 level relationship
0 A A B 0 A-B
1 A A C 0 A-C
2 B B C 0 B-C
3 C C D 0 C-D
4 E E F 0 E-F
5 F F G 0 F-G
0 A B C 1 A-B-C
1 A C D 1 A-C-D
2 B C D 1 B-C-D
3 E F G 1 E-F-G
0 A C D 2 A-B-C-D
all relationship related to A=
user_root user1 user2 level relationship
0 A A B 0 A-B
1 A A C 0 A-C
0 A B C 1 A-B-C
1 A C D 1 A-C-D
0 A C D 2 A-B-C-D
Let's say I have a data frame:
A B
0 a b
1 c d
2 e f
and what I am aiming for is to find the difference between the following rows from column A
Like this:
A B Ic
0 a b (a-a)
1 c d (a-c)
2 e f (a-e)
This is what I tried:
df['dA'] = df['A'] - df['A']
But it doesn't give me the result I needed. Any help would be greatly appreciated.
Select first value by loc by index and column name or iat by column name and position and subtract:
df['Ic'] = df.loc[0,'A'] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
df['Ic'] = df['A'].iat[0] - df['A']
print (df)
A B Ic
0 4 b 0
1 1 d 3
2 0 f 4
Detail:
print (df.loc[0,'A'])
4
print (df['A'].iat[0])
4