pandas matching database with string keeping index of database - python

I have a database with strings and the index as below.
df0
idx name_id_code string_line_0
0 0.01 A
1 0.5 B
2 77.6 C
3 29.8 D
4 56.2 E
5 88.1000005 F
6 66.4000008 G
7 2.1 H
8 99 I
9 550.9999999 J
df1
idx string_line_1
0 A
1 F
2 J
3 G
4 D
Now, I want to match the df1 with df0, taking the values where df1 = df 0 but, keeping the index of df0 true as below
df_result name_id_code string_line_0
0 0.01 A
5 88.1000005 F
9 550.9999999 J
6 66.4000008 G
3 29.8 D
I tried with my code but it didnt work for string and only matching index
c = df0['name_id_code'] + ' (' + df0['string_line_0'].astype(str) + ')'
out = df1[df2['string_line_1'].isin(s)]
I also tried to keep simple just last column match like
c = df0['string_line_0'].astype(str) + ')'
out = df1[df1['string_line_1'].isin(s)]
but blank output.

Because is filtered df0 DataFrame then is index values not changed if use Series.isin by df1['string_line_1', only order of columns is like in original df0:
out = df0[df0['string_line_0'].isin(df1['string_line_1'])]
print (out)
name_id_code string_line_0
idx
0 0.010000 A
3 29.800000 D
5 88.100001 F
6 66.400001 G
9 551.000000 J
Or if use DataFrame.merge then for avoid lost df0.index is necessary add DataFrame.reset_index:
out = (df1.rename(columns={'string_line_1':'string_line_0'})
.merge(df0.reset_index(), on='string_line_0'))
print (out)
string_line_0 idx name_id_code
0 A 0 0.010000
1 F 5 88.100001
2 J 9 551.000000
3 G 6 66.400001
4 D 3 29.800000
Similar solution, only same values in string_line_0 and string_line_1 columns:
out = (df1.merge(df0.reset_index(), left_on='string_line_1', right_on='string_line_0'))
print (out)
string_line_1 idx name_id_code string_line_0
0 A 0 0.010000 A
1 F 5 88.100001 F
2 J 9 551.000000 J
3 G 6 66.400001 G
4 D 3 29.800000 D

You can do:
out = df0.loc[(df0["string_line_0"].isin(df1["string_line_1"]))].copy()
out["string_line_0"] = pd.Categorical(out["string_line_0"], categories=df1["string_line_1"].unique())
out.sort_values(by=["string_line_0"], inplace=True)
The first line filters df0 to just the rows where string_line_0 is in the string_line_1 column of df1.
The second line converts string_line_0 in the output df to a Categorical feature, which is then custom sorted by the order of the values in df1

Related

Populate two columns based on different values of other two columns

I have a df that looks like this:
|ID|PREVIOUS |CURRENT|NEXT|
|--| --- | --- |---|
|1||A||
|1||B||
|2||C||
|2||D||
|2||E||
|2||F||
|3||G||
|4||H||
|4||I||
I want it to populate PREVIOUS and NEXT columns like this:
|ID|PREVIOUS |CURRENT|NEXT|
|--| --- | --- |---|
|1|nan|A|B|
|1|A|B|nan|
|2|nan|C|D|
|2|C|D|E|
|2|D|E|F|
|2|E|F|nan|
|3|nan|G|nan|
|4|nan|H|I|
|4|H|I|nan|
So for each unique ID I want to populate PREVIOUS and next columns based on the values of CURRENT column.
Until now I figured out how to do it if the df had only one type of ID (exept the case where there is no PREVIOUS and NEXT i.e ID=3) but I am struggling to generalize it for more ID-s.
for i in range(0,len(df)):
if i==0:
df["PREVIOUS"].iloc[i] = str(np.NaN)
df["NEXT"].iloc[i] = df["CURRENT"].iloc[i+1]
if i == (len(df)-1):
df["NEXT"].iloc[i] = str(np.NaN)
df["PREVIOUS"].iloc[i] = df["CURRENT"].iloc[i-1]
if (i > 0) and (i < (len(df)-1)):
df["PREVIOUS"].iloc[i] = df["CURRENT"].iloc[i-1]
df["NEXT"].iloc[i] = df["CURRENT"].iloc[i+1]
I am guessing it should employe a groupby and size() but until now I couldn't achieve the result I wanted.
This should do what your question asks:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,2,2,2,2,3,4,4], 'CURRENT':list('ABCDEFGHI')})
print(df)
from collections import defaultdict
valById = defaultdict(list)
df.apply(lambda x: valById[x['ID']].append(x['CURRENT']), axis = 1)
df = pd.DataFrame([{'ID':k, 'PREVIOUS': v[i-1] if i else np.nan, 'CURRENT': v[i], 'NEXT': v[i+1] if i+1 < len(v) else np.nan} for k, v in valById.items() for i in range(len(v))])
print(df)
Output:
ID CURRENT
0 1 A
1 1 B
2 2 C
3 2 D
4 2 E
5 2 F
6 3 G
7 4 H
8 4 I
ID PREVIOUS CURRENT NEXT
0 1 NaN A B
1 1 A B NaN
2 2 NaN C D
3 2 C D E
4 2 D E F
5 2 E F NaN
6 3 NaN G NaN
7 4 NaN H I
8 4 H I NaN

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

Remove any 0 value from row, order values descending for row, for each non 0 value in row return the index, column name, and score to a new df

I'm looking for a more efficient way of doing the below (perhaps using boolean masks and vecotrization).
I'm new to this forum so apologies if my first question is not quite what was expected.
#order each row by values descending
#remove any 0 value column from row
#for each non 0 value return the index, column name, and score to a new dataframe
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
column_names = ['index_row','header','score']
#create empty df with final output columns
df_result = pd.DataFrame(columns = column_names)
row_index=list(df.index.values)
for row in row_index:
working_row=row
#change all 0 values to null and drop any extraneous columns
subset_cols=df.loc[[working_row],:].replace(0,pd.np.nan).dropna(axis=1,how='any').columns.to_list()
#order by score
sub_df = df.loc[[working_row],subset_cols].sort_values(by =row, axis=1, ascending=False)
s_cols = sub_df.columns.to_list()
scores = sub_df.values.tolist()
scores = scores[0]
index_row=[]
header=[]
score=[]
for count, value in enumerate(scores):
header.append(s_cols[count])
score.append(value)
index_row.append(row)
data={'index_row': index_row,
'header': header,
'score': score}
result_frame = pd.DataFrame (data, columns =['index_row','header','score'])
df_result=pd.concat([df_result, result_frame], ignore_index=True)
df_result
You could do it directly with melt and some additional processing:
df_result = df.reset_index().rename(columns={'index': 'index_row'}).melt(
id_vars='index_row', var_name='header', value_name='score').query(
"score!=0").sort_values(['index_row', 'score'], ascending=[True, False]
).reset_index(drop=True)
it gives as expected:
index_row header score
0 0 b 36
1 0 d 7
2 0 c 2
3 0 a 1
4 1 c 8
5 1 d 8
6 1 b 2
7 2 c 100
8 2 d 9
9 2 a 8
10 3 d 50
11 3 b 6
12 3 a 5
for index in df.index:
temp_df = df.loc[index].reset_index().reset_index()
temp_df.columns = ['index_row', 'header', 'score']
temp_df['index_row'] = index
temp_df.sort_values(by=['score'], ascending=False, inplace=True)
df_result = df_result.append(temp_df[temp_df.score != 0], ignore_index=True)
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
df=df.reset_index()
results=pd.melt(df,id_vars='index',var_name='header',value_name='score')
filter=results['score']!=0
print(results[filter].sort_values(by=['index','score'],ascending=[True,False]))
output:
index header score
4 0 b 36
12 0 d 7
8 0 c 2
0 0 a 1
9 1 c 8
13 1 d 8
5 1 b 2
10 2 c 100
14 2 d 9
2 2 a 8
15 3 d 50
7 3 b 6
3 3 a 5
​

Pandas Copy columns from one data frame to another with different name

I have to copy columns from one DataFrame A to another DataFrame B. The column names in A and B do not match.
What is the best way to do it? There are several columns like this. Do I need to write for each column like B["SO"] = A["Sales Order"] etc.
i would use pd.concat
combined_df = pd.concat([df1, df2[['column_a', 'column_b']]], axis=1)
also gives you the power to concat different size dateframes , outer join etc.
Use:
df1 = pd.DataFrame({
'SO':list('abcdef'),
'RI':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
print (df1)
SO RI C
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df2)
D E F
0 1 5 a
1 3 3 a
2 5 6 a
3 7 9 b
4 1 2 b
5 0 4 b
Create dictionary for rename, select columns matched, rename by dict and DataFrame.join to original - DataFrames matched by index values:
d = {'SO':'Sales Order',
'RI':'Retail Invoices'}
df11 = df1[d.keys()].rename(columns=d)
print (df11)
Sales Order Retail Invoices
0 a 4
1 b 5
2 c 4
3 d 5
4 e 5
5 f 4
df = df2.join(df11)
print (df)
D E F Sales Order Retail Invoices
0 1 5 a a 4
1 3 3 a b 5
2 5 6 a c 4
3 7 9 b d 5
4 1 2 b e 5
5 0 4 b f 4
Make a dictionary of abbreviations. And try this code.
Ex:
full_form_dict = {'SO':'Sales Order',
'RI':'Retail Invoices',}
A_col = list(A.columns)
B_col = [v for k,v in full_form_dict.items() if k in A_col]
# to loop over A_col
# B_col = [v for col in A_col for k,v in full_form_dict.items() if k == col]

Pandas: How to add number of the row within grouped rows

so I have DataFrame:
>>> df2
text
0 0 a
0 1 b
0 2 c
0 3 d
1 4 e
1 5 f
1 6 g
2 7 h
2 8 1
How do I create another column, which contains counter for each row within an level=0 index?
I have tried the following code (i need to get df['counter'] column):
current_index = ''
for index, row in df.iterrows():
if index[0] != current_index:
current_index = index[0]
df[(df.index == current_index)]['counter'] = np.arange(len(df[(df.index == current_index)].index))
and following code as well:
df2 = pd.DataFrame()
for group, df in df1.groupby('level_0_column'):
df0 = df0.sort_values(by=['level_1_column'])
df['counter'] = list(df.reset_index().index.values + 1)
df2 = df2.append(df0)
I have around 650K rows in DataFrame... goes to infinite loop. Please advice
I believe you're looking for groupby along the 0th column index + cumcount:
df['counter'] = df.groupby(level=0).cumcount() + 1
df
text counter
0 0 a 1
1 b 2
2 c 3
3 d 4
1 4 e 1
5 f 2
6 g 3
2 7 h 1
8 1 2

Categories