I'm having one problem to iterate over my dataframe. The way I'm doing is taking a long time and I don't have that many rows (I have like 300k rows)
What am I trying to do?
Check if one DF (A) contains the value of two columns of the other DF (B). You can think of this as a multiple-key field
If True, get the index of DF.B and assign to one column of DF.A
If False, two steps:
a. append to DF.B the two columns not found
b. assign the new ID to DF.A (I couldn't do this one)
This is my code, where:
df is DF.A and df_id is DF.B:
SampleID and ParentID are the two columns I am interested to check if they exist in both dataframes
Real_ID is the column to which I want to assign the id of DF.B (df_id)
for index, row in df.iterrows():
#check if columns exist in the other dataframe
real_id = df_id[(df_id['SampleID'] == row['SampleID']) & (df_id['ParentID'] == row['ParentID'])]
if real_id.empty:
#row does not exist, append to df_id
df_id = df_id.append(row[['SampleID','ParentID']])
else:
#row exists, assign id of df_id to df
row['Real_ID'] = real_id.index
EXAMPLE:
DF.A (df)
Real_ID SampleID ParentID Something AnotherThing
0 20 21 a b
1 10 11 a b
2 40 51 a b
DF.B (df_id)
SampleID ParentID
0 10 11
1 20 21
Result:
Real_ID SampleID ParentID Something AnotherThing
0 1 10 11 a b
1 0 20 21 a b
2 2 40 51 a b
SampleID ParentID
0 20 21
1 10 11
2 40 51
Again, this solution is very slow. I'm sure there is a better way to do this and that's why I'm asking here. Unfortunately this was what I got after some hours...
Thanks
you can do it this way:
Data (pay attention at the index in the B DF):
In [276]: cols = ['SampleID', 'ParentID']
In [277]: A
Out[277]:
Real_ID SampleID ParentID Something AnotherThing
0 NaN 10 11 a b
1 NaN 20 21 a b
2 NaN 40 51 a b
In [278]: B
Out[278]:
SampleID ParentID
3 10 11
5 20 21
Solution:
In [279]: merged = pd.merge(A[cols], B, on=cols, how='outer', indicator=True)
In [280]: merged
Out[280]:
SampleID ParentID _merge
0 10 11 both
1 20 21 both
2 40 51 left_only
In [281]: B = pd.concat([B, merged.ix[merged._merge=='left_only', cols]])
In [282]: B
Out[282]:
SampleID ParentID
3 10 11
5 20 21
2 40 51
In [285]: A['Real_ID'] = pd.merge(A[cols], B.reset_index(), on=cols)['index']
In [286]: A
Out[286]:
Real_ID SampleID ParentID Something AnotherThing
0 3 10 11 a b
1 5 20 21 a b
2 2 40 51 a b
Related
There are other questions on the same topic and they helped but I have an extra twist.
I have a dataframe with multiple values in each (but not all) cells.
df = pd.DataFrame({'a':["10-30-410","20-40-500","25-50"], 'b':["5-8-9","4", "99"]})
index
a
b
0
10-30-410
5-8-9
1
20-40-500
4
2
25-50
99
How can I split each cell by the dash "-" and create three new dataframes? Note that not all cells have multiple values, in which case the second and third dataframes get NA or blank (treating these as strings).
So I need df1 to be the first of those values:
index
a
b
0
10
5
1
20
4
2
25
99
And df2 would be:
index
a
b
0
30
8
1
40
2
50
And likewise for df3:
index
a
b
0
410
9
1
500
2
I got df1 with this
df1 = df.replace(r'(\d+).*(\d+).*(\d+)+', r'\1', regex=True)
But df2 doesn't quite work. I get the second values but also 4 and 99, which should be blank.
df2 = df.replace(r'(\d+)-(\d+).*', r'\2', regex=True)
index
a
b
0
30
8
1
40
4 - should be blank
2
50
99 - should be blank
Is this the right approach? I'm pretty good on regex but fuzzy with groups. Thank you.
Use str.split + concat + stack to get the data in a more usable format:
new_df = pd.concat(
(df['a'].str.split('-', expand=True),
df['b'].str.split('-', expand=True)),
keys=('a', 'b'),
axis=1
).stack(dropna=False).droplevel(0)
new_df:
a b
0 10 5
1 30 8
2 410 9
0 20 4
1 40 None
2 500 None
0 25 99
1 50 None
2 None None
Expandable option for n cols:
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
Then groupby level 0 + reset_index to create a list of dataframes:
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
dfs:
[ a b
0 10 5
1 20 4
2 25 99,
a b
0 30 8
1 40 None
2 50 None,
a b
0 410 9
1 500 None
2 None None]
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'a': ["10-30-410", "20-40-500", "25-50"],
'b': ["5-8-9", "4", "99"]
})
cols = ['a', 'b']
new_df = pd.concat(
(df[c].str.split('-', expand=True) for c in cols),
keys=cols,
axis=1
).stack(dropna=False).droplevel(0)
dfs = [g.reset_index(drop=True) for _, g in new_df.groupby(level=0)]
print(dfs)
You can also try with filter:
k = pd.concat((df[c].str.split('-', expand=True).add_prefix(c+ '-')
for c in df.columns), 1).fillna('')
df1 = k.filter(like='0')
df2 = k.filter(like='1')
df3 = k.filter(like='2')
NOTE: To strip the digit from columns use : k.filter(like='0').rename(columns= lambda x: x.split('-')[0])
I'm looking for a more efficient way of doing the below (perhaps using boolean masks and vecotrization).
I'm new to this forum so apologies if my first question is not quite what was expected.
#order each row by values descending
#remove any 0 value column from row
#for each non 0 value return the index, column name, and score to a new dataframe
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
column_names = ['index_row','header','score']
#create empty df with final output columns
df_result = pd.DataFrame(columns = column_names)
row_index=list(df.index.values)
for row in row_index:
working_row=row
#change all 0 values to null and drop any extraneous columns
subset_cols=df.loc[[working_row],:].replace(0,pd.np.nan).dropna(axis=1,how='any').columns.to_list()
#order by score
sub_df = df.loc[[working_row],subset_cols].sort_values(by =row, axis=1, ascending=False)
s_cols = sub_df.columns.to_list()
scores = sub_df.values.tolist()
scores = scores[0]
index_row=[]
header=[]
score=[]
for count, value in enumerate(scores):
header.append(s_cols[count])
score.append(value)
index_row.append(row)
data={'index_row': index_row,
'header': header,
'score': score}
result_frame = pd.DataFrame (data, columns =['index_row','header','score'])
df_result=pd.concat([df_result, result_frame], ignore_index=True)
df_result
You could do it directly with melt and some additional processing:
df_result = df.reset_index().rename(columns={'index': 'index_row'}).melt(
id_vars='index_row', var_name='header', value_name='score').query(
"score!=0").sort_values(['index_row', 'score'], ascending=[True, False]
).reset_index(drop=True)
it gives as expected:
index_row header score
0 0 b 36
1 0 d 7
2 0 c 2
3 0 a 1
4 1 c 8
5 1 d 8
6 1 b 2
7 2 c 100
8 2 d 9
9 2 a 8
10 3 d 50
11 3 b 6
12 3 a 5
for index in df.index:
temp_df = df.loc[index].reset_index().reset_index()
temp_df.columns = ['index_row', 'header', 'score']
temp_df['index_row'] = index
temp_df.sort_values(by=['score'], ascending=False, inplace=True)
df_result = df_result.append(temp_df[temp_df.score != 0], ignore_index=True)
test_data={'a':[1,0,8,5],
'b':[36,2,0,6],
'c':[2,8,100,0],
'd':[7,8,9,50]}
df=pd.DataFrame(test_data,columns=['a','b','c','d'])
df=df.reset_index()
results=pd.melt(df,id_vars='index',var_name='header',value_name='score')
filter=results['score']!=0
print(results[filter].sort_values(by=['index','score'],ascending=[True,False]))
output:
index header score
4 0 b 36
12 0 d 7
8 0 c 2
0 0 a 1
9 1 c 8
13 1 d 8
5 1 b 2
10 2 c 100
14 2 d 9
2 2 a 8
15 3 d 50
7 3 b 6
3 3 a 5
I have two Dataframes (df1 and df2)
df1:
A B C D
12 52 16 23
19 32 30 09
df2:
A G C D E
12 13 16 04 100
I want to create a new column in df1 called 'Compare'
Then I want to compare the columns 'A' and 'C' and if the are same then give 'Compare' in this row the value 'X'.
result = df1[df1["A"].isin(df2["A"].tolist())]
does not work.
You can chain 2 conditions with & for bitwise AND or | for bitwise OR and set new values by numpy.where:
mask = df1["A"].isin(df2["A"]) & df1["C"].isin(df2["C"])
df1['Compare'] = np.where(mask, 'X', '')
print (df1)
A B C D Compare
0 12 52 16 23 X
1 19 32 30 9
Or use DataFrame.merge with left join and indicator=True:
s = df1[['A','C']].merge(df2[['A','C']], how='left', indicator=True)['_merge']
df1['Compare'] = np.where(s == 'both', 'X', '')
print (df1)
A B C D Compare
0 12 52 16 23 X
1 19 32 30 9
I have two dataframes
df1 = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]], index = ['a','b','c', 'a'], columns = ['d','e'])
d e
a 1 2
b 3 4
c 5 6
a 7 8
df2 = pd.DataFrame([['a', 10],['b',20],['c',30],['f',40]])
0 1
0 a 10
1 b 20
2 c 30
3 f 40
i want my final dataframe to multiply rows of df1 to multiply by a factor corresponding to value in df2 (for eg. 20 for b)
so my output should look like
d e
a 10 20
b 60 80
c 150 180
a 70 80
Kindly provide a solution assuming df1 to be hundreds of rows in length. I could only think of looping through df1.index.
Use set_index and reindex to align df2 with df1 and then mul
In [1150]: df1.mul(df2.set_index(0).reindex(df1.index)[1], axis=0)
Out[1150]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
Create a mapping and call df.apply:
In [1128]: mapping = dict(df2.values)
In [1129]: df1.apply(lambda x: x * mapping[x.name], 1)
Out[1129]:
d e
a 10 20
b 60 80
c 150 180
a 70 80
IIUC:
In [55]: df1 * pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[55]:
d e
a 10 20
a 70 80
b 60 80
c 150 180
Helper DF:
In [57]: pd.DataFrame(np.tile(df2[[1]],2), columns=df1.columns, index=df2[0])
Out[57]:
d e
0
a 10 10
b 20 20
c 30 30
This is straight forward. You just make sure they have a common axis, then you can combine them:
put the lookup column into the index
df2.set_index(0, inplace=True)
1
0
a 10
b 20
c 30
Now you can put that column into df1 very easily:
df1['multiplying_factor'] = df2[1]
Now you just want to multiply two columns:
df1['final_value'] = df1.e*df1.multiplying_factor
Now df1 looks like:
d e multiplying_factor final_value
a 1 2 10 20
b 3 4 20 80
c 5 6 30 180
a 7 8 10 80
I'm trying to merge two DataFrames summing columns value.
>>> print(df1)
id name weight
0 1 A 0
1 2 B 10
2 3 C 10
>>> print(df2)
id name weight
0 2 B 15
1 3 C 10
I need to sum weight values during merging for similar values in the common column.
merge = pd.merge(df1, df2, how='inner')
So the output will be something like following.
id name weight
1 2 B 25
2 3 C 20
This solution works also if you want to sum more than one column. Assume data frames
>>> df1
id name weight height
0 1 A 0 5
1 2 B 10 10
2 3 C 10 15
>>> df2
id name weight height
0 2 B 25 20
1 3 C 20 30
You can concatenate them and group by index columns.
>>> pd.concat([df1, df2]).groupby(['id', 'name']).sum().reset_index()
id name weight height
0 1 A 0 5
1 2 B 35 30
2 3 C 30 45
In [41]: pd.merge(df1, df2, on=['id', 'name']).set_index(['id', 'name']).sum(axis=1)
Out[41]:
id name
2 B 25
3 C 20
dtype: int64
If you set the common columns as the index, you can just sum the two dataframes, much simpler than merging:
In [30]: df1 = df1.set_index(['id', 'name'])
In [31]: df2 = df2.set_index(['id', 'name'])
In [32]: df1 + df2
Out[32]:
weight
id name
1 A NaN
2 B 25
3 C 20