Merge two Dataframes with same columns with overwrite - python

I have dataframe like this:
df = pd.DataFrame({"flag":["1","0","1","0"],
"val":["111","111","222","222"], "qwe":["","11","","12"]})
It gives:
flag qwe val
0 1 111
1 0 11 111
2 1 222
3 0 12 222
Then i am filtering first dataframe like this:
dff = df.loc[df["flag"]=="1"]
**was:**
dff.loc["qwe"] = "123"
**edited:** (setting all rows in column "qwe" to "123")
dff["qwe"] = "123"
And now i need to merge/join df and dff in such a way to get:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Adding changes in 'qwe' from dff only if df value is empty.
Something like this:
pd.merge(df, dff, left_index=True, right_index=True, how="left")
gives
flag_x qwe_x val_x flag_y qwe_y val_y
0 1 111 1 111
1 0 11 111 NaN NaN NaN
2 1 222 1 222
3 0 12 222 NaN NaN NaN
so, after that i need to drop flag_y, val_y, rename _x columns and merge manually qwe_x and qwe_y. But is there any way to make it easier?

pd.merge has an on argument that you can use to join columns with the same name in different dataframes.
Try:
pd.merge(df, dff, how="left", on=['flag', 'qwe', 'val'])
However, I don't think you need to do that at all. You can produce the same result using df.loc to conditionally assign a value:
df.loc[(df["flag"] == "1") & (df['qwe'].isnull()), 'qwe'] = 123

After edited changes, for me works this code:
c1 = dff.combine_first(df)
It produces:
flag qwe val
0 1 123 111
1 0 11 111
2 1 123 222
3 0 12 222
Which is exactly i was looking for.

Related

Merging dataframes with multiple key columns

I'd like to merge this dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,10,100],[2,20,np.nan],[3,30,300]], columns=["A","B","C"])
df1
A B C
0 1 10 100
1 2 20 NaN
2 3 30 300
with this one:
df2 = pd.DataFrame([[1,422],[10,72],[2,278],[300,198]], columns=["ID","Value"])
df2
ID Value
0 1 422
1 10 72
2 2 278
3 300 198
to get an output:
df_output = pd.DataFrame([[1,10,100,422],[1,10,100,72],[2,20,200,278],[3,30,300,198]], columns=["A","B","C","Value"])
df_output
A B C Value
0 1 10 100 422
1 1 10 100 72
2 2 20 NaN 278
3 3 30 300 198
The idea is that for df2 the key column is "ID", while for df1 we have 3 possible key columns ["A","B","C"].
Please notice that the numbers in df2 are chosen to be like this for simplicity, and they can include random numbers in practice.
How do I perform such a merge? Thanks!
IIUC, you need a double merge/join.
First, melt df1 to get a single column, while keeping the index. Then merge to get the matches. Finally join to the original DataFrame.
s = (df1
.reset_index().melt(id_vars='index')
.merge(df2, left_on='value', right_on='ID')
.set_index('index')['Value']
)
# index
# 0 422
# 1 278
# 0 72
# 2 198
# Name: Value, dtype: int64
df_output = df1.join(s)
output:
A B C Value
0 1 10 100.0 422
0 1 10 100.0 72
1 2 20 NaN 278
2 3 30 300.0 198
Alternative with stack + map:
s = df1.stack().droplevel(1).map(df2.set_index('ID')['Value']).dropna()
df_output = df1.join(s.rename('Value'))

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?
Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Copy data from one data-frame to another and substitute the data depending on values

I have two dataframes df1 and df1
the head of df1 looks like:
date last_location cost_factor is_reporting
0 24/02/2014 510 1.0026 0
1 25/02/2014 498 0.9981 0
2 26/02/2014 492 0.9986 4
3 27/02/2014 489 0.9986 4
4 28/02/2014 493 0.9986 0
5 03/03/2014 485 0.9986 0
and the head of df2 looks like:
date dept
0 24/02/2014 A
1 25/02/2014 A
2 26/02/2014 B
3 27/02/2014 B
4 28/02/2014 B
5 03/03/2014 C
I would like to add the is_reporting column from df1 to df2 but instead of using the value from df1 I would like to insert a 1 if the value is anything other than 0. So the desired result would look like:
date dept is_reporting
0 24/02/2014 A 0
1 25/02/2014 A 0
2 26/02/2014 B 1
3 27/02/2014 B 1
4 28/02/2014 B 0
5 03/03/2014 C 0
I think I need to copy is_reporting into df2 from df1 and use replace but I do not know how to replace any value greater than 0 with 1
You need DataFrame.merge with left join and replace values in column is_reporting by compare for not equal by Series.ne with casting to integers and DataFrame.assign for overwrite column:
df3 = df2.merge(df1[['date','is_reporting']], on='date', how='left')
df3 = df3.assign(is_reporting = df3['is_reporting'].ne(0).astype(int))
Or swap operations:
df1 = df1.assign(is_reporting = df1['is_reporting'].ne(0).astype(int))
df3 = df2.merge(df1[['date','is_reporting']], on='date', how='left')
print (df3)
date dept is_reporting
0 24/02/2014 A 0
1 25/02/2014 A 0
2 26/02/2014 B 1
3 27/02/2014 B 1
4 28/02/2014 B 0
5 03/03/2014 C 0

Python - One-hot-encode to single column

I'm almost zero experienced with python but I'm trying to learn it. I have a Pandas dataframe which came with some dummies. I want to convert them back to a single column but I simply can't figure out a way to that. Is there any way to that?
I have this:
ID var_1 var_2 var_3 var_4
231 1 0 0 0
220 0 1 0 0
303 0 0 1 0
324 0 0 0 1
I need to transform to it:
ID var
231 1
220 2
303 3
324 4
Assuming these really are one-hot-encodings, use np.argmax along the first axis:
pd.DataFrame({'ID' : df['ID'], 'var' : df.iloc[:, 1:].values.argmax(axis=1) + 1})
ID var
0 231 1
1 220 2
2 303 3
3 324 4
However, if "ID" is a part of the index, use this instead:
pd.DataFrame({'ID' : df.index, 'var' : df.values.argmax(axis=1)})
Try something new wide_to_long
s=pd.wide_to_long(df,['var'],i='ID',j='Var',sep='_')
s[s['var']==1].reset_index().drop('var',1)
Out[593]:
ID Var
0 231 1
1 220 2
2 303 3
3 324 4

Many DataFrames in one output, how to merge. Python

I my code:
def my_def():
abc_all=open('file.txt')#where file.txt is a big data with many lines
lines=abc_all.readlines()
lines=lines[4:]
file.txt look like(about 120 letters in one line, and many lines:
AAA S S SSDAS ASDJAI A 234 33 43 234 2342999 2.31 22 33....
SSS S D W2UUQ Q231WQ A 222 11 23 123 1231299 2.31 22 11....
for line in lines:
abcd=line[5:-34]
abcd2 =abcd[:27]+abcd[40:]
abc=abcd2.split()
result=pd.DataFrame(abc)
and now i want to save the results but when i am using for example: result.to_csv() i am receiving only first line in my output file.
I have just printed result and like i see i am receiving each line like DataFrame , so thats the reason why my output is printing only first line.
part of result:
0 1 2 3 4 5 6
0 22.2222 -11.1111 222 111 name1 1 l
0 1 2 3 4 5 6
0 33.2222 -11.1111 444 333 name2 1 c
0 1 2 3 4 5 6
0 12.1111 -11.1111 222 111 name3 1 c
How can i save this output like one DataFrame, or how to merge all those DataFrames to one.
Thanks for help!
What about building a list of list first and then turning it into a DataFrame. It might be easier and faster:
results = []
for line in lines:
abcd = line[5:-34]
# if we know all row are the same length one split
results.append(abcd.split())
df = pd.DataFrame(results)
As a one-liner even:
df = pd.DataFrame([line[5:-34].split() for line in lines])
ex:
b = [ "0 22.2222 -11.1111 222 111 name1 1 l", "0 33.2222 -11.1111 444 333 name2 1 c", "0 12.1111 -11.1111 222 111 name3 1 c"]
df = pd.DataFrame([line.split() for line in b])
print(df)
0 1 2 3 4 5 6 7
0 0 22.2222 -11.1111 222 111 name1 1 l
1 0 33.2222 -11.1111 444 333 name2 1 c
2 0 12.1111 -11.1111 222 111 name3 1 c
As stated by #Andrew L it might be intersting to see if the file has not a format already supported by one of the many pandas.read_...
EDIT: Seems like a job for pd.read_csv('file.txt', sep=' ') and the deleting the columns you don't want / selecting the ones you want to keep

Categories