Merge pandas dataFrames based on row values - python

I have two .tsv files that look like:
ID prop name size
A x rob 2
B y sally 3
C z debby 5
D w meg 6
and
ID lst_name area
A sanches 4
D smith 7
C roberts 8
I have them loaded into pandas DataFrames and would like to merge them so I get a new dataFrame:
ID-name prop name size lst_name area
A x rob 2 sanches 4
B y sally 3
C z debby 5 roberts 8
D w meg 6 smith 7
I have been trying to accomplish this with pd.merge() but am having issues with the following:
df = pd.DataFrame.from_csv("a.tsv", sep='\t')
df1 = pd.DataFrame.from_csv("b.tsv", sep='\t')
result = pd.merge(df, df1, how='inner',on=["ID","ID-name"])
Is it possible to accomplish a merge like this with pandas?

What you would need is an left join (or outer join, of course depending on your case), since in this sample you also would like to see the records for B even though it has no records on df1.
result = pd.merge(df, df1, how="left",on=["ID","ID"])
prop name size lst_name area
ID ID
A A x rob 2 sanches 4.0
B B y sally 3 NaN NaN
C C z debby 5 roberts 8.0
D D w meg 6 smith 7.0

Here's one way to do it using join
df1 = pd.DataFrame({'ID':['A','B','C','D'],'prop':['x','y','z','w'],'name':['rob','sally','debby','meg'],'size':[2,3,5,6]})
df2 = pd.DataFrame({'ID':['A','D','C'],'lst_name':['sanches','smith','roberts'],'area':[4,7,8]})
df1.set_index('ID').join(df2.set_index('ID')).reset_index()
>>>
ID prop name size lst_name area
0 A x rob 2 sanches 4.0
1 B y sally 3 NaN NaN
2 C z debby 5 roberts 8.0
3 D w meg 6 smith 7.0

Related

Apply a EWMA rolling window function in Pandas but avoid initial NAN values

I have the following dataframe and subsequent EWMA function:
from functools import partial
#Create DF
d = {'Name': ['Jim', 'Jim','Jim','Jim','Jim','Jim','Jim','Jim',], 'col2': [5,5,5,5,5,5,5,5]}
df1 = pd.DataFrame(data=d)
#EWMA 5
alpha = 1-np.log(2)/3
window5 = 5
weights5 = list(reversed([(1-alpha)**n for n in range(window5)]))
ewma5 = partial(np.average, weights=weights5)
df1['Rolling5'] = df1.groupby('Name')['col2'].transform(lambda x: x.rolling(5).apply(ewma5))
df1
Which results in:
I have a specified a rolling window of 5 but does anyone know how can I get the EWMA to calculate in the first to 4th Row even though there arent 5 values?
EG for row 1, calculate for just row 1 (which would just be the same value) and then for row 2 it calculates the EWMA of rows 1 & 2. Also open to more efficient ways of doing this!
Thanks very much!
You can use ewm and set min_periods in rolling to 1:
def f(x):
return x.ewm(alpha=1-np.log(2)/3).mean().iloc[-1]
df1['Rolling5'] = df1.groupby('Name')['col2'].transform(
lambda x: x.rolling(5, min_periods=1).apply(f))
Comparing with the original:
df1['Rolling5_original'] = df1.groupby('Name')['col2'].transform(
lambda x: x.rolling(5).apply(ewma5))
df1['Rolling5'] = df1.groupby('Name')['col2'].transform(
lambda x: x.rolling(5, min_periods=1).apply(f))
df1
Output:
Name col2 Rolling5_original Rolling5
0 Jim 1 NaN 1.000000
1 Jim 2 NaN 1.812315
2 Jim 3 NaN 2.736992
3 Jim 4 NaN 3.710959
4 Jim 5 4.702821 4.702821
5 Jim 6 5.702821 5.702821
6 Jim 7 6.702821 6.702821
7 Jim 8 7.702821 7.702821
You're close, if you specify the min_periods=1, the windows out of rolling will start from size 1 and then expand till 5 and stay there. As for the average, we will pass the weights' corresponding parts to cover the cases it will fall short:
weights = (1-alpha) ** np.arange(5)[::-1]
df["rolling_5"] = (df.col2
.rolling(5, min_periods=1)
.apply(lambda win: np.average(win, weights=weights[-win.size:]))
)
to get
Name col2 rolling_5
0 Jim 5 5.0
1 Jim 5 5.0
2 Jim 5 5.0
3 Jim 5 5.0
4 Jim 5 5.0
5 Jim 5 5.0
6 Jim 5 5.0
7 Jim 5 5.0

Issue with removing duplicates in pandas dataframe

Edit: This has been solved thanks to fsl, duplicated where removed and the issue was the index that needed to be reseted.
I have this dataframe:
Ubicacion lat lon
0 a 19.28034 -99.17121
1 b 19.28333 -99.17535
2 c 19.28028 -99.16887
3 a 19.28034 -99.17121
4 b 19.28333 -99.17535
5 c 19.28028 -99.16887
6 b 19.28333 -99.17535
7 d 19.29259 -99.17757
8 d 19.29259 -99.17757
9 d 19.29259 -99.17757
And I want to remove all duplicate rows, so I use:
ubicaciones_finales = ubicaciones_finales.drop_duplicates(keep="first")
And I get this:
Ubicacion lat lon
0 a 19.28034 -99.17121
1 b 19.28333 -99.17535
2 c 19.28028 -99.16887
7 d 19.29259 -99.17757
Everything seems fine except that rows go 0, 1, 2 and then 7. So when I run:
for k, row in ubicaciones_finales.iterrows():
print(k)
I get:
0
1
2
7
How do I solve this? btw, already check pandas documentation
df.drop_duplicates()
brand style rating
0 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
And its the same, it goes from 0 to 2 witouth 1. Thank you for your time.
IIUC, go with reset_index or simply pass ignore_index=True:
df = df.drop_duplicates(keep='first').reset_index(drop=True)
# or
df = df.drop_duplicates(keep='first', ignore_index=True)
Output:
Ubicacion lat lon
0 a 19.28034 -99.17121
1 b 19.28333 -99.17535
2 c 19.28028 -99.16887
3 d 19.29259 -99.17757

Insert values into database from data frame only into corresponding rows

I have a variable dataframe which has variable values when the script is run on different occasions and the values are directly inserted to database.
For example, on first run, it may have:
column1 column2
A 2
B 1
C 3
D 5
while on other run, it may have:
column1 column2
A 4
B 6
D 8
what I am able to do for now inside the database:
column1 run1 run2
A 2 4
B 1 6
C 3 8
D 5 -
What I want instead:
column1 run1 run2
A 2 4
B 1 6
C 3 -
D 5 8
Please help me a find a way-around, if not the complete code.
Set the column1 as index and concat on axis=1:
pd.concat([df1.set_index('column1'),df2.set_index('column1')],axis=1,sort=False)
#for exact_match:-> pd.concat([df1.set_index('column1'),df2.set_index('column1')],axis=1,sort=False).fillna('-')
column2 column2
A 2 4.0
B 1 6.0
C 3 NaN
D 5 8.0
I am writing in r you can convert it into python
df1 = data.frame(col_1 = c('a','b','c','d'),col_2 = c(2,1,3,5))
df2 = data.frame(col_1 = c('a','b','d'),col_2 = c(4,6,8))
finaldf= merge(df1,df2, by = 'col_1' , all = TRUE)
you will get below output
col_1 col_2.x col_2.y
a 2 4
b 1 6
c 3 NA
d 5 8
IF u dont want NA replace it.
Use pd.merge
pd.merge(df1, df2, how= 'left', on = 'col1')

Pandas Dataframe replace values in a Series

I am trying to update my_df based on conditional selection as in:
my_df[my_df['group'] == 'A']['rank'].fillna('A+')
However, this is not persistence ... e.g: the my_df still have NaN or NaT ... and I am not sure how to do this in_place. Please advise on how to persist the the update to my_df.
Create boolean mask and assign to filtered column rank:
my_df = pd.DataFrame({'group':list('AAAABC'),
'rank':['a','b',np.nan, np.nan, 'c',np.nan],
'C':[7,8,9,4,2,3]})
print (my_df)
group rank C
0 A a 7
1 A b 8
2 A NaN 9
3 A NaN 4
4 B c 2
5 C NaN 3
m = my_df['group'] == 'A'
my_df.loc[m, 'rank'] = my_df.loc[m, 'rank'].fillna('A+')
print(my_df)
group rank C
0 A a 7
1 A b 8
2 A A+ 9
3 A A+ 4
4 B c 2
5 C NaN 3
You need to assign it back
my_df.loc[my_df['group'] == 'A','rank']=my_df.loc[my_df['group'] == 'A','rank'].fillna('A+')
Your operations are not in-place, so you need to assign back to a variable. In addition, chained indexing is not recommended.
One option is pd.Series.mask with a Boolean series:
# data from #jezrael
df['rank'].mask((df['group'] == 'A') & df['rank'].isnull(), 'A+', inplace=True)
print(df)
C group rank
0 7 A a
1 8 A b
2 9 A A+
3 4 A A+
4 2 B c
5 3 C NaN

Replacing non-null values with column names

Given the following data frame:
import pandas as pd
d = pd.DataFrame({'a':[1,2,3],'b':[np.nan,5,6]})
d
a b
0 1 NaN
1 2 5.0
2 3 6.0
I would like to replace all non-null values with the column name.
Desired result:
a b
0 a NaN
1 a b
2 a b
In reality, I have many columns.
Thanks in advance!
Update to answer from root:
To perform this on a subset of columns:
d.loc[:,d.columns[3:]] = np.where(d.loc[:,d.columns[3:]].notnull(), d.loc[:,d.columns[3:]].columns, d.loc[:,d.columns[3:]])
Using numpy.where and notnull:
d[:] = np.where(d.notnull(), d.columns, d)
The resulting output:
a b
0 a NaN
1 a b
2 a b
Edit
To select specific columns:
cols = d.columns[3:] # or whatever Index/list-like of column names
d[cols] = np.where(d[cols].notnull(), cols, d[cols])
I can think of one possibility using apply/transform:
In [1610]: d.transform(lambda x: np.where(x.isnull(), x, x.name))
Out[1610]:
a b
0 a nan
1 a b
2 a b
You could also use df.where:
In [1627]: d.where(d.isnull(), d.columns.values.repeat(len(d)).reshape(d.shape))
Out[1627]:
a b
0 a NaN
1 a b
2 b b

Categories