I have the following wide df1:
Area geotype type ...
1 a 2 ...
1 a 1 ...
2 b 4 ...
4 b 8 ...
And the following two-column df2:
Area geotype
1 London
4 Cambridge
And I want the following:
Area geotype type ...
1 London 2 ...
1 London 1 ...
2 b 4 ...
4 Cambridge 8 ...
So I need to match based on the non-unique Area column, and then only if there is a match, replace the set values in the geotype column.
Apologies if this is a duplicate, I did actually search hard for a solution to this.
use update + map
df1.geotype.update(df1.Area.map(df2.set_index('Area').geotype))
Area geotype type
0 1 London 2
1 1 London 1
2 2 b 4
3 4 Cambridge 8
I think you can use map by Series created with set_index and then fill NaN values by combine_first or fillna:
df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).combine_first(df1.geotype)
#df1.geotype = df1.ID.map(df2.set_index('ID')['geotype']).fillna(df1.geotype)
print (df1)
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8e
Another solution with mask and numpy.in1d:
df1.geotype = df1.geotype.mask(np.in1d(df1.ID, df2.ID),
df1.ID.map(df2.set_index('ID')['geotype']))
print (df1)
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8e
EDIT by comment:
Problem is not unique ID values in df2 like:
df2 = pd.DataFrame({'ID': [1, 1, 4], 'geotype': ['London', 'Paris', 'Cambridge']})
print (df2)
ID geotype
0 1 London
1 1 Paris
2 4 Cambridge
So function map cannot choose right value and raise error.
Solution is remove duplicates by drop_duplicates, by default keep first value:
df2 = df2.drop_duplicates('ID')
print (df2)
ID geotype
0 1 London
2 4 Cambridge
Or if need keep last value:
df2 = df2.drop_duplicates('ID', keep='last')
print (df2)
ID geotype
1 1 Paris
2 4 Cambridge
If cannot remove duplicates, there is another solution with outer merge, but there are duplicated rows where is duplicated ID in df2:
df1 = pd.merge(df1, df2, on='ID', how='outer', suffixes=('_',''))
df1.geotype = df1.geotype.combine_first(df1.geotype_)
df1 = df1.drop('geotype_', axis=1)
print (df1)
ID type geotype
0 1 2 London
1 1 2 Paris
2 2 1 a
3 3 4 b
4 4 8e Cambridge
alternative solution:
In [78]: df1.loc[df1.ID.isin(df2.ID), 'geotype'] = df1.ID.map(df2.set_index('ID').geotype)
In [79]: df1
Out[79]:
ID geotype type
0 1 London 2
1 2 a 1
2 3 b 4
3 4 Cambridge 8
UPDATE: answers updated question - if you have duplicates in the Area column in the df2 DF:
In [152]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.set_index('Area').geotype)
...
skipped
...
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
get rid of duplicates:
In [153]: df1.loc[df1.Area.isin(df2.Area), 'geotype'] = df1.Area.map(df2.drop_duplicates(subset='Area').set_index('Area').geotype)
In [154]: df1
Out[154]:
Area geotype type
0 1 London 2
1 1 London 1
2 2 b 4
3 4 Cambridge 8
Related
I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?
here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2
I have two dataframes with the same column order but different column names and different rows. df2 rows vary from df1 rows.
df1= col_id num name
0 1 3 linda
1 2 4 James
df2= id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
This is the output I need. Outputs rows with same, OLD and NEW values so the user can clearly see what changed between two dataframes:
result col_id num name
0 1 was 3| now 2 was linda| now granpa
1 2 was 4| now 6 was James| now linda
2 was | now 3 was | now 7 was | now sam
Since your goal is just to compare differences, use DataFrame.compare instead of aggregating into strings.
However,
DataFrame.compare can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames
So we just need to align the row/column indexes, either via merge or reindex.
Align via merge
Outer-merge the two dfs:
merged = df1.merge(df2, how='outer', left_on='col_id', right_on='id')
# col_id num name_x id no name_y
# 0 1 3 linda 1 2 granpa
# 1 2 4 james 2 6 linda
# 2 NaN NaN NaN 3 7 sam
Divide the merged frame into left/right frames and align their columns with set_axis:
cols = df1.columns
left = merged.iloc[:, :len(cols)].set_axis(cols, axis=1)
# col_id num name
# 0 1 3 linda
# 1 2 4 james
# 2 NaN NaN NaN
right = merged.iloc[:, len(cols):].set_axis(cols, axis=1)
# col_id num name
# 0 1 2 granpa
# 1 2 6 linda
# 2 3 7 sam
compare the aligned left/right frames (use keep_equal=True to show equal cells):
left.compare(right, keep_shape=True, keep_equal=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
left.compare(right, keep_shape=True)
# col_id num name
# self other self other self other
# 0 NaN NaN 3 2 linda granpa
# 1 NaN NaN 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Align via reindex
If you are 100% sure that one df is a subset of the other, then reindex the subsetted rows.
In your example, df1 is a subset of df2, so reindex df1:
df1.assign(id=df1.col_id) # copy col_id (we need original col_id after reindexing)
.set_index('id') # set index to copied id
.reindex(df2.id) # reindex against df2's id
.reset_index(drop=True) # remove copied id
.set_axis(df2.columns, axis=1) # align column names
.compare(df2, keep_equal=True, keep_shape=True)
# col_id num name
# self other self other self other
# 0 1 1 3 2 linda granpa
# 1 2 2 4 6 james linda
# 2 NaN 3 NaN 7 NaN sam
Nullable integers
Normally int cannot mix with nan, so pandas converts to float. To keep the int values as int (like the examples above):
Ideally we'd convert the int columns to nullable integers with astype('Int64') (capital I).
However, there is currently a comparison bug with Int64, so just use astype(object) for now.
If I understand correctly, you want something like this:
new_df = df1.drop(['name', 'num'], axis=1).merge(df2.rename({'id': 'col_id'}, axis=1), how='outer')
Output:
>>> new_df
col_id no name
0 1 2 granpa
1 2 6 linda
2 3 7 sam
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?
df2 = df2.loc[df2['products'] <= 20]
I have a dataframe that looks something like this:
df = pd.DataFrame({'Name':['a','a','a','a','b','b','b'], 'Year':[1999,1999,1999,2000,1999,2000,2000], 'Name_id':[1,1,1,1,2,2,2]})
Name Name_id Year
0 a 1 1999
1 a 1 1999
2 a 1 1999
3 a 1 2000
4 b 2 1999
5 b 2 2000
6 b 2 2000
What I'd like to have is a new column 'yr_name_id' that increases for each unique Name_id-Year combination and then begins anew with each new Name_id.
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2
I've tried a variety of things and looked here, here and at a few posts on groupby and enumerate.
At first I tried creating a unique dictionary after combining Name_id and Year and then using map to assign values, but when I try to combine Name_id and Year as strings via:
df['yr_name_id'] = str(df['Name_id']) + str(df['Year'])
The new column has a non-unique syntax of 0 0 1\n1 1\n2 1\n3 1\n4 2\n5 2... which I don't really understand.
A more promising approach that I think I just need help with the lambda is by using groupby
df['yr_name_id'] = df.groupby(['Name_id', 'Year'])['Name_id'].transform(lambda x: )#unsure from this point
I am very unfamiliar with lambda's so any guidance on how I might do this would be greatly appreciated.
IIUC you can do it this way:
In [99]: df['yr_name_id'] = pd.Categorical(pd.factorize(df['Name_id'].astype(str) + '-' + df['Year'].astype(str))[0] + 1)
In [100]: df
Out[100]:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 3
5 b 2 2000 4
6 b 2 2000 4
In [101]: df.dtypes
Out[101]:
Name object
Name_id int64
Year int64
yr_name_id category
dtype: object
But looking at your desired DF, it looks like you want to categorize just a Year column, not a combination of Name_id + Year
In [102]: df['yr_name_id'] = pd.Categorical(pd.factorize(df.Year)[0] + 1)
In [103]: df
Out[103]:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2
In [104]: df.dtypes
Out[104]:
Name object
Name_id int64
Year int64
yr_name_id category
dtype: object
Use itertools.count:
from itertools import count
counter = count(1)
df['yr_name_id'] = (df.groupby(['Name_id', 'Year'])['Name_id']
.transform(lambda x: next(counter)))
Output:
Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 3
5 b 2 2000 4
6 b 2 2000 4