How to merge two datasets based on conditions - python

I'm attempting to merge two datasets in python based on 3 conditions. They have to have the same Longtitude,Latitude and month of a specific year. One dataset has the size of about 16k and the other 1.7k.
A simple example of the inputs and expected output is as follows:
>df1
long lat date proximity
5 8 23/06/2009 Near
6 10 05/10/2012 Far
8 6 19/02/2010 Near
3 4 30/04/2014 Near
5 8 01/06/2009 Far
>df2
long lat date mine
5 8 10/06/2009 1
8 6 24/02/2010 0
7 2 19/04/2014 1
3 4 30/04/2013 1
If any condition is false the value in "mine" when merged is 0. How would I merge to get:
long lat date proximity mine
5 8 23/06/2009 Near 1
6 10 05/10/2012 Far 0
8 6 19/02/2010 Near 0
3 4 30/04/2014 Near 0
5 8 01/06/2009 Far 1
The date column is not necessary in the output if that makes it easier.

Here you go:
df1['year-month'] = pd.to_datetime(df1['date'], format='%d/%m/%Y').dt.strftime('%Y/%m')
df2['year-month'] = pd.to_datetime(df2['date'], format='%d/%m/%Y').dt.strftime('%Y/%m')
joined = df1.merge(df2,
how='left',
on =['long', 'lat', 'year-month'],
suffixes=['', '_r']).drop(columns = ['date_r', 'year-month'])
joined['mine'] = joined['mine'].fillna(0).astype(int)
print(joined)
Output
long lat date proximity mine
0 5 8 23/06/2009 Near 1
1 6 10 05/10/2012 Far 0
2 8 6 19/02/2010 Near 0
3 3 4 30/04/2014 Near 0
4 5 8 01/06/2009 Far 1

First extract the month and year from the date column and assign it to temporary column mon-year, then use DataFrame.merge to left merge the dataframes df1, df2 on long, lat and mon-year, then use Series.fillna to fill the NaN values in the mine column with 0, finally use DataFrame.drop to drop the temporary column mon-year:
df1['mon-year'] = df1['date'].str.extract(r'/(.*)')
df2['mon-year'] = df2['date'].str.extract(r'/(.*)')
# OR we can use pd.to_datetime,
# df1['mon-year'] = pd.to_datetime(df1['date'], format='%d/%m/%Y').dt.strftime('%m-%Y')
# df2['mon-year'] = pd.to_datetime(df2['date'], format='%d/%m/%Y').dt.strftime('%m-%Y')
df3 = df1.merge(
df2.drop('date', 1),
on=['long', 'lat', 'mon-year'], how='left').drop('mon-year', 1)
df3['mine'] = df3['mine'].fillna(0)
Result:
# print(df3)
long lat date proximity mine
0 5 8 23/06/2009 Near 1.0
1 6 10 05/10/2012 Far 0.0
2 8 6 19/02/2010 Near 0.0
3 3 4 30/04/2014 Near 0.0
4 5 8 01/06/2009 Far 1.0

You could merge using mutiple keys as follows:
df_1.merge(df_2, how='left', left_on=['long', 'lat', 'date'], right_on=['long', 'lat', 'date'])

Related

How to compare two data frame and get the unmatched rows using python? [duplicate]

This question already has answers here:
Pandas: Find rows which don't exist in another DataFrame by multiple columns
(2 answers)
Closed 1 year ago.
I have two data frames, df1 and df2. Now, df1 contains 6 records and df2 contains 4 records. I want to get the unmatched records out of it. I tried it but getting an error ValueError: Can only compare identically-labelled DataFrame objects I guess this is due to the length of df as the df1 has 6 and df2 has 4 but how do I compare them both and get the unmatched rows?
code
df1=
a b c
0 1 2 3
1 4 5 6
2 3 5 5
3 5 6 7
4 6 7 8
5 6 6 6
df2 =
a b c
0 3 5 5
1 5 6 7
2 6 7 8
3 6 6 6
index = (df != df2).any(axis=1)
df3 = df.loc[index]
which gives:
ValueError: Can only compare identically-labelled DataFrame objects
Expected output:
a b c
0 1 2 3
1 4 5 6
I know that the error is due to the length but is there any way where we can compare two data frames and get the unmatched records out of it?
Use df.merge with indicator=True and pick all rows except both:
In [173]: df = df1.merge(df2, indicator=True, how='outer').query('_merge != "both"').drop('_merge', 1)
In [174]: df
Out[174]:
a b c
0 1 2 3
1 4 5 6
MultiIndex.from_frame + isin
We can use MultiIndex.from_frame on both df1 and df2 to create the corresponding multiindices, then use isin to test the membership of the index created from df1 in index created from df2 to create a boolean mask which can be then used to filter the non matching rows.
i1 = pd.MultiIndex.from_frame(df1)
i2 = pd.MultiIndex.from_frame(df2)
df1[~i1.isin(i2)]
Result
a b c
0 1 2 3
1 4 5 6

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Pandas Dataframe Create New Column That is Row Below Current Row's Value [duplicate]

I've got a pandas dataframe. I want to 'lag' one of my columns. Meaning, for example, shifting the entire column 'gdp' up by one, and then removing all the excess data at the bottom of the remaining rows so that all columns are of equal length again.
df =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
df_lag =
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
Anyway to do this?
In [44]: df['gdp'] = df['gdp'].shift(-1)
In [45]: df
Out[45]:
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
4 6 NaN 7
In [46]: df[:-1]
Out[46]:
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
shift column gdp up:
df.gdp = df.gdp.shift(-1)
and then remove the last row
Time is going. And current Pandas documentation recommend this way:
df.loc[:, 'gdp'] = df.gdp.shift(-1)
To easily shift by 5 values for example and also get rid of the NaN rows, without having to keep track of the number of values you shifted by:
d['gdp'] = df['gdp'].shift(-5)
df = df.dropna()
First shift the column:
df['gdp'] = df['gdp'].shift(-1)
Second remove the last row which contains an NaN Cell:
df = df[:-1]
Third reset the index:
df = df.reset_index(drop=True)
df.gdp = df.gdp.shift(-1) ## shift up
df.gdp.drop(df.gdp.shape[0] - 1,inplace = True) ## removing the last row

Calculating CAGR by row by row in a pandas data frame?

I am working with company data. I have a data set of round about 1900 companies (index) and 30 variables per company (columns). These varibales always come in pairs of three (three periods). It basically looks like this
df = pd.DataFrame({'id' : ['1','2','3','7'],
'revenue_0' : [7,2,5,4],
'revenue_1' : [5,6,3,1],
'revenue_2' : [1,9,4,8],
'profit_0' : [3,6,4,4],
'profit_1' : [4,6,9,1],
'profit_2' : [5,5,9,8]})
I am trying to compute the compound annual growth rate (CAGR) for e.g. revenue for each company (id) - such that revenue_cagr = ((revenue_2/revenue_1)^(1/3))-1
I would like to pass a function to a set of columns row by row - at least, that is my idea.
def CAGR(start_value, end_value, periods):
((end_value/start_value)^(1/periods))-1
Is it possible to apply this function row by row for a set of columns (maybe with for i, row in df.iterrows(): or df.apply())? Respectively, is there a smarter way to do this?
Update
The desired outcome - examplified with the column revenue_cagr - should look as follows:
df = pd.DataFrame({'id' : ['1','2','3','7'],
'revenue_0' : [7,2,5,4],
'revenue_1' : [5,6,3,1],
'revenue_2' : [1,9,4,8],
'profit_0' : [3,6,4,4],
'profit_1' : [4,6,9,1],
'profit_2' : [5,5,9,8],
'revenue_cagr' : [-0.48, 0.65, -0.07, 0.26],
'profit_cagr' : [0.19, -0.06, 0.31, 0.26]
})
You can use set_index + str.rsplit for triples first:
df1 = df.set_index('id')
df1.columns = df1.columns.str.rsplit('_', expand=True, n=1)
print (df1)
profit revenue
0 1 2 0 1 2
id
1 3 4 5 7 5 1
2 6 6 5 2 6 9
3 4 9 9 5 3 4
7 4 1 8 4 1 8
Then divide by div all 2 with 0 levels selected by xs, add pow, sub and add_suffix:
df1 = df1.xs('2', axis=1, level=1)
.div(df1.xs('0', axis=1, level=1))
.pow((1./3))
.sub(1)
.add_suffix('_cagr')
print (df1)
profit_cagr revenue_cagr
id
1 0.185631 -0.477242
2 -0.058964 0.650964
3 0.310371 -0.071682
7 0.259921 0.259921
Last join to original:
df = df.join(df1, on='id')
print (df)
id profit_0 profit_1 profit_2 revenue_0 revenue_1 revenue_2 \
0 1 3 4 5 7 5 1
1 2 6 6 5 2 6 9
2 3 4 9 9 5 3 4
3 7 4 1 8 4 1 8
profit_cagr revenue_cagr
0 0.185631 -0.477242
1 -0.058964 0.650964
2 0.310371 -0.071682
3 0.259921 0.259921

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.
I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

Categories