Pandas - Merge two df's on non-unique date (outer join) - python

I have two df's that I would like to combine in a slightly unusual way.
The df's in question:
df1:
Index colA
2012-01-02 1
2012-01-05 2
2012-01-10 3
2012-01-10 4
and then df2:
Index colB
2012-01-01 6
2012-01-05 7
2012-01-08 8
2012-01-10 9
Output:
Index colA colB
2012-01-01 NaN 6
2012-01-02 1 NaN
2012-01-05 2 7
2012-01-08 NaN 8
2012-01-10 3 9
2012-01-10 4 NaN
Happy to have the NaN output if there is no matching date between the df's.
If there is a matching date I would like to return both columns.
There can be an instance where a single date has eg. 20 rows in df1 and 15 rows in df2.. it would match off the first 15 (don't care about ordering) and then return NaN's for the last 5 rows in df2.
When trying to do this myself with pd.merge() and others I can't because the date is obviously not unique for an index.
Any suggestions how to get the intended behavior?
Thanks

You may need create a helper key with cumcount
df1=df1.assign(key=df1.groupby('Index').cumcount())
df2=df2.assign(key=df2.groupby('Index').cumcount())
fdf=df1.merge(df2,how='outer').drop('key',1).sort_values('Index')
fdf
Out[104]:
Index colA colB
4 2012-01-01 NaN 6.0
0 2012-01-02 1.0 NaN
1 2012-01-05 2.0 7.0
5 2012-01-08 NaN 8.0
2 2012-01-10 3.0 9.0
3 2012-01-10 4.0 NaN

Using join() should work
df1.join(df2, how='outer', sort=True)

Related

How to reshape a dataframe into a particular format

I have the following dataframe which I want to reshape:
year control Counts_annee_control
0 2014.0 0 81.0
1 2016.0 1 147.0
2 2016.0 0 168.0
3 2015.0 1 90.0
4 2016.0 1 147.0
I want the control column to be the new index, so I will have only two columns. The year values should be the new columns of the dataframe and finally the Counts_annee_control column will fill the dataframe.
I tried using groupby and transform but I don't think I'm doing it correctly.
You are probably looking for pivot_table.
If df is your DataFrame, then this
modified_df = df.pivot_table(
values='Counts_annee_control', index='control', columns='year'
)
will result in
year 2014.0 2015.0 2016.0
control
0 81.0 NaN 168.0
1 NaN 90.0 147.0

How to drop '_merge' column in Pandas.merge

I am sorting 2 dataframes according to its accuracy. So I merge 2 df with strict conditions with how='outer', indicator=True at first then save it to a df called 'perfect'. Later I extract left_only and right_only from _merge column to two new dfs. Then I merge these two df with simple conditions how='outer', indicator=True and save new df as 'partial match'. But when I do this I get eeror
ValueError: Cannot use name of an existing column for indicator column
because I used indicator = True again but I need that indicator to apply for unmatched rows (ie, left only and right only) and put them for much simpler conditions.
How can I drop that merge column? Or how can I remove this ValueError?
_merge It is not appearing in df.columns, so I am unable to drop(['_merge') or del df._merge
Use 'string' for indicator instead of True. See docs
indicatorbool or str, default False If True, adds a column to output
DataFrame called “_merge” with information on the source of each row.
If string, column with information on source of each row will be added
to output DataFrame, and column will be named value of string.
Information column is Categorical-type and takes on a value of
“left_only” for observations whose merge key only appears in ‘left’
DataFrame, “right_only” for observations whose merge key only appears
in ‘right’ DataFrame, and “both” if the observation’s merge key is
found in both.
Then the second time you merge, use a different 'string' for indicator.
dfA = pd.DataFrame({'key':np.arange(0,10), 'dataA':np.arange(100,110)})
dfB = pd.DataFrame({'key':np.arange(5,15), 'dataB':np.arange(100,110)})
dfA.merge(dfB, on='key', indicator='Ind', how='outer')
Output:
key dataA dataB Ind
0 0 100.0 NaN left_only
1 1 101.0 NaN left_only
2 2 102.0 NaN left_only
3 3 103.0 NaN left_only
4 4 104.0 NaN left_only
5 5 105.0 100.0 both
6 6 106.0 101.0 both
7 7 107.0 102.0 both
8 8 108.0 103.0 both
9 9 109.0 104.0 both
10 10 NaN 105.0 right_only
11 11 NaN 106.0 right_only
12 12 NaN 107.0 right_only
13 13 NaN 108.0 right_only
14 14 NaN 109.0 right_only

Leading and Trailing Padding Dates in Pandas DataFrame

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

Group by and find sum for groups but return NaN as NaN, not 0

I have a dataframe where each unique group has 4 rows.
So I need to group by columns that makes them unique and does some aggregations such as max, min, sum and average.
But the problem is that I have for some group all NaN values (in some column) and returns me a 0. Is it possible to return me a NaN?
For example:
df
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 8 5 NaN
2018-02-11 14:00:00 1 a 12 1 NaN NaN
2018-02-11 14:00:00 1 a 12 3 7 NaN
2018-02-11 14:00:00 1 a 12 4 12 NaN
2018-02-11 14:00:00 2 a 5 NaN 5 5
2018-02-11 14:00:00 2 a 5 NaN 3 2
2018-02-11 14:00:00 2 a 5 NaN NaN 6
2018-02-11 14:00:00 2 a 5 NaN 7 NaN
So, for example, I need to groupby ('id', 'el', 'conn') and find sum for column1, column3 and column2. (In real case I have a lot more columns need to be performed aggregation on).
I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values.
Desired output:
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 16 24 NaN
2018-02-11 14:00:00 2 a 5 NaN 15 13
Any help is welcomed.
Change parameter min_count to 1 - this working in last pandas version 0.22.0:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.
df = df.groupby(['time','id', 'el', 'conn'], as_index=False).sum(min_count=1)
print (df)
time id el conn column1 column2 column3
0 2018-02-11 14:00:00 1 a 12 16.0 24.0 NaN
1 2018-02-11 14:00:00 2 a 5 NaN 15.0 13.0
I think it should be something like this.
df.groupby(['time','id','el','conn']).sum()
Output in Python 2:
Some little tutorial for groupby I find interesting in these cases:
https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_groups/
https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

Elegant resample for groups in Pandas

For a given pandas data frame called full_df which looks like
index id timestamp data
------- ---- ------------ ------
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-05-01 9.0
The start and end dates (and the time delta between start and end) are varying.
But I need a id wise resampled version (added rows marked with *)
index id timestamp data
------- ---- ------------ ------ ----
1 1 2017-01-01 10.0
2 1 2017-02-01 11.0
3 1 2017-03-01 NaN *
4 1 2017-04-01 13.0
5 2 2017-02-01 1.0
6 2 2017-03-01 2.0
7 2 2017-04-01 NaN *
8 2 2017-05-01 9.0
Because the dataset is very large I was wondering if there is more efficient way of doing so than
Do full_df.groupby('id')
Do for each group df
df.index = pd.DatetimeIndex(df['timestamp'])
all_days = pd.date_range(df.index.min(), df.index.max(), freq='MS')
df = df.reindex(all_days)
Combine all groups again with a new index
That's time consuming and not very elegant. Any ideas?
Using resample
In [1175]: (df.set_index('timestamp').groupby('id').resample('MS').asfreq()
.drop(['id', 'index'], 1).reset_index())
Out[1175]:
id timestamp data
0 1 2017-01-01 10.0
1 1 2017-02-01 11.0
2 1 2017-03-01 NaN
3 1 2017-04-01 13.0
4 2 2017-02-01 1.0
5 2 2017-03-01 2.0
6 2 2017-04-01 NaN
7 2 2017-05-01 9.0
Details
In [1176]: df
Out[1176]:
index id timestamp data
0 1 1 2017-01-01 10.0
1 2 1 2017-02-01 11.0
2 3 1 2017-04-01 13.0
3 4 2 2017-02-01 1.0
4 5 2 2017-03-01 2.0
5 6 2 2017-05-01 9.0
In [1177]: df.dtypes
Out[1177]:
index int64
id int64
timestamp datetime64[ns]
data float64
dtype: object
Edit to add: this way does the min/max of dates for full_df, not df. If there wide variation in start/end dates between IDs this will unfortunately inflate the dataframe and #JohnGalt method is better. Nevertheless I'll leave this here as an alternate approach as it ought to be faster than groupby/resample for cases where it is appropriate.
I think the most efficient approach is likely going to be with stack/unstack or melt/pivot.
You could do something like this, for example:
full_df.set_index(['timestamp','id']).unstack('id').stack('id',dropna=False)
index data
timestamp id
2017-01-01 1 1.0 10.0
2 NaN NaN
2017-02-01 1 2.0 11.0
2 4.0 1.0
2017-03-01 1 NaN NaN
2 5.0 2.0
2017-04-01 1 3.0 13.0
2 NaN NaN
2017-05-01 1 NaN NaN
2 6.0 9.0
Just add reset_index().set_index('id') if you want it to display more like how you have it above. Note in particular the use of dropna=False with stack which preserves the NaN placeholders. Without that, the stack/unstack method just leaves you back where you started.
This method automatically includes the min & max dates, and all dates present for at least one timestamp. If there are interior timestamps missing for everyone, then you need to add a resample like this:
full_df.set_index(['timestamp','id']).unstack('id')\
.resample('MS').mean()\
.stack('id',dropna=False)

Categories