Concatenate pandas DataFrames on columns, similar to outer merge - python

I have 3 dataframes with dates on the first column of each. I would like to concat these dataframes but concating related with the row value of each. If the values match, add on the same row, otherwise, I would expect to have a NaN.
import numpy as np
import pandas as pd
# Create the pandas DataFrame
df1 = pd.DataFrame(['2018-12-31','2019-09-30','2022-01-31'], columns = ['Date1'])
df2 = pd.DataFrame(['2019-09-30','2022-02-28'], columns = ['Date2'])
df3 = pd.DataFrame(['2019-09-30','2021-06-30','2021-11-30','2022-03-31'], columns = ['Date3'])
display(df1)
display(df2)
display(df3)
data = {'Date1': ['2018-12-31','2019-09-30',np.nan,np.nan,'2022-01-31',np.nan,np.nan],
'Date2': [np.nan,'2019-09-30',np.nan,np.nan,np.nan,'2022-02-28',np.nan],
'Date3': [np.nan,'2019-09-30','2021-06-30','2021-11-30',np.nan,np.nan,'2022-01-31']}
desired_df = pd.DataFrame(data)
desired_df
This is what I am trying to achieve.
Date1
Date2
Date3
0
2018-12-31
NaN
NaN
1
2019-09-30
2019-09-30
2019-09-30
2
NaN
NaN
2021-06-30
3
NaN
NaN
2021-11-30
4
2022-01-31
NaN
NaN
5
NaN
2022-02-28
NaN
6
NaN
NaN
2022-01-31
My original idea was to used something like:
pd.concat([df1,df2,df3], axis=1, join="outer")
However, above will produce something like:
Date1
Date2
Date3
2018-12-31
2019-09-30
2019-09-30
2019-09-30
2022-02-28
2021-06-30
2022-01-31
NaN
2021-11-30
NaN
NaN
2022-03-31

We could set_index with the Dates (by setting the drop parameter to False, we don't lose the column), then concat horizontally:
out = (pd.concat([df.set_index(f'Date{i+1}', drop=False)
for i, df in enumerate([df1, df2, df3])], axis=1)
.sort_index().reset_index(drop=True))
Output:
Date1 Date2 Date3
0 2018-12-31 NaN NaN
1 2019-09-30 2019-09-30 2019-09-30
2 NaN NaN 2021-06-30
3 NaN NaN 2021-11-30
4 2022-01-31 NaN NaN
5 NaN 2022-02-28 NaN
6 NaN NaN 2022-03-31

Related

How to sort columns except index column in a data frame in python after pivot

So I have a data frame
testdf = pd.DataFrame({"loc" : ["ab12","bc12","cd12","ab12","bc13","cd12"], "months" :
["Jun21","Jun21","July21","July21","Aug21","Aug21"], "dept" :
["dep1","dep2","dep3","dep2","dep1","dep3"], "count": [15, 16, 15, 92, 90, 2]})
That looks like this:
When I pivot it,
df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
df.columns = df.columns.droplevel(0)
df
it looks like this:
I am looking for a sort function which will sort only the months columns in sequence and not the first 2 columns i.e loc & dept.
when I try this:
df.sort_values(by = ['Jun21'],ascending = False, inplace = True, axis = 1, ignore_index=True)[2:]
it gives me error.
I want the columns to be in sequence Jun21, Jul21, Aug21
I am looking for something which will make it dynamic and I wont need to manually change the sequence when the month changes.
Any hint will be really appreciated.
It is quite simple if you do using groupby
df = testdf.groupby(['loc', 'dept', 'months']).sum().unstack(level=2)
df = df.reindex(['Jun21', 'July21', 'Aug21'], axis=1, level=1)
Output
count
months Jun21 July21 Aug21
loc dept
ab12 dep1 15.0 NaN NaN
dep2 NaN 92.0 NaN
bc12 dep2 16.0 NaN NaN
bc13 dep1 NaN NaN 90.0
cd12 dep3 NaN 15.0 2.0
We can start by converting the column months in datetime like so :
>>> testdf.months = (pd.to_datetime(testdf.months, format="%b%y", errors='coerce'))
>>> testdf
loc months dept count
0 ab12 2021-06-01 dep1 15
1 bc12 2021-06-01 dep2 16
2 cd12 2021-07-01 dep3 15
3 ab12 2021-07-01 dep2 92
4 bc13 2021-08-01 dep1 90
5 cd12 2021-08-01 dep3 2
Then, we apply your code to get the pivot :
>>> df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
>>> df.columns = df.columns.droplevel(0)
>>> df
months NaT NaT 2021-06-01 2021-07-01 2021-08-01
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0
And to finish we can reformat the column names using strftime to get the expected result :
>>> df.columns = df.columns.map(lambda t: t.strftime('%b%y') if pd.notnull(t) else '')
>>> df
months Jun21 Jul21 Aug21
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.
Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

window correlation using groupy and rolling in Pandas

I want to calculate rolling correlation of grouped data. How can I do it in Pandas? I have created dummy data and done it with PySpark below using SQL
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
my_array = np.random.random(90).reshape(-1, 3)
groups = np.array(['a', 'b', 'c']).reshape(-1,1)
groups = np.repeat(groups, 10).reshape(-1, 1)
my_array = np.append(my_array, groups, axis = 1)
df = pd.DataFrame(my_array, columns = list('abcd'))
df['date'] = pd.to_datetime([datetime.today() + timedelta(i) for i in range(30)])
spark.createDataFrame(df).createOrReplaceTempView('df_tbl')
spark.sql("""
select *,
corr(a,b) over (partition by d order by date rows between 8 preceding and current row) as cor1,
corr(a,b) over (partition by d order by date rows between 8 preceding and current row) as cor2
from df_tbl
""").toPandas().head(10)
Use date as index and apply rolling groupby functionality to calculate corr on a and b. Later reset_index to makes indices into columns as it will be hard to access timestamp as index.
Like this
df.set_index('date', inplace=True)
result = df.groupby(['d'])[['a','b']].rolling(8).corr()
result.reset_index(inplace=True)
Output would look like this:
d date level_2 a b
0 a 2020-03-03 21:21:29.512854 a NaN NaN
1 a 2020-03-03 21:21:29.512854 b NaN NaN
2 a 2020-03-04 21:21:29.512866 a NaN NaN
3 a 2020-03-04 21:21:29.512866 b NaN NaN
4 a 2020-03-05 21:21:29.512869 a NaN NaN
5 a 2020-03-05 21:21:29.512869 b NaN NaN
6 a 2020-03-06 21:21:29.512871 a NaN NaN
7 a 2020-03-06 21:21:29.512871 b NaN NaN
8 a 2020-03-07 21:21:29.512872 a NaN NaN
9 a 2020-03-07 21:21:29.512872 b NaN NaN
10 a 2020-03-08 21:21:29.512874 a NaN NaN
11 a 2020-03-08 21:21:29.512874 b NaN NaN
12 a 2020-03-09 21:21:29.512876 a NaN NaN
13 a 2020-03-09 21:21:29.512876 b NaN NaN
14 a 2020-03-10 21:21:29.512878 a 1.000000 -0.166854
15 a 2020-03-10 21:21:29.512878 b -0.166854 1.000000
16 a 2020-03-11 21:21:29.512880 a 1.000000 -0.095549
17 a 2020-03-11 21:21:29.512880 b -0.095549 1.000000
...
...

Get lagged data in pandas

I want to get the lagged data from a dataset. The dataset is monthly and looks like this:
Final Profits
JCCreateDate
2016-04-30 31163371.59
2016-05-31 27512300.34
...
2019-02-28 16800693.82
2019-03-31 5384227.13
Now out of the above dataset, I've selected a window of data (last 12 months of data) from which I want to subtract 3,6,9 and 12 months.
I've created the window dataset like this:
df_all = pd.read_csv('dataset.csv')
df = pd.read_csv('window_dataset.csv')
data_start, data_end = pd.to_datetime(df.first_valid_index()), pd.to_datetime(df.last_valid_index())
dr = pd.date_range(data_start, data_end, freq='M')
Now for the daterange dr I wanted to subtract the months, lets suppose I subtract 3 months from dr and try to retrieve the data from df_all
df_all.loc[dr - pd.DateOffset(months=3)]
which gives me following output
Final Profits
2018-01-30 NaN
2018-02-28 9240766.46
2018-03-30 NaN
2018-04-30 13250515.05
2018-05-31 12539224.15
2018-06-30 17778326.04
2018-07-31 19345671.02
2018-08-30 NaN
2018-09-30 14815607.14
2018-10-31 28979099.74
2018-11-28 NaN
2018-12-31 12395273.24
As one can see I've got some NaN because the months like Jan, Mar has got 31 days and the subtraction is searching for the wrong day of the month. How to deal with it ?
I'm not 100% what you are looking for but I suspect use shift.
# set up dataframe
index = pd.date_range(start='2016-04-30', end='2019-03-31', freq='M' )
df = pd.DataFrame(np.random.randint(5000000, 50000000, 36), index=index, columns=['Final Profits'])
# create three columns shifting and subtracing from 'Final_Profits'
df['3mos'] = df['Final Profits'] - df['Final Profits'].shift(3)
df['6mos'] = df['Final Profits'] - df['Final Profits'].shift(6)
df['9mos'] = df['Final Profits'] - df['Final Profits'].shift(9)
print(df.head(12))
Final Profits 3mos 6mos 9mos
2016-04-30 45197972 NaN NaN NaN
2016-05-31 5029292 NaN NaN NaN
2016-06-30 20310120 NaN NaN NaN
2016-07-31 10514197 -34683775.0 NaN NaN
2016-08-31 31219405 26190113.0 NaN NaN
2016-09-30 21504727 1194607.0 NaN NaN
2016-10-31 19234437 8720240.0 -25963535.0 NaN
2016-11-30 18881711 -12337694.0 13852419.0 NaN
2016-12-31 27237712 5732985.0 6927592.0 NaN
2017-01-31 21692788 2458351.0 11178591.0 -23505184.0
2017-02-28 7869701 -11012010.0 -23349704.0 2840409.0
2017-03-31 20943248 -6294464.0 -561479.0 633128.0

pandas - Extend Index of a DataFrame setting all columns for new rows to NaN?

I have time-indexed data:
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2 = df2.set_index('day')
df2
b
day
2012-01-01 0.22
2012-01-03 0.30
What is the best way to extend this data frame so that it has one row for every day in January 2012 (say), where all columns are set to NaN (here only b) where we don't have data?
So the desired result would be:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
...
2012-01-31 NaN
Many thanks!
Use this (current as of pandas 1.1.3):
ix = pd.date_range(start=date(2012, 1, 1), end=date(2012, 1, 31), freq='D')
df2.reindex(ix)
Which gives:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
[...]
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
For older versions of pandas replace pd.date_range with pd.DatetimeIndex.
You can resample passing day as frequency, without specifying a fill_method parameter missing values will be NaN filled as you desired
df3 = df2.asfreq('D')
df3
Out[16]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
To answer your second part, I can't think of a more elegant way at the moment:
df3 = DataFrame({ 'day': Series([date(2012, 1, 4), date(2012, 1, 31)])})
df3.set_index('day',inplace=True)
merged = df2.append(df3)
merged = merged.asfreq('D')
merged
Out[46]:
b
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
This constructs a second time series and then we just append and call asfreq('D') as before.
Here's another option:
First add a NaN record on the last day you want, then resample. This way resampling will fill the missing dates for you.
Starting Frame:
import pandas as pd
import numpy as np
from datetime import date
df2 = pd.DataFrame({ 'day': pd.Series([date(2012, 1, 1), date(2012, 1, 3)]), 'b' : pd.Series([0.22, 0.3]) })
df2= df2.set_index('day')
df2
Out:
b
day
2012-01-01 0.22
2012-01-03 0.30
Filled Frame:
df2 = df2.set_value(date(2012,1,31),'b',np.float('nan'))
df2.asfreq('D')
Out:
b
day
2012-01-01 0.22
2012-01-02 NaN
2012-01-03 0.30
2012-01-04 NaN
2012-01-05 NaN
2012-01-06 NaN
2012-01-07 NaN
2012-01-08 NaN
2012-01-09 NaN
2012-01-10 NaN
2012-01-11 NaN
2012-01-12 NaN
2012-01-13 NaN
2012-01-14 NaN
2012-01-15 NaN
2012-01-16 NaN
2012-01-17 NaN
2012-01-18 NaN
2012-01-19 NaN
2012-01-20 NaN
2012-01-21 NaN
2012-01-22 NaN
2012-01-23 NaN
2012-01-24 NaN
2012-01-25 NaN
2012-01-26 NaN
2012-01-27 NaN
2012-01-28 NaN
2012-01-29 NaN
2012-01-30 NaN
2012-01-31 NaN
Mark's answer seems to not be working anymore on pandas 1.1.1.
However, using the same idea, the following works:
from datetime import datetime
import pandas as pd
# get start and desired end dates
first_date = df['date'].min()
today = datetime.today()
# set index
df.set_index('date', inplace=True)
# and here is were the magic happens
idx = pd.date_range(first_date, today, freq='D')
df = df.reindex(idx)
EDIT: just found out that this exact use case is in the docs:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex
def extendframe(df, ndays):
"""
(df, ndays) -> df that is padded by ndays in beginning and end
"""
ixd = df.index - datetime.timedelta(ndays)
ixu = df.index + datetime.timedelta(ndays)
ixx = df.index.union(ixd.union(ixu))
df_ = df.reindex(ixx)
return df_
Not exactly the question since here you know that the second index is all days in January, but suppose you have another index say from another data frame df1, which might be disjoint and with a random frequency. Then you can do this:
ix = pd.DatetimeIndex(list(df2.index) + list(df1.index)).unique().sort_values()
df2.reindex(ix)
Converting indices to lists allows one to create a longer list in a natural way.

Categories