window correlation using groupy and rolling in Pandas

window correlation using groupy and rolling in Pandas - python

I want to calculate rolling correlation of grouped data. How can I do it in Pandas? I have created dummy data and done it with PySpark below using SQL
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
my_array = np.random.random(90).reshape(-1, 3)
groups = np.array(['a', 'b', 'c']).reshape(-1,1)
groups = np.repeat(groups, 10).reshape(-1, 1)
my_array = np.append(my_array, groups, axis = 1)
df = pd.DataFrame(my_array, columns = list('abcd'))
df['date'] = pd.to_datetime([datetime.today() + timedelta(i) for i in range(30)])
spark.createDataFrame(df).createOrReplaceTempView('df_tbl')
spark.sql("""
select *,
corr(a,b) over (partition by d order by date rows between 8 preceding and current row) as cor1,
corr(a,b) over (partition by d order by date rows between 8 preceding and current row) as cor2
from df_tbl
""").toPandas().head(10)

Use date as index and apply rolling groupby functionality to calculate corr on a and b. Later reset_index to makes indices into columns as it will be hard to access timestamp as index.
Like this
df.set_index('date', inplace=True)
result = df.groupby(['d'])[['a','b']].rolling(8).corr()
result.reset_index(inplace=True)
Output would look like this:
d date level_2 a b
0 a 2020-03-03 21:21:29.512854 a NaN NaN
1 a 2020-03-03 21:21:29.512854 b NaN NaN
2 a 2020-03-04 21:21:29.512866 a NaN NaN
3 a 2020-03-04 21:21:29.512866 b NaN NaN
4 a 2020-03-05 21:21:29.512869 a NaN NaN
5 a 2020-03-05 21:21:29.512869 b NaN NaN
6 a 2020-03-06 21:21:29.512871 a NaN NaN
7 a 2020-03-06 21:21:29.512871 b NaN NaN
8 a 2020-03-07 21:21:29.512872 a NaN NaN
9 a 2020-03-07 21:21:29.512872 b NaN NaN
10 a 2020-03-08 21:21:29.512874 a NaN NaN
11 a 2020-03-08 21:21:29.512874 b NaN NaN
12 a 2020-03-09 21:21:29.512876 a NaN NaN
13 a 2020-03-09 21:21:29.512876 b NaN NaN
14 a 2020-03-10 21:21:29.512878 a 1.000000 -0.166854
15 a 2020-03-10 21:21:29.512878 b -0.166854 1.000000
16 a 2020-03-11 21:21:29.512880 a 1.000000 -0.095549
17 a 2020-03-11 21:21:29.512880 b -0.095549 1.000000
...
...

Related

Concatenate pandas DataFrames on columns, similar to outer merge

I have 3 dataframes with dates on the first column of each. I would like to concat these dataframes but concating related with the row value of each. If the values match, add on the same row, otherwise, I would expect to have a NaN.
import numpy as np
import pandas as pd
# Create the pandas DataFrame
df1 = pd.DataFrame(['2018-12-31','2019-09-30','2022-01-31'], columns = ['Date1'])
df2 = pd.DataFrame(['2019-09-30','2022-02-28'], columns = ['Date2'])
df3 = pd.DataFrame(['2019-09-30','2021-06-30','2021-11-30','2022-03-31'], columns = ['Date3'])
display(df1)
display(df2)
display(df3)
data = {'Date1': ['2018-12-31','2019-09-30',np.nan,np.nan,'2022-01-31',np.nan,np.nan],
'Date2': [np.nan,'2019-09-30',np.nan,np.nan,np.nan,'2022-02-28',np.nan],
'Date3': [np.nan,'2019-09-30','2021-06-30','2021-11-30',np.nan,np.nan,'2022-01-31']}
desired_df = pd.DataFrame(data)
desired_df
This is what I am trying to achieve.
Date1
Date2
Date3
0
2018-12-31
NaN
NaN
1
2019-09-30
2019-09-30
2019-09-30
2
NaN
NaN
2021-06-30
3
NaN
NaN
2021-11-30
4
2022-01-31
NaN
NaN
5
NaN
2022-02-28
NaN
6
NaN
NaN
2022-01-31
My original idea was to used something like:
pd.concat([df1,df2,df3], axis=1, join="outer")
However, above will produce something like:
Date1
Date2
Date3
2018-12-31
2019-09-30
2019-09-30
2019-09-30
2022-02-28
2021-06-30
2022-01-31
NaN
2021-11-30
NaN
NaN
2022-03-31

We could set_index with the Dates (by setting the drop parameter to False, we don't lose the column), then concat horizontally:
out = (pd.concat([df.set_index(f'Date{i+1}', drop=False)
for i, df in enumerate([df1, df2, df3])], axis=1)
.sort_index().reset_index(drop=True))
Output:
Date1 Date2 Date3
0 2018-12-31 NaN NaN
1 2019-09-30 2019-09-30 2019-09-30
2 NaN NaN 2021-06-30
3 NaN NaN 2021-11-30
4 2022-01-31 NaN NaN
5 NaN 2022-02-28 NaN
6 NaN NaN 2022-03-31

How to sort columns except index column in a data frame in python after pivot

So I have a data frame
testdf = pd.DataFrame({"loc" : ["ab12","bc12","cd12","ab12","bc13","cd12"], "months" :
["Jun21","Jun21","July21","July21","Aug21","Aug21"], "dept" :
["dep1","dep2","dep3","dep2","dep1","dep3"], "count": [15, 16, 15, 92, 90, 2]})
That looks like this:
When I pivot it,
df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
df.columns = df.columns.droplevel(0)
df
it looks like this:
I am looking for a sort function which will sort only the months columns in sequence and not the first 2 columns i.e loc & dept.
when I try this:
df.sort_values(by = ['Jun21'],ascending = False, inplace = True, axis = 1, ignore_index=True)[2:]
it gives me error.
I want the columns to be in sequence Jun21, Jul21, Aug21
I am looking for something which will make it dynamic and I wont need to manually change the sequence when the month changes.
Any hint will be really appreciated.

It is quite simple if you do using groupby
df = testdf.groupby(['loc', 'dept', 'months']).sum().unstack(level=2)
df = df.reindex(['Jun21', 'July21', 'Aug21'], axis=1, level=1)
Output
count
months Jun21 July21 Aug21
loc dept
ab12 dep1 15.0 NaN NaN
dep2 NaN 92.0 NaN
bc12 dep2 16.0 NaN NaN
bc13 dep1 NaN NaN 90.0
cd12 dep3 NaN 15.0 2.0

We can start by converting the column months in datetime like so :
>>> testdf.months = (pd.to_datetime(testdf.months, format="%b%y", errors='coerce'))
>>> testdf
loc months dept count
0 ab12 2021-06-01 dep1 15
1 bc12 2021-06-01 dep2 16
2 cd12 2021-07-01 dep3 15
3 ab12 2021-07-01 dep2 92
4 bc13 2021-08-01 dep1 90
5 cd12 2021-08-01 dep3 2
Then, we apply your code to get the pivot :
>>> df = pd.pivot_table(testdf, values = ['count'], index = ['loc','dept'], columns = ['months'], aggfunc=np.sum).reset_index()
>>> df.columns = df.columns.droplevel(0)
>>> df
months NaT NaT 2021-06-01 2021-07-01 2021-08-01
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0
And to finish we can reformat the column names using strftime to get the expected result :
>>> df.columns = df.columns.map(lambda t: t.strftime('%b%y') if pd.notnull(t) else '')
>>> df
months Jun21 Jul21 Aug21
0 ab12 dep1 15.0 NaN NaN
1 ab12 dep2 NaN 92.0 NaN
2 bc12 dep2 16.0 NaN NaN
3 bc13 dep1 NaN NaN 90.0
4 cd12 dep3 NaN 15.0 2.0

Pandas trying to make values within a column into new columns after groupby on column

My original dataframe looked like:
timestamp variables value
1 2017-05-26 19:46:41.289 inf 0.000000
2 2017-05-26 20:40:41.243 tubavg 225.489639
... ... ... ...
899541 2017-05-02 20:54:41.574 caspre 684.486450
899542 2017-04-29 11:17:25.126 tvol 50.895000
Now I want to bucket this dataset by time, which can be done with the code:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby(pd.Grouper(key='timestamp', freq='5min'))
But I also want all the different metrics to become columns in the new dataframe. For example the first two rows from the original dataframe would look like:
timestamp inf tubavg caspre tvol ...
1 2017-05-26 19:46:41.289 0.000000 225.489639 xxxxxxx xxxxx
... ... ... ...
xxxxx 2017-05-02 20:54:41.574 xxxxxx xxxxxx 684.486450 50.895000
Now as it can be seen the time has been bucketed by 5 min intervals and will look at all the values of variables and try to create columns for those columns for all buckets. The bucket has assumed the very first value of the time it had bucketed with.
in order to solve this, I have tried a couple of different solutions, but can't seem to find anything without constant errors.

Try unstacking the variables column from rows to columns with .unstack(1). The parameter is 1, because we want the second index column (0 would be the first)
Then, drop the level of the multi-index you just created to make it a little bit cleaner with .droplevel().
Finally, use pd.Grouper. Since the date/time is on the index, you don't need to specify a key.
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.set_index(['timestamp','variables']).unstack(1)
df.columns = df.columns.droplevel()
df = df.groupby(pd.Grouper(freq='5min')).mean().reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-04-29 11:20:00 NaN NaN NaN NaN
2 2017-04-29 11:25:00 NaN NaN NaN NaN
3 2017-04-29 11:30:00 NaN NaN NaN NaN
4 2017-04-29 11:35:00 NaN NaN NaN NaN
... ... ... ... ...
7885 2017-05-26 20:20:00 NaN NaN NaN NaN
7886 2017-05-26 20:25:00 NaN NaN NaN NaN
7887 2017-05-26 20:30:00 NaN NaN NaN NaN
7888 2017-05-26 20:35:00 NaN NaN NaN NaN
7889 2017-05-26 20:40:00 NaN NaN 225.489639 NaN
Another way would be to .groupby the variables as well and then .unstack(1) again:
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
df = df.groupby([pd.Grouper(freq='5min', key='timestamp'), 'variables']).mean().unstack(1)
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
variables timestamp caspre inf tubavg tvol
0 2017-04-29 11:15:00 NaN NaN NaN 50.895
1 2017-05-02 20:50:00 684.48645 NaN NaN NaN
2 2017-05-26 19:45:00 NaN 0.0 NaN NaN
3 2017-05-26 20:40:00 NaN NaN 225.489639 NaN

Select a (non-indexed) column based on text content of a cell in a python/pandas dataframe

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky

Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc

Group by and find sum for groups but return NaN as NaN, not 0

I have a dataframe where each unique group has 4 rows.
So I need to group by columns that makes them unique and does some aggregations such as max, min, sum and average.
But the problem is that I have for some group all NaN values (in some column) and returns me a 0. Is it possible to return me a NaN?
For example:
df
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 8 5 NaN
2018-02-11 14:00:00 1 a 12 1 NaN NaN
2018-02-11 14:00:00 1 a 12 3 7 NaN
2018-02-11 14:00:00 1 a 12 4 12 NaN
2018-02-11 14:00:00 2 a 5 NaN 5 5
2018-02-11 14:00:00 2 a 5 NaN 3 2
2018-02-11 14:00:00 2 a 5 NaN NaN 6
2018-02-11 14:00:00 2 a 5 NaN 7 NaN
So, for example, I need to groupby ('id', 'el', 'conn') and find sum for column1, column3 and column2. (In real case I have a lot more columns need to be performed aggregation on).
I have tried a few ways: .sum(), .transform('sum'), but returns me a zero for group with all NaN values.
Desired output:
time id el conn column1 column2 column3
2018-02-11 14:00:00 1 a 12 16 24 NaN
2018-02-11 14:00:00 2 a 5 NaN 15 13
Any help is welcomed.

Change parameter min_count to 1 - this working in last pandas version 0.22.0:
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.
New in version 0.22.0: Added with the default being 1. This means the sum or product of an all-NA or empty series is NaN.
df = df.groupby(['time','id', 'el', 'conn'], as_index=False).sum(min_count=1)
print (df)
time id el conn column1 column2 column3
0 2018-02-11 14:00:00 1 a 12 16.0 24.0 NaN
1 2018-02-11 14:00:00 2 a 5 NaN 15.0 13.0

I think it should be something like this.
df.groupby(['time','id','el','conn']).sum()
Output in Python 2:
Some little tutorial for groupby I find interesting in these cases:
https://chrisalbon.com/python/data_wrangling/pandas_apply_operations_to_groups/
https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.