Pandas Data Frame correlation with Series

Pandas Data Frame correlation with Series - python

I have a data frame and a series that I would like to return a rolling correlation as a new data frame.
So I have 3 columns in df1, I would like to return a new data frame that is the rolling correlation of each of these columns with a Series object.
import pandas as pd
df1 = pd.read_csv('https://bpaste.net/raw/d0456d3a020b')
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index(df1['Date'])
del df1['Date']
df2 = pd.read_csv('https://bpaste.net/raw/d5cb455cb091')
df2['Date'] = pd.to_datetime(df2['Date'])
df2 = df2.set_index(df2['Date'])
del df2['Date']
pd.rolling_corr(df1, df2)
result https://bpaste.net/show/58b59c656ce4
gives NaNs and 1s only
pd.rolling_corr(df1['IWM_Close'], spy, window=22)
gives the ideal series returned, but I did not want to loop through the columns of the data frame. Is there a better way to do it?
Thanks.

I believe your second input has to be a Series to be correlated with all columns in the first DataFrame.
This works:
index = pd.DatetimeIndex(start=date(2015,1,1), freq='W', periods = 100)
df1 = pd.DataFrame(np.random.random((100,3)), index=index)
df2 = pd.DataFrame(np.random.random((100,1)), index=index)
print(pd.rolling_corr(df1, df2.squeeze(), window=20).tail())
or, for the same result:
df2 = pd.Series(np.random.random(100), index=index)
print(pd.rolling_corr(df1, df2, window=20).tail())
0 1 2
2016-10-30 -0.170971 -0.039929 -0.091098
2016-11-06 -0.199441 0.000093 -0.096331
2016-11-13 -0.213728 -0.020709 -0.129935
2016-11-20 -0.075859 0.014667 -0.153830
2016-11-27 -0.114041 0.019886 -0.155472
but this doesn't - note the missing .squeeze() - only correlates the matching columns:
print(pd.rolling_corr(df1, df2, window=20).tail())
0 1 2
2016-10-30 0.019865 NaN NaN
2016-11-06 0.087075 NaN NaN
2016-11-13 0.011679 NaN NaN
2016-11-20 -0.004155 NaN NaN
2016-11-27 0.111408 NaN NaN

Related

How to count values in a panda dataframe with specific index dates

How can I count with a loop how many 2-up and 2-dn are in a column at the same index date in a panda dataframe?
df1 = pd.DataFrame()
index = ['2020-01-01','2020-01-01','2020-01-01','2020-01-08','2020-01-08','2020-01-08']
df1 = pd.DataFrame(index = index)
bars = ['1-inside','2-up','2-dn','2-up','2-up','1-inside']
df1['Strat'] = bars
df1
Result should be:
2020-01-01 2-up = 1, 2-dn = 1
2020-01-08 2-up = 2, 2-dn = 0
Afterwards I would like to plot the results with matplotlib.

Use SeriesGroupBy.value_counts for count, reshape by Series.unstack and then plot by DataFrame.plot.bar:
need = ['2-up','2-dn']
df1 = df1['Strat'].groupby(level=0).value_counts().unstack(fill_value=0)[need]
print (df1)
Strat 2-up 2-dn
2020-01-01 1 1
2020-01-08 2 0
Or you can filter before counts by Series.isin in boolean indexing:
need = ['2-up','2-dn']
df1 = (df1.loc[df1['Strat'].isin(need), 'Strat']
.groupby(level=0)
.value_counts()
.unstack(fill_value=0))
df1.plot.bar()

Merge two dataframes and keep the common values while retaining values based on another column

When I merge two dataframes, it keeps the columns from the left and the right dataframes
with a _x and _y appended.
But I want it to make it one column and 'merge' the values of the two columns such that:
when the values are the same it just puts that one value
when the values are different it keeps the value based on another column called 'date'
and takes the value which is the 'latest' based on the date.
I also tried doing it using concatenate and in this case it does 'merge' the two columns, but it just seems to 'append' the two rows.
In the code below for example, I would like to get as output the dataframe df_desired. How can I get that?
import pandas as pd
import numpy as np
np.random.seed(30)
company1 = ('comA','comB','comC','comD')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[100,200,300,400]
df1['date'] = [20191231,20191231,20191001,20190931]
print("\ndf1:")
print(df1)
company2 = ('comC','comD','comE','comF')
df2 = pd.DataFrame(columns=None)
df2['company'] = company2
df2['clv']=[300,450,500,600]
df2['date'] = [20191231,20191231,20191231,20191231]
print("\ndf2:")
print(df2)
df_desired = pd.DataFrame(columns=None)
df_desired['company'] = ('comA','comB','comC','comD','comE','comF')
df_desired['clv']=[100,200,300,450,500,600]
df_desired['date'] = [20191231,20191231,20191231,20191231,20191231,20191231]
print("\ndf_desired:")
print(df_desired)
df_merge = pd.merge(df1,df2,left_on = 'company',
right_on = 'company',how='outer')
print("\ndf_merge:")
print(df_merge)
# alternately
df_concat = pd.concat([df1, df2], ignore_index=True, sort=False)
print("\ndf_concat:")
print(df_concat)

One approach is to concat the two dataframes then sort the concatenated dataframe on date in ascending order and drop the duplicate entries(while keeping the latest entry) based on company:
df = pd.concat([df1, df2])
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
df = df.sort_values('date', na_position='first').drop_duplicates('company', keep='last', ignore_index=True)
Result:
company clv date
0 comA 100 2019-12-31
1 comB 200 2019-12-31
2 comC 300 2019-12-31
3 comD 450 2019-12-31
4 comE 500 2019-12-31
5 comF 600 2019-12-31

Rename a column based on values in another column

I have a set of four dataframes: df_all = [df1, df2, df3, df4]
As a sample, they look like this:
df1:
Name Dates a
Apple 5-5-15 NaN
Apple 6-5-15 42
Apple 6-5-16 36
Apple 6-5-17 36
df2:
Name Dates a
Banana 5-5-15 85
Banana 6-5-15 NaN
Banana 6-6-15 100
Banana 6-5-16 18
I want to merge on "Dates", which I achieve in the following manner:
for cols in df_all:
cols = cols.drop(['Name'], axis=1, inplace=True)
a = df1.merge(df2, how='left', on = 'Date').merge(df3, how='left', on = 'Date').merge(df4, how='left', on = 'Date')
This gives me exactly what I want. However, the columns are renamed as a_x, a_y, a_x, a_y
This sample below shows what happens when I merge only df1 and df2.
Dates a_x a_y
5-5-15 NaN 85
6-5-15 42 NaN
6-6-15 NaN 100
6-5-16 36 18
6-5-17 36 NaN
Before the merge, I want to rename column a based on the value in Name (apple, or banana), and I want to automate it as much as possible to rename all dataframe column 'a' to the value in their column 'Name'

Try to change the column name in your first for loop before you drop that column.
for cols in df_all:
name = cols['Name'][0]
cols.drop(['Name'],axis=1,inplace=True)
cols.rename(columns={'a':name},inplace=True)
a = df1.merge(df2, how='left', on = 'Date').merge(df3, how='left', on = 'Date').merge(df4, how='left', on = 'Date')

Try do with concat and modify your dataframe
df=pd.concat([x.set_index(['Name','Dates']).a.unstack(level=0) for x in listdf])
Or combine then then pivot_table
df=pd.concat([df1,df2]).pivot_table(index='Dates',columns='Name',values='a',aggfunc='first')
Name Apple Banana
Dates
5-5-15 NaN 85.0
6-5-15 42.0 NaN
6-5-16 36.0 18.0
6-5-17 36.0 NaN
6-6-15 NaN 100.0

You can do a rename before you drop the name column. Since all names are the same in a dataframe, you can get it from the first line:
for cols in df_all:
cols.rename(columns={'a': cols.at[0, 'Name']}, inplace=True)
cols = cols.drop(['Name'], axis=1, inplace=True)

You can automate the process of merging the dataframe by renaming the columns before merging and then using functools.reduce to merge all dataframes in df_all:
from functools import reduce
# rename column a
df_all = [df.rename(columns={'a': df.pop('Name').iloc[0]}) for df in df_all]
# merge all dataframes
merged = reduce(lambda d1, d2: pd.merge(d1, d2, on=['Dates'], how='left') , df_all)
# print(merged)
# sample result after merging df1 & df2
Dates Apple Banana
0 5-5-15 NaN 85.0
1 6-5-15 42.0 NaN
2 6-5-16 36.0 18.0
3 6-5-17 36.0 NaN

How to use dropna to drop columns on a subset of columns in Pandas

I want to use Pandas' dropna function on axis=1 to drop columns, but only on a subset of columns with some thresh set. More specifically, I want to pass an argument on which columns to ignore in the dropna operation. How can I do this? Below is an example of what I've tried.
import pandas as pd
df = pd.DataFrame({
'building': ['bul2', 'bul2', 'cap1', 'cap1'],
'date': ['2019-01-01', '2019-02-01', '2019-01-01', '2019-02-01'],
'rate1': [301, np.nan, 250, 276],
'rate2': [250, 300, np.nan, np.nan],
'rate3': [230, np.nan, np.nan, np.nan],
'rate4': [230, np.nan, 245, np.nan],
})
# Only retain columns with more than 3 non-missing values
df.dropna(1, thresh=3)
building date rate1
0 bul2 2019-01-01 301.0
1 bul2 2019-02-01 NaN
2 cap1 2019-01-01 250.0
3 cap1 2019-02-01 276.0
# Try to do the same but only apply dropna to the subset of [building, date, rate1, and rate2],
# (meaning do NOT drop rate3 and rate4)
df.dropna(1, thresh=3, subset=['building', 'date', 'rate1', 'rate2'])
KeyError: ['building', 'date', 'rate1', 'rate2']

# Desired subset of columns against which to apply `dropna`.
cols = ['building', 'date', 'rate1', 'rate2']
# Apply `dropna` and see which columns remain.
filtered_cols = df.loc[:, cols].dropna(axis=1, thresh=3).columns
# Use a conditional list comprehension to determine which columns were dropped.
dropped_cols = [col for col in cols if col not in filtered_cols]
# Use a conditional list comprehension to display all columns other than those that were dropped.
new_cols = [col for col in df if col not in dropped_cols]
>>> df[new_cols]
building date rate1 rate3 rate4
0 bul2 2019-01-01 301.0 230.0 230.0
1 bul2 2019-02-01 NaN NaN NaN
2 cap1 2019-01-01 250.0 NaN 245.0
3 cap1 2019-02-01 276.0 NaN NaN

I find it easiest to first count the number of not null values in each column and then apply your criteria:
# Count not null values in each column
notnulls = df.notnull().sum()
# Find columns with >3 not null values
notnull_cols = notnulls[notnulls>3].index
# Subset df to these columns
df[notnull_cols]

extract values from a data frame

The first and the second data frames are as below:
import pandas as pd
d = {'0': [2154,799,1023,4724], '1': [27, 2981, 952,797],'2':[4905,569,4767,569]}
df1 = pd.DataFrame(data=d)
and
d={'PART_NO': ['J661-03982','661-08913', '922-8972','661-00352','661-06291',''], 'PART_NO_ENCODED': [2154,799,1023,27,569]}
df2 = pd.DataFrame(data=d)
I want to get the corresponding part_no for each row in df1 so the resulting data frame should look like this:
d={'PART_NO': ['J661-03982','661-00352',''], 'PART_NO_ENCODED': [2154,27,4905]}
df3 = pd.DataFrame(data=d)
This I can achieve like this:
df2.set_index('PART_NO_ENCODED').reindex(df1.iloc[0,:]).reset_index().rename(columns={0:'PART_NO_ENCODED'})
But instead of passing reindex(df1.iloc[0,:]) one value that's 0,1 at a Time I want to get for all the rows in df1 the corresponding part_no. Please help?

You can use the second dataframe as a dictionary of replacements:
df3 = df1.replace(df2.set_index('PART_NO_ENCODED').to_dict()['PART_NO'])
The values that are not in df2, will not be replaced. They have to be identified and discarded:
df3 = df3[df1.isin(df2['PART_NO_ENCODED'].tolist())]
# 0 1 2
#0 J661-03982 661-00352 NaN
#1 661-08913 NaN 661-06291
#2 922-8972 NaN NaN
#3 NaN NaN 661-06291
You can later replace the missing values with '' or any other value of your choice with fillna.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Data Frame correlation with Series - python

Related

How to count values in a panda dataframe with specific index dates

Merge two dataframes and keep the common values while retaining values based on another column

Rename a column based on values in another column

How to use dropna to drop columns on a subset of columns in Pandas

extract values from a data frame

Categories

Resources