Correlation between two dataframes column with matched headers - python

I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?

So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64

Related

Add a row for missing period and for the corresponding period calculate the average of last 3 Months

I am trying to write a code which adds missing periods to the dataframe and calculates their respective averages. Refer to the below example:
Invoice Date Amount
9 01/2020 227500
4 02/2020 56000
0 03/2020 22000
1 05/2020 25000
5 06/2020 75000
2 07/2020 27000
6 08/2020 48000
3 09/2020 35000
7 10/2020 115000
8 12/2020 85000
In the above dataframe, we see that there's a record missing for '11/2020'. I am trying to add the record for the period of 11/2020 and calculate it's mean for the last three months i.e., if 11/2020 is missing, take the amounts of 12/2020,10/2020 and 9/2020 and calculate its Mean and add/append it to the dataframe.
Expected output:
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 75000.00
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67
9 12/2020 85000.00
Please note that, I am able to arrive at the above result with the following code:
import pandas as pd
FundAdmin = {
'Invoice Date': ['03/2020', '05/2020', '07/2020', '09/2020', '02/2020', '04/2020', '06/2020', '08/2020', '10/2020', '12/2020',
'01/2020'
],
'Amount': [22000, 25000, 27000, 35000, 56000, 75000, 48000, 115000, 77000, 85000, 227500]
}
expected_dates = ['01/2020', '02/2020', '03/2020', '04/2020', '05/2020', '06/2020', '07/2020', '08/2020', '09/2020', '10/2020', '11/2020',
'12/2020'
]
df = pd.DataFrame(FundAdmin, columns = ['Invoice Date', 'Amount'])
current_dates = df['Invoice Date']
missing_dates = list(set(expected_dates) - set(current_dates))
sorted_df = df.sort_values(by = 'Invoice Date')
for i in missing_dates:
Top_3_Rows = sorted_df.tail(3)# print(Top_3_Rows)
Top_3_Rows_Amount = round(Top_3_Rows.mean(), 2)
CalcDF = {
'Invoice Date': i,
'Amount': float(Top_3_Rows_Amount)
}
FullDF = df.append(CalcDF, ignore_index = True)
print(FullDF)
However, my code is not able to handle the calculation for missing records in the middle of the dataframe. Meaning, it adds missing period to dataframe, but is not able to pick up the values of the previous 3months and it is adding the same mean amount to all the missing periods. Example: If there's a record for 4/2020 missing, code should be able to add a new record for 4/2020 and assign the value of the mean generated out of 1/2020,2/2020 and 3/2020 to 4/2020. Instead, it is assigning the Mean value of other missing period. Please refer to the below:
Expected Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 101833.33 <---- New Record Inserted for 4/2020 through the calculation the mean for 3/2020,2/2020,1/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <---- New Record Inserted for 11/2020 through the calculation the mean for 12/2020,10/2020,9/2020
9 12/2020 85000.00
My Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 65666.67 <--- Value same as 11/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <--- This works fine.
9 12/2020 85000.00
From my observation, I found that my code is not able to fetch the last 3 records if the missing period occurs to be in the middle of the dataframe, as I am using tail() method and it is fetching the records of 9/2020,10/2020 and 12/2020, caluclating its mean and assigning the same value to 4/2020. I am a complete beginner to python and if any assistance provided to resolve the above issue is greatly appreciated.
Would this work for you?
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from random import randint
df_len = 100
df = pd.DataFrame({
'Invoice': [randint(1, 10) for _ in range(df_len)],
'Dates' : [(datetime.today() - pd.DateOffset(months=mnths_ago)).date()
for mnths_ago in range(df_len)],
'Amount': [randint(1, 100000) for _ in range(df_len)],
})
# Drop 10 random rows
drop_indices = np.random.choice(df.index, 10, replace=False)
df = df.drop(drop_indices)
df
Invoice Dates Amount
0 1 2020-05-19 23797
1 6 2020-04-19 54101
2 10 2020-03-19 91522
3 5 2020-02-19 48762
4 1 2020-01-19 54497
.. ... ... ...
93 1 2012-08-19 56834
94 10 2012-07-19 21382
95 2 2012-06-19 33056
96 1 2012-05-19 93336
98 7 2012-03-19 12406
from dateutil import relativedelta
def get_prev_mean(date):
return df[:df.loc[df.Dates == date].index[0]].tail(3)['Amount'].mean()
r = relativedelta.relativedelta(df.Dates.min(), df.Dates.max())
n_months = -(r.years * 12) + r.months
all_months = [(df.Dates.max() - pd.DateOffset(months=mnths_ago)).date() for mnths_ago in range(n_months)]
missing_months = [mnth for mnth in all_months if mnth in list(df.Dates)]
dct = {mnth: get_prev_mean(mnth) for mnth in missing_months}
to_merge = pd.DataFrame(data=dct.values(), index=dct.keys()).reset_index()
to_merge.columns = ['Dates', 'Amount']
out = pd.concat([df, to_merge], sort=False).sort_values(by='Dates').reset_index(drop=True)
out
Invoice Dates Amount
0 7.0 2012-03-19 12406.0
1 1.0 2012-05-19 93336.0
2 2.0 2012-06-19 33056.0
3 10.0 2012-07-19 21382.0
4 1.0 2012-08-19 56834.0
.. ... ... ...
171 10.0 2020-03-19 91522.0
172 NaN 2020-04-19 23797.0
173 6.0 2020-04-19 54101.0
174 NaN 2020-05-19 NaN
175 1.0 2020-05-19 23797.0

How to generate discrete data to pass into a contour plot using pandas and matplotlib?

I have two sets of continuous data that I would like to pass into a contour plot. The x-axis would be time, the y-axis would be mass, and the z-axis would be frequency (as in how many times that data point appears). However, most data points are not identical but rather very similar. Thus, I suspect it's easiest to discretize both the x-axis and y-axis.
Here's the data I currently have:
INPUT
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Dates'].head(5)
df['Mass'].head(5)
OUTPUT
13 2003-05-09
14 2003-09-09
15 2010-01-18
16 2010-11-21
17 2012-06-29
Name: Date, dtype: datetime64[ns]
13 2500.0
14 3500.0
15 4000.0
16 4500.0
17 5000.0
Name: Mass, dtype: float64
I'd like to convert the data such that it groups up data points within the year (ex: all datapoints taken in 2003) and it groups up data points within different levels of mass (ex: all datapoints between 3000-4000 kg). Next, the code would count how many data points are within each of these blocks and pass that as the z-axis.
Ideally, I'd also like to be able to adjust the levels of slices. Ex: grouping points up every 100kg instead of 1000kg, or passing a custom list of levels that aren't equally distributed. How would I go about doing this?
I think the function you are looking for is pd.cut
import pandas as pd
import numpy as np
import datetime
n = 10
scale = 1e3
Min = 0
Max = 1e4
np.random.seed(6)
Start = datetime.datetime(2000, 1, 1)
Dates = np.array([base + datetime.timedelta(days=i*180) for i in range(n)])
Mass = np.random.rand(n)*10000
df = pd.DataFrame(index = Dates, data = {'Mass':Mass})
print(df)
gives you:
Mass
2000-01-01 8928.601514
2000-06-29 3319.798053
2000-12-26 8212.291231
2001-06-24 416.966257
2001-12-21 1076.566799
2002-06-19 5950.520642
2002-12-16 5298.173622
2003-06-14 4188.074286
2003-12-11 3354.078493
2004-06-08 6225.194322
if you want to group your Masses by say 1000, or implement your own custom bins, you can do this:
Bins,Labels=np.arange(Min,Max+.1,scale),(np.arange(Min,Max,scale))+(scale)/2
EqualBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(1,'Equal Bins',EqualBins)
Bins,Labels=[0,1000,5000,10000],['Small','Medium','Big']
CustomBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(2,'Custom Bins',CustomBins)
If you want to just show the year, month, etc it is very simple:
df['Year'] = df.index.year
df['Month'] = df.index.month
but you can also do custom date ranges if you like:
Bins=[datetime.datetime(1999, 12, 31),datetime.datetime(2000, 9, 1),
datetime.datetime(2002, 1, 1),datetime.datetime(2010, 9, 1)]
Labels = ['Early','Middle','Late']
CustomDateBins = pd.cut(df.index,bins=Bins,labels=Labels)
df.insert(3,'Custom Date Bins',CustomDateBins)
print(df)
This yields something like what you want:
Mass Equal Bins Custom Bins Custom Date Bins Year Month
2000-01-01 8928.601514 8500.0 Big Early 2000 1
2000-06-29 3319.798053 3500.0 Medium Early 2000 6
2000-12-26 8212.291231 8500.0 Big Middle 2000 12
2001-06-24 416.966257 500.0 Small Middle 2001 6
2001-12-21 1076.566799 1500.0 Medium Middle 2001 12
2002-06-19 5950.520642 5500.0 Big Late 2002 6
2002-12-16 5298.173622 5500.0 Big Late 2002 12
2003-06-14 4188.074286 4500.0 Medium Late 2003 6
2003-12-11 3354.078493 3500.0 Medium Late 2003 12
2004-06-08 6225.194322 6500.0 Big Late 2004 6
The .groupby function is probably of interst to you as well:
yeargroup = df.groupby(df.index.year).mean()
massgroup = df.groupby(df['Equal Bins']).count()
print(yeargroup)
print(massgroup)
Mass Year Month
2000 6820.230266 2000.0 6.333333
2001 746.766528 2001.0 9.000000
2002 5624.347132 2002.0 9.000000
2003 3771.076389 2003.0 9.000000
2004 6225.194322 2004.0 6.000000
Mass Custom Bins Custom Date Bins Year Month
Equal Bins
500.0 1 1 1 1 1
1500.0 1 1 1 1 1
2500.0 0 0 0 0 0
3500.0 2 2 2 2 2
4500.0 1 1 1 1 1
5500.0 2 2 2 2 2
6500.0 1 1 1 1 1
7500.0 0 0 0 0 0
8500.0 2 2 2 2 2
9500.0 0 0 0 0 0

How to convert time data to numeric value?

I have a dataframe out:
dates min max wh
0 2005-09-06 07:41:18 21:59:57 14:18:39
1 2005-09-12 14:49:22 14:49:22 00:00:00
2 2005-09-19 11:08:56 11:24:05 00:15:09
3 2005-09-21 21:19:21 21:20:15 00:00:54
4 2005-09-22 19:41:52 19:41:52 00:00:00
5 2005-10-13 11:22:07 21:05:41 09:43:34
6 2005-11-22 11:53:12 21:21:22 09:28:10
7 2005-11-23 00:07:01 14:08:50 14:01:49
8 2005-11-30 13:42:48 23:59:19 10:16:31
9 2005-12-01 00:05:16 10:24:12 10:18:56
10 2005-12-21 17:38:43 19:26:03 01:47:20
11 2005-12-22 09:20:07 11:25:40 02:05:33
12 2006-01-23 07:46:20 08:01:52 00:15:32
13 2006-04-27 16:27:54 19:29:52 03:01:58
14 2006-05-11 12:48:34 23:10:44 10:22:10
15 2006-05-15 10:14:59 22:28:12 12:13:13
16 2006-05-16 01:14:07 23:55:51 22:41:44
17 2006-05-17 01:12:45 23:57:56 22:45:11
18 2006-05-18 02:42:08 21:48:49 19:06:41
and I want the average workhours per day (which presents the column wh) per month.
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out2=out.groupby('month')['wh'].mean().reset_index(name='wh2')
I used this so far, but the values in wh are no numeric data so I can't build the mean. How can I convert the whole column wh build the mean?
My wh was made by the following:
df = pd.read_csv("Testordner2/"+i, parse_dates=True)
df['new_time'] = pd.to_datetime(df['new_time'])
df['dates']= df['new_time'].dt.date
df['time'] = df['new_time'].dt.time
out = df.groupby(df['dates']).agg({'time': ['min', 'max']}) \
.stack(level=0).droplevel(1)
out['min_as_time_format'] = pd.to_datetime(out['min'], format="%H:%M:%S")
out['max_as_time_format'] = pd.to_datetime(out['max'], format="%H:%M:%S")
out['wh'] = out['max_as_time_format'] - out['min_as_time_format']
out['wh'].astype(str).str[-18:-10]
One possible solution is convert timedeltas to native format, aggregate mean and then convert back to timedeltas:
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out['wh'] = pd.to_timedelta(out['wh']).astype(np.int64)
out2=pd.to_timedelta(out.groupby('month')['wh'].mean()).reset_index(name='wh2')
print (out2)
month wh2
0 2005-09 02:54:56.400000
1 2005-10 09:43:34
2 2005-11 11:15:30
3 2005-12 04:43:56.333333
4 2006-01 00:15:32
5 2006-04 03:01:58
6 2006-05 17:25:47.800000

Pandas: how to change the data type of values of a row?

I have the following DataFrame:
actor Daily Total actor1 actor2
Day
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 27.5 12.5 15.0
How do I change the data type of 'Avg' row to integer? How do I round those values in the row?
In pandas after add new row filled by floats all columns are changed to floats.
Possible solution is round and convert all columns:
df = df.round().astype(int)
Or add new Series converted to integer:
df = df.append(df.mean().rename('Avg').round().astype(int))
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 28 12 15
If want convert only columns with row values filled by whole numbers:
d = dict.fromkeys(df.columns[df.loc['Avg'] == df.loc['Avg'].astype(int)], 'int')
df = df.astype(d)
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25.0 10.0 15
2019-01-02 30.0 15.0 15
Avg 27.5 12.5 15
Use loc to access index then use numpy.round in apply.
import numpy as np
df.loc['Avg'] = df.loc['Avg'].apply(np.round)

How to calculate rolling mean on a GroupBy object using Pandas?

How to calculate rolling mean on a GroupBy object using Pandas?
My Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index('ds')
grouped_df = df.groupby('city')
What grouped_df looks like:
I want calculate rolling mean on each of my groups in my GroupBy object using Pandas?
I tried pd.rolling_mean(grouped_df, 3).
Here is the error I get:
AttributeError: 'DataFrameGroupBy' object has no attribute 'dtype'
Edit: Do I use itergroups maybe and calculate rolling mean on each group on each group as I iterate through?
You could try iterating over the groups
In [39]: df = pd.DataFrame({'a':list('aaaaabbbbbaaaccccbbbccc'),"bookings":range(1,24)})
In [40]: grouped = df.groupby('a')
In [41]: for group_name, group_df in grouped:
....: print group_name
....: print pd.rolling_mean(group_df['bookings'],3)
....:
a
0 NaN
1 NaN
2 2.000000
3 3.000000
4 4.000000
10 6.666667
11 9.333333
12 12.000000
dtype: float64
b
5 NaN
6 NaN
7 7.000000
8 8.000000
9 9.000000
17 12.333333
18 15.666667
19 19.000000
dtype: float64
c
13 NaN
14 NaN
15 15
16 16
20 18
21 20
22 22
dtype: float64
You want the dates on your left column and all city values as separate columns. One way to do this is set the index on date and city, and then unstack. This is equivalent to a pivot table. You can then perform your rolling mean in the usual fashion.
df = pd.read_csv("example.csv", parse_dates=['ds'])
df = df.set_index(['date', 'city']).unstack('city')
rm = pd.rolling_mean(df, 3)
I wouldn't recommend using a function, as the data for a given city can simply be returned as follows (: returns all rows):
df.loc[:, city]

Categories