I'd like to compute the correlation of the variable "hours" between two groups in a panel data. Specifically, I'd like to compute the correlation of hours between groups A and B with group C. So the end result would contain two numbers: corr(hours_A, hours_C), and corr(hours_B, hours_C).
I have tried:
data.groupby('group').corr()
But it gave me the correlation between "hours" and "other variables" within each group, but I want the correlation of just the "hours" variable across two groups. I'm new to Python, so any help is welcome!
group
year
hours
other variables
A
2000
2784
567
A
2001
2724
567
A
2002
2715
567
B
2000
2301
567
B
2001
2612
567
B
2002
2489
567
C
2000
2190
567
C
2001
2139
567
C
2002
2159
567
Update:
Thank you for answering my question!
I eventually figured out some code of my own, but my code is not as elegant as the answers provided. For what it's worth, I'm posting it here.
df = df.set_index(['group','year'])
df = df.unstack(level=0)
df.index = pd.to_datetime(df.index).year
df.columns = df.columns.rename(['variables',"group"])
df.xs('hours', level="variables", axis=1).corr()
Indexing year isn't necessary for the correlation, but if I want to create cross sections of the data later, it might come in handy.
maybe it is not the best way to do it but I believe this will get you on your way.
import pandas as pd
import numpy as np
data = data[['group', 'year', 'hours']]
data_new = data.set_index(['year', 'group']).unstack(['group'])
final_df = pd.DataFrame(data_new.to_numpy(), columns=['A', 'B', 'C'])
final_df.corr()
I will also leave the process to (I think) reproduce your problem for anyone who wishes to give it a try!
import pandas as pd
import numpy as np
data_str = '''A|2000|2784|567
A|2001|2724|567
A|2002|2715|567
B|2000|2301|567
B|2001|2612|567
B|2002|2489|567
C|2000|2190|567
C|2001|2139|567
C|2002|2159|567'''.split('\n')
data = pd.DataFrame([x.split('|') for x in data_str], columns=['group', 'year', 'hours', 'other_variables'])
data['hours'] = data['hours'].astype(int)
You can apply list to the groups and then convert to Series, transpose, and then call corr() on the data.
from io import StringIO
import pandas as pd
>>> data = StringIO("""group,year,hours,other,variables
A,2000,2784,567
A,2001,2724,567
A,2002,2715,567
B,2000,2301,567
B,2001,2612,567
B,2002,2489,567
C,2000,2190,567
C,2001,2139,567
C,2002,2159,567""")
>>> df = pd.read_csv(data)
>>> df.groupby('group')['hours'].apply(list).apply(pd.Series).T.corr()
0 1 2
0 1.000000 0.771752 0.898470
1 0.771752 1.000000 0.972589
2 0.898470 0.972589 1.000000
How does this work? The groupby + apply(list) produces the following, which is a Series with three rows, each being a list of three items.
A [2784, 2724, 2715]
B [2301, 2612, 2489]
C [2190, 2139, 2159]
The apply(pd.Series) converts the list in each row to a series. You then have to transpose with the T operator to get the data for each group in a single column.
0 1 2
group
A 2784 2724 2715
B 2301 2612 2489
C 2190 2139 2159
transposed is
group A B C
0 2784 2301 2190
1 2724 2612 2139
2 2715 2489 2159
If you only want the two values, it would be
>>> df.groupby('group')['hours'].apply(list).apply(pd.Series).T.corr().iloc[1:3,0].values
array([-0.86594029, 0.86783525])
In this example, you use iloc to get the second and third rows in the first column (python indexes are zero-based) and then the values property of a Series to return an array rather than a Series.
Related
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.
I have a csv that contains warehouses, dates and quantities (stock). I'm trying to plot the quantities by date for each warehouse (a separate plot by warehouse). I'm a beginner in Python, I've tried looking around but I can't find anything that would solve my problem.
Here's what the table looks like
csv sample
Thanks for your help !
import pandas as pd
data = pd.read_csv("path_to_your_file.csv",header=0)
by_date = data.groupby(["warehouse","date"]).agg(['mean', 'count', 'sum'])
print(by_date)
Something like this is simple and would give you your result. You will need to first install pandas library with pip in the console:
$>pip install pandas
Here you can have some documentation of the library with tutorials and walk-throughs and a cheatsheet of the basics.
If you want to plot the data rather than simply print it, you can do something like the following:
import pandas as pd
df = pd.read_csv("path_to_your_file.csv")
This sho7uld produce a Datafr4ame of the form:
Wharehouse Date Qty
0 A 4/20/2022 485
1 A 4/21/2022 642
2 A 4/22/2022 315
3 A 4/23/2022 845
4 B 4/20/2022 325
5 B 4/21/2022 156
6 B 4/22/2022 851
7 C 4/20/2022 268
8 C 4/21/2022 452
9 C 4/22/2022 265
To plot the data
df.groupby('Wharehouse').plot.bar(x= 'Date', y='Qty')
Yields the following:
Wharehouse
A AxesSubplot(0.125,0.125;0.775x0.755)
B AxesSubplot(0.125,0.125;0.775x0.755)
C AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object
I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))
Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
I'm ttrying to pick all the values from a dataframe column I have and apply them to a mathematical function.
Heres how the data looks:
Year % PM
1 2002 3
2 2002 2.3
I am trying to apply this function :
M = 100000
t = (THE PERCENTAGE FROM THE DATAFRAME)/12
n = 15*12
PM = M*((t*(1+t)**n)/(((1+t)**n)-1))
print(PM)
And my goal is to do it to all the rows and append the value of each result to PM in the dF
You can just add the formula as a column directly to the DF, creating t_div_12 as a vector from the column as below:
M = 100000
n = 15*12
t_div_12 = df["%"]/12
df["PM"] = M*((t_div_12 *(1+t_div_12 )**n)/(((1+t_div_12)**n)-1))
First, I would avoid using constants, which are not repeated in the code. You can apply this function to your dataframe by using this code snippet:
dF = pd.DataFrame([[2002, 3], [2002, 2.3]], columns=["Year", "%"])
dF['PM'] = 100000*((dF["%"]/12*(1+dF["%"]/12)**(15*12))/(((1+dF["%"]/12)**(15*12))-1))
It will give you:
Year % PM
0 2002 3.0 25000.000000
1 2002 2.3 19166.666667
df['PM'] = df['%'].map(lambda t: M*(((t/12)*(1+(t/12))**n)/(((1+(t/12))**n)-1)))
Given a pandas dataframe in the following format:
toy = pd.DataFrame({
'id': [1,2,3,
1,2,3,
1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13',
'2016-02-12', '2016-02-12', '2016-02-12',
'2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165,
144, 305, 293,
23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)
The my_metric column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id
and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes
or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date column (which could be the index if you prefer).
To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally
on id and at the same time the calculation should only be made on observations within the lookback time (checked using the date column). Hence, the output dataframe should consist of one row for each id-date combination, with the my_metric column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).
For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:
I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.
Thank you!
Calculating lookback time: (Current_year - 2 years)
from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime
In [1691]: dt = '2018-01-01'
In [1695]: dt = parser.parse(dt)
In [1696]: lookback_time = dt - relativedelta(years=2)
Now, filter the dataframe on lookback time and calculate rolling average
In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)
In [1674]: toy.sort_values('id')
Out[1674]:
date id my_metric new_metric
0 2015-05-13 1 395 395.0
3 2016-02-12 1 144 144.0
6 2018-07-23 1 23 83.5
1 2015-05-13 2 634 634.0
4 2016-02-12 2 305 305.0
7 2018-07-23 2 395 350.0
2 2015-05-13 3 165 165.0
5 2016-02-12 3 293 293.0
8 2018-07-23 3 242 267.5
So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:
Consider now the following code:
# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
xt.index = xt.index.droplevel(0)
return xt
dt='730D' # rolling average window: 730 days = 2 years
# Group by the 'id' column
g = toy.groupby('id')
# Apply the custom function
df = g.apply(rolling_average, dt=dt)
# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])
The result is as expected: