So I created a function that returns the returns of quantile portfolios as a time series.
If I call Quantile_Returns(2014), the result (DataFrame) looks like this.
Date Q1 Q2 Q3 Q4 Q5
2014-02-28 6.20 4.87 5.41 5.04 4.91
2014-03-31 -0.50 0.05 1.55 1.36 1.49
2014-04-30 -0.17 0.20 0.33 -0.26 1.76
2014-05-30 2.69 1.95 1.95 2.11 2.29
2014-06-30 3.12 3.40 2.81 1.82 2.36
2014-07-31 -2.52 -2.34 -1.92 -2.36 -1.80
2014-08-29 4.60 3.87 4.50 4.65 3.58
2014-09-30 -3.29 -3.25 -3.51 -0.96 -1.76
2014-10-31 2.55 4.63 2.37 3.60 2.10
2014-11-28 0.88 2.08 1.26 4.46 2.83
2014-12-31 0.35 0.20 -0.19 1.01 0.34
2015-01-30 -2.97 -2.63 -3.44 -2.32 -2.61
Now I would want to call this function for a number of years time_period = list(range(1960,2021)) and get a result that is a time series which goes from 1960 to 2021.
I tried like this
time_period = list(range(1960,2021))
for j in time_period:
if j == 1960:
Quantile = pd.DataFrame(Quantile_Returns(j))
else:
Quantile = pd.concat(Quantile, Quantile_Returns(j+1))
But It did not work.
The Error is:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
How can I implement this?
Thank you!
Try replacing the whole loop with
Quantile = pd.concat(Quantile_Returns(j) for j in range(1960, 2021))
pd.concat is expecting a sequence of pandas objects, and in the second pass through your loop you are giving it a DataFrame as the first argument (not a sequence of DataFrames). Also, the second argument should be an axis to concatenate on, not another DataFrame.
Here, I just passed it the sequence of all the DataFrames for different years as the first argument (using a generator expression).
Related
I have a dataframe that I'd like to export to a csv file where each column is stacked on top of one another. I want to use each header as a label with the date in this format, Allu_1_2013.
date Allu_1 Allu_2 Alluv_3 year
2013-01-01 2.00 1.45 3.54 2013
2014-01-01 3.09 2.35 9.01 2014
2015-01-01 4.79 4.89 10.04 2015
The final csv text tile should look like
Allu_1_2013 2.00
Allu_1_2014 3.09
Allu_1_2015 4.79
Allu_2_2013 1.45
Allu_2_2014 2.35
Allu_2_2015 4.89
Allu_3_2013 3.54
Allu_3_2014 9.01
Allu_3_2015 10.04
You can use melt:
new_df = df.melt(id_vars=["date", "year"],
var_name="Date",
value_name="Value").drop(columns=['date'])
new_df['idx'] = new_df['Date'] + '_' + new_df['year'].astype(str)
new_df = new_df.drop(columns=['year', 'Date'])
Value
idx
0
2
Allu_1_2013
1
3.09
Allu_1_2014
2
4.79
Allu_1_2015
3
1.45
Allu_2_2013
4
2.35
Allu_2_2014
5
4.89
Allu_2_2015
6
3.54
Alluv_3_2013
7
9.01
Alluv_3_2014
8
10.04
Alluv_3_2015
import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]
This question already has answers here:
forward fill specific columns in pandas dataframe
(6 answers)
Closed 4 years ago.
I have had a rethink of the issue and have reformulated my question.
I have a dataframe (df) which has timeseries data for a number of factors. The timeseries for each factor can start on different days which is fine. For some specific days there is missing data (white space) for FactorB and FactorC (in this example 07/01/2017). For FactorB and FactorC with these white-space days I would like to fill the holes with value for that factor from the previous day. For example:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
In df data is missing for FactorB and FactorC on 07/01/2017. I would like the resulting df to look like:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62 5.93 5.91
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
I am wondering if I need to specifically change the white space for FactorB and FactorC on the date with the hole in it (in this example 07/01/2017) to NaN before I then apply
df= df.replace('',np.NaN).ffill()
So my intermediate output for the issue would look like:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62 NaN NaN
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
But how would I apply a NaN to only days where I am legitimately missing data (not changing the days before the FactorB and FactorC timeseries started. Also is there a way to do this without specifically calling a date as the holes could be on any date.
I have tried the following but when I check the data the white space is still there and I feel like I'm going no where:
col = ['FactorB', 'FactorC']
df[col] = df[col].ffill()
I've also tried:
df.fillna(method='ffill')
and
df= df.replace('',np.NaN).ffill()
If some values are missing and not NaN:
df = df.replace('',np.NaN).ffill()
I have a set of stock market data, sampled below.
I would like to like to work out the MAX ‘close’ price over each 5 day period.
symbol date open high low close volume
AAU 1-Jan-07 2.25 2.25 2.25 2.25 0
AAU 2-Jan-07 2.25 2.25 2.25 2.25 0
AAU 3-Jan-07 2.32 2.32 2.26 2.26 39800
AAU 4-Jan-07 2.29 2.35 2.27 2.32 114200
AAU 5-Jan-07 2.32 2.32 2.26 2.27 113600
AAU 8-Jan-07 2.27 2.35 2.1 2.33 84500
AAU 9-Jan-07 2.31 2.31 2.21 2.23 54200
AAU 10-Jan-07 2.24 2.3 2.2 2.3 29000
AAU 11-Jan-07 2.23 2.33 2.22 2.24 21400
AAU 12-Jan-07 2.25 2.33 2.25 2.33 45200
To do this I have added a new column to calculate the end date range (+5 days):
df[‘1w_date'] = df[‘date'].shift(-6)
The df then looks like this:
symbol date open high low close volume 5d_date
AAU 1-Jan-07 2.25 2.25 2.25 2.25 0 8-Jan-07
AAU 2-Jan-07 2.25 2.25 2.25 2.25 0 9-Jan-07
AAU 3-Jan-07 2.32 2.32 2.26 2.26 39800 10-Jan-07
AAU 4-Jan-07 2.29 2.35 2.27 2.32 114200 11-Jan-07
AAU 5-Jan-07 2.32 2.32 2.26 2.27 113600 12-Jan-07
AAU 8-Jan-07 2.27 2.35 2.1 2.33 84500 15-Jan-07
AAU 9-Jan-07 2.31 2.31 2.21 2.23 54200 16-Jan-07
AAU 10-Jan-07 2.24 2.3 2.2 2.3 29000 17-Jan-07
AAU 11-Jan-07 2.23 2.33 2.22 2.24 21400 18-Jan-07
AAU 12-Jan-07 2.25 2.33 2.25 2.33 45200 19-Jan-07
Next I set the date column as the df Index:
df = df.set_index(['date'])
Then I attempt to loop through each row using the ‘date’ as the start date and the ‘5d_date’ as the end date.
for i in df:
date_filter = df.loc[df[‘date’]:df[‘5d_date']]
df[‘min_value'] = min(date_filter['low'])
df[‘max_value'] = max(date_filter['high'])
Unfortunately I get a KeyError: ‘date’.
I have tried many different ways, but cannot figure out how to do this. Does anyone know how to fix this, or a better way of doing it?
Thanks.
After you set the index to date, you can use pd.DataFrame.rolling:
df.rolling('7d')['close'].mean()
Out[93]:
date
2007-01-01 2.250000
2007-01-02 2.250000
2007-01-03 2.253333
2007-01-04 2.270000
2007-01-05 2.270000
2007-01-08 2.286000
2007-01-09 2.282000
2007-01-10 2.290000
2007-01-11 2.274000
2007-01-12 2.286000
Name: close, dtype: float64
or, even without doing so,
df.rolling(5)['close'].mean()
Out[94]:
date
2007-01-01 NaN
2007-01-02 NaN
2007-01-03 NaN
2007-01-04 NaN
2007-01-05 2.270
2007-01-08 2.286
2007-01-09 2.282
2007-01-10 2.290
2007-01-11 2.274
2007-01-12 2.286
Name: close, dtype: float64
depending on whether you want a week (1), or five rows of data (2).
To have either of these at the start of the range instead of the end, just add .shift(-4) to the latter, and even to the former if you really do have exactly five days per week, every week.
In the dataframe below:
T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 -
How do I replace all the - with NaN's. I do not want to specify column names since I do not know before hand which column will have -
If those are strings, then your floats are probably also strings.
Assuming your dataframe is df, I'd try
pd.to_numeric(df.stack(), 'coerce').unstack()
Deeper explanation
Pandas doesn't usually represent missing floats with '-'. Therefore, that '-' must be a string. Thus, the dtype of any column with a '-' in it, must be 'object'. That makes it highly likely that whatever parsed the data, left the floats as string.
setup
from io import StringIO
import pandas as pd
txt = """T2MN T2MX RH2M DFP2M RAIN
6.96 9.32 84.27 5.57 -
6.31 10.46 - 5.63 -
- 10.66 79.38 3.63 -
0.79 4.45 94.24 1.85 -
1.45 3.99 91.71 1.17 - """
df = pd.read_csv(StringIO(txt), delim_whitespace=True)
print(df)
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 -
1 6.31 10.46 - 5.63 -
2 - 10.66 79.38 3.63 -
3 0.79 4.45 94.24 1.85 -
4 1.45 3.99 91.71 1.17 -
What are the dtypes?
print(df.dtypes)
T2MN object
T2MX float64
RH2M object
DFP2M float64
RAIN object
dtype: object
What is the type of the first element?
print(type(df.iloc[0, 0]))
<class 'str'>
This means that any column with a '-' is like a column of strings that look like floats. You want to use pd.to_numeric with parameter errors='coerce' to force non-numeric items to np.nan. However, pd.to_numeric does not operate on a pd.DataFrame so we stack and unstack.
pd.to_numeric(df.stack(), 'coerce').unstack()
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
Just replace() the string:
In [10]: df.replace('-', 'NaN')
Out[10]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
I think you want the actual numpy.nan instead of a string NaN as you can use a lot of methods such as fillna/isnull/notnull on the pandas.Series/pandas.DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame([['-']*10]*10)
df = df.replace('-',np.nan)
It looks like you were reading this data from CSV/FWF file... If it's true the easiest way to get rid of '-' would be to explain Pandas that it's NaN's representation:
df = pd.read_csv(filename, na_values=['NaN', 'nan', '-'])
Test:
In [79]: df
Out[79]:
T2MN T2MX RH2M DFP2M RAIN
0 6.96 9.32 84.27 5.57 NaN
1 6.31 10.46 NaN 5.63 NaN
2 NaN 10.66 79.38 3.63 NaN
3 0.79 4.45 94.24 1.85 NaN
4 1.45 3.99 91.71 1.17 NaN
In [80]: df.dtypes
Out[80]:
T2MN float64
T2MX float64
RH2M float64
DFP2M float64
RAIN float64
dtype: object