KeyError: 'Symbols' when using a pivot table - python

I am trying to look up data in pandas dataframe:
import pandas as pd
import numpy as np
from statsmodels import api as sm
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016,12,2)
end = datetime.datetime.today()
df = web.get_data_yahoo(['F', '^GSPC'], start, end)
if i unstack the data here
df.unstack()
i get the following:
Attributes Symbols Date
Adj Close F 2016-12-01 1.011866e+01
2016-12-02 9.963994e+00
2016-12-05 1.012680e+01
2016-12-06 1.022449e+01
2016-12-07 1.063152e+01
...
Volume ^GSPC 2019-11-22 3.226780e+09
2019-11-25 3.511530e+09
2019-11-26 4.595590e+09
2019-11-27 3.033090e+09
2019-11-29 1.743020e+11
Length: 9048, dtype: float64
df has the following data:
Attributes Adj Close Close High Low Open Volume
Symbols F ^GSPC F ^GSPC F ^GSPC F ^GSPC F ^GSPC F ^GSPC
Date
2015-02-11 12.216836 2068.530029 16.250000 2068.530029 16.309999 2073.479980 16.010000 2057.989990 16.080000 2068.550049 34285300.0 3.596860e+09
2015-02-12 12.299535 2088.479980 16.360001 2088.479980 16.450001 2088.530029 16.299999 2069.979980 16.340000 2069.979980 23738800.0 3.788350e+09
2015-02-13 12.254424 2096.989990 16.299999 2096.989990 16.360001 2097.030029 16.190001 2086.699951 16.330000 2088.780029 19954600.0 3.527450e+09
2015-02-17 12.111583 2100.340088 16.110001 2100.340088 16.299999 2101.300049 16.000000 2089.800049 16.209999 2096.469971 44362300.0 3.361750e+09
2015-02-18 12.186762 2099.679932 16.209999 2099.679932 16.330000 2100.229980 16.059999 2092.149902 16.160000 2099.159912 22812700.0 3.370020e+09
... ... ... ... ... ... ... ... ... ... ... ... ...
2019-11-22 8.890000 3110.290039 8.890000 3110.290039 8.900000 3112.870117 8.770000 3099.260010 8.800000 3111.409912 34966700.0 3.226780e+09
2019-11-25 9.000000 3133.639893 9.000000 3133.639893 9.010000 3133.830078 8.870000 3117.439941 8.900000 3117.439941 30580900.0 3.511530e+09
2019-11-26 9.010000 3140.520020 9.010000 3140.520020 9.020000 3142.689941 8.910000 3131.000000 8.980000 3134.850098 30093800.0 4.595590e+09
2019-11-27 9.100000 3153.629883 9.100000 3153.629883 9.150000 3154.260010 9.020000 3143.409912 9.030000 3145.489990 37396100.0 3.033090e+09
2019-11-29 9.060000 3140.979980 9.060000 3140.979980 9.100000 3150.300049 9.030000 3139.340088 9.040000 3147.179932 13096200.0 1.743020e+11
1210 rows × 12 columns
To find the data in df i am using a pivot_table:
df.pivot_table(values = 'Adj Close', index = 'Date', columns = 'Symbols')
but i am getting an error:
KeyError: 'Symbols'
Why am i getting this error?

Seems you already have a multiindex with what you need, you don't have to pivot.
>>> df['Adj Close'].head()
Symbols F ^GSPC
Date
2016-12-01 10.297861 2191.080078
2016-12-02 10.140451 2191.949951
2016-12-05 10.306145 2204.709961
2016-12-06 10.405562 2212.229980
2016-12-07 10.819797 2241.350098
>>>

Related

pandas .drop(columns=[]) is returning KeyError when columns are in the csv and dataframe

I'm trying to import market data from a csv to run some backtests.
I wrote the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("30mindata.csv")
df = df.drop(columns=['Volume', 'NumberOfTrades', 'BidVolume', 'AskVolume'])
print(df)
I'm getting the error:
KeyError: "['Volume', 'NumberOfTrades', 'BidVolume', 'AskVolume'] not found in axis"
When I remove the line of code containing drop() the dataframe prints as follows:
Date Time Open High Low Last Volume NumberOfTrades BidVolume AskVolume
0 2018/2/18 14:00:00 2734.50 2741.00 2734.00 2739.75 5304 2787 2299 3005
1 2018/2/18 14:30:00 2739.75 2741.00 2739.25 2740.25 1402 815 648 754
2 2018/2/18 15:00:00 2740.25 2743.50 2739.25 2742.00 4536 2301 2074 2462
3 2018/2/18 15:30:00 2742.25 2744.75 2742.25 2744.00 4102 1826 1949 2153
4 2018/2/18 16:00:00 2744.00 2744.25 2742.25 2742.25 2492 1113 1551 941
... ... ... ... ... ... ... ... ... ... ...
59074 2023/2/17 10:30:00 4076.25 4088.00 4076.00 4086.50 92507 54379 44917 47590
59075 2023/2/17 11:00:00 4086.50 4090.50 4079.25 4081.00 107233 67968 55784 51449
59076 2023/2/17 11:30:00 4081.00 4090.50 4079.50 4088.25 171507 92705 86022 85485
59077 2023/2/17 12:00:00 4088.00 4089.00 4085.25 4086.00 41032 17210 21176 19856
59078 2023/2/17 12:30:00 4086.25 4088.00 4085.25 4085.75 5164 2922 2818 2346
I have another file that uses this exact form of pd.read_csv() and then df.drop(columns=[]) which works just fine. I tried df.loc[:, 'Volume'] and got the same KeyError saying 'Volume' was not found in the axis. I really don't understand how the labels aren't in the dataframe when they get output correctly without the .drop() function
It's very likely that you have blank spaces in the name of your columns.
Try removing those spaces doing this...
import pandas as pd
df = pd.read_csv("30mindata.csv")
df.columns = [col.strip() for col in df.columns]
Then try to drop the columns as before

Join in Pandas dataframe (Auto conversion from date to date-time?)

I tried to use the join function to combine the close price of all 500 stocks in 5 year period (2013-02-08 to 2018-02-07), where each column represent a stock, with the index of the dataframe being the dates.
But the join function in pandas seems to automatically change the date format (index), rendering all the entries in the combined dataframe to be NaN.
The code to import and preview the file:
import pandas as pd
df= pd.read_csv('all_stocks_5yr.csv')
df.head()
(https://i.stack.imgur.com/29Wq4.png)
# df.info()
df['Name'].unique().shape #There are 505 stock names in total
# dates = pd.date_range(df['date'].min(), df['date'].max()) #check the date range
Single out the close prices:
close_prices = pd.DataFrame(index=dates) #Make the index column to be the dates
# close_prices.head()
symbols = df['Name'].unique(). #Denote the stock names as an array
So I tried to test the result for each stock using the first 3 stocks:
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
print(df_tmp) #print the temporary dataframes
i += 1
if i >3: break
And the results were expected, a dataframe indexed by date and only one stock:
AAL
date
2013-02-08 14.75
2013-02-11 14.46
2013-02-12 14.27
2013-02-13 14.66
2013-02-14 13.99
... ...
2018-02-01 53.88
2018-02-02 52.10
2018-02-05 49.76
2018-02-06 51.18
2018-02-07 51.40
[1259 rows x 1 columns]
AAPL
date
2013-02-08 67.8542
2013-02-11 68.5614
2013-02-12 66.8428
2013-02-13 66.7156
2013-02-14 66.6556
... ...
2018-02-01 167.7800
2018-02-02 160.5000
2018-02-05 156.4900
2018-02-06 163.0300
2018-02-07 159.5400
[1259 rows x 1 columns]
...
Now here's part I find very confusing: I checked what happens when combining first 3 stock dataframes using join function, with index 'date':
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
close_prices = close_prices.join(df_tmp)
i += 1
if i >3: break
close_prices.head()
(https://i.stack.imgur.com/MqVDo.png)
Somehow the index changed from date to date-time format, and therefore naturally "join" function will recognize that none of the entries matches with the new index, and automatically put a NA there for every single entry.
What caused the date to have changed to date-time?
You can use pivot:
df = pd.read_csv('all_stocks_5yr.csv')
out = df1.pivot(index='date', columns='Name', values='close')
Output:
>>> df.iloc[:, :8]
Name A AAL AAP AAPL ABBV ABC ABT ACN
date
2013-02-08 45.08 14.75 78.90 67.8542 36.25 46.89 34.41 73.31
2013-02-11 44.60 14.46 78.39 68.5614 35.85 46.76 34.26 73.07
2013-02-12 44.62 14.27 78.60 66.8428 35.42 46.96 34.30 73.37
2013-02-13 44.75 14.66 78.97 66.7156 35.27 46.64 34.46 73.56
2013-02-14 44.58 13.99 78.84 66.6556 36.57 46.77 34.70 73.13
... ... ... ... ... ... ... ... ...
2018-02-01 72.83 53.88 117.29 167.7800 116.34 99.29 62.18 160.46
2018-02-02 71.25 52.10 113.93 160.5000 115.17 96.02 61.69 156.90
2018-02-05 68.22 49.76 109.86 156.4900 109.51 91.90 58.73 151.83
2018-02-06 68.45 51.18 112.20 163.0300 111.20 91.54 58.86 154.69
2018-02-07 68.06 51.40 109.93 159.5400 113.62 94.22 58.67 155.15
[1259 rows x 8 columns]
Source 'all_stocks_5yr.csv': Kaggle

Python Django: Merge Dataframe performing Sum on overlapping columns

I want to merge two DataFrames with exact the same column names. The overlapping columns can be added togheter. I'm having a bit of troubles because the grouping should be happening on the "index" called "Date" but I can't this index through using the 'Date' name.
Actually, I just need the index (Date) and the sum of all the stocks their 'Adj Close'.
I tried:
data.join(temp, how='outer')
Returns: "ValueError: columns overlap but no suffix specified: Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')"
data = pd.concat([data, temp]).groupby([data.index, temp.index], as_index=True).sum(axis=1)
Returns: "Grouper and axis must be same length
data = pd.merge(data, temp, left_index=True, right_index=True)['Adj Close'].sum(axis=1, skipna=True).astype(np.int64)
Returns: "KeyError: 'Adj Close'"
Code
def overview(request):
stocks = Stock.objects.all()
data = None
for stock in stocks:
if data is None:
data = yf.download(stock.ticker, start=stock.trade_date, period="ytd")
else:
temp = yf.download(stock.ticker, start=stock.trade_date, period="ytd")
data.join(temp, how='outer')
DataFrame Output 1
[*********************100%***********************] 1 of 1 completed
Open High ... Adj Close Volume
Date ...
2019-09-19 55.502499 55.939999 ... 54.697304 88242400
2019-09-20 55.345001 55.639999 ... 53.897728 221652400
2019-09-23 54.737499 54.959999 ... 54.142803 76662000
2019-09-24 55.257500 55.622501 ... 53.885353 124763200
2019-09-25 54.637501 55.375000 ... 54.714626 87613600
... ... ... ... ... ...
2020-09-10 120.360001 120.500000 ... 113.489998 182274400
2020-09-11 114.570000 115.230003 ... 112.000000 180860300
2020-09-14 114.720001 115.930000 ... 115.360001 140150100
2020-09-15 118.330002 118.830002 ... 115.540001 184642000
2020-09-16 115.230003 116.000000 ... 112.129997 154679000
[251 rows x 6 columns]
Dataframe Output 2
[*********************100%***********************] 1 of 1 completed
Open High ... Adj Close Volume
Date ...
2020-09-03 1699.520020 1700.000000 ... 1629.510010 3186300
2020-09-04 1609.000000 1634.989990 ... 1581.209961 2792500
2020-09-08 1525.000000 1555.550049 ... 1523.599976 2701600
2020-09-09 1548.900024 1558.719971 ... 1547.229980 1962100
2020-09-10 1550.180054 1573.660034 ... 1526.050049 1651200
2020-09-11 1528.150024 1538.699951 ... 1515.760010 1535300
2020-09-14 1531.650024 1557.000000 ... 1508.829956 2133000
2020-09-15 1527.890015 1550.989990 ... 1535.119995 1152100
2020-09-16 1542.479980 1554.369995 ... 1512.089966 1106400
Let say you have 2 df like this:
df1 = pd.DataFrame({'Adj Close':[1, 2]}, index=['2019-09-19','2019-09-20'])
df2 = pd.DataFrame({'Adj Close':[3, 4, 5]}, index=['2019-09-19','2019-09-20','2019-09-21'])
df1
Adj Close
2019-09-19 1
2019-09-20 2
df2
Adj Close
2019-09-19 3
2019-09-20 4
2019-09-21 5
Then you can concat into one df:
df = pd.concat([df1, df2])
Adj Close
2019-09-19 1
2019-09-20 2
2019-09-19 3
2019-09-20 4
2019-09-21 5
and make groupby with sum:
result = df.groupby(df.index).sum()
Adj Close
2019-09-19 4
2019-09-20 6
2019-09-21 5

Pandas read_csv with different date parsers

I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]

Seaborn timeseries plot with multiple series

I'm trying to make a time series plot with seaborn from a dataframe that has multiple series.
From this post:
seaborn time series from pandas dataframe
I gather that tsplot isn't going to work as it is meant to plot uncertainty.
So is there another Seaborn method that is meant for line charts with multiple series?
My dataframe looks like this:
print(df.info())
print(df.describe())
print(df.values)
print(df.index)
output:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 253 entries, 2013-01-03 to 2014-01-03
Data columns (total 5 columns):
Equity(24 [AAPL]) 253 non-null float64
Equity(3766 [IBM]) 253 non-null float64
Equity(5061 [MSFT]) 253 non-null float64
Equity(6683 [SBUX]) 253 non-null float64
Equity(8554 [SPY]) 253 non-null float64
dtypes: float64(5)
memory usage: 11.9 KB
None
Equity(24 [AAPL]) Equity(3766 [IBM]) Equity(5061 [MSFT]) \
count 253.000000 253.000000 253.000000
mean 67.560593 194.075383 32.547436
std 6.435356 11.175226 3.457613
min 55.811000 172.820000 26.480000
25% 62.538000 184.690000 28.680000
50% 65.877000 193.880000 33.030000
75% 72.299000 203.490000 34.990000
max 81.463000 215.780000 38.970000
Equity(6683 [SBUX]) Equity(8554 [SPY])
count 253.000000 253.000000
mean 33.773277 164.690180
std 4.597291 10.038221
min 26.610000 145.540000
25% 29.085000 156.130000
50% 33.650000 165.310000
75% 38.280000 170.310000
max 40.995000 184.560000
[[ 77.484 195.24 27.28 27.685 145.77 ]
[ 75.289 193.989 26.76 27.85 146.38 ]
[ 74.854 193.2 26.71 27.875 145.965]
...,
[ 80.167 187.51 37.43 39.195 184.56 ]
[ 79.034 185.52 37.145 38.595 182.95 ]
[ 77.284 186.66 36.92 38.475 182.8 ]]
DatetimeIndex(['2013-01-03', '2013-01-04', '2013-01-07', '2013-01-08',
'2013-01-09', '2013-01-10', '2013-01-11', '2013-01-14',
'2013-01-15', '2013-01-16',
...
'2013-12-19', '2013-12-20', '2013-12-23', '2013-12-24',
'2013-12-26', '2013-12-27', '2013-12-30', '2013-12-31',
'2014-01-02', '2014-01-03'],
dtype='datetime64[ns]', length=253, freq=None, tz='UTC')
This works (but I want to get my hands dirty with Seaborn):
df.plot()
Output:
Thank you for your time!
Update1:
df.to_dict() returned:
https://gist.github.com/anonymous/2bdc1ce0f9d0b6ccd6675ab4f7313a5f
Update2:
Using #knagaev sample code, I've narrowed it down to this difference:
current dataframe (output of print(current_df)):
Equity(24 [AAPL]) Equity(3766 [IBM]) \
2013-01-03 00:00:00+00:00 77.484 195.2400
2013-01-04 00:00:00+00:00 75.289 193.9890
2013-01-07 00:00:00+00:00 74.854 193.2000
2013-01-08 00:00:00+00:00 75.029 192.8200
2013-01-09 00:00:00+00:00 73.873 192.3800
desired dataframe (output of print(desired_df)):
Date Company Kind Price
0 2014-01-02 IBM Open 187.210007
1 2014-01-02 IBM High 187.399994
2 2014-01-02 IBM Low 185.199997
3 2014-01-02 IBM Close 185.529999
4 2014-01-02 IBM Volume 4546500.000000
5 2014-01-02 IBM Adj Close 171.971090
6 2014-01-02 MSFT Open 37.349998
7 2014-01-02 MSFT High 37.400002
8 2014-01-02 MSFT Low 37.099998
9 2014-01-02 MSFT Close 37.160000
10 2014-01-02 MSFT Volume 30632200.000000
11 2014-01-02 MSFT Adj Close 34.960000
12 2014-01-02 ORCL Open 37.779999
13 2014-01-02 ORCL High 38.029999
14 2014-01-02 ORCL Low 37.549999
15 2014-01-02 ORCL Close 37.840000
16 2014-01-02 ORCL Volume 18162100.000000
What's the best way to reorganize the current_df to desired_df?
Update 3:
I finally got it working from the help of #knagaev:
I had to add a dummy column as well as finesse the index:
df['Datetime'] = df.index
melted_df = pd.melt(df, id_vars='Datetime', var_name='Security', value_name='Price')
melted_df['Dummy'] = 0
sns.tsplot(melted_df, time='Datetime', unit='Dummy', condition='Security', value='Price', ax=ax)
to produce:
You can try to get hands dirty with tsplot.
You will draw your line charts with standard errors ("statistical additions")
I tried to simulate your dataset. So here is the results
import pandas.io.data as web
from datetime import datetime
import seaborn as sns
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
start = datetime(2014,1,1)
end = datetime(2014,3,28)
f = web.DataReader(stocks, 'yahoo',start,end)
df = pd.DataFrame(f.to_frame().stack()).reset_index()
df.columns = ['Date', 'Company', 'Kind', 'Price']
sns.tsplot(df, time='Date', unit='Kind', condition='Company', value='Price')
By the way this sample is very imitative. The parameter "unit" is "Field in the data DataFrame identifying the sampling unit (e.g. subject, neuron, etc.). The error representation will collapse over units at each time/condition observation. " (from documentation). So I used the 'Kind' field for illustrative purposes.
Ok, I made an example for your dataframe.
It has dummy field for "noise cleaning" :)
import pandas.io.data as web
from datetime import datetime
import seaborn as sns
stocks = ['ORCL', 'TSLA', 'IBM','YELP', 'MSFT']
start = datetime(2010,1,1)
end = datetime(2015,12,31)
f = web.DataReader(stocks, 'yahoo',start,end)
df = pd.DataFrame(f.to_frame().stack()).reset_index()
df.columns = ['Date', 'Company', 'Kind', 'Price']
df_open = df[df['Kind'] == 'Open'].copy()
df_open['Dummy'] = 0
sns.tsplot(df_open, time='Date', unit='Dummy', condition='Company', value='Price')
P.S. Thanks to #VanPeer - now you can use seaborn.lineplot for this problem

Categories