pandas .drop(columns=[]) is returning KeyError when columns are in the csv and dataframe - python

I'm trying to import market data from a csv to run some backtests.
I wrote the following code:
import pandas as pd
import numpy as np
df = pd.read_csv("30mindata.csv")
df = df.drop(columns=['Volume', 'NumberOfTrades', 'BidVolume', 'AskVolume'])
print(df)
I'm getting the error:
KeyError: "['Volume', 'NumberOfTrades', 'BidVolume', 'AskVolume'] not found in axis"
When I remove the line of code containing drop() the dataframe prints as follows:
Date Time Open High Low Last Volume NumberOfTrades BidVolume AskVolume
0 2018/2/18 14:00:00 2734.50 2741.00 2734.00 2739.75 5304 2787 2299 3005
1 2018/2/18 14:30:00 2739.75 2741.00 2739.25 2740.25 1402 815 648 754
2 2018/2/18 15:00:00 2740.25 2743.50 2739.25 2742.00 4536 2301 2074 2462
3 2018/2/18 15:30:00 2742.25 2744.75 2742.25 2744.00 4102 1826 1949 2153
4 2018/2/18 16:00:00 2744.00 2744.25 2742.25 2742.25 2492 1113 1551 941
... ... ... ... ... ... ... ... ... ... ...
59074 2023/2/17 10:30:00 4076.25 4088.00 4076.00 4086.50 92507 54379 44917 47590
59075 2023/2/17 11:00:00 4086.50 4090.50 4079.25 4081.00 107233 67968 55784 51449
59076 2023/2/17 11:30:00 4081.00 4090.50 4079.50 4088.25 171507 92705 86022 85485
59077 2023/2/17 12:00:00 4088.00 4089.00 4085.25 4086.00 41032 17210 21176 19856
59078 2023/2/17 12:30:00 4086.25 4088.00 4085.25 4085.75 5164 2922 2818 2346
I have another file that uses this exact form of pd.read_csv() and then df.drop(columns=[]) which works just fine. I tried df.loc[:, 'Volume'] and got the same KeyError saying 'Volume' was not found in the axis. I really don't understand how the labels aren't in the dataframe when they get output correctly without the .drop() function

It's very likely that you have blank spaces in the name of your columns.
Try removing those spaces doing this...
import pandas as pd
df = pd.read_csv("30mindata.csv")
df.columns = [col.strip() for col in df.columns]
Then try to drop the columns as before

Related

Join in Pandas dataframe (Auto conversion from date to date-time?)

I tried to use the join function to combine the close price of all 500 stocks in 5 year period (2013-02-08 to 2018-02-07), where each column represent a stock, with the index of the dataframe being the dates.
But the join function in pandas seems to automatically change the date format (index), rendering all the entries in the combined dataframe to be NaN.
The code to import and preview the file:
import pandas as pd
df= pd.read_csv('all_stocks_5yr.csv')
df.head()
(https://i.stack.imgur.com/29Wq4.png)
# df.info()
df['Name'].unique().shape #There are 505 stock names in total
# dates = pd.date_range(df['date'].min(), df['date'].max()) #check the date range
Single out the close prices:
close_prices = pd.DataFrame(index=dates) #Make the index column to be the dates
# close_prices.head()
symbols = df['Name'].unique(). #Denote the stock names as an array
So I tried to test the result for each stock using the first 3 stocks:
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
print(df_tmp) #print the temporary dataframes
i += 1
if i >3: break
And the results were expected, a dataframe indexed by date and only one stock:
AAL
date
2013-02-08 14.75
2013-02-11 14.46
2013-02-12 14.27
2013-02-13 14.66
2013-02-14 13.99
... ...
2018-02-01 53.88
2018-02-02 52.10
2018-02-05 49.76
2018-02-06 51.18
2018-02-07 51.40
[1259 rows x 1 columns]
AAPL
date
2013-02-08 67.8542
2013-02-11 68.5614
2013-02-12 66.8428
2013-02-13 66.7156
2013-02-14 66.6556
... ...
2018-02-01 167.7800
2018-02-02 160.5000
2018-02-05 156.4900
2018-02-06 163.0300
2018-02-07 159.5400
[1259 rows x 1 columns]
...
Now here's part I find very confusing: I checked what happens when combining first 3 stock dataframes using join function, with index 'date':
i = 1
for symbol in symbols:
df_sym = df[df['Name']==symbol]
df_tmp = pd.DataFrame(data=df_sym['close'].to_numpy() , index = df_sym['date'], columns=[symbol])
close_prices = close_prices.join(df_tmp)
i += 1
if i >3: break
close_prices.head()
(https://i.stack.imgur.com/MqVDo.png)
Somehow the index changed from date to date-time format, and therefore naturally "join" function will recognize that none of the entries matches with the new index, and automatically put a NA there for every single entry.
What caused the date to have changed to date-time?
You can use pivot:
df = pd.read_csv('all_stocks_5yr.csv')
out = df1.pivot(index='date', columns='Name', values='close')
Output:
>>> df.iloc[:, :8]
Name A AAL AAP AAPL ABBV ABC ABT ACN
date
2013-02-08 45.08 14.75 78.90 67.8542 36.25 46.89 34.41 73.31
2013-02-11 44.60 14.46 78.39 68.5614 35.85 46.76 34.26 73.07
2013-02-12 44.62 14.27 78.60 66.8428 35.42 46.96 34.30 73.37
2013-02-13 44.75 14.66 78.97 66.7156 35.27 46.64 34.46 73.56
2013-02-14 44.58 13.99 78.84 66.6556 36.57 46.77 34.70 73.13
... ... ... ... ... ... ... ... ...
2018-02-01 72.83 53.88 117.29 167.7800 116.34 99.29 62.18 160.46
2018-02-02 71.25 52.10 113.93 160.5000 115.17 96.02 61.69 156.90
2018-02-05 68.22 49.76 109.86 156.4900 109.51 91.90 58.73 151.83
2018-02-06 68.45 51.18 112.20 163.0300 111.20 91.54 58.86 154.69
2018-02-07 68.06 51.40 109.93 159.5400 113.62 94.22 58.67 155.15
[1259 rows x 8 columns]
Source 'all_stocks_5yr.csv': Kaggle

Not able to use a key from a merged dataframe

I've got two dataframes that both have a date column and an emaX column, when I merge them I get the expected result of a single date column and two emaX columns. But when I try access the date key from the merged dataframe, it returns a KeyError: date.
This is the function that returns the emaX (I have two, but they're nearly identical):
def av_get_ema_20():
ti = TechIndicators(key=TOKEN, output_format="pandas")
emaData20, meta_ema = ti.get_ema(symbol=SYMBOL, interval=INTERVAL, time_period=20, series_type=EMA_TYPE)
ema20renamed = pd.DataFrame(emaData20)
ema20renamed.rename(columns={'EMA': 'ema20'}, inplace=True)
return ema20renamed
Then I merge the two returned dataframes:
mergedDF = pd.merge(av_get_ema_10(), av_get_ema_20(), on=["date"], how="inner")
# TEST LINE
print(mergedDF)
The dataframe that is printed out appears as I expected it to be:
ema10 ema20
date
2020-01-02 11:30:00 3226.5200 NaN
2020-01-02 12:30:00 3229.0927 NaN
2020-01-02 13:30:00 3232.0558 NaN
2020-01-02 14:30:00 3235.0839 NaN
2020-01-02 15:30:00 3239.1668 NaN
... ... ...
2020-03-26 11:30:00 2524.9545 2473.8551
2020-03-26 12:30:00 2533.1755 2483.0279
2020-03-26 13:30:00 2541.2982 2492.0586
2020-03-26 14:30:00 2551.0458 2501.8540
2020-03-26 15:30:00 2565.2866 2513.9983
But then when I attempt to use the merged dataframe (for ex. interating through the dataframe), I get KeyError: date:
for index, row in mergedDF.iterrows():
print(row["date"], row["ema10"], row["ema20"])
Am I misinterpreting the dataframe in some way or is there something else I am supposed to do prior to using the merged set (including the date)? I'm at a loss here.

KeyError: 'Symbols' when using a pivot table

I am trying to look up data in pandas dataframe:
import pandas as pd
import numpy as np
from statsmodels import api as sm
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016,12,2)
end = datetime.datetime.today()
df = web.get_data_yahoo(['F', '^GSPC'], start, end)
if i unstack the data here
df.unstack()
i get the following:
Attributes Symbols Date
Adj Close F 2016-12-01 1.011866e+01
2016-12-02 9.963994e+00
2016-12-05 1.012680e+01
2016-12-06 1.022449e+01
2016-12-07 1.063152e+01
...
Volume ^GSPC 2019-11-22 3.226780e+09
2019-11-25 3.511530e+09
2019-11-26 4.595590e+09
2019-11-27 3.033090e+09
2019-11-29 1.743020e+11
Length: 9048, dtype: float64
df has the following data:
Attributes Adj Close Close High Low Open Volume
Symbols F ^GSPC F ^GSPC F ^GSPC F ^GSPC F ^GSPC F ^GSPC
Date
2015-02-11 12.216836 2068.530029 16.250000 2068.530029 16.309999 2073.479980 16.010000 2057.989990 16.080000 2068.550049 34285300.0 3.596860e+09
2015-02-12 12.299535 2088.479980 16.360001 2088.479980 16.450001 2088.530029 16.299999 2069.979980 16.340000 2069.979980 23738800.0 3.788350e+09
2015-02-13 12.254424 2096.989990 16.299999 2096.989990 16.360001 2097.030029 16.190001 2086.699951 16.330000 2088.780029 19954600.0 3.527450e+09
2015-02-17 12.111583 2100.340088 16.110001 2100.340088 16.299999 2101.300049 16.000000 2089.800049 16.209999 2096.469971 44362300.0 3.361750e+09
2015-02-18 12.186762 2099.679932 16.209999 2099.679932 16.330000 2100.229980 16.059999 2092.149902 16.160000 2099.159912 22812700.0 3.370020e+09
... ... ... ... ... ... ... ... ... ... ... ... ...
2019-11-22 8.890000 3110.290039 8.890000 3110.290039 8.900000 3112.870117 8.770000 3099.260010 8.800000 3111.409912 34966700.0 3.226780e+09
2019-11-25 9.000000 3133.639893 9.000000 3133.639893 9.010000 3133.830078 8.870000 3117.439941 8.900000 3117.439941 30580900.0 3.511530e+09
2019-11-26 9.010000 3140.520020 9.010000 3140.520020 9.020000 3142.689941 8.910000 3131.000000 8.980000 3134.850098 30093800.0 4.595590e+09
2019-11-27 9.100000 3153.629883 9.100000 3153.629883 9.150000 3154.260010 9.020000 3143.409912 9.030000 3145.489990 37396100.0 3.033090e+09
2019-11-29 9.060000 3140.979980 9.060000 3140.979980 9.100000 3150.300049 9.030000 3139.340088 9.040000 3147.179932 13096200.0 1.743020e+11
1210 rows × 12 columns
To find the data in df i am using a pivot_table:
df.pivot_table(values = 'Adj Close', index = 'Date', columns = 'Symbols')
but i am getting an error:
KeyError: 'Symbols'
Why am i getting this error?
Seems you already have a multiindex with what you need, you don't have to pivot.
>>> df['Adj Close'].head()
Symbols F ^GSPC
Date
2016-12-01 10.297861 2191.080078
2016-12-02 10.140451 2191.949951
2016-12-05 10.306145 2204.709961
2016-12-06 10.405562 2212.229980
2016-12-07 10.819797 2241.350098
>>>

Pandas read_csv with different date parsers

I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]

Conditional test within pandas dataframe

Can someone help me with a pandas question? I have a timeseries dataframe such as this:
GOOG AAPL
2010-12-09 16:00:00 591.50 551
2010-12-10 16:00:00 592.21 523
2010-12-13 16:00:00 594.62 578
2010-12-14 16:00:00 594.91 567
2010-12-15 16:00:00 590.30 577
...
I need to loop through each timestamp and test whether AAPL is > 570. If it is, then I want to print the date and the price of AAPL for that entry. Is this possible?
There's no need for any looping, one of the main benefits of pandas being built on numpy is it can easily operate on whole columns. It's as simple as:
df['AAPL'][df['AAPL'] > 570]
Output:
2010-12-13 16:00:00 578
2010-12-15 16:00:00 577
Name: AAPL, dtype: int64
Ah ha I got it:
What you can do since it is built on top of numpy is this:
my_dataframe[my_dataframe.AAPL > 570]
and you're almost done.
From here you have all the rows that correspond to AAPL > 570, now it's just printing out the values you need:
valid_rows = my_dataframe[my_dataframe.AAPL > 570]
for row in valid_rows.to_records():
print row[1],row[2]
The dataframe.where can be used for searching the entire frame.
I had forgotten that pandas made it extremely easy to reference columns.

Categories