pandas index_col="datetime" makes df['datetime'] unavailable

pandas index_col="datetime" makes df['datetime'] unavailable - python

The title says it all.
The following bit of pseudo-code returns the following error:
df = pd.read_sql(query, conn, parse_dates=["datetime"],
index_col="datetime")
df['datetime']
I get :
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\admin\.virtualenvs\EnkiForex-ey09TNOL\lib\site-packages\pandas\core\indexes\base.py", line 2656, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'datetime'
Am I misunderstanding what's going on by indexing the datetime col? I can access all the other columns normally though.

An index is not a column. Think of the index as labels for the rows of the DataFrame. index_col='datetime' makes the datetime column (in the csv) the index of df. To access the index, use df.index.

import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(d)
time = pd.date_range(end='4/5/2018',periods=2)
df.index = time
df.index
the end is DatetimeIndex(['2018-04-04', '2018-04-05'], dtype='datetime64[ns]', freq='D')
just use df.index can get the information of the index_col

Related

How can we concatenate two columns based on names?

I was working with a multiindex dataframe (which I find unbeleivably complicated to work with). I flattened the multiindex into jest Level0, with this line of code.
df_append.columns = df_append.columns.map('|'.join).str.strip('|')
Now, when I print columns, I get this.
Index(['IDRSSD', 'RCFD3531|TRDG ASSETS-US TREAS SECS IN DOM OFF',
'RCFD3532|TRDG ASSETS-US GOV AGC CORP OBLGS',
'RCFD3533|TRDG ASSETS-SECS ISSD BY ST POL SUB',
'TEXTF660|3RD ITEMIZED AMT FOR OTHR TRDG ASTS',
'Unnamed: 115_level_0|Unnamed: 115_level_1',
'Unnamed: 133_level_0|Unnamed: 133_level_1',
'Unnamed: 139_level_0|Unnamed: 139_level_1',
'Unnamed: 20_level_0|Unnamed: 20_level_1',
'Unnamed: 87_level_0|Unnamed: 87_level_1', 'file', 'schedule_code',
'year', 'qyear'],
dtype='object', length=202)
I am trying to concatenate two columns into one single column, like this.
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
Here is the error that I am seeing.
Traceback (most recent call last):
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'IDRSSD'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-153-92d2e8486595>", line 1, in <module>
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'IDRSSD'
To me, it looks like I have a column named 'IDRSSD' and a column named 'qyear', but Python disagrees. Or, perhaps I am misinterpreting the error message. Can I get these two columns concatenated into one, or is this impossible to do? Thanks everyone.

I tried the method below. It worked for me.
1.) First convert the column to string:
df_append['IDRSSD'] = df_append['IDRSSD'].astype(str)
df_append['qyear'] = df_append['qyear'].astype(str)
2.) Now join then both the columns into one using '-' as seperator
df_append['period'] = df_append[['IDRSSD', 'qyear']].apply(lambda x: '-'.join(x), axis=1)
Attached the screenshot of my approach.

You can use df_append.columns = df_append.columns.to_flat_index() to change the MultiIndex into a one dimensional array of tuples. From there you should be able to manipulate the columns more easily, or at least see what the issue is.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.to_flat_index.html

use apply method
import pandas as pd
def concat(row):
if ("col1" in row) & ("col2" in row):
return str(row['col1']) + "-" +str(row['col2'])
df =pd.DataFrame([["1","2"],["1","2"]],columns=["col1","col2"])
df['col3'] = df.apply(lambda row: concat(row), axis=1)
df

pandas filter dataframe based on chained splits

I have a pandas dataframe which contains a column (column name filenames) with filenames. The filenames look something like:
long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...
To filter, I do this (lets say `select_string="0"):
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
but I get thrown this:
Traceback (most recent call last):
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python_file.py", line 118, in <module>
main()
File "inference.py", line 57, in main
_=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
logger=logger, select_string=select_string)
File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 0
I think it does not like me chaining the splits, but I vaguely remember doing this sometime ago and it did work.. so, I am perplexed why it throws this error.
PS: I do know how to solve using .contains but I would like to use this approach of comparig strings.
Any pointers would be great!

Here is another way, with .str.extract():
import pandas as pd
df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
'long_file2_name_1.jpg',
'long_file3_name_0.jpg',
'long_file3_name_33.jpg',]
})
Now, create a boolean mask. The squeeze() method ensures we have a series, so the mask will work:
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
.astype(int)
.eq(0)
.squeeze())
print(df.loc[mask])
filename
0 long_file1_name_0.jpg
2 long_file3_name_0.jpg

Assuming all rows contain .jpg, if not please change it to only . instead
select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]

This part:
df_fp["filenames"].str.split(".jpg")[0]
returns you the first row of the DataFrame, not the first element of the list.
What you are looking for is expand (it will create a new columns for every element in the list after the split) parameter:
df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']
Alternatively you could do that via apply:
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']
but contains is definitely more appropriate here.

Why does pandas generate a KeyError when looking up date in date-indexed table?

Consider the following code:
date_index = np.array(['2019-01-01', '2019-01-02'], dtype=np.datetime64)
df = pd.DataFrame({'a': np.array([1, 2])}, index=date_index)
date_to_lookup = date_index[0]
print(df.at[date_to_lookup, 'a'])
One might expect it to work and print 1. Yet (at least in Anaconda python 3.7.3 with Pandas 0.24.2) it fails with the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../site-packages/pandas/core/indexing.py", line 2270, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
File ".../site-packages/pandas/core/frame.py", line 2771, in _get_value
return engine.get_value(series._values, index)
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 447, in pandas._libs.index.DatetimeEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 17897
It appears that Pandas DataFrame and Series objects always store dates as dtype 'datetime64[ns]' or 'datetime64[ns, tz]', and the issue arises because Pandas automatically converts 'datetime64[D]' dtype to 'datetime64[ns]' when creating the index, but does not do that when looking up an element in that index. I could avoid the error above by converting the key to 'datetime64[ns]'. E.g. both of the following lines successfully print 1:
print(df.at[pd.to_datetime(date_to_lookup), 'a'])
print(df.at[date_to_lookup.astype('datetime64[ns]'), 'a'])
This behavior (automatic dtype conversion when creating an index, but not when looking up an element) seems counterintuitive to me. What is the reason it was implemented this way? Is there some coding style one is expected to follow to avoid errors like this? Or is it a bug I should file?

You can avoid this by select by positions with DataFrame.iat and Index.get_loc for position of column a:
print(df.iat[0, df.columns.get_loc('a')])
#alternative
#print(df.iloc[0, df.columns.get_loc('a')])
1
Another idea is use df.index for selecting instead date_index[0]:
print(df.at[df.index[0], 'a'])

I think this is a bug you found in 0.24.2, it works on my system python 3.7.2 and pandas 0.25.3:
date_index = np.array(['2019-01-01', '2019-01-02'], dtype=np.datetime64)
df = pd.DataFrame({'a': np.array([1, 2])}, index=date_index)
date_to_lookup = date_index[0]
print(df.at[date_to_lookup, 'a'])
1

Using Pandas dataframe to process list data get loc error

I'm using Pandas to process List data. The list data is
[(datetime.date(1992, 2, 1), 114535.0), (datetime.date(1992, 3, 1), 120025.0), (datetime.date(1992, 4, 1), 124470.0), (datetime.date(1992, 5, 1), 125822.0), (datetime.date(1992, 6, 1), 122834.0)]
I create labels and use DataFrame.from_records to read the data
labels = ['date', 'value']
df = pd.DataFrame.from_records(listData, columns=labels)
df = df.set_index('date')
print(df.loc['1992-03-01'])
I got the following error using df.loc
File "/usr/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 2890, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '1992-03-01'

Create DatetimeIndex with to_datetime for convert dates objects to timestamps:
labels = ['date', 'value']
df = pd.DataFrame.from_records(listData, columns=labels)
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
print(df.loc['1992-03-01'])
value 120025.0
Name: 1992-03-01 00:00:00, dtype: float64
Or looking by date objects:
labels = ['date', 'value']
df = pd.DataFrame.from_records(listData, columns=labels)
df = df.set_index('date')
print(df.loc[datetime.date(1992, 3, 1)])
value 120025.0
Name: 1992-03-01, dtype: float64

How to fix 'KeyError:****_target ' error in Python 3.7 [duplicate]

This question already has answers here:
NumPy Error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(4 answers)
Solution to Error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(2 answers)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
(10 answers)
Closed 3 years ago.
I got this code from youtube,
not sure, why the tutor(Sentdex) is not getting the same error as I am.
I have a Test.csv file with dates as Index
Dates 'A Close' 'B Close' 'DLF Close' 'ICICI Close'
1 jan 18 555 111 122 400
2 jan 18 566 132 128 398
and so on .....
from collections import Counter
import numpy as np
import pandas as pd
hm_days = 7
def process_data(ticker):
df = pd.read_csv('Test.csv', index_col=0)
tickers = df.columns.values.tolist()
df.fillna(0, inplace=True)
for i in range(1, hm_days+1):
df['{}_{}d'.format(ticker, i)] = (df[ticker].shift(-i)-
df[ticker])/df[ticker]
df.fillna(0, inplace=True)
return tickers, df
def buy_sell_hold(*args):
cols = [c for c in args]
req = 0.02
for col in cols:
if all(col) > req:
return 1
if all(col) < -req:
return -1
return 0
def extract_feature(ticker):
tickers, df = process_data(ticker)
df['{}_target'.format(ticker)] = list(map(buy_sell_hold,
df[['{}_{}d'.format(ticker, i)
for i in range(1, hm_days + 1)]].values))
vals = df['{}_target'.format(ticker)].values.tolist()
str_vals = [str(i) for i in vals]
print('Data spread:', Counter(str_vals))
df.fillna(0, inplace=True)
df = df.replace([np.inf, -np.inf], np.nan)
df.dropna(inplace=True)
df_vals = df[[ticker for ticker in tickers]].pct_change()
df_vals = df_vals.replace([np.inf, -np.inf], 0)
df_vals.fillna(0, inplace=True)
x = df_vals.values
y = df['{}_target'.format(ticker)].values
return x, y, df
extract_feature('DLF Close')
This is the error I am getting:
Traceback (most recent call last):
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'DLF Close_target'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Sudipto\Dropbox\Sentdex\PPF Backup\try.py", line 48, in <module>
extract_feature('DLF Close')
File "C:\Users\Sudipto\Dropbox\Sentdex\PPF Backup\try.py", line 33, in extract_feature
vals = df['{}_target'.format(ticker)].values.tolist()
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "C:\Users\Sudipto\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'DLF Close_target'
I gather the issue with the line:
vals = df['{}_target'.format(ticker)].values.tolist()
I checked the codes twice, thrice...and couldn't figure out what is wrong when I call for "DLF Close". Can anyone help me with this?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas index_col="datetime" makes df['datetime'] unavailable - python

An index is not a column. Think of the index as labels for the rows of the DataFrame. index_col='datetime' makes the datetime column (in the csv) the index of df. To access the index, use df.index.

import pandas as pd d = {'col1': [1, 2], 'col2': [3, 4]} df = pd.DataFrame(d) time = pd.date_range(end='4/5/2018',periods=2) df.index = time df.index the end is DatetimeIndex(['2018-04-04', '2018-04-05'], dtype='datetime64[ns]', freq='D') just use df.index can get the information of the index_col

Related

How can we concatenate two columns based on names?

pandas filter dataframe based on chained splits

Why does pandas generate a KeyError when looking up date in date-indexed table?

Using Pandas dataframe to process list data get loc error

How to fix 'KeyError:****_target ' error in Python 3.7 [duplicate]

Categories

Resources