Not able to access a column in pandas data frame - python

I have data frame df.
df.columns gives this output
Index([u'Talk Time\t', u'Hold Time\t', u'Work Time\t', u'Call Type'], dtype='object')
Here, column 'Talk Time' has "\t" character with it, so if I do the following, I get an error
df['Talk Time']
Traceback (most recent call last):
File "<ipython-input-78-f2b7b9f43f59>", line 1, in <module>
old['Talk Time']
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "C:\Users\Admin\Anaconda\lib\site-packages\pandas\core\index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas\index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas\index.c:3820)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:3700)
File "pandas\hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12323)
File "pandas\hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12274)
KeyError: 'Talk Time'
So I modify columns to remove tab characters as follows:
for n in range(len(df.columns)):
df.columns.values[n] = df.columns.values[n].rstrip()
Tab characters get removed, df.columns give the following output
Index([u'Talk Time', u'Hold Time', u'Work Time', u'Call Type'], dtype='object')
But, still when I am trying to access a column as
df['Talk Time']
, I am seeing the same error. Why is it happening?

The main issue is, that you replaced the value of the columns and that is you actually managed to do. But that is just an alias, thus the actual name stayed as was before. So df['Talk Time\t'] worked well on, if you tried to, but obviously that wasn't the result you waited for.
So the solution is that you have to change the df.columns instead of df.columns.value
df.columns = [c.rstrip() for c in df.columns]
This is what works fine according to your needs

I can't reproduce your second error, however, you could do:
df.columns = [i.rstrip() for i in df.columns]
Maybe this will help !

Related

Adding the last line of code results in this error: "in get_loc raise KeyError(key) from err". What causes this error?

I am trying to subset the Dates from 2013 to 2018 and adding the last line of the code Code results in this error Error.
Why does this happen and can anyone tell me if there is a better way to subset the Dates ?
ERROR :
File "C:\Users\Dev\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Date'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\Python Projects\MTE\Fitness Tracker\Analyze Your Runkeeper Fitness Data\datasets\Fitness Data.py", line 29, in <module>
datesss=df_run[(df_run['Date'] > '01-01-2013') & (df_run['Date'] <= '31-12-2018')]
File "C:\Users\Dev\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\Dev\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'Date'
Your code looks correct, can you please check when you are creating df_run Dataframe from df_activities, it has 'Date' column and not index else you will have to reset_index().
Error suggest that the DataFrame(df_run) either does not contain 'Date' column or it is set as index.
Boolean mask can be used, works with date as object type or with TimeStamp type:
Solution
mask = (df['Date'] > '01-01-2013') & (df['Date'] <= '31-12-2019')
df.loc[mask]

How can we concatenate two columns based on names?

I was working with a multiindex dataframe (which I find unbeleivably complicated to work with). I flattened the multiindex into jest Level0, with this line of code.
df_append.columns = df_append.columns.map('|'.join).str.strip('|')
Now, when I print columns, I get this.
Index(['IDRSSD', 'RCFD3531|TRDG ASSETS-US TREAS SECS IN DOM OFF',
'RCFD3532|TRDG ASSETS-US GOV AGC CORP OBLGS',
'RCFD3533|TRDG ASSETS-SECS ISSD BY ST POL SUB',
'TEXTF660|3RD ITEMIZED AMT FOR OTHR TRDG ASTS',
'Unnamed: 115_level_0|Unnamed: 115_level_1',
'Unnamed: 133_level_0|Unnamed: 133_level_1',
'Unnamed: 139_level_0|Unnamed: 139_level_1',
'Unnamed: 20_level_0|Unnamed: 20_level_1',
'Unnamed: 87_level_0|Unnamed: 87_level_1', 'file', 'schedule_code',
'year', 'qyear'],
dtype='object', length=202)
I am trying to concatenate two columns into one single column, like this.
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
Here is the error that I am seeing.
Traceback (most recent call last):
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'IDRSSD'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-153-92d2e8486595>", line 1, in <module>
df_append['period'] = df_append['IDRSSD'].astype(str) + '-' + df_append['qyear'].astype(str)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\ryans\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'IDRSSD'
To me, it looks like I have a column named 'IDRSSD' and a column named 'qyear', but Python disagrees. Or, perhaps I am misinterpreting the error message. Can I get these two columns concatenated into one, or is this impossible to do? Thanks everyone.
I tried the method below. It worked for me.
1.) First convert the column to string:
df_append['IDRSSD'] = df_append['IDRSSD'].astype(str)
df_append['qyear'] = df_append['qyear'].astype(str)
2.) Now join then both the columns into one using '-' as seperator
df_append['period'] = df_append[['IDRSSD', 'qyear']].apply(lambda x: '-'.join(x), axis=1)
Attached the screenshot of my approach.
You can use df_append.columns = df_append.columns.to_flat_index() to change the MultiIndex into a one dimensional array of tuples. From there you should be able to manipulate the columns more easily, or at least see what the issue is.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.to_flat_index.html
use apply method
import pandas as pd
def concat(row):
if ("col1" in row) & ("col2" in row):
return str(row['col1']) + "-" +str(row['col2'])
df =pd.DataFrame([["1","2"],["1","2"]],columns=["col1","col2"])
df['col3'] = df.apply(lambda row: concat(row), axis=1)
df

pandas filter dataframe based on chained splits

I have a pandas dataframe which contains a column (column name filenames) with filenames. The filenames look something like:
long_file1_name_0.jpg
long_file2_name_1.jpg
long_file3_name_0.jpg
...
To filter, I do this (lets say `select_string="0"):
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
but I get thrown this:
Traceback (most recent call last):
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2889, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 97, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1032, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1039, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "python_file.py", line 118, in <module>
main()
File "inference.py", line 57, in main
_=some_function(config_dict=config_dict, logger=logger, select_string=config_dict['global']['select_string'])
File "/file/location/dir/etc/fprint/dataloaders.py", line 31, in some_function2
logger=logger, select_string=select_string)
File "/file/location/dir/etc/fprint/preprocess.py", line 25, in df_preprocess
df_fp = df_fp[~df_fp["filenames"].str.split(".jpg")[0].split("_")[-1]==select_string]
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 882, in __getitem__
return self._get_value(key)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/series.py", line 991, in _get_value
loc = self.index.get_loc(label)
File "/file/location/dir/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
raise KeyError(key) from err
KeyError: 0
I think it does not like me chaining the splits, but I vaguely remember doing this sometime ago and it did work.. so, I am perplexed why it throws this error.
PS: I do know how to solve using .contains but I would like to use this approach of comparig strings.
Any pointers would be great!
Here is another way, with .str.extract():
import pandas as pd
df = pd.DataFrame({'filename': ['long_file1_name_0.jpg',
'long_file2_name_1.jpg',
'long_file3_name_0.jpg',
'long_file3_name_33.jpg',]
})
Now, create a boolean mask. The squeeze() method ensures we have a series, so the mask will work:
mask = (df['filename'].str.extract( r'\w+_(\d+).jpg' )
.astype(int)
.eq(0)
.squeeze())
print(df.loc[mask])
filename
0 long_file1_name_0.jpg
2 long_file3_name_0.jpg
Assuming all rows contain .jpg, if not please change it to only . instead
select_string=str(0) #select string should be of type str
df_fp=df_fp[df_fp["filenames"].apply(lambda x: x.split(".jpg")[0].split("_")[-1]).astype(str)==select_string]
This part:
df_fp["filenames"].str.split(".jpg")[0]
returns you the first row of the DataFrame, not the first element of the list.
What you are looking for is expand (it will create a new columns for every element in the list after the split) parameter:
df[df['filenames'].str.split('.jpg', expand=True)[0].str.split('_', expand=True)[1] == '0']
Alternatively you could do that via apply:
df[df['filenames'].apply(lambda x: x.split('.jpg')[0].split('_')[-1]) == '0']
but contains is definitely more appropriate here.

Why does pandas generate a KeyError when looking up date in date-indexed table?

Consider the following code:
date_index = np.array(['2019-01-01', '2019-01-02'], dtype=np.datetime64)
df = pd.DataFrame({'a': np.array([1, 2])}, index=date_index)
date_to_lookup = date_index[0]
print(df.at[date_to_lookup, 'a'])
One might expect it to work and print 1. Yet (at least in Anaconda python 3.7.3 with Pandas 0.24.2) it fails with the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../site-packages/pandas/core/indexing.py", line 2270, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
File ".../site-packages/pandas/core/frame.py", line 2771, in _get_value
return engine.get_value(series._values, index)
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 447, in pandas._libs.index.DatetimeEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 17897
It appears that Pandas DataFrame and Series objects always store dates as dtype 'datetime64[ns]' or 'datetime64[ns, tz]', and the issue arises because Pandas automatically converts 'datetime64[D]' dtype to 'datetime64[ns]' when creating the index, but does not do that when looking up an element in that index. I could avoid the error above by converting the key to 'datetime64[ns]'. E.g. both of the following lines successfully print 1:
print(df.at[pd.to_datetime(date_to_lookup), 'a'])
print(df.at[date_to_lookup.astype('datetime64[ns]'), 'a'])
This behavior (automatic dtype conversion when creating an index, but not when looking up an element) seems counterintuitive to me. What is the reason it was implemented this way? Is there some coding style one is expected to follow to avoid errors like this? Or is it a bug I should file?
You can avoid this by select by positions with DataFrame.iat and Index.get_loc for position of column a:
print(df.iat[0, df.columns.get_loc('a')])
#alternative
#print(df.iloc[0, df.columns.get_loc('a')])
1
Another idea is use df.index for selecting instead date_index[0]:
print(df.at[df.index[0], 'a'])
I think this is a bug you found in 0.24.2, it works on my system python 3.7.2 and pandas 0.25.3:
date_index = np.array(['2019-01-01', '2019-01-02'], dtype=np.datetime64)
df = pd.DataFrame({'a': np.array([1, 2])}, index=date_index)
date_to_lookup = date_index[0]
print(df.at[date_to_lookup, 'a'])
1

load text file with separate columns in python pandas

I have a text file that looks like this:
# Pearson correlation [n=344 #col=2]
# Name Name Value BiasCorr 2.50% 97.50% N: 2.50% N:97.50%
# --------------- --------------- -------- -------- -------- -------- -------- --------
101_DGCA3.1D[0] 101_LEC.1D[0] +0.85189 +0.85071 +0.81783 +0.87777 +0.82001 +0.87849
I have loaded it into python pandas using the following code:
import pandas as pd
data = pd.read_table('test.txt')
print data
However, I can't seem to access the different columns separately. I have tried using sep=' ' and copying the spaces between the columns in the text file, but I still don't get any column names and trying to print data[0] gives me an error:
Traceback (most recent call last):
File "cut_afni_output.py", line 3, in <module>
print data[0]
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1969, in __getitem__
return self._getitem_column(key)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 1976, in _getitem_column
return self._get_item_cache(key)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 1091, in _get_item_cache
values = self._data.get(item)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3211, in get
loc = self.items.get_loc(item)
File "/home/user/anaconda2/lib/python2.7/site-packages/pandas/core/index.py", line 1759, in get_loc
return self._engine.get_loc(key)
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3843)
File "pandas/hashtable.pyx", line 668, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12265)
File "pandas/hashtable.pyx", line 676, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12216)
KeyError: 0
I haven't been able to set the header row manually because it seems like python views the whole thing as one column. How do I make the text file be read in as separate columns that I can call?
Try this:
In [33]: df = pd.read_csv(filename, comment='#', header=None, delim_whitespace=True)
In [34]: df
Out[34]:
0 1 2 3 4 5 6 7
0 101_DGCA3.1D[0] 101_LEC.1D[0] 0.85189 0.85071 0.81783 0.87777 0.82001 0.87849

Categories