Having trouble with - class 'pandas.core.indexing._AtIndexer' - python

I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.

There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.

Related

'Timestamp' object is not subscriptable

This is my code, I am trying to get months for a new column
import pandas as pd
df = pd.read_excel("..\Data.xlsx")
df.head(4)
p = df["Month"][0]
p[0:3]
I don't know what's the issue is here but it was working well for other datasets with the same attributes
Dataset:
Month Passengers
0 1995-01-01 112
1 1995-02-01 118
2 1995-03-01 132
3 1995-04-01 129
4 1995-05-01 121
P.S: In the excel data set month values are in Jan-1995 Feb-1995 format, it changed to YY:MM:DAY format because of pandas.
Traceback (most recent call last):
File "C:\Users\sreen\AppData\Local\Temp/ipykernel_27276/630478717.py", line 1, in <module>
p[0:3]
TypeError: 'Timestamp' object is not subscriptable
Maybe you need to write p = df["Month"]? In you current code, p is the first value of the Month column, so p[0:3] is just a Timestamp, which can't be subscripted.
This shall work for you:
df.rename(columns = {'Month':'Date'}, inplace = True)
df['Month'] = pd.DatetimeIndex(df['Date']).month

Create a True/False column in Python DataFrame(1,0) based on two columns values

I'm having trouble creating a new column based on columns 'language_1' and 'language_2' in python dataframe. I want to create a 'bilingual' column where a '1' represents a user who speaks both English and Spanish(bi-lingual) and a 0 for non-bilingual speakers. Ultimately I want to compare their average ratings to each other, but want to categorize them first. I tried using if statements but I'm not sure how to write an if statement that combines multiple conditions to result in 1 value. Thank you for any help.
===============================================================================================
name language_1 language_2 rating bilingual
Kevin English Null 4.25
Miguel English Spanish 4.56
Carlos English Spanish 4.61
Aaron Null Spanish 4.33
===============================================================================================
Here is the code I've tried to use to append the new column to my dataframe.
def label_bilingual(row):
if row('language_english') == row['English'] and row('language_spanish') == 'Spanish':
val = 1
else:
val = 0
df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
Here is the error I'm getting.
----> 1 df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
'Series' object is not callable
You have a few issues with your function, one which is causing your error and a few more which will cause more problems after.
1 - You have tried to call the column with row('name') which is not callable.
df('row')
Traceback (most recent call last):
File "<pyshell#30>", line 1, in <module>
df('row')
TypeError: 'DataFrame' object is not callable
2 - You have tried to compare row['column'] to row['English'] which will not work, as a column named English does not exist
KeyError: 'English'
3 - You do not return any values
val = 1
val = 0
You need to modify your function as below to resolve these errors.
def label_bilingual(row):
if row['language_1'] == 'English' and row['language_2'] == 'Spanish':
return 1
else:
return 0
Output
>>> df['bilingual'] = df.apply(label_bilingual, axis=1)
>>> df
name language_1 language_2 rating bilingual
0 Kevin English Null 4.25 0
1 Miguel English Spanish 4.56 1
2 Carlos English Spanish 4.61 1
3 Aaron Null Spanish 4.33 0
To make it simpler I'd suggest having missing values in either column as numpy.nan. For example if missing values were recorded as np.nan:
bilingual = np.where(np.isnan(df[['language_1', 'language_2']].values.any(), 0, 1))
df['bilingual'] = bilingual
Here np.where checks condition inside, which in turn checks whether values in either of language columns are missing. And if true, than a person is not bilingual and gets a 0, otherwise 1.

Is there a way to use Lists as values in a DataFrame?

I'm dealing with the famous Kaggle challenge "House prices".
I want to train my Dataset with sklearn.linear_model LinearRegression
After reading the following article:
https://developers.google.com/machine-learning/crash-course/representation/feature-engineering
I wrote a function converting all String values in my train DataFrame into Lists.
For example an original feature values might look like this [Ex, Gd, Ta, Po] and after the conversion it will look like this: [1,0,0,0] [0,1,0,0] [0,0,1,0] [0,0,0,1].
When I try to train my data I get the following Error:
Traceback (most recent call last): File
"C:/Users/Owner/PycharmProjects/HousePrices/main.py", line 27, in
linereg.fit(train_df, target) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\linear_model\base.py",
line 458, in fit
y_numeric=True, multi_output=True) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py",
line 756, in check_X_y
estimator=estimator) File "C:\Users\Owner\PycharmProjects\HousePrices\venv\lib\site-packages\sklearn\utils\validation.py",
line 567, in check_array
array = array.astype(np.float64) ValueError: setting an array element with a sequence.
This only happens when I convert some columns as I explained.
Is there any way to train a Linear-Regression model with vectors as values?
This is my conversion function:
def feature_to_boolean_vector(df, feature_name, new_name):
vectors_list = [] #each tuple will represent an option
feature_options = df[feature_name].unique()
feature_options_length = len(feature_options)
# creating a list the size of feature_options_length, all 0's
list_to_be_vector = [0 for i in range(feature_options_length)]
for i in range(feature_options_length):
list_to_be_vector[i] = 1 # inserting 1 representing option number i
vectors_list.append(list_to_be_vector.copy())
list_to_be_vector[i] = 0
mapping = dict(zip(feature_options, vectors_list)) # dict from values to vectors
df[new_name] = df[feature_name].map(mapping)
df.drop([feature_name], axis=1, inplace=True)
And this is my train attempt (after pre-processing):
linereg = LinearRegression()
linereg.fit(train_df, target)
Thank you in advance.
LinearRegression does not support list as a feature. I saw you're using one-hot, and you can use each dimension as a column of features. By contrast, you can use the simpler method pd.get_dummies in pandas.
print(df['feature'])
0 Ex
1 Gd
2 Ta
3 Po
Name: feature, dtype: object
df = pd.get_dummies(df['feature'])
print(df)
Ex Gd Po Ta
0 1 0 0 0
1 0 1 0 0
2 0 0 0 1
3 0 0 1 0

Pandas ExcelFile.parse() reading file in as dict instead of dataframe

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)
I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:
xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df
Results:
{'Security exposure - 21 day lag': Percent of Total Holdings \
0 KMNFC vs. 3 Month LIBOR AUD
1 04-OCT-16
2 Australian Dollar
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 Long/Short Net Exposure
9 Total
10 NaN
11 Long
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
(This goes on for 20-30 more rows and 5-6 more columns)
I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:
However, I cannot use iloc:
df.iloc[:,1]
Traceback (most recent call last):
File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
df.iloc[:,1]
AttributeError: 'dict' object has no attribute 'iloc'
Any thoughts? What am I missing?
EDIT:
To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.
EDIT (2):
Per a suggestion, I tried 'read_excel', with the same results:
df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
The problem was here:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
which returned a single element list. I changed it to the following:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]
which returns a string, and the code then performs as expected.
All thanks to ayhan for pointing this out.

pandas str.split with .tolist() produced a float?

I have a hard time bug fixing my code which worked fine in testing on a small subset of the entire data. I could double check types to be sure, but the error message is already informative enough: The list I made ended up being a float. But how?
The last three lines which ran:
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments = all_treatments.drop(['DIAGNOS','INDATUMA','date'], axis=1)
all_treatments['tobacco'] = tobacco(diagnoses)
The error:
Traceback (most recent call last):
File "treatments2_noiopro.py", line 97, in <module>
all_treatments['tobacco'] = tobacco(diagnoses)
File "treatments2_noiopro.py", line 13, in tobacco
for codes in codes_column]
TypeError: 'float' object is not iterable
FWIW, the function itself is:
def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code <'F18'
for code in codes) if codes else False
for codes in codes_column]
I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.
You can use str.split on the series and apply a function to the result:
def tobacco(codes):
return any(['C30' <= code < 'C40' or 'F17' <= code <'F18' for code in codes])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df
DIAGNOS
0 C35 C50
1 C36
2 C37
3 C50 C51
4 F1 F2
5 F17
6 F3 F17
7
df.DIAGNOS.str.split(' ').apply(tobacco)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 False
dtype: bool
edit:
Seems like using str.contains is significantly faster than both methods.
tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('C3')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df.DIAGNOS.str.contains(tobacco_codes)
I guess diagnoses is a generator and since you drop something in line 2 of your code this changes the generator. I can't test anything right now, but let me know if it works when commenting line 2 of your code.

Categories