Pandas ExcelFile.parse() reading file in as dict instead of dataframe - python

I am new to python and even newer to pandas, but relatively well versed in R. I am using Anaconda, with Python 3.5 and pandas 0.18.1. I am trying to read in an excel file as a dataframe. The file admittedly is pretty... ugly. There is a lot of empty space, missing headers, etc. (I am not sure if this is the source of any issues)
I create the file object, then find the appropriate sheet, then try to read that sheet as a dataframe:
xl = pd.ExcelFile(allFiles[i])
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
df = xl.parse(sName)
df
Results:
{'Security exposure - 21 day lag': Percent of Total Holdings \
0 KMNFC vs. 3 Month LIBOR AUD
1 04-OCT-16
2 Australian Dollar
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 Long/Short Net Exposure
9 Total
10 NaN
11 Long
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
(This goes on for 20-30 more rows and 5-6 more columns)
I am using Anaconda, and Spyder, which has a 'Variable Explorer'. It shows the variable df to be a dict of the DataFrame type:
However, I cannot use iloc:
df.iloc[:,1]
Traceback (most recent call last):
File "<ipython-input-77-d7b3e16ccc56>", line 1, in <module>
df.iloc[:,1]
AttributeError: 'dict' object has no attribute 'iloc'
Any thoughts? What am I missing?
EDIT:
To be clear, what I am really trying to do is reference the first column of the df. In R this would be df[,1]. Looking around it seems to be not a very popular way to do things, or not the 'correct' way. I understand why indexing by column names, or keys, is better, but in this situation, I really just need to index the dataframes by column numbers. Any working method of doing that would be greatly appreciated.
EDIT (2):
Per a suggestion, I tried 'read_excel', with the same results:
df = pd.ExcelFile(allFiles[i]).parse(sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-90-fc40aa59bd20>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'
df = pd.read_excel(allFiles[i], sheetname = sName)
df.loc[1]
Traceback (most recent call last):
File "<ipython-input-91-72b8405c6c42>", line 2, in <module>
df.loc[1]
AttributeError: 'dict' object has no attribute 'loc'

The problem was here:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()]
which returned a single element list. I changed it to the following:
sName = [s for s in xl.sheet_names if 'security exposure' in s.lower()][0]
which returns a string, and the code then performs as expected.
All thanks to ayhan for pointing this out.

Related

'Timestamp' object is not subscriptable

This is my code, I am trying to get months for a new column
import pandas as pd
df = pd.read_excel("..\Data.xlsx")
df.head(4)
p = df["Month"][0]
p[0:3]
I don't know what's the issue is here but it was working well for other datasets with the same attributes
Dataset:
Month Passengers
0 1995-01-01 112
1 1995-02-01 118
2 1995-03-01 132
3 1995-04-01 129
4 1995-05-01 121
P.S: In the excel data set month values are in Jan-1995 Feb-1995 format, it changed to YY:MM:DAY format because of pandas.
Traceback (most recent call last):
File "C:\Users\sreen\AppData\Local\Temp/ipykernel_27276/630478717.py", line 1, in <module>
p[0:3]
TypeError: 'Timestamp' object is not subscriptable
Maybe you need to write p = df["Month"]? In you current code, p is the first value of the Month column, so p[0:3] is just a Timestamp, which can't be subscripted.
This shall work for you:
df.rename(columns = {'Month':'Date'}, inplace = True)
df['Month'] = pd.DatetimeIndex(df['Date']).month

Having trouble with - class 'pandas.core.indexing._AtIndexer'

I'm working on a ML project to predict answer times in stack overflow based on tags. Sample data:
Unnamed: 0 qid i qs qt tags qvc qac aid j as at
0 1 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563372 67183.0 2 1235000501
1 2 563355 62701.0 0 1235000081 php,error,gd,image-processing 220 2 563374 66554.0 0 1235000551
2 3 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563358 15842.0 3 1235000177
3 4 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563413 893.0 18 1235001545
4 5 563356 15842.0 10 1235000140 lisp,scheme,subjective,clojure 1047 16 563454 11649.0 4 1235002457
I'm stuck at the data cleaning process. I intend to create a new column named 'time_taken' which stores the difference between the at and qt columns.
Code:
import pandas as pd
import numpy as np
df = pd.read_csv("answers.csv")
df['time_taken'] = 0
print(type(df.time_taken))
for i in range(0,263541):
val = df.qt[i]
qtval = val.item()
val = df.at[i]
atval = val.item()
df.time_taken[i] = qtval - atval
I'm getting this error:
Traceback (most recent call last):
File "<ipython-input-39-9384be9e5531>", line 1, in <module>
val = df.at[0]
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2080, in __getitem__
return super().__getitem__(key)
File "D:\Softwares\Anaconda\lib\site-packages\pandas\core\indexing.py", line 2027, in __getitem__
return self.obj._get_value(*key, takeable=self._takeable)
TypeError: _get_value() missing 1 required positional argument: 'col'
The problem here lies in the indexing of df.at
Types of both df.qt and df.at are
<class 'pandas.core.indexing._AtIndexer'>
<class 'pandas.core.series.Series'> respectively.
I'm an absolute beginner in data science and do not have enough experience with pandas and numpy.
There is, to put it mildly, an easier way to do this.
df['time_taken'] = df['at'] - df.qt
The AtIndexer issue comes up because .at is a pandas method. You want to make sure to not name columns any names that are the same as a Python/Pandas method for this reason. You can get around it just by indexing with df['at'] instead of df.at.
Besides that, this operation — if I'm understanding it — can be done with one short line vs. a long for loop.

Create a True/False column in Python DataFrame(1,0) based on two columns values

I'm having trouble creating a new column based on columns 'language_1' and 'language_2' in python dataframe. I want to create a 'bilingual' column where a '1' represents a user who speaks both English and Spanish(bi-lingual) and a 0 for non-bilingual speakers. Ultimately I want to compare their average ratings to each other, but want to categorize them first. I tried using if statements but I'm not sure how to write an if statement that combines multiple conditions to result in 1 value. Thank you for any help.
===============================================================================================
name language_1 language_2 rating bilingual
Kevin English Null 4.25
Miguel English Spanish 4.56
Carlos English Spanish 4.61
Aaron Null Spanish 4.33
===============================================================================================
Here is the code I've tried to use to append the new column to my dataframe.
def label_bilingual(row):
if row('language_english') == row['English'] and row('language_spanish') == 'Spanish':
val = 1
else:
val = 0
df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
Here is the error I'm getting.
----> 1 df_doc_1['bilingual'] = df_doc_1.apply(label_bilingual, axis=1)
'Series' object is not callable
You have a few issues with your function, one which is causing your error and a few more which will cause more problems after.
1 - You have tried to call the column with row('name') which is not callable.
df('row')
Traceback (most recent call last):
File "<pyshell#30>", line 1, in <module>
df('row')
TypeError: 'DataFrame' object is not callable
2 - You have tried to compare row['column'] to row['English'] which will not work, as a column named English does not exist
KeyError: 'English'
3 - You do not return any values
val = 1
val = 0
You need to modify your function as below to resolve these errors.
def label_bilingual(row):
if row['language_1'] == 'English' and row['language_2'] == 'Spanish':
return 1
else:
return 0
Output
>>> df['bilingual'] = df.apply(label_bilingual, axis=1)
>>> df
name language_1 language_2 rating bilingual
0 Kevin English Null 4.25 0
1 Miguel English Spanish 4.56 1
2 Carlos English Spanish 4.61 1
3 Aaron Null Spanish 4.33 0
To make it simpler I'd suggest having missing values in either column as numpy.nan. For example if missing values were recorded as np.nan:
bilingual = np.where(np.isnan(df[['language_1', 'language_2']].values.any(), 0, 1))
df['bilingual'] = bilingual
Here np.where checks condition inside, which in turn checks whether values in either of language columns are missing. And if true, than a person is not bilingual and gets a 0, otherwise 1.

Unable to load csv file into Dataframe using Surprise in Python

The Scenario
The Dataset is to be imported which consists has considerable NaN values in it. For same I'm using SurPRISE package (written by Nicholas Hug) in Python rather than using Pandas. Reason being the Method of predicting NaN values is good with mentioned package.
The Problem
Dataset post_df1.csv is as mentioned below:
uid iid rat
1 303.0 785.0 3.000000
2 291.0 1042.0 4.000000
3 234.0 1184.0 2.000000
4 102.0 768.0 2.000000
5 181.0 1081.0 1.000000
...
194 944.0 110.0 NaN
195 944.0 111.0 NaN
196 944.0 112.0 NaN
197 944.0 113.0 NaN
198 944.0 114.0 5.000000
199 944.0 115.0 5.000000
Importing it using SurPRISE
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1, 5))
df = Dataset.load_from_file('post_df1.csv', reader=reader)
returns error:
Traceback (most recent call last):
File "<input>", line 3, in <module>
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 173, in load_from_file
return DatasetAutoFolds(ratings_file=file_path, reader=reader)
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 306, in __init__
self.raw_ratings = self.read_ratings(self.ratings_file)
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 205, in read_ratings
itertools.islice(f, self.reader.skip_lines, None)]
File "/home/x/.local/lib/python2.7/site-packages/surprise/dataset.py", line 455, in parse_line
return uid, iid, float(r) + self.offset, timestamp
ValueError: could not convert string to float:
I'm unable to figure out, where's the String! since post_df1.csv when read using Pandas, returns this:
post_df1.dtypes
uid float64
iid float64
rat float64
dtype: object
Questions
What is the possibility that when reading it using this package might treat entire data as string?
I noticed in Error, that float has an offset and timestamp as return value in Dataset.py. How can I limit it upto uid, iid, rat / float only?
return uid, iid, float(r) + self.offset, timestamp
3. List item
References
Suprise Package Docs
EDIT #1
So, here's how the post_df1 & post_df2 are formed. Also for post_df1 I tried to take values from row 1 onwards, in case 0th row is header.
# PRE PROCESSED CLUSTER 0 -- Named to POST DataFrame1
if flag1 is 1:
print pre_df01
post_df1 = pre_df01.iloc[1:, :]
elif flag1 is 2:
print pre_df02
post_df1 = pre_df02.iloc[1:, :]
elif flag1 is 3:
print pre_df03
post_df1 = pre_df03.iloc[1:, :]
# PRE PROCESSED CLUSTER 1 -- Named to POST DataFrame2
if flag2 is 1:
print pre_df11
post_df2 = pre_df11
elif flag2 is 2:
print pre_df12
post_df2 = pre_df12
elif flag2 is 3:
print pre_df13
post_df2 = pre_df13
Here, I've already tried removing header and index to avoid any string type in it.
# EXPORT TO CSV & LOAD AGAIN IN PROGRAM
post_df1.to_csv("post_df1.csv", sep='\t', index=False, header=False)
post_df2.to_csv("post_df2.csv", sep='\t', index=False, header=False)
Since, importing is issue in code, I looked into csv file using Spreadsheet, here's how it looks
Clearly it is without Headers.
It seems like this error arise because of header of each column in post_df1.csv, which is in string format. When you remove first row with column names from csv file, your snippet of code should be working.

pandas str.split with .tolist() produced a float?

I have a hard time bug fixing my code which worked fine in testing on a small subset of the entire data. I could double check types to be sure, but the error message is already informative enough: The list I made ended up being a float. But how?
The last three lines which ran:
diagnoses = all_treatments['DIAGNOS'].str.split(' ').tolist()
all_treatments = all_treatments.drop(['DIAGNOS','INDATUMA','date'], axis=1)
all_treatments['tobacco'] = tobacco(diagnoses)
The error:
Traceback (most recent call last):
File "treatments2_noiopro.py", line 97, in <module>
all_treatments['tobacco'] = tobacco(diagnoses)
File "treatments2_noiopro.py", line 13, in tobacco
for codes in codes_column]
TypeError: 'float' object is not iterable
FWIW, the function itself is:
def tobacco(codes_column):
return [any('C30' <= code < 'C40' or
'F17' <= code <'F18'
for code in codes) if codes else False
for codes in codes_column]
I am using versions pandas 0.16.2 np19py26_0, iopro 1.7.1 np19py27_p0, and python 2.7.10 0 under Linux.
You can use str.split on the series and apply a function to the result:
def tobacco(codes):
return any(['C30' <= code < 'C40' or 'F17' <= code <'F18' for code in codes])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df
DIAGNOS
0 C35 C50
1 C36
2 C37
3 C50 C51
4 F1 F2
5 F17
6 F3 F17
7
df.DIAGNOS.str.split(' ').apply(tobacco)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 False
dtype: bool
edit:
Seems like using str.contains is significantly faster than both methods.
tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
data = [('C35 C50'), ('C36'), ('C37'), ('C50 C51'), ('F1 F2'), ('F17'), ('F3 F17'), ('C3')]
df = pd.DataFrame(data=data, columns=['DIAGNOS'])
df.DIAGNOS.str.contains(tobacco_codes)
I guess diagnoses is a generator and since you drop something in line 2 of your code this changes the generator. I can't test anything right now, but let me know if it works when commenting line 2 of your code.

Categories