ImportCSV problems with Pandas - python

I imported a CSV to pandas. However, when I try to use various model packages the cannot coerce one of the columns to float. When I try to do it manually I cannot coerce it either. When I try to check the types of all of my columns, I get the error message below. Any idea whats going on?
values = pd.read_csv(".../train_values.csv")
values.dtypes()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'Series' object is not callable

dtypes is an attribute of the pd.Series not a function. You can access the type by running
values.dtypes

Related

Pandas version 0.22.0 - drop_duplicates() got an unexpected keyword argument 'keep'

I am trying to drop duplicates in my dataframe using drop_duplicates(subset=[''], keep=False). Apparently it is working just okay in my Jupyter Notebook but when I am trying to executing through terminal as .py file I am getting the below error:
Traceback (most recent call last):
File "/home/source/fork/PySpark_Analytics/Notebooks/Krish/beryllium_pandas.py", line 54, in <module>
dffsamelname = dffsameflname.drop_duplicates(subset=['INDIVIDUAL_LASTNAME'], keep=False)
File "/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/pandas/util/decorators.py", line 88, in wrapper
return func(*args, **kwargs)
TypeError: drop_duplicates() got an unexpected keyword argument 'keep'
Checked that the pandas version is >0.18 as keep = false was introduced then.
# Trying to drop both the records with same last name
dffsamelname = dffsameflname.drop_duplicates(subset=['INDIVIDUAL_LASTNAME'], keep=False)
I want to drop both the records being dropped. Hence keep=false is necessary.
It just works fine if I remove the keep=false.
It may be that your object is not a native pandas dataframe but instead a pyspark dataframe.
From this http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.drop_duplicates it seems that subset is only argument accepted. Can you add your imports and the lines where you create the dataframe.

Error processing a csv file in Keras

I'm working on building an LSTM recurrent neural network that processes a set of texts and uses them to predict the author of new texts. I have a CSV file containing a single column of long text entries that is comma separated like this:
"sample text here","more text here","extra text written here"
This goes on for a few thousand entries. I'm trying to load this so I can feed it through the Keras Tokenizer and then use it in training my model but I'm stuck on an error originating on the first call to the Tokenizer where it kicks back with
Traceback (most recent call last):
File "test.py", line 35, in <module>
t.fit_on_texts(X_train)
File "I:\....text.py",
line 175, in fit_on_texts
self.split)
File "I:\....text.py",
line 47, in text_to_word_sequence
text = text.translate(translate_map)
AttributeError: 'numpy.ndarray' object has no attribute 'translate'
I'm very new to python, but as far as I can tell the issue is that the Tokenizer is expecting strings, but it's getting passed an ndarray instead. What I can't seem to manage is finding a way to pass it the correct thing, and I would really appreciate any advice. I've been working on this for a couple days now and it's just not coming to me.
Here's the relevant section of my code:
X_train = pandas.read_csv('I:\\xTrain.csv', sep=",", header=None, error_bad_lines=False).as_matrix()
t = Tokenizer(lower=False)
t.fit_on_texts(X_train)
t.texts_to_matrix(X_train, mode='count', lower=False)
I've tried reading it in a variety of ways, including using numpy.loadtxt. The error has varied a bit with the methods, but it's always that I'm trying to feed the wrong kind of input to the Tokenizer and I can't seem to work out how to get the right kind. What am I missing here? Thanks for taking the time to read!
Update
With help from furas, I discovered that my array was two columns wide and have successfully removed the second empty column. Unfortunately, this seems to have simply changed the error I'm getting slightly. It now reads:
Traceback (most recent call last):
File "test.py", line 36, in <module>
t.fit_on_texts(X_train)
File "I:\....text.py",
line 175, in fit_on_texts
self.split)
File "I:\....text.py",
line 47, in text_to_word_sequence
text = text.translate(translate_map)
AttributeError: 'numpy.int64' object has no attribute 'translate'
The only change is that numpy.ndarray is now numpy.int64. It looks to me like this is an int array now, even though it contains strings of text, so I'm attempting to find a way to convert it into a string array.
del X_train[1]
X_train[0] = Y_train[0].apply(str)
Is the code I've tried so far. The first line strips the extra column, but the second line seems to do nothing. I'm still trying to figure out how to get this data into the proper format.

AttributeError: 'module' object has no attribute 'TimeSeries' after python Validation.py

Just starting Computational Investing by Tucker Balch. I'm using virtualbox and installed Ubuntu. After installing QSTK, I ran python Validation.py (Step 7). I keep getting an:
AttributeError: 'module' object has no attribute 'TimeSeries'
There are many similar questions so I believe problem is the use of the same name as the file somewhere in the code. I was wondering if anyone had a solution specific to this class and QSTK.
The full error is:
Traceback (most recent call last):
File "Validation.py", line 122 in <module>
import QSTK.qstkutil.tsutil as tsu
File "usr/local/lib/python2.7/dist-packages/QSTK-0.2.8 py2.7.egg/QSTK/qstkutil/tsutil.py", line 19, in <module>
from QSTK.qstkutil import qsdateutil
File "usr/local/lib/python2.7/dist-packages/QSTK-0.2.8-py2.7.egg/QSTK/qstkutil/qsdateutil.py", line 38, in <module>
GTS_DATES = _cache_dates()
File "usr/local/lib/python2.7/dist-packages/QSTK-0.2.8-py2.7.egg/QSTK/qstkutil/qsdateutil.py", line 36, in _cache_dates
return pd.TimeSeries(index=dates, data=dates)
AttributeError: 'module' object has no attribute 'TimeSeries'
I encountered this issue too. This caused by the pandas lib. You can get into the path(my file path is /Library/Python/2.7/site-packages/QSTK/qstkutil) where the qstkutil.py of QSTK located. Then change all the 'TimeSeries' of this file as 'Series'.
You can also get some insights from here(https://github.com/QuantSoftware/QuantSoftwareToolkit/issues/73)
Corley is spot on. You can solve the problem by changing 2 occurrences of "TimeSeries" to "Series" in /usr/local/lib/python2.7/dist-packages/QSTK-0.2.8-py2.7.egg/QSTK/qstkutil/qsdateutil.py. "TimeSeries" also appears once in /usr/local/lib/python2.7/dist-packages/QSTK-0.2.8-py2.7.egg/QSTK/qstkutil/tsutil.py but I haven't encountered an error yet due to it.
Changing TimeSeries to Series corrects the issue for me.
Seems that
import pandas as pd;
pd.TimeSeries = pd.Series
should work, but did not for me.

How to read HDF table from pandas?

I have an my_file.h5 file that, presumably, contains data in HDF5 format (PyTables). I try to read this file using pandas:
import pandas as pd
store = pd.HDFStore('my_file.h5')
Then I try to use the store object:
print store
As a result I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.7/pandas/io/pytables.py", line 133, in __repr__
kind = v._v_attrs.pandas_type
File "/usr/lib/python2.7/dist-packages/tables/attributeset.py", line 302, in __getattr__
(name, self._v__nodePath)
AttributeError: Attribute 'pandas_type' does not exist in node: '/data'
Does anybody know what am I doing wrong? Can the problem be caused by the fact that my *.h5 is not really what I think it is (not data in hdf5 format)?
In your /usr/lib/pymodules/python2.7/pandas/io/pytables.py, line 133
kind = v._v_attrs.pandas_type
In my pytables.py I see
kind = getattr(n._v_attrs,'pandas_type',None)
By using getattr, if there is no pandas_type attribute, then kind is set to None. I'm guessing my version of Pandas
In [7]: import pandas as pd
In [8]: pd.__version__
Out[8]: '0.10.0'
is newer than yours. If so, the fix is to upgrade your pandas.
I had a h5 table. Made with pytables independent of pandas and needed to turn it into a list of tuples then import it to a df. This woked nice because it allows me to make use of my pytables index to run a "where" on the input. This saves me reading all the rows.

How to convert from Python date to Excel date using xlrd (attribute xlrd.xldate_from_date_tuple does not exist)

How to convert from Python date to Excel date using xlrd module? How to convert a python datetime.datetime to excel serial date number suggests a 'manual' solution, I wonder if it is the best way.
Xlrd document suggests to use xlrd.xldate_from_date_tuple
but
>>> import xlrd
>>> xlrd.xldate_from_date_tuple
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'xldate_from_date_tuple'
Could you help? Thanks.
Use xlrd.xldate.xldate_from_date_tuple

Categories