OverflowError: size does not fit in an int - python

I am writing a python script to use in AzureML. My dataset is quite big. I have a dataset with columns called ID(int) and DataType(text). I would like to concatenate these values to just have one column with text that has both the ID and the DataType seperated by a comma.
How can I avoid getting an error when I do this. Do I have any mistakes in my code?
When i run this code I get the following error:
Error 0085: The following error occurred during script evaluation, please view the output log for more information:
---------- Start of error message from Python interpreter ----------
data:text/plain,Caught exception while executing function: Traceback (most recent call last):
File "C:\server\invokepy.py", line 167, in batch
idfs.append(rutils.RUtils.RFileToDataFrame(infile))
File "C:\server\RReader\rutils.py", line 15, in RFileToDataFrame
rreader = RReaderFactory.construct_from_file(filename, compressed)
File "C:\server\RReader\rreaderfactory.py", line 25, in construct_from_file
return _RReaderFactory.construct_from_stream(stream)
File "C:\server\RReader\rreaderfactory.py", line 46, in construct_from_stream
return RReader(BinaryReader(RFactoryConstants.big_endian, stream.read()))
File "C:\pyhome\lib\gzip.py", line 254, in read
self._read(readsize)
File "C:\pyhome\lib\gzip.py", line 313, in _read
self._add_read_data( uncompress )
File "C:\pyhome\lib\gzip.py", line 329, in _add_read_data
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int
My code is as below:
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1):
import pandas as pd
dataframe1['SignalID,DataType'] = dataframe1['ID'] + " , " + dataframe1['DataType']
dataframe1 = dataframe1.drop('DataType')
dataframe1 = dataframe1.drop('ID')
# Return value must be of a sequence of pandas.DataFrame
return dataframe1
I get the same error when I run the default python code in AzureML. So I am pretty sure my data just does not fit in the data frame.
The default script is the following:
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
#
# The entry point function can contain up to two input arguments:
# Param<dataframe1>: a pandas.DataFrame
# Param<dataframe2>: a pandas.DataFrame
def azureml_main(dataframe1 = None, dataframe2 = None):
# Execution logic goes here
print('Input pandas.DataFrame #1:\r\n\r\n{0}'.format(dataframe1))
# If a zip file is connected to the third input port is connected,
# it is unzipped under ".\Script Bundle". This directory is added
# to sys.path. Therefore, if your zip file contains a Python file
# mymodule.py you can import it using:
# import mymodule
# Return value must be of a sequence of pandas.DataFrame
return dataframe1,

If you need concatenate integer ID and string DataType columns to new column SignalID, use casting by astype. Then you can drop columns DataType and ID adding parameter axis=1:
import pandas as pd
def azureml_main(dataframe1):
dataframe1['SignalID'] = dataframe1['ID'].astype(str)
+ " , "
+ dataframe1['DataType']
dataframe1 = dataframe1.drop(['DataType', 'ID'], axis=1)
# Return value must be of a sequence of pandas.DataFrame
return dataframe1

Related

Deleting rows from Python Dataframe with condition

I'm trying to delete some rows from huge dataset in Pandas. I decided to use iterrows() function for searching for indexes to delete (since I know that deleting while iteration is bad idea).
Right now it looks like that:
list_to_delete = []
rows_to_delete = {}
for index, row in train.iterrows():
if <some conditions>:
list_to_delete.append(int(index))
rows_to_delete[int(index)] = row
train = train.drop([train.index[i] for i in list_to_delete])
It's giving me such error:
Traceback (most recent call last):
File "C:/Users/patka/PycharmProjects/PARSER/getStatistics.py", line 115, in <module>
train = train.drop([train.index[i] for i in list_to_delete])
File "C:/Users/patka/PycharmProjects/PARSER/getStatistics.py", line 115, in <listcomp>
train = train.drop([train.index[i] for i in list_to_delete])
File "C:\Users\patka\PycharmProjects\PARSER\venv\lib\site-packages\pandas\core\indexes\base.py", line 3958, in __getitem__
return getitem(key)
IndexError: index 25378 is out of bounds for axis 0 with size 25378
How is it possible?
Before that I created a copy of this dataset and tried to delete chosen rows from this copy while iterating through original one (with inplace=True). Unfortunately there was error saying that NoneType object has no attribute 'drop'.
I would appreciate your help very much.
My example row looks like that:
resolution Done
priority Major
created 2000-07-04T13:13:52.000+0200
status Resolved
Team XBee
changelog {'Team" : {'from':...

why is pandas returning an empty dataframe from this sql query when WHERE clause is used?

I have a python program that uses pandas to write them to a csv, one of the column values I use as the file folder name. One of my columns contains dates in a column that is CAST as just a date in the format of 2018-10-31 and I need the program to generate files for today only. When I add an AND clause with CAST(GETDATE() AS DATE) or CONVERT(DATE, GETDATE()) it returns an empty dataframe, when I remove the AND it works just fine, but builds a CSV containing all of the info for all dates in the table.
my error:
Empty DataFrame
Columns: [ ScheduleDate, Vendor, note, WOGFileFolder]
Index: []
Traceback (most recent call last):
File "testing.py", line 85, in <module>
WOG_folder = pickups_df.at[0, 'WOGFileFolder']
File "File Path Redacted", line 2141, in __getitem__
key = self._convert_key(key)
File "File Path Redacted", line 2227, in _convert_key
raise ValueError("At based indexing on an non-integer "
ValueError: At based indexing on an non-integer index can only have non-integer indexers
my code:
stmt_vendor = """
SELECT DISTINCT
Vendor
FROM SOME_DB..SOME_TABLE;"""
cur.execute(stmt_vendor)
partners = cur.fetchall()
for partner in partners:
stmt_vendor_pickup_list = """
SELECT
ScheduleDate
, Vendor
, note
, WOGFileFolder
FROM SOME_DB..SOME_TABLE
WHERE Vendor = ?
AND ScheduleDate = CAST(GETDATE() AS DATE
ORDER BY ScheduleDate;"""
pickups_df = pd.read_sql_query(stmt_vendor_pickup_list, conn, params= partner)
logger.infor(pickups_df)
WOG_folder = pickups_df.at[0, 'WOGFileFolder']
vendor_dir = 'FILE_PATH_REDACTED' + WOG_folder
if not os.path.exists(vendor_dir):
os.mkdir(vendor_dir)
logger.info("CREATED DIR FOR: " + WOG_folder)
else:
logger.info("DIR FOR " + WOG_folder + " ALREADY EXISTS")
pickup_file_name = WOG_folder + '_testing.csv'
and then the program continues on to build the CSV using this info.
Does it work if you put the "?" into quotes?
Note that if the backend is postgres, you must use simple quotes

Python Pandas: creating a dataframe using a function for one of the fields

I am trying to create a dataframe where one of the fields is calculated using a function. To do this I use the following code:
import pandas as pd
def didSurvive(sex):
return int(sex == "female")
titanic_df = pd.read_csv("test.csv")
submission = pd.DataFrame({
"PassengerId": titanic_df["PassengerId"],
"Survived": didSurvive(titanic_df["Sex"])
})
submission.to_csv('titanic-predictions.csv', index=False)
when I run this code I get the following error:
D:\Documents\kaggle\titanic>python predictor.py
File "predictor.py", line 3
def didSurvive() {
^
SyntaxError: invalid syntax
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
D:\Documents\kaggle\titanic>python predictor.py
Traceback (most recent call last):
File "predictor.py", line 10, in
"Survived": didSurvive(titanic_df["Sex"])
File "predictor.py", line 4, in didSurvive
return int(sex == "female")
File "C:\Python34\lib\site-packages\pandas\core\series.py", line 92,
in wrapper
"{0}".format(str(converter)))
TypeError: cannot convert the series to
D:\Documents\kaggle\titanic>
I think what is happening is I'm trying to run the int() on a series of booleans instead of an individual boolean. How do I go about fixing this?
To convert the data type of a Series, you can use astype() function, this should work:
def didSurvive(sex):
return (sex == "female").astype(int)
You can also reformat data during the import from csv file
titanic_df = pd.read_csv("test.csv", converters={'Sex':didSurvive})
submission = pd.DataFrame(titanic_df, columns=['PassengerId', 'Sex'])

Error in scikit code

I am new to Machine Learning and am trying the titanic problem from Kaggle. I have written the attached code that uses decision tree to do computations on data. There is an error that I am unable to remove.
Code :
#!/usr/bin/env python
from __future__ import print_function
import pandas as pd
import numpy as np
from sklearn import tree
train_uri = './titanic/train.csv'
test_uri = './titanic/test.csv'
train = pd.read_csv(train_uri)
test = pd.read_csv(test_uri)
# print(train[train["Sex"] == 'female']["Survived"].value_counts(normalize=True))
train['Child'] = float('NaN')
train['Child'][train['Age'] < 18] = 1
train['Child'][train['Age'] >= 18] = 0
# print(train[train['Child'] == 1]['Survived'].value_counts(normalize=True))
# print(train['Embarked'][train['Embarked'] == 'C'].value_counts())
# print(train.shape)
## Fill empty 'Embarked' values with 'S'
train['Embarked'] = train['Embarked'].fillna('S')
## Convert Embarked classes to integers
train["Embarked"][train["Embarked"] == "S"] = 0
train['Embarked'][train['Embarked'] == "C"] = 1
train['Embarked'][train['Embarked'] == "Q"] = 2
train['Sex'][train['Sex'] == 'male'] = 0
train['Sex'][train['Sex'] == 'female'] = 1
target = train['Survived'].values
features_a = train[['Pclass', 'Sex', 'Age', 'Fare']].values
tree_a = tree.DecisionTreeClassifier()
##### Line With Error #####
tree_a = tree_a.fit(features_a, target)
# print(tree_a.feature_importances_)
# print(tree_a.score(features_a, target))
Error:
Traceback (most recent call last):
File "titanic.py", line 40, in <module>
tree_a = tree_a.fit(features_a, target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 122, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This error isn't present when I run the code on Datacamp server but present when I run it locally. I don't understand why this is coming up, I have checked the data and the values in either features_a or target don't contain NaN or really high values.
Try each feature one by one and you will probably find one of them has some nulls. I note you do not check if sex has nulls.
Also by coding each categoric variable manually it would be easy to make an error perhaps by misspelling one of the categories. Instead you can use df=pd.get_dummies(df) and it will automatically code all the categoric variables for you. No need to specify each category manually.
You can also try dropna() function of pandas to drop all those rows from dataset which have invalid values like NaN.

Appending DataFrame to List in Pandas, Python

I have a a file of data and want to select a specific State. From there I need to return this in a list, but there will be years that correspond to the date with missing data, so I need to replace the missing data.
I am having some issue with my code, likely something is slightly off in my for loop:
def stateCountAsList(filepath,state):
import pandas as pd
pd.set_option('display.width',200)
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
dfState = df[df['State'] == state]
yearList = range(1999,2012)
countsList = []
for dfState['Year'] in yearList:
countsList = dfState['Count']
else:
countsList.append(np.nan)
return countsList
print countsList.tolist()
stateCountAsList(filepath, state)
state = 'California'
Traceback:
C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py:59: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
for dfState['Year'] in yearList:
Traceback (most recent call last):
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 67, in <module>
stateCountAsList(filepath, state)
File "C:\Users\Michael\workspace\UCIIntrotoPythonDA\src\Michael_Madani_week3.py", line 62, in stateCountAsList
countsList.append(np.nan)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\core\series.py", line 1466, in append
verify_integrity=verify_integrity)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 754, in concat
copy=copy)
File "C:\Users\Michael\Anaconda\lib\site-packages\pandas\tools\merge.py", line 805, in __init__
raise TypeError("cannot concatenate a non-NDFrame object")
TypeError: cannot concatenate a non-NDFrame object
You have at least two different issues in your code:
The warning
A value is trying to be set on a copy of a slice from a DataFrame.
is triggered by for dfState['Year'] in yearList (line 59 in your code). In this line you try to loop over a range of years (1999 to 2012), but instead you implicitely try to assign the year value to dfState['Year']. This is not a copy, but a "view" (http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy), since df = dataFrame.iloc[0:638,:] returns a view.
But as mentioned earlier, you don't want to assign a value to the DataFrame here, only loop over years. So the for-loop should look like:
for year in range(1999,2012):
...
The second issue is in line 62. Here, you try to append np.nan to your "list" countsList - but countsList is not a list anymore, but a DataFrame!
Two lines before, you assign a pd.Series (countsList = dfState['Count']), effectively changing the type. This gives you the TypeError: cannot concatenate a non-NDFrame object
With this information you should be able to correct your loop.
As an alternative, you can get the desired result using Pandas query method (http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental):
def stateCountAsList(filepath,state):
import pandas as pd
import numpy as np
dataFrame = pd.read_csv(filepath,header=0,sep='\t')
df = dataFrame.iloc[0:638,:]
stateList = df.query("(State == #state) & (Year > 1999 < 2005").Count.tolist()
return stateList

Categories