Python Pandas: Got an Error from trying to bin data

Python Pandas: Got an Error from trying to bin data - python

Python is returning this error message. Would anyone happen to know how to fix?
TypeError: putmask() argument 1 must be numpy.ndarray, not numpy.int64
I got this error when trying to use pd.cut to classify my data into different bins, as seen below:
df['Credit Score'].fillna(0, inplace=True)
df['Credit Score'] = df['Credit Score'].astype(int)
bins = [0, 500, 550, 600, 650]
Summary_Scores = df.groupby(pd.cut('Credit Score', bins))
The data is being read from a csv file:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5729 entries, 0 to 252032
Data columns (total 5 columns):
Loan # 5729 non-null int64
Amount 5729 non-null float64
Issue Date 5729 non-null datetime64[ns]
Purposes 5661 non-null object
Credit Score 5729 non-null int32
dtypes: datetime64[ns](1), float64(1), int32(1), int64(1), object(1)
memory usage: 246.2+ KB
None

Related

How to parse a date column as datetimes, not objects in Pandas?

I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?

When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)

Datatype in converting dataframe from pandas to pyspark into foundry

For those of you working into Foundry's environnement, I'm trying to build a pipeline in "Code repositories" to process a raw dataset (from Excel file) into a clean one that I'll analyse in "Contour" later.
To that end I used python, except that pipeline seems to be using pyspark and at some point I must convert the dataset I've cleaned with pandas into a pyspark one and that's where i'm stuck.
I've looked at several post on stackover flow to convert Pandas DF to Pyspark DF but none of the solutions seems to be working so far.
When I try to run the transform, there's always a datatype failing to be converted evethough I forced a schema.
The Python code section has been tested succefully in Spyder (importing and exporting has an Excel file) and give the expected result. It's only when I need to convert to pyspark that it fails somehow.
#transform_pandas(
Output("/MDM_OUT_OF_SERVICE_EVENTS_CLEAN"),
OOS_raw=Input("/MDM_OUT_OF_SERVICE_EVENTS"),
)
def DA_transform(OOS_raw):
''' Code Section in Python '''
mySchema=StructType([StructField(OOS_dup.columns[0], IntegerType(),
True),
StructField(OOS_dup.columns[1], StringType(), True),
...])
OOS_out=sqlContext.createDataFrame(OOS_dup,schema
=mySchema,verifySchema=False)
return OOS_out
I got this error message at some point :
AttributeError: 'unicode' object has no attribute 'toordinal'.
According to this post : What is causing 'unicode' object has no attribute 'toordinal' in pyspark?
it's because pyspark fail to convert the Data into Datetype
but data is in Datetime64[ns] in pandas. I've tried converting this columns into string and integer and it fails as well.
Here is a picture of the output dataset from Python :
Here is the datatypes return by pandas once the data set has been cleaned :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4972 entries, 0 to 4971
Data columns (total 51 columns):
OOS_ID 4972 non-null int64
OPERATOR_CODE 4972 non-null object
ATA_CAUSE 4972 non-null int64
EVENT_CODE 3122 non-null object
AC_MODEL 4972 non-null object
AC_SN 4972 non-null int64
OOS_DATE 4972 non-null datetime64[ns]
AIRPORT_CODE 4915 non-null object
RTS_DATE 4972 non-null datetime64[ns]
EVENT_TYPE 4972 non-null object
CORRECTIVE_ACTION 417 non-null object
DD_HOURS_OOS 4972 non-null float64
EVENT_DESCRIPTION 4972 non-null object
EVENT_CATEGORY 4972 non-null object
ATA_REPORTED 324 non-null float64
TOTAL_CAUSES 4875 non-null float64
EVENT_NUMBER 3117 non-null float64
RTS_TIME 4972 non-null object
OOS_TIME 4972 non-null object
PREV_REPORTED 4972 non-null object
FERRY_IND 4972 non-null object
REPAIR_STN_CODE 355 non-null object
MAINT_DOWN_TIME 4972 non-null float64
LOGBOOK_RECORD_IDENTIFIER 343 non-null object
RTS_IND 4972 non-null object
READY_FOR_USE 924 non-null object
DQ_COMMENTS 2 non-null object
REVIEWED 5 non-null object
DOES_NOT_MEET_SPECS 4 non-null object
CORRECTED 12 non-null object
EDITED_BY 4972 non-null object
EDIT_DATE 4972 non-null datetime64[ns]
OUTSTATION_INDICATOR 3801 non-null object
COMMENT_TEXT 11 non-null object
ATA_CAUSE_CHAPTER 4972 non-null int64
ATA_CAUSE_SECTION 4972 non-null int64
ATA_CAUSE_COMPONENT 770 non-null float64
PROCESSOR_COMMENTS 83 non-null object
PARTS_AVAIL_AT_STATION 4972 non-null object
PARTS_SHIPPED_AT_STATION 4972 non-null object
ENGINEER_AT_STATION 4972 non-null object
ENGINEER_SENT_AT_STATION 4972 non-null object
SOURCE_FILE 4972 non-null object
OOS_Month 4972 non-null float64
OOS_Hour 4972 non-null float64
OOS_Min 4972 non-null float64
RTS_Month 4972 non-null float64
RTS_Hour 4972 non-null float64
RTS_Min 4972 non-null float64
OOS_Timestamp 4972 non-null datetime64[ns]
RTS_Timestamp 4972 non-null datetime64[ns]
dtypes: datetime64[ns](5), float64(12), int64(5), object(29)

In case it might help some of you I found this in the offical Foundry documentation on how to properly transition between pandas and pyspark DF.
OOS_dup is my Pandas dataframe I want to convert back to Spark.
# Extract the name of each columns with its data type in pandas
col = OOS_dup.columns
col_type = list()
for c in col:
t = OOS_dup[c].dtype.name
col_type.append(t)
df_schema = pd.DataFrame({"field": col, "data_type": col_type})
# Define a function to replace missing (NaN sky coverage cells with Null
def replace_missing(df, col_names):
for col in col_names:
df = df.withColumn("{}".format(col),
F.when(df["{}".format(col)] == "NaN", None).otherwise(df["{}".format(col)]))
return df
# Replace missing values
OOS_dup = replace_missing(OOS_dup, col)
# Define a function to change column types to the proper type in spark
def change_type(df, col_names, dtypes):
for col in col_names:
df = df.withColumn("{}".format(col), F.when(dtypes == "float64", (df["{}".format(col)]).cast("double")).when(dtypes == "int64", (df["{}".format(col)]).cast("int")).when(dtypes == "datetime64[ns]", (df["{}".format(col)]).cast("date")).otherwise((df["{}".format(col)]).cast("string")))
return df
# Cast each columns to the proper data type
OOS_dup = change_type(OOS_dup, df_schema["field"], df_schema["data_type"])
OOS_dup = sqlContext.createDataFrame(OOS_dup)

.dropna() increases memory usage

First I import the whole file and get a memory consumption of 1002.0+ KB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
)
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32063 entries, 0 to 32062
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1002.0+ KB
# None
then I drop NaN, run the script again and get a memory consumption of 1.2+ MB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
).dropna(how="all")
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 32062 entries, 0 to 32061
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1.2+ MB
# None
since I'm dropping one row I would expect that memory consumption goes down or at least remain the same no this.
Does any body know why is this happening? or how to fix it? or if this is a bug?
EDIT: chicago.csv

The change comes from the fact that your index changed from a RangeIndex to an Int64Index, which takes more memory.
You can "fix" this by resetting the index after the dropna(), but this will have the side effect of changing the row index (which you may not care about).
Here is an illustrative example:
First create a sample DataFrame:
df = pd.DataFrame({"a": range(10000)})
df.loc[1000, "a"] = None
Print the info:
print(df.info())
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 10000 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
Drop the na values:
print(df.dropna().info())
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 9999 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 156.2 KB
Reset (and drop) the index:
df.dropna().reset_index(drop=True).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 9999 entries, 0 to 9998
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB

This is not a bug. It is working as intended, you are loading the file in so it is taking that amount of memory as before but more because you are then searching through the dataframe and removing the rows that have NaN which adds memory usage.

Wrong Data Type while Reading a large Text File

I'm trying to read the following file using pandas. The code that I'm using is the following:
df = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';', nrows=5)
The df.info() is giving the correct output.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 columns):
Date 5 non-null object
Time 5 non-null object
Global_active_power 5 non-null float64
Global_reactive_power 5 non-null float64
Voltage 5 non-null float64
Global_intensity 5 non-null float64
Sub_metering_1 5 non-null float64
Sub_metering_2 5 non-null float64
Sub_metering_3 5 non-null float64
dtypes: float64(7), object(2)
memory usage: 440.0+ bytes
But when I'm trying to read the entire data set using the same code except nrows:
df_all = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';') the column types are becoming object.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
Global_active_power object
Global_reactive_power object
Voltage object
Global_intensity object
Sub_metering_1 object
Sub_metering_2 object
Sub_metering_3 float64
dtypes: float64(1), object(6)
memory usage: 126.7+ MB
Can anyone please tell me why this is happening? And how to resolve it?
Thanks!

My guess would be that when you read the full data set in there are values in the additional rows that are being interpreted as different data types, for example floats interpreted as integers. You can specify the data types explicitly using the dtype argument in read_csv - see docs here.
Alternatively you could try to force the data types after loading the data; e.g. like so:
df["Global_active_power"] = df["Global_active_power"].astype(float)

sum columns in dataframe with pandas

I have a dataframe df_F1
df_F1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 7 columns):
class_energy 2 non-null object
ACT_TIME_AERATEUR_1_F1 2 non-null float64
ACT_TIME_AERATEUR_1_F3 2 non-null float64
dtypes: float64(6), object(1)
memory usage: 128.0+ bytes
df_F1.head()
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
low 5.875550 431
medium 856.666667 856
I try to create a dataframe Ratio wich contain for each class_energy the value of energy of each ACT_TIME_AERATEUR_1_Fx devided by the sum of energy of all class_energy for each ACT_TIME_AERATEUR_1_Fx. For example :
ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
low 5.875550/(5.875550 + 856.666667) 431/(431+856)
medium 856.666667/(5.875550+856.666667) 856/(431+856)
Can you help me please to resolve it?
Thank you in advancce
Best regards

you can do this:
In [20]: df.set_index('class_energy').apply(lambda x: x/x.sum()).reset_index()
Out[20]:
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
0 low 0.006812 0.334887
1 medium 0.993188 0.665113

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas: Got an Error from trying to bin data - python

Related

How to parse a date column as datetimes, not objects in Pandas?

Datatype in converting dataframe from pandas to pyspark into foundry

.dropna() increases memory usage

Wrong Data Type while Reading a large Text File

sum columns in dataframe with pandas

Categories

Resources