error in summarizing data using Pandas_profiling - python

i am trying to use the pandas_profiling ProfileReport method
from pandas_profiling import ProfileReport
ProfileReport(data)
i have tried updating the jupyter,pandas and python through conda but I am still getting the following error:
BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
this is the structure of the data:
RangeIndex: 32593 entries, 0 to 32592
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 code_module 32593 non-null object
1 code_presentation 32593 non-null object
2 id_student 32593 non-null int64
3 gender 32593 non-null object
4 region 32593 non-null object
5 highest_education 32593 non-null object
6 imd_band 31482 non-null object
7 age_band 32593 non-null object
8 num_of_prev_attempts 32593 non-null int64
9 studied_credits 32593 non-null int64
10 disability 32593 non-null object
11 final_result 32593 non-null object
dtypes: int64(3), object(9)
memory usage: 3.0+ MB```

Related

'<' not supported between instances of 'str' and 'int' in Python

When I try to create a new variable in dataframe Call08q1_09q1 by adding two float variable
Call08q1_09q1['MBS']=Call08q1_09q1['RCFD8639']+Call08q1_09q1['RCFD2170']
the error below shows up:
'<' not supported between instances of 'str' and 'int' in Python
However, I don't have string in my dataframe.
Call08q1_09q1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39675 entries, 0 to 39674
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RSSD9001 39675 non-null float64
1 RSSD9999 39675 non-null float64
2 RCFD2170 39673 non-null float64
3 RCFD8639 38166 non-null float64
4 RCFD8641 38166 non-null float64
5 RCFD8639 38166 non-null float64
6 RCFD0211 38166 non-null float64
7 RCFD1287 38166 non-null float64
8 RCON3531 1107 non-null float64
9 RCFD1289 38166 non-null float64
10 RCFD1294 38166 non-null float64
11 RCFD1293 38166 non-null float64
12 RCFD1298 38166 non-null float64
13 RCON3532 1111 non-null float64
14 RCFD3210 38443 non-null float64
15 RIAD4230 38398 non-null float64
16 RIAD4340 38441 non-null float64
17 RCFD2122 39644 non-null float64
18 RCFD2125 249 non-null float64
19 RCFD1600 52 non-null float64
dtypes: float64(20)
You have loads of nulls in your columns as the printout tells you. How are those represented? Can you add these nulls with ints? I suggest you debug by inspecting these null values and taking appropriate action to fill them, drop them, or otherwise transform them into something useful.
The error has not occured in the line of code below since this one does not contain any comparison operator (<, >, ..).
Call08q1_09q1['MBS']= Call08q1_09q1['RCFD8639'] + Call08q1_09q1['RCFD2170']
The error has for sure occured in a line where you try to compare a string with a number (int) like the scenario below :
s="1"
n= 3
s < n
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [1], line 3
1 s="1"
2 n= 3
----> 3 s<n
TypeError: '<' not supported between instances of 'str' and 'int'
To fix that you need to cast the string as a number :
int(s) < n
#True

Why is match.columns.get_loc returning a boolean array, not an indice?

I'm trying to identify the index position of a particular column name in Python. I used this exact same method previously on the same dataframe and it returned the number of the index position of the column name. However, in this case it doesn't seem to be working. Here is the relevant code:
The dataframe:
match.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25979 entries, 0 to 25978
Data columns (total 68 columns):
id_x 25979 non-null int64
country_id 25979 non-null int64
league_id 25979 non-null int64
season 25979 non-null object
stage 25979 non-null int64
date 25979 non-null object
match_api_id 25979 non-null int64
home_team_api_id 25979 non-null int64
away_team_api_id 25979 non-null int64
home_team_goal 25979 non-null int64
away_team_goal 25979 non-null int64
home_player_1 24755 non-null float64
home_player_2 24664 non-null float64
home_player_3 24698 non-null float64
home_player_4 24656 non-null float64
home_player_5 24663 non-null float64
home_player_6 24654 non-null float64
home_player_7 24752 non-null float64
home_player_8 24670 non-null float64
home_player_9 24706 non-null float64
home_player_10 24543 non-null float64
home_player_11 24424 non-null float64
away_player_1 24745 non-null float64
away_player_2 24701 non-null float64
away_player_3 24686 non-null float64
away_player_4 24658 non-null float64
away_player_5 24644 non-null float64
away_player_6 24666 non-null float64
away_player_7 24744 non-null float64
away_player_8 24638 non-null float64
away_player_9 24651 non-null float64
away_player_10 24538 non-null float64
away_player_11 24425 non-null float64
goal 14217 non-null object
shoton 14217 non-null object
shotoff 14217 non-null object
foulcommit 14217 non-null object
card 14217 non-null object
cross 14217 non-null object
corner 14217 non-null object
possession 14217 non-null object
BSA 14161 non-null float64
Home Team 25979 non-null object
Away Team 25979 non-null object
name_x 25979 non-null object
name_y 25979 non-null object
home_player_1 24755 non-null object
home_player_2 24664 non-null object
home_player_3 24698 non-null object
home_player_4 24656 non-null object
home_player_5 24663 non-null object
home_player_6 24654 non-null object
home_player_7 24752 non-null object
home_player_8 24670 non-null object
home_player_9 24706 non-null object
home_player_10 24543 non-null object
home_player_11 24424 non-null object
away_player_1 24745 non-null object
away_player_2 24701 non-null object
away_player_3 24686 non-null object
away_player_4 24658 non-null object
away_player_5 24644 non-null object
away_player_6 24666 non-null object
away_player_7 24744 non-null object
away_player_8 24638 non-null object
away_player_9 24651 non-null object
away_player_10 24538 non-null object
away_player_11 24425 non-null object
dtypes: float64(23), int64(9), object(36)
Rest of code:
#remove rows that dont contain player names
column_start = match.columns.get_loc("home_player_1")
column_start
column_end = match.columns.get_loc("away_player_11")
columns = match.columns[column_start:column_end]
#match.dropna(axis=columns)
This causes the following error:
TypeError: only integer scalar arrays can be converted to a scalar index
Problem is both columns are duplicated, home_player_1 and also away_player_11 (and many another columns too).
So if same values in columns you can remove duplicated columns by:
match = match.loc[:, ~match.columns.duplicated()]
Or you can deduplicate columns names by:
s = match.columns.to_series()
match.columns = (match.columns +
s.groupby(s).cumcount().astype(str).radd('_').str.replace('_0',''))
You have to check if your index column is monotonic, because if not, it will not return the index number but a boolean array.
print(df.Index.is_monotonic)
At least if you don't want to modify the index column, you can try to add a step like:
df.index[matchArray] == True].tolist()

Select colums in an indexed dataframe

I've imported a csv file with pd.read_csv, used parse_dates and index_col.
This results in de following dataframe
DatetimeIndex: 195972 entries, 2018-02-01 to 2019-10-25
Data columns (total 19 columns):
account_manager 195972 non-null object
article_des 195896 non-null object
article_n 195972 non-null object
article_o 195972 non-null object
budget_code 195972 non-null object
budget_naam 195972 non-null object
country 195972 non-null object
currency 195972 non-null object
customer 195972 non-null object
industrie 195972 non-null object
klantnaam 195972 non-null object
month 195972 non-null int64
revenue 195972 non-null float64
revenue_local 195972 non-null float64
sap_code 195972 non-null object
volume 195972 non-null float64
week 195972 non-null int64
weight 195972 non-null float64
year 195972 non-null int64
dtypes: float64(4), int64(3), object(12)
memory usage: 20.9+ MB
None
I've tried every possible way to select only one column (weight) from this dataframe in a new dataframe. None of them work. What;s the trick to select columns in an indexed dataframe?
If I import the csv without an index_col I can make any selection I want.
df['weight'] will return a series while df[['weight']] will return a DataFrame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print(df)
# a b
# 0 1 2
# 1 3 4
print(type(df['a']))
# <class 'pandas.core.series.Series'>
print(type(df[['a']]))
# <class 'pandas.core.frame.DataFrame'>
To select multiple columns, you have to pass a list of column names.
print(df[['a', 'b']])
# a b
# 0 1 2
# 1 3 4
So to select a single column as a DataFrame, pass a list of one element.
print(df[['a']])
# a
# 0 1
# 1 3

DataFrame filtering by datatime column off

I have setup two dataframes and attempted to filter the results by moving a datetime object column to the index, and using .last('7D') to pull the entries whose datetime is 'stamped' within the last seven days. It worked for the first dataframe, but not the second. I have tried a variety of variations to filter the df to get what I need, but cannot get accurate output. I'm at a loss! This has been built iterative as well, so if you see some refactoring opportunities, let me know.
Original DataFrame: engagements
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 15 columns):
REQ_NAME 2572 non-null object
REQ_ID 2572 non-null object
STATUS 2572 non-null object
full_name 2572 non-null object
BIZ_UNIT 2572 non-null object
COMPLEXITY 2378 non-null object
PRIORITY 2390 non-null object
OPEN_DATE 2572 non-null datetime64[ns]
REQ_DATE 2572 non-null object
REQ_CAT 2572 non-null object
REQ_NOTE 2572 non-null object
CostCenter 2572 non-null int64
TargetCompletionDate 2572 non-null object
UpdateDTTM 2514 non-null datetime64[ns]
age 2572 non-null timedelta64[ns]
dtypes: datetime64[ns](2), int64(1), object(11), timedelta64[ns](1)
memory usage: 301.5+ KB
Separating DataFrame:
active_engagements = engagements[engagements['STATUS'].isin(active_status)]
comp_engagements = engagements[engagements['STATUS'].isin(comp_status)]
First Filter:
act_eng_open_lw = active engagements.set_index('OPEN_DATE')
act_eng_open_lw = act_eng_open_lw.last('7D')
Output is the 10 rows of data I expect to see
Problem Child DataFrame:
act_eng_comp_lw = comp_engagements.set_index('UpdateDTTM')
act_eng_comp_lw = act_eng_comp_lw.last('7D')
Output is 105 rows, where I would expect 32
Info calls on both filtered DFs: act_eng_open_lw:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2019-12-20 to 2019-12-26
Data columns (total 14 columns):
REQ_NAME 10 non-null object
REQ_ID 10 non-null object
STATUS 10 non-null object
full_name 10 non-null object
BIZ_UNIT 10 non-null object
COMPLEXITY 5 non-null object
PRIORITY 5 non-null object
REQ_DATE 10 non-null object
REQ_CAT 10 non-null object
REQ_NOTE 10 non-null object
CostCenter 10 non-null int64
TargetCompletionDate 10 non-null object
UpdateDTTM 5 non-null datetime64[ns]
age 10 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(1), object(11), timedelta64[ns](1)
memory usage: 1.2+ KB
act_eng_comp_lw
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105 entries, 2019-12-26 to 2019-11-27
Data columns (total 14 columns):
REQ_NAME 105 non-null object
REQ_ID 105 non-null object
STATUS 105 non-null object
full_name 105 non-null object
BIZ_UNIT 105 non-null object
COMPLEXITY 102 non-null object
PRIORITY 104 non-null object
OPEN_DATE 105 non-null datetime64[ns]
REQ_DATE 105 non-null object
REQ_CAT 105 non-null object
REQ_NOTE 105 non-null object
CostCenter 105 non-null int64
TargetCompletionDate 105 non-null object
age 105 non-null int64
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 12.3+ KB
Question: Using the same filter, why is one Datetime column filtering properly with .last and the other is not?
I ended up changing the method I was using to catch the last 7 days, versus .last:
act_eng_open_lw = act_eng_open_lw[act_eng_open_lw.index > dt.datetime.now() - pd.to_timedelta("7day")]
This method works on both of my dataframes effectively.

Datatype in converting dataframe from pandas to pyspark into foundry

For those of you working into Foundry's environnement, I'm trying to build a pipeline in "Code repositories" to process a raw dataset (from Excel file) into a clean one that I'll analyse in "Contour" later.
To that end I used python, except that pipeline seems to be using pyspark and at some point I must convert the dataset I've cleaned with pandas into a pyspark one and that's where i'm stuck.
I've looked at several post on stackover flow to convert Pandas DF to Pyspark DF but none of the solutions seems to be working so far.
When I try to run the transform, there's always a datatype failing to be converted evethough I forced a schema.
The Python code section has been tested succefully in Spyder (importing and exporting has an Excel file) and give the expected result. It's only when I need to convert to pyspark that it fails somehow.
#transform_pandas(
Output("/MDM_OUT_OF_SERVICE_EVENTS_CLEAN"),
OOS_raw=Input("/MDM_OUT_OF_SERVICE_EVENTS"),
)
def DA_transform(OOS_raw):
''' Code Section in Python '''
mySchema=StructType([StructField(OOS_dup.columns[0], IntegerType(),
True),
StructField(OOS_dup.columns[1], StringType(), True),
...])
OOS_out=sqlContext.createDataFrame(OOS_dup,schema
=mySchema,verifySchema=False)
return OOS_out
I got this error message at some point :
AttributeError: 'unicode' object has no attribute 'toordinal'.
According to this post : What is causing 'unicode' object has no attribute 'toordinal' in pyspark?
it's because pyspark fail to convert the Data into Datetype
but data is in Datetime64[ns] in pandas. I've tried converting this columns into string and integer and it fails as well.
Here is a picture of the output dataset from Python :
Here is the datatypes return by pandas once the data set has been cleaned :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4972 entries, 0 to 4971
Data columns (total 51 columns):
OOS_ID 4972 non-null int64
OPERATOR_CODE 4972 non-null object
ATA_CAUSE 4972 non-null int64
EVENT_CODE 3122 non-null object
AC_MODEL 4972 non-null object
AC_SN 4972 non-null int64
OOS_DATE 4972 non-null datetime64[ns]
AIRPORT_CODE 4915 non-null object
RTS_DATE 4972 non-null datetime64[ns]
EVENT_TYPE 4972 non-null object
CORRECTIVE_ACTION 417 non-null object
DD_HOURS_OOS 4972 non-null float64
EVENT_DESCRIPTION 4972 non-null object
EVENT_CATEGORY 4972 non-null object
ATA_REPORTED 324 non-null float64
TOTAL_CAUSES 4875 non-null float64
EVENT_NUMBER 3117 non-null float64
RTS_TIME 4972 non-null object
OOS_TIME 4972 non-null object
PREV_REPORTED 4972 non-null object
FERRY_IND 4972 non-null object
REPAIR_STN_CODE 355 non-null object
MAINT_DOWN_TIME 4972 non-null float64
LOGBOOK_RECORD_IDENTIFIER 343 non-null object
RTS_IND 4972 non-null object
READY_FOR_USE 924 non-null object
DQ_COMMENTS 2 non-null object
REVIEWED 5 non-null object
DOES_NOT_MEET_SPECS 4 non-null object
CORRECTED 12 non-null object
EDITED_BY 4972 non-null object
EDIT_DATE 4972 non-null datetime64[ns]
OUTSTATION_INDICATOR 3801 non-null object
COMMENT_TEXT 11 non-null object
ATA_CAUSE_CHAPTER 4972 non-null int64
ATA_CAUSE_SECTION 4972 non-null int64
ATA_CAUSE_COMPONENT 770 non-null float64
PROCESSOR_COMMENTS 83 non-null object
PARTS_AVAIL_AT_STATION 4972 non-null object
PARTS_SHIPPED_AT_STATION 4972 non-null object
ENGINEER_AT_STATION 4972 non-null object
ENGINEER_SENT_AT_STATION 4972 non-null object
SOURCE_FILE 4972 non-null object
OOS_Month 4972 non-null float64
OOS_Hour 4972 non-null float64
OOS_Min 4972 non-null float64
RTS_Month 4972 non-null float64
RTS_Hour 4972 non-null float64
RTS_Min 4972 non-null float64
OOS_Timestamp 4972 non-null datetime64[ns]
RTS_Timestamp 4972 non-null datetime64[ns]
dtypes: datetime64[ns](5), float64(12), int64(5), object(29)
In case it might help some of you I found this in the offical Foundry documentation on how to properly transition between pandas and pyspark DF.
OOS_dup is my Pandas dataframe I want to convert back to Spark.
# Extract the name of each columns with its data type in pandas
col = OOS_dup.columns
col_type = list()
for c in col:
t = OOS_dup[c].dtype.name
col_type.append(t)
df_schema = pd.DataFrame({"field": col, "data_type": col_type})
# Define a function to replace missing (NaN sky coverage cells with Null
def replace_missing(df, col_names):
for col in col_names:
df = df.withColumn("{}".format(col),
F.when(df["{}".format(col)] == "NaN", None).otherwise(df["{}".format(col)]))
return df
# Replace missing values
OOS_dup = replace_missing(OOS_dup, col)
# Define a function to change column types to the proper type in spark
def change_type(df, col_names, dtypes):
for col in col_names:
df = df.withColumn("{}".format(col), F.when(dtypes == "float64", (df["{}".format(col)]).cast("double")).when(dtypes == "int64", (df["{}".format(col)]).cast("int")).when(dtypes == "datetime64[ns]", (df["{}".format(col)]).cast("date")).otherwise((df["{}".format(col)]).cast("string")))
return df
# Cast each columns to the proper data type
OOS_dup = change_type(OOS_dup, df_schema["field"], df_schema["data_type"])
OOS_dup = sqlContext.createDataFrame(OOS_dup)

Categories