I had tried to sum multiple values of same format 'hh:mm:ss' using pandas data series and it's data type is datetime.time/object.but it getting error.could you please guide me best way.
Here is the code:
import pandas as pd
school_diesel =pd.read_excel(r'*********************** Diesel Log 23-09-2019 17_59_05.xlsx',heading=[1,2])
school_running = pd.read_excel(r'*************Daily Log 23-09-2019 18_09_41.xlsx',0)
school_diesel.columns = school_diesel.iloc[0] # replace headings with next row values
school_running.columns = school_running.iloc[0] # replace headings with next row values
school_running.columns
school_diesel.columns
school_diesel.drop(school_diesel.head(1).index, inplace=True) #drop first row of the table- as this repeated heading
school_running.drop(school_running.head(1).index, inplace=True) #drop first row of the table- as this repeated heading
data types of each field is:
input: school_running.info()
output: <class 'pandas.core.frame.DataFrame'>
Int64Index: 12469 entries, 1 to 12469
Data columns (total 25 columns):
Sno 12468 non-null object
City 12467 non-null object
Zone 12467 non-null object
Branch 12467 non-null object
Building Code 12383 non-null object
Branch Type 12305 non-null object
AC or Non AC 11405 non-null object
Student Strength 12467 non-null object
Company Name 12381 non-null object
Gen SNo 12467 non-null object
Capacity KVA 12381 non-null object
Fuel Capacity 12467 non-null object
Last Diesel Purchase 12467 non-null object
Purchase Qty 12467 non-null object
Amount 12467 non-null object
Last Fuel Filled 12467 non-null object
Filling Qty 12467 non-null object
Diesel Opening Qty 12467 non-null object
Generator On Date 12467 non-null object
Generator Off Date 12467 non-null object
Running Hours 12466 non-null object
Consumed Units 12466 non-null object
Diesel Consumed 12466 non-null object
Diesel Balance Qty 12466 non-null object
Remarks 6267 non-null object
dtypes: object(25)
memory usage: 1.3+ MB
error occured at line :
school_running['Running Hours'].sum()
error is :
----> 1 school_running['Running Hours'].sum()
**
TypeError: unsupported operand type(s) for +: 'datetime.time' and 'datetime.time'
Expected output is to sum of total Running Hours.
**Time data is : **
school_running['Running Hours'].head(10)
1 00:00:00
2 00:00:00
3 00:25:00
4 00:00:00
5 00:00:00
6 00:00:00
7 00:00:00
8 00:00:00
9 00:00:00
10 01:20:00
Name: Running Hours, dtype: object
You have to convert them into Timedeltas. Below will give you total seconds
df['Running Hours'] = df['Running Hours'].astype(str).map(lambda x: x[-1:] + x[:x.find("-")])
pd.to_timedelta(df['Running Hours']).sum().total_seconds()
Related
When I try to create a new variable in dataframe Call08q1_09q1 by adding two float variable
Call08q1_09q1['MBS']=Call08q1_09q1['RCFD8639']+Call08q1_09q1['RCFD2170']
the error below shows up:
'<' not supported between instances of 'str' and 'int' in Python
However, I don't have string in my dataframe.
Call08q1_09q1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39675 entries, 0 to 39674
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RSSD9001 39675 non-null float64
1 RSSD9999 39675 non-null float64
2 RCFD2170 39673 non-null float64
3 RCFD8639 38166 non-null float64
4 RCFD8641 38166 non-null float64
5 RCFD8639 38166 non-null float64
6 RCFD0211 38166 non-null float64
7 RCFD1287 38166 non-null float64
8 RCON3531 1107 non-null float64
9 RCFD1289 38166 non-null float64
10 RCFD1294 38166 non-null float64
11 RCFD1293 38166 non-null float64
12 RCFD1298 38166 non-null float64
13 RCON3532 1111 non-null float64
14 RCFD3210 38443 non-null float64
15 RIAD4230 38398 non-null float64
16 RIAD4340 38441 non-null float64
17 RCFD2122 39644 non-null float64
18 RCFD2125 249 non-null float64
19 RCFD1600 52 non-null float64
dtypes: float64(20)
You have loads of nulls in your columns as the printout tells you. How are those represented? Can you add these nulls with ints? I suggest you debug by inspecting these null values and taking appropriate action to fill them, drop them, or otherwise transform them into something useful.
The error has not occured in the line of code below since this one does not contain any comparison operator (<, >, ..).
Call08q1_09q1['MBS']= Call08q1_09q1['RCFD8639'] + Call08q1_09q1['RCFD2170']
The error has for sure occured in a line where you try to compare a string with a number (int) like the scenario below :
s="1"
n= 3
s < n
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [1], line 3
1 s="1"
2 n= 3
----> 3 s<n
TypeError: '<' not supported between instances of 'str' and 'int'
To fix that you need to cast the string as a number :
int(s) < n
#True
I have a df ("data") full of information from IMDB, and I'm trying to replace the string of text for unavailable synopses with a NaN. I wanted to start by simply counting them, using this code:
data[data['synopsis'=='\\nIt looks like we don\'t have a Synopsis for this title yet. Be the first to contribute! Just click the "Edit page" button at the bottom of the page or learn more in the Synopsis submission guide.\\n']].count()
But, I get a key error. I have a hunch it's because of the dtype?
I've tried to convert the synopsis column from object into string, to no avail, using this code:
data['synopsis'] = data['synopsis'].apply(str)
and this code:
pd.Series('synopsis').astype('str')
But when I look at the info, nothing changes. I was able to convert startYear to datetime, though.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 27007 entries, 0 to 31893
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 tconst 27007 non-null object
1 titleType 27007 non-null object
2 primaryTitle 27007 non-null object
3 originalTitle 27007 non-null object
4 isAdult 27007 non-null int64
5 startYear 27007 non-null datetime64[ns]
6 endYear 27007 non-null object
7 runtimeMinutes 27007 non-null object
8 genres 27007 non-null object
9 storyline 20362 non-null object
10 synopsis 27007 non-null object
11 countries_of_origin 26640 non-null object
12 budget 11295 non-null object
13 opening_weekend 771 non-null object
14 production_company 19478 non-null object
15 rating 13641 non-null float64
16 number_of_votes 13641 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(14)
memory usage: 4.7+ MB
I'm new to all this--what am I doing wrong?
You've got the bracket in the wrong spot in your filter line. You want to close df[df['synopsis']==...].
And I believe that pandas uses object as the dtype for strings, so it is correct for it to say object.
I've imported a csv file with pd.read_csv, used parse_dates and index_col.
This results in de following dataframe
DatetimeIndex: 195972 entries, 2018-02-01 to 2019-10-25
Data columns (total 19 columns):
account_manager 195972 non-null object
article_des 195896 non-null object
article_n 195972 non-null object
article_o 195972 non-null object
budget_code 195972 non-null object
budget_naam 195972 non-null object
country 195972 non-null object
currency 195972 non-null object
customer 195972 non-null object
industrie 195972 non-null object
klantnaam 195972 non-null object
month 195972 non-null int64
revenue 195972 non-null float64
revenue_local 195972 non-null float64
sap_code 195972 non-null object
volume 195972 non-null float64
week 195972 non-null int64
weight 195972 non-null float64
year 195972 non-null int64
dtypes: float64(4), int64(3), object(12)
memory usage: 20.9+ MB
None
I've tried every possible way to select only one column (weight) from this dataframe in a new dataframe. None of them work. What;s the trick to select columns in an indexed dataframe?
If I import the csv without an index_col I can make any selection I want.
df['weight'] will return a series while df[['weight']] will return a DataFrame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print(df)
# a b
# 0 1 2
# 1 3 4
print(type(df['a']))
# <class 'pandas.core.series.Series'>
print(type(df[['a']]))
# <class 'pandas.core.frame.DataFrame'>
To select multiple columns, you have to pass a list of column names.
print(df[['a', 'b']])
# a b
# 0 1 2
# 1 3 4
So to select a single column as a DataFrame, pass a list of one element.
print(df[['a']])
# a
# 0 1
# 1 3
I have setup two dataframes and attempted to filter the results by moving a datetime object column to the index, and using .last('7D') to pull the entries whose datetime is 'stamped' within the last seven days. It worked for the first dataframe, but not the second. I have tried a variety of variations to filter the df to get what I need, but cannot get accurate output. I'm at a loss! This has been built iterative as well, so if you see some refactoring opportunities, let me know.
Original DataFrame: engagements
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 15 columns):
REQ_NAME 2572 non-null object
REQ_ID 2572 non-null object
STATUS 2572 non-null object
full_name 2572 non-null object
BIZ_UNIT 2572 non-null object
COMPLEXITY 2378 non-null object
PRIORITY 2390 non-null object
OPEN_DATE 2572 non-null datetime64[ns]
REQ_DATE 2572 non-null object
REQ_CAT 2572 non-null object
REQ_NOTE 2572 non-null object
CostCenter 2572 non-null int64
TargetCompletionDate 2572 non-null object
UpdateDTTM 2514 non-null datetime64[ns]
age 2572 non-null timedelta64[ns]
dtypes: datetime64[ns](2), int64(1), object(11), timedelta64[ns](1)
memory usage: 301.5+ KB
Separating DataFrame:
active_engagements = engagements[engagements['STATUS'].isin(active_status)]
comp_engagements = engagements[engagements['STATUS'].isin(comp_status)]
First Filter:
act_eng_open_lw = active engagements.set_index('OPEN_DATE')
act_eng_open_lw = act_eng_open_lw.last('7D')
Output is the 10 rows of data I expect to see
Problem Child DataFrame:
act_eng_comp_lw = comp_engagements.set_index('UpdateDTTM')
act_eng_comp_lw = act_eng_comp_lw.last('7D')
Output is 105 rows, where I would expect 32
Info calls on both filtered DFs: act_eng_open_lw:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2019-12-20 to 2019-12-26
Data columns (total 14 columns):
REQ_NAME 10 non-null object
REQ_ID 10 non-null object
STATUS 10 non-null object
full_name 10 non-null object
BIZ_UNIT 10 non-null object
COMPLEXITY 5 non-null object
PRIORITY 5 non-null object
REQ_DATE 10 non-null object
REQ_CAT 10 non-null object
REQ_NOTE 10 non-null object
CostCenter 10 non-null int64
TargetCompletionDate 10 non-null object
UpdateDTTM 5 non-null datetime64[ns]
age 10 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(1), object(11), timedelta64[ns](1)
memory usage: 1.2+ KB
act_eng_comp_lw
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105 entries, 2019-12-26 to 2019-11-27
Data columns (total 14 columns):
REQ_NAME 105 non-null object
REQ_ID 105 non-null object
STATUS 105 non-null object
full_name 105 non-null object
BIZ_UNIT 105 non-null object
COMPLEXITY 102 non-null object
PRIORITY 104 non-null object
OPEN_DATE 105 non-null datetime64[ns]
REQ_DATE 105 non-null object
REQ_CAT 105 non-null object
REQ_NOTE 105 non-null object
CostCenter 105 non-null int64
TargetCompletionDate 105 non-null object
age 105 non-null int64
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 12.3+ KB
Question: Using the same filter, why is one Datetime column filtering properly with .last and the other is not?
I ended up changing the method I was using to catch the last 7 days, versus .last:
act_eng_open_lw = act_eng_open_lw[act_eng_open_lw.index > dt.datetime.now() - pd.to_timedelta("7day")]
This method works on both of my dataframes effectively.
For those of you working into Foundry's environnement, I'm trying to build a pipeline in "Code repositories" to process a raw dataset (from Excel file) into a clean one that I'll analyse in "Contour" later.
To that end I used python, except that pipeline seems to be using pyspark and at some point I must convert the dataset I've cleaned with pandas into a pyspark one and that's where i'm stuck.
I've looked at several post on stackover flow to convert Pandas DF to Pyspark DF but none of the solutions seems to be working so far.
When I try to run the transform, there's always a datatype failing to be converted evethough I forced a schema.
The Python code section has been tested succefully in Spyder (importing and exporting has an Excel file) and give the expected result. It's only when I need to convert to pyspark that it fails somehow.
#transform_pandas(
Output("/MDM_OUT_OF_SERVICE_EVENTS_CLEAN"),
OOS_raw=Input("/MDM_OUT_OF_SERVICE_EVENTS"),
)
def DA_transform(OOS_raw):
''' Code Section in Python '''
mySchema=StructType([StructField(OOS_dup.columns[0], IntegerType(),
True),
StructField(OOS_dup.columns[1], StringType(), True),
...])
OOS_out=sqlContext.createDataFrame(OOS_dup,schema
=mySchema,verifySchema=False)
return OOS_out
I got this error message at some point :
AttributeError: 'unicode' object has no attribute 'toordinal'.
According to this post : What is causing 'unicode' object has no attribute 'toordinal' in pyspark?
it's because pyspark fail to convert the Data into Datetype
but data is in Datetime64[ns] in pandas. I've tried converting this columns into string and integer and it fails as well.
Here is a picture of the output dataset from Python :
Here is the datatypes return by pandas once the data set has been cleaned :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4972 entries, 0 to 4971
Data columns (total 51 columns):
OOS_ID 4972 non-null int64
OPERATOR_CODE 4972 non-null object
ATA_CAUSE 4972 non-null int64
EVENT_CODE 3122 non-null object
AC_MODEL 4972 non-null object
AC_SN 4972 non-null int64
OOS_DATE 4972 non-null datetime64[ns]
AIRPORT_CODE 4915 non-null object
RTS_DATE 4972 non-null datetime64[ns]
EVENT_TYPE 4972 non-null object
CORRECTIVE_ACTION 417 non-null object
DD_HOURS_OOS 4972 non-null float64
EVENT_DESCRIPTION 4972 non-null object
EVENT_CATEGORY 4972 non-null object
ATA_REPORTED 324 non-null float64
TOTAL_CAUSES 4875 non-null float64
EVENT_NUMBER 3117 non-null float64
RTS_TIME 4972 non-null object
OOS_TIME 4972 non-null object
PREV_REPORTED 4972 non-null object
FERRY_IND 4972 non-null object
REPAIR_STN_CODE 355 non-null object
MAINT_DOWN_TIME 4972 non-null float64
LOGBOOK_RECORD_IDENTIFIER 343 non-null object
RTS_IND 4972 non-null object
READY_FOR_USE 924 non-null object
DQ_COMMENTS 2 non-null object
REVIEWED 5 non-null object
DOES_NOT_MEET_SPECS 4 non-null object
CORRECTED 12 non-null object
EDITED_BY 4972 non-null object
EDIT_DATE 4972 non-null datetime64[ns]
OUTSTATION_INDICATOR 3801 non-null object
COMMENT_TEXT 11 non-null object
ATA_CAUSE_CHAPTER 4972 non-null int64
ATA_CAUSE_SECTION 4972 non-null int64
ATA_CAUSE_COMPONENT 770 non-null float64
PROCESSOR_COMMENTS 83 non-null object
PARTS_AVAIL_AT_STATION 4972 non-null object
PARTS_SHIPPED_AT_STATION 4972 non-null object
ENGINEER_AT_STATION 4972 non-null object
ENGINEER_SENT_AT_STATION 4972 non-null object
SOURCE_FILE 4972 non-null object
OOS_Month 4972 non-null float64
OOS_Hour 4972 non-null float64
OOS_Min 4972 non-null float64
RTS_Month 4972 non-null float64
RTS_Hour 4972 non-null float64
RTS_Min 4972 non-null float64
OOS_Timestamp 4972 non-null datetime64[ns]
RTS_Timestamp 4972 non-null datetime64[ns]
dtypes: datetime64[ns](5), float64(12), int64(5), object(29)
In case it might help some of you I found this in the offical Foundry documentation on how to properly transition between pandas and pyspark DF.
OOS_dup is my Pandas dataframe I want to convert back to Spark.
# Extract the name of each columns with its data type in pandas
col = OOS_dup.columns
col_type = list()
for c in col:
t = OOS_dup[c].dtype.name
col_type.append(t)
df_schema = pd.DataFrame({"field": col, "data_type": col_type})
# Define a function to replace missing (NaN sky coverage cells with Null
def replace_missing(df, col_names):
for col in col_names:
df = df.withColumn("{}".format(col),
F.when(df["{}".format(col)] == "NaN", None).otherwise(df["{}".format(col)]))
return df
# Replace missing values
OOS_dup = replace_missing(OOS_dup, col)
# Define a function to change column types to the proper type in spark
def change_type(df, col_names, dtypes):
for col in col_names:
df = df.withColumn("{}".format(col), F.when(dtypes == "float64", (df["{}".format(col)]).cast("double")).when(dtypes == "int64", (df["{}".format(col)]).cast("int")).when(dtypes == "datetime64[ns]", (df["{}".format(col)]).cast("date")).otherwise((df["{}".format(col)]).cast("string")))
return df
# Cast each columns to the proper data type
OOS_dup = change_type(OOS_dup, df_schema["field"], df_schema["data_type"])
OOS_dup = sqlContext.createDataFrame(OOS_dup)