When I try to create a new variable in dataframe Call08q1_09q1 by adding two float variable
Call08q1_09q1['MBS']=Call08q1_09q1['RCFD8639']+Call08q1_09q1['RCFD2170']
the error below shows up:
'<' not supported between instances of 'str' and 'int' in Python
However, I don't have string in my dataframe.
Call08q1_09q1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39675 entries, 0 to 39674
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RSSD9001 39675 non-null float64
1 RSSD9999 39675 non-null float64
2 RCFD2170 39673 non-null float64
3 RCFD8639 38166 non-null float64
4 RCFD8641 38166 non-null float64
5 RCFD8639 38166 non-null float64
6 RCFD0211 38166 non-null float64
7 RCFD1287 38166 non-null float64
8 RCON3531 1107 non-null float64
9 RCFD1289 38166 non-null float64
10 RCFD1294 38166 non-null float64
11 RCFD1293 38166 non-null float64
12 RCFD1298 38166 non-null float64
13 RCON3532 1111 non-null float64
14 RCFD3210 38443 non-null float64
15 RIAD4230 38398 non-null float64
16 RIAD4340 38441 non-null float64
17 RCFD2122 39644 non-null float64
18 RCFD2125 249 non-null float64
19 RCFD1600 52 non-null float64
dtypes: float64(20)
You have loads of nulls in your columns as the printout tells you. How are those represented? Can you add these nulls with ints? I suggest you debug by inspecting these null values and taking appropriate action to fill them, drop them, or otherwise transform them into something useful.
The error has not occured in the line of code below since this one does not contain any comparison operator (<, >, ..).
Call08q1_09q1['MBS']= Call08q1_09q1['RCFD8639'] + Call08q1_09q1['RCFD2170']
The error has for sure occured in a line where you try to compare a string with a number (int) like the scenario below :
s="1"
n= 3
s < n
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [1], line 3
1 s="1"
2 n= 3
----> 3 s<n
TypeError: '<' not supported between instances of 'str' and 'int'
To fix that you need to cast the string as a number :
int(s) < n
#True
Related
i am trying to use the pandas_profiling ProfileReport method
from pandas_profiling import ProfileReport
ProfileReport(data)
i have tried updating the jupyter,pandas and python through conda but I am still getting the following error:
BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
this is the structure of the data:
RangeIndex: 32593 entries, 0 to 32592
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 code_module 32593 non-null object
1 code_presentation 32593 non-null object
2 id_student 32593 non-null int64
3 gender 32593 non-null object
4 region 32593 non-null object
5 highest_education 32593 non-null object
6 imd_band 31482 non-null object
7 age_band 32593 non-null object
8 num_of_prev_attempts 32593 non-null int64
9 studied_credits 32593 non-null int64
10 disability 32593 non-null object
11 final_result 32593 non-null object
dtypes: int64(3), object(9)
memory usage: 3.0+ MB```
I'm trying to identify the index position of a particular column name in Python. I used this exact same method previously on the same dataframe and it returned the number of the index position of the column name. However, in this case it doesn't seem to be working. Here is the relevant code:
The dataframe:
match.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25979 entries, 0 to 25978
Data columns (total 68 columns):
id_x 25979 non-null int64
country_id 25979 non-null int64
league_id 25979 non-null int64
season 25979 non-null object
stage 25979 non-null int64
date 25979 non-null object
match_api_id 25979 non-null int64
home_team_api_id 25979 non-null int64
away_team_api_id 25979 non-null int64
home_team_goal 25979 non-null int64
away_team_goal 25979 non-null int64
home_player_1 24755 non-null float64
home_player_2 24664 non-null float64
home_player_3 24698 non-null float64
home_player_4 24656 non-null float64
home_player_5 24663 non-null float64
home_player_6 24654 non-null float64
home_player_7 24752 non-null float64
home_player_8 24670 non-null float64
home_player_9 24706 non-null float64
home_player_10 24543 non-null float64
home_player_11 24424 non-null float64
away_player_1 24745 non-null float64
away_player_2 24701 non-null float64
away_player_3 24686 non-null float64
away_player_4 24658 non-null float64
away_player_5 24644 non-null float64
away_player_6 24666 non-null float64
away_player_7 24744 non-null float64
away_player_8 24638 non-null float64
away_player_9 24651 non-null float64
away_player_10 24538 non-null float64
away_player_11 24425 non-null float64
goal 14217 non-null object
shoton 14217 non-null object
shotoff 14217 non-null object
foulcommit 14217 non-null object
card 14217 non-null object
cross 14217 non-null object
corner 14217 non-null object
possession 14217 non-null object
BSA 14161 non-null float64
Home Team 25979 non-null object
Away Team 25979 non-null object
name_x 25979 non-null object
name_y 25979 non-null object
home_player_1 24755 non-null object
home_player_2 24664 non-null object
home_player_3 24698 non-null object
home_player_4 24656 non-null object
home_player_5 24663 non-null object
home_player_6 24654 non-null object
home_player_7 24752 non-null object
home_player_8 24670 non-null object
home_player_9 24706 non-null object
home_player_10 24543 non-null object
home_player_11 24424 non-null object
away_player_1 24745 non-null object
away_player_2 24701 non-null object
away_player_3 24686 non-null object
away_player_4 24658 non-null object
away_player_5 24644 non-null object
away_player_6 24666 non-null object
away_player_7 24744 non-null object
away_player_8 24638 non-null object
away_player_9 24651 non-null object
away_player_10 24538 non-null object
away_player_11 24425 non-null object
dtypes: float64(23), int64(9), object(36)
Rest of code:
#remove rows that dont contain player names
column_start = match.columns.get_loc("home_player_1")
column_start
column_end = match.columns.get_loc("away_player_11")
columns = match.columns[column_start:column_end]
#match.dropna(axis=columns)
This causes the following error:
TypeError: only integer scalar arrays can be converted to a scalar index
Problem is both columns are duplicated, home_player_1 and also away_player_11 (and many another columns too).
So if same values in columns you can remove duplicated columns by:
match = match.loc[:, ~match.columns.duplicated()]
Or you can deduplicate columns names by:
s = match.columns.to_series()
match.columns = (match.columns +
s.groupby(s).cumcount().astype(str).radd('_').str.replace('_0',''))
You have to check if your index column is monotonic, because if not, it will not return the index number but a boolean array.
print(df.Index.is_monotonic)
At least if you don't want to modify the index column, you can try to add a step like:
df.index[matchArray] == True].tolist()
I've imported a csv file with pd.read_csv, used parse_dates and index_col.
This results in de following dataframe
DatetimeIndex: 195972 entries, 2018-02-01 to 2019-10-25
Data columns (total 19 columns):
account_manager 195972 non-null object
article_des 195896 non-null object
article_n 195972 non-null object
article_o 195972 non-null object
budget_code 195972 non-null object
budget_naam 195972 non-null object
country 195972 non-null object
currency 195972 non-null object
customer 195972 non-null object
industrie 195972 non-null object
klantnaam 195972 non-null object
month 195972 non-null int64
revenue 195972 non-null float64
revenue_local 195972 non-null float64
sap_code 195972 non-null object
volume 195972 non-null float64
week 195972 non-null int64
weight 195972 non-null float64
year 195972 non-null int64
dtypes: float64(4), int64(3), object(12)
memory usage: 20.9+ MB
None
I've tried every possible way to select only one column (weight) from this dataframe in a new dataframe. None of them work. What;s the trick to select columns in an indexed dataframe?
If I import the csv without an index_col I can make any selection I want.
df['weight'] will return a series while df[['weight']] will return a DataFrame.
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print(df)
# a b
# 0 1 2
# 1 3 4
print(type(df['a']))
# <class 'pandas.core.series.Series'>
print(type(df[['a']]))
# <class 'pandas.core.frame.DataFrame'>
To select multiple columns, you have to pass a list of column names.
print(df[['a', 'b']])
# a b
# 0 1 2
# 1 3 4
So to select a single column as a DataFrame, pass a list of one element.
print(df[['a']])
# a
# 0 1
# 1 3
I have setup two dataframes and attempted to filter the results by moving a datetime object column to the index, and using .last('7D') to pull the entries whose datetime is 'stamped' within the last seven days. It worked for the first dataframe, but not the second. I have tried a variety of variations to filter the df to get what I need, but cannot get accurate output. I'm at a loss! This has been built iterative as well, so if you see some refactoring opportunities, let me know.
Original DataFrame: engagements
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 15 columns):
REQ_NAME 2572 non-null object
REQ_ID 2572 non-null object
STATUS 2572 non-null object
full_name 2572 non-null object
BIZ_UNIT 2572 non-null object
COMPLEXITY 2378 non-null object
PRIORITY 2390 non-null object
OPEN_DATE 2572 non-null datetime64[ns]
REQ_DATE 2572 non-null object
REQ_CAT 2572 non-null object
REQ_NOTE 2572 non-null object
CostCenter 2572 non-null int64
TargetCompletionDate 2572 non-null object
UpdateDTTM 2514 non-null datetime64[ns]
age 2572 non-null timedelta64[ns]
dtypes: datetime64[ns](2), int64(1), object(11), timedelta64[ns](1)
memory usage: 301.5+ KB
Separating DataFrame:
active_engagements = engagements[engagements['STATUS'].isin(active_status)]
comp_engagements = engagements[engagements['STATUS'].isin(comp_status)]
First Filter:
act_eng_open_lw = active engagements.set_index('OPEN_DATE')
act_eng_open_lw = act_eng_open_lw.last('7D')
Output is the 10 rows of data I expect to see
Problem Child DataFrame:
act_eng_comp_lw = comp_engagements.set_index('UpdateDTTM')
act_eng_comp_lw = act_eng_comp_lw.last('7D')
Output is 105 rows, where I would expect 32
Info calls on both filtered DFs: act_eng_open_lw:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2019-12-20 to 2019-12-26
Data columns (total 14 columns):
REQ_NAME 10 non-null object
REQ_ID 10 non-null object
STATUS 10 non-null object
full_name 10 non-null object
BIZ_UNIT 10 non-null object
COMPLEXITY 5 non-null object
PRIORITY 5 non-null object
REQ_DATE 10 non-null object
REQ_CAT 10 non-null object
REQ_NOTE 10 non-null object
CostCenter 10 non-null int64
TargetCompletionDate 10 non-null object
UpdateDTTM 5 non-null datetime64[ns]
age 10 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(1), object(11), timedelta64[ns](1)
memory usage: 1.2+ KB
act_eng_comp_lw
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105 entries, 2019-12-26 to 2019-11-27
Data columns (total 14 columns):
REQ_NAME 105 non-null object
REQ_ID 105 non-null object
STATUS 105 non-null object
full_name 105 non-null object
BIZ_UNIT 105 non-null object
COMPLEXITY 102 non-null object
PRIORITY 104 non-null object
OPEN_DATE 105 non-null datetime64[ns]
REQ_DATE 105 non-null object
REQ_CAT 105 non-null object
REQ_NOTE 105 non-null object
CostCenter 105 non-null int64
TargetCompletionDate 105 non-null object
age 105 non-null int64
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 12.3+ KB
Question: Using the same filter, why is one Datetime column filtering properly with .last and the other is not?
I ended up changing the method I was using to catch the last 7 days, versus .last:
act_eng_open_lw = act_eng_open_lw[act_eng_open_lw.index > dt.datetime.now() - pd.to_timedelta("7day")]
This method works on both of my dataframes effectively.
I had tried to sum multiple values of same format 'hh:mm:ss' using pandas data series and it's data type is datetime.time/object.but it getting error.could you please guide me best way.
Here is the code:
import pandas as pd
school_diesel =pd.read_excel(r'*********************** Diesel Log 23-09-2019 17_59_05.xlsx',heading=[1,2])
school_running = pd.read_excel(r'*************Daily Log 23-09-2019 18_09_41.xlsx',0)
school_diesel.columns = school_diesel.iloc[0] # replace headings with next row values
school_running.columns = school_running.iloc[0] # replace headings with next row values
school_running.columns
school_diesel.columns
school_diesel.drop(school_diesel.head(1).index, inplace=True) #drop first row of the table- as this repeated heading
school_running.drop(school_running.head(1).index, inplace=True) #drop first row of the table- as this repeated heading
data types of each field is:
input: school_running.info()
output: <class 'pandas.core.frame.DataFrame'>
Int64Index: 12469 entries, 1 to 12469
Data columns (total 25 columns):
Sno 12468 non-null object
City 12467 non-null object
Zone 12467 non-null object
Branch 12467 non-null object
Building Code 12383 non-null object
Branch Type 12305 non-null object
AC or Non AC 11405 non-null object
Student Strength 12467 non-null object
Company Name 12381 non-null object
Gen SNo 12467 non-null object
Capacity KVA 12381 non-null object
Fuel Capacity 12467 non-null object
Last Diesel Purchase 12467 non-null object
Purchase Qty 12467 non-null object
Amount 12467 non-null object
Last Fuel Filled 12467 non-null object
Filling Qty 12467 non-null object
Diesel Opening Qty 12467 non-null object
Generator On Date 12467 non-null object
Generator Off Date 12467 non-null object
Running Hours 12466 non-null object
Consumed Units 12466 non-null object
Diesel Consumed 12466 non-null object
Diesel Balance Qty 12466 non-null object
Remarks 6267 non-null object
dtypes: object(25)
memory usage: 1.3+ MB
error occured at line :
school_running['Running Hours'].sum()
error is :
----> 1 school_running['Running Hours'].sum()
**
TypeError: unsupported operand type(s) for +: 'datetime.time' and 'datetime.time'
Expected output is to sum of total Running Hours.
**Time data is : **
school_running['Running Hours'].head(10)
1 00:00:00
2 00:00:00
3 00:25:00
4 00:00:00
5 00:00:00
6 00:00:00
7 00:00:00
8 00:00:00
9 00:00:00
10 01:20:00
Name: Running Hours, dtype: object
You have to convert them into Timedeltas. Below will give you total seconds
df['Running Hours'] = df['Running Hours'].astype(str).map(lambda x: x[-1:] + x[:x.find("-")])
pd.to_timedelta(df['Running Hours']).sum().total_seconds()