This goal is to convert the type from 'object' to 'float' from KDD 99 dataset.
This is the information of the dataset :
class 'pandas.core.frame.DataFrame'
RangeIndex: 494020 entries, 0 to 494019
Data columns (total 42 columns):
duration 494020 non-null int64
protocol_type 494020 non-null object
service 494020 non-null object
src_bytes 494020 non-null object
dst_bytes 494020 non-null int64
flag 494020 non-null int64
land 494020 non-null int64
wrong_fragment 494020 non-null int64
urgent 494020 non-null int64
hot 494020 non-null int64
num_failed_logins 494020 non-null int64
logged_in 494020 non-null int64
num_compromised 494020 non-null int64
root_shell 494020 non-null int64
su_attempted 494020 non-null int64
num_root 494020 non-null int64
num_file_creations 494020 non-null int64
num_shells 494020 non-null int64
num_access_files 494020 non-null int64
num_outbound_cmds 494020 non-null int64
is_hot_login 494020 non-null int64
is_guest_login 494020 non-null int64
count 494020 non-null int64
serror_rate 494020 non-null int64
rerror_rate 494020 non-null float64
same_srv_rate 494020 non-null float64
diff_srv_rate 494020 non-null float64
srv_count 494020 non-null float64
srv_serror_rate 494020 non-null float64
srv_rerror_rate 494020 non-null float64
srv_diff_host_rate 494020 non-null float64
dst_host_count 494020 non-null int64
dst_host_srv_count 494020 non-null int64
dst_host_same_srv_rate 494020 non-null float64
dst_host_diff_srv_rate 494020 non-null float64
dst_host_same_src_port_rate 494020 non-null float64
dst_host_srv_diff_host_rate 494020 non-null float64
dst_host_serror_rate 494020 non-null float64
dst_host_srv_serror_rate 494020 non-null float64
dst_host_rerror_rate 494020 non-null float64
dst_host_srv_rerror_rate 494020 non-null float64
class 494020 non-null object
dtypes: float64(15), int64(23), object(4)
memory usage: 158.3+ MB
There are 4 object types that need to convert to float contains :
1. protocol type : 'tcp' , 'udp' , 'icmp'
2. service : 'http' , 'auth' , 'http_443' , etc
3. src_bytes : 'OTH' 'REJ' , 'SF' , etc
4. class : 'normal' , 'neptune' , 'smurf' , etc
model('protocol_type').astype(float)
But i got this error :
TypeError: 'DataFrame' object is not callable
I hope that someone can help me to fix this problem.
Thank you :)
First of all, as #thecruisy pointed out, you should use brackets instead of (), which leads to:
model['protocol_type'].astype(float)
However, since the column is in object (or str), that will raise a ValueError.
ValueError: could not convert string to float: 'tcp'
What you should do instead is to encode them. You can use either pandas.DataFrame:
model['protocol_type'].astype('category').cat.codes.astype(float)
# ^^^^^^^^^^^^^^
# This may be redundant, though
Or use sklearn.preprocessing.LabelEncoder
Related
i am trying to use the pandas_profiling ProfileReport method
from pandas_profiling import ProfileReport
ProfileReport(data)
i have tried updating the jupyter,pandas and python through conda but I am still getting the following error:
BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
this is the structure of the data:
RangeIndex: 32593 entries, 0 to 32592
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 code_module 32593 non-null object
1 code_presentation 32593 non-null object
2 id_student 32593 non-null int64
3 gender 32593 non-null object
4 region 32593 non-null object
5 highest_education 32593 non-null object
6 imd_band 31482 non-null object
7 age_band 32593 non-null object
8 num_of_prev_attempts 32593 non-null int64
9 studied_credits 32593 non-null int64
10 disability 32593 non-null object
11 final_result 32593 non-null object
dtypes: int64(3), object(9)
memory usage: 3.0+ MB```
I'm trying to identify the index position of a particular column name in Python. I used this exact same method previously on the same dataframe and it returned the number of the index position of the column name. However, in this case it doesn't seem to be working. Here is the relevant code:
The dataframe:
match.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25979 entries, 0 to 25978
Data columns (total 68 columns):
id_x 25979 non-null int64
country_id 25979 non-null int64
league_id 25979 non-null int64
season 25979 non-null object
stage 25979 non-null int64
date 25979 non-null object
match_api_id 25979 non-null int64
home_team_api_id 25979 non-null int64
away_team_api_id 25979 non-null int64
home_team_goal 25979 non-null int64
away_team_goal 25979 non-null int64
home_player_1 24755 non-null float64
home_player_2 24664 non-null float64
home_player_3 24698 non-null float64
home_player_4 24656 non-null float64
home_player_5 24663 non-null float64
home_player_6 24654 non-null float64
home_player_7 24752 non-null float64
home_player_8 24670 non-null float64
home_player_9 24706 non-null float64
home_player_10 24543 non-null float64
home_player_11 24424 non-null float64
away_player_1 24745 non-null float64
away_player_2 24701 non-null float64
away_player_3 24686 non-null float64
away_player_4 24658 non-null float64
away_player_5 24644 non-null float64
away_player_6 24666 non-null float64
away_player_7 24744 non-null float64
away_player_8 24638 non-null float64
away_player_9 24651 non-null float64
away_player_10 24538 non-null float64
away_player_11 24425 non-null float64
goal 14217 non-null object
shoton 14217 non-null object
shotoff 14217 non-null object
foulcommit 14217 non-null object
card 14217 non-null object
cross 14217 non-null object
corner 14217 non-null object
possession 14217 non-null object
BSA 14161 non-null float64
Home Team 25979 non-null object
Away Team 25979 non-null object
name_x 25979 non-null object
name_y 25979 non-null object
home_player_1 24755 non-null object
home_player_2 24664 non-null object
home_player_3 24698 non-null object
home_player_4 24656 non-null object
home_player_5 24663 non-null object
home_player_6 24654 non-null object
home_player_7 24752 non-null object
home_player_8 24670 non-null object
home_player_9 24706 non-null object
home_player_10 24543 non-null object
home_player_11 24424 non-null object
away_player_1 24745 non-null object
away_player_2 24701 non-null object
away_player_3 24686 non-null object
away_player_4 24658 non-null object
away_player_5 24644 non-null object
away_player_6 24666 non-null object
away_player_7 24744 non-null object
away_player_8 24638 non-null object
away_player_9 24651 non-null object
away_player_10 24538 non-null object
away_player_11 24425 non-null object
dtypes: float64(23), int64(9), object(36)
Rest of code:
#remove rows that dont contain player names
column_start = match.columns.get_loc("home_player_1")
column_start
column_end = match.columns.get_loc("away_player_11")
columns = match.columns[column_start:column_end]
#match.dropna(axis=columns)
This causes the following error:
TypeError: only integer scalar arrays can be converted to a scalar index
Problem is both columns are duplicated, home_player_1 and also away_player_11 (and many another columns too).
So if same values in columns you can remove duplicated columns by:
match = match.loc[:, ~match.columns.duplicated()]
Or you can deduplicate columns names by:
s = match.columns.to_series()
match.columns = (match.columns +
s.groupby(s).cumcount().astype(str).radd('_').str.replace('_0',''))
You have to check if your index column is monotonic, because if not, it will not return the index number but a boolean array.
print(df.Index.is_monotonic)
At least if you don't want to modify the index column, you can try to add a step like:
df.index[matchArray] == True].tolist()
I have setup two dataframes and attempted to filter the results by moving a datetime object column to the index, and using .last('7D') to pull the entries whose datetime is 'stamped' within the last seven days. It worked for the first dataframe, but not the second. I have tried a variety of variations to filter the df to get what I need, but cannot get accurate output. I'm at a loss! This has been built iterative as well, so if you see some refactoring opportunities, let me know.
Original DataFrame: engagements
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 15 columns):
REQ_NAME 2572 non-null object
REQ_ID 2572 non-null object
STATUS 2572 non-null object
full_name 2572 non-null object
BIZ_UNIT 2572 non-null object
COMPLEXITY 2378 non-null object
PRIORITY 2390 non-null object
OPEN_DATE 2572 non-null datetime64[ns]
REQ_DATE 2572 non-null object
REQ_CAT 2572 non-null object
REQ_NOTE 2572 non-null object
CostCenter 2572 non-null int64
TargetCompletionDate 2572 non-null object
UpdateDTTM 2514 non-null datetime64[ns]
age 2572 non-null timedelta64[ns]
dtypes: datetime64[ns](2), int64(1), object(11), timedelta64[ns](1)
memory usage: 301.5+ KB
Separating DataFrame:
active_engagements = engagements[engagements['STATUS'].isin(active_status)]
comp_engagements = engagements[engagements['STATUS'].isin(comp_status)]
First Filter:
act_eng_open_lw = active engagements.set_index('OPEN_DATE')
act_eng_open_lw = act_eng_open_lw.last('7D')
Output is the 10 rows of data I expect to see
Problem Child DataFrame:
act_eng_comp_lw = comp_engagements.set_index('UpdateDTTM')
act_eng_comp_lw = act_eng_comp_lw.last('7D')
Output is 105 rows, where I would expect 32
Info calls on both filtered DFs: act_eng_open_lw:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2019-12-20 to 2019-12-26
Data columns (total 14 columns):
REQ_NAME 10 non-null object
REQ_ID 10 non-null object
STATUS 10 non-null object
full_name 10 non-null object
BIZ_UNIT 10 non-null object
COMPLEXITY 5 non-null object
PRIORITY 5 non-null object
REQ_DATE 10 non-null object
REQ_CAT 10 non-null object
REQ_NOTE 10 non-null object
CostCenter 10 non-null int64
TargetCompletionDate 10 non-null object
UpdateDTTM 5 non-null datetime64[ns]
age 10 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(1), object(11), timedelta64[ns](1)
memory usage: 1.2+ KB
act_eng_comp_lw
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105 entries, 2019-12-26 to 2019-11-27
Data columns (total 14 columns):
REQ_NAME 105 non-null object
REQ_ID 105 non-null object
STATUS 105 non-null object
full_name 105 non-null object
BIZ_UNIT 105 non-null object
COMPLEXITY 102 non-null object
PRIORITY 104 non-null object
OPEN_DATE 105 non-null datetime64[ns]
REQ_DATE 105 non-null object
REQ_CAT 105 non-null object
REQ_NOTE 105 non-null object
CostCenter 105 non-null int64
TargetCompletionDate 105 non-null object
age 105 non-null int64
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 12.3+ KB
Question: Using the same filter, why is one Datetime column filtering properly with .last and the other is not?
I ended up changing the method I was using to catch the last 7 days, versus .last:
act_eng_open_lw = act_eng_open_lw[act_eng_open_lw.index > dt.datetime.now() - pd.to_timedelta("7day")]
This method works on both of my dataframes effectively.
In Python3 and pandas I have two dataframes, "doacoes_cnpjs" and "te"
doacoes_cnpjs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22811 entries, 0 to 47353
Data columns (total 19 columns):
UF 22811 non-null object
Partido_x 22811 non-null object
Cargo_x 22811 non-null object
Nome_candidato_x 22811 non-null object
CPF_candidato 22811 non-null int64
CPF_CNPJ_doador 22811 non-null float64
Nome_doador 22811 non-null object
Nome_doador_Receita 22811 non-null object
Valor 22811 non-null float64
CPF_CNPJ_doador_originario 22811 non-null object
Nome_doador_originario 22811 non-null object
Nome_doador_originario_Receita 22811 non-null object
Estado 22811 non-null object
Cargo_y 22811 non-null object
Nome_candidato_y 22811 non-null object
CPF 22811 non-null int64
Nome_urna 22811 non-null object
Partido_y 22811 non-null object
Situacao 22811 non-null object
dtypes: float64(2), int64(2), object(15)
memory usage: 3.5+ MB
te.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5541 entries, 0 to 5664
Data columns (total 13 columns):
DATA_LS 4118 non-null object
DATA_INCLUS 2957 non-null object
Proprietario 5541 non-null object
Nome_propriedade 5541 non-null object
Municipio 5525 non-null object
Estado 5533 non-null object
CNPJ_CPF_CEI 5541 non-null object
CNPJ_CPF_CEI_limpo 5541 non-null float64
Trab_Envolv 4529 non-null float64
Ramo_atividade 2840 non-null object
Localizacao 2734 non-null object
Cod_ativ 2975 non-null object
Tipo_lista 5541 non-null object
dtypes: float64(2), object(11)
memory usage: 606.0+ KB
Dataframes have two columns with the same type of code - "CPF_CNPJ_doador" and "CNPJ_CPF_CEI_limpo". They are codes with integers, with 13 or 14 digits
Example: "6158959000136", "78141843000103", "46991295000106", "5351494000172" ...
I want to create a new dataframe from a comparison of "doacoes_cnpjs" and "te", using the columns "CPF_CNPJ_doador" and "CNPJ_CPF_CEI_limpo". But it can not be a common merge
I want to compare only the first eight numbers of the columns. Example: from "6158959000136" only use "61589590" and compare with "78141843" from code "78141843000103", and thus on all lines
Please, is there a way to do this? Or is it best to turn the codes into strings and before extracting the first few characters?
I get the following error:
exportStore.append(key, hdfStoreLocal, index = False, data_columns = True)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 911, in append
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 1270, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3605, in write
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3293, in create_axes
raise e
ValueError: invalid itemsize in generic type tuple
Any ideas on why this would happen? It's a rather large project, so I'm not sure what code I can offer, but this happens on the first append. Any help would be very much appreciated.
EDIT::::::
Show Version result:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
nose: None
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.13.3
statsmodels: None
IPython: 1.2.1
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
Info result:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 61500 entries, 0 to 61499
Data columns (total 48 columns):
Sequential_Code_1 61500 non-null float64
Age_1 61500 non-null float64
Sex_1 61500 non-null object
Race_1 61500 non-null object
Ethnicity_1 61500 non-null object
Principal_Code_1 61500 non-null object
Admitting_Code_1 61500 non-null object
Principal_Code_2 61500 non-null object
Other_Codes_1 61500 non-null object
Other_Codes_2 61500 non-null object
Other_Codes_3 61500 non-null object
Other_Codes_4 61500 non-null object
Other_Codes_5 61500 non-null object
Other_Codes_6 61500 non-null object
Other_Codes_7 61500 non-null object
Other_Codes_8 61500 non-null object
Other_Codes_9 61500 non-null object
Other_Codes_10 61500 non-null object
Other_Codes_11 61500 non-null object
Other_Codes_12 61500 non-null object
Other_Codes_13 61500 non-null object
Other_Codes_14 61500 non-null object
Other_Codes_15 61500 non-null object
Other_Codes_16 61500 non-null object
Other_Codes_17 61500 non-null object
Other_Codes_18 61500 non-null object
Other_Codes_19 61500 non-null object
Other_Codes_20 61500 non-null object
Other_Codes_21 61500 non-null object
Other_Codes_22 61500 non-null object
Other_Codes_23 61500 non-null object
Other_Codes_24 61500 non-null object
External_Code_1 61500 non-null object
Place_Code_1 61500 non-null object
Head:
head Sequential_Number_1 Age_1 Sex_1 Race_1 \
1128 2.000000e+13 73 F 01
2185 2.000000e+13 52 M 01
2202 2.000000e+13 64 M 01
2283 2.000000e+13 72 F 01
4471 2.000000e+13 62 F 01
The problem is that you need to specify a min_itemsize, see docs here.
This controls how big the column is for string-like columns. If you don't have any length to ANY values it fails (prob could be a better error message). It will take the biggest length of the passed values to figure out what size it needs to be.
The reason to specify this is that say you are appending in multiple chunks. You could have a longer string in chunk 2 which means the column should be at least that size, but only looking at chunk 1 doesn't tell you this.
Further would pre-process this data to not have 0-len strings instead use np.nan as the missing value (which HDFstore / pandas) handle properly.