In Python3 and pandas I have two dataframes, "doacoes_cnpjs" and "te"
doacoes_cnpjs.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 22811 entries, 0 to 47353
Data columns (total 19 columns):
UF 22811 non-null object
Partido_x 22811 non-null object
Cargo_x 22811 non-null object
Nome_candidato_x 22811 non-null object
CPF_candidato 22811 non-null int64
CPF_CNPJ_doador 22811 non-null float64
Nome_doador 22811 non-null object
Nome_doador_Receita 22811 non-null object
Valor 22811 non-null float64
CPF_CNPJ_doador_originario 22811 non-null object
Nome_doador_originario 22811 non-null object
Nome_doador_originario_Receita 22811 non-null object
Estado 22811 non-null object
Cargo_y 22811 non-null object
Nome_candidato_y 22811 non-null object
CPF 22811 non-null int64
Nome_urna 22811 non-null object
Partido_y 22811 non-null object
Situacao 22811 non-null object
dtypes: float64(2), int64(2), object(15)
memory usage: 3.5+ MB
te.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5541 entries, 0 to 5664
Data columns (total 13 columns):
DATA_LS 4118 non-null object
DATA_INCLUS 2957 non-null object
Proprietario 5541 non-null object
Nome_propriedade 5541 non-null object
Municipio 5525 non-null object
Estado 5533 non-null object
CNPJ_CPF_CEI 5541 non-null object
CNPJ_CPF_CEI_limpo 5541 non-null float64
Trab_Envolv 4529 non-null float64
Ramo_atividade 2840 non-null object
Localizacao 2734 non-null object
Cod_ativ 2975 non-null object
Tipo_lista 5541 non-null object
dtypes: float64(2), object(11)
memory usage: 606.0+ KB
Dataframes have two columns with the same type of code - "CPF_CNPJ_doador" and "CNPJ_CPF_CEI_limpo". They are codes with integers, with 13 or 14 digits
Example: "6158959000136", "78141843000103", "46991295000106", "5351494000172" ...
I want to create a new dataframe from a comparison of "doacoes_cnpjs" and "te", using the columns "CPF_CNPJ_doador" and "CNPJ_CPF_CEI_limpo". But it can not be a common merge
I want to compare only the first eight numbers of the columns. Example: from "6158959000136" only use "61589590" and compare with "78141843" from code "78141843000103", and thus on all lines
Please, is there a way to do this? Or is it best to turn the codes into strings and before extracting the first few characters?
Related
I'm trying to identify the index position of a particular column name in Python. I used this exact same method previously on the same dataframe and it returned the number of the index position of the column name. However, in this case it doesn't seem to be working. Here is the relevant code:
The dataframe:
match.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25979 entries, 0 to 25978
Data columns (total 68 columns):
id_x 25979 non-null int64
country_id 25979 non-null int64
league_id 25979 non-null int64
season 25979 non-null object
stage 25979 non-null int64
date 25979 non-null object
match_api_id 25979 non-null int64
home_team_api_id 25979 non-null int64
away_team_api_id 25979 non-null int64
home_team_goal 25979 non-null int64
away_team_goal 25979 non-null int64
home_player_1 24755 non-null float64
home_player_2 24664 non-null float64
home_player_3 24698 non-null float64
home_player_4 24656 non-null float64
home_player_5 24663 non-null float64
home_player_6 24654 non-null float64
home_player_7 24752 non-null float64
home_player_8 24670 non-null float64
home_player_9 24706 non-null float64
home_player_10 24543 non-null float64
home_player_11 24424 non-null float64
away_player_1 24745 non-null float64
away_player_2 24701 non-null float64
away_player_3 24686 non-null float64
away_player_4 24658 non-null float64
away_player_5 24644 non-null float64
away_player_6 24666 non-null float64
away_player_7 24744 non-null float64
away_player_8 24638 non-null float64
away_player_9 24651 non-null float64
away_player_10 24538 non-null float64
away_player_11 24425 non-null float64
goal 14217 non-null object
shoton 14217 non-null object
shotoff 14217 non-null object
foulcommit 14217 non-null object
card 14217 non-null object
cross 14217 non-null object
corner 14217 non-null object
possession 14217 non-null object
BSA 14161 non-null float64
Home Team 25979 non-null object
Away Team 25979 non-null object
name_x 25979 non-null object
name_y 25979 non-null object
home_player_1 24755 non-null object
home_player_2 24664 non-null object
home_player_3 24698 non-null object
home_player_4 24656 non-null object
home_player_5 24663 non-null object
home_player_6 24654 non-null object
home_player_7 24752 non-null object
home_player_8 24670 non-null object
home_player_9 24706 non-null object
home_player_10 24543 non-null object
home_player_11 24424 non-null object
away_player_1 24745 non-null object
away_player_2 24701 non-null object
away_player_3 24686 non-null object
away_player_4 24658 non-null object
away_player_5 24644 non-null object
away_player_6 24666 non-null object
away_player_7 24744 non-null object
away_player_8 24638 non-null object
away_player_9 24651 non-null object
away_player_10 24538 non-null object
away_player_11 24425 non-null object
dtypes: float64(23), int64(9), object(36)
Rest of code:
#remove rows that dont contain player names
column_start = match.columns.get_loc("home_player_1")
column_start
column_end = match.columns.get_loc("away_player_11")
columns = match.columns[column_start:column_end]
#match.dropna(axis=columns)
This causes the following error:
TypeError: only integer scalar arrays can be converted to a scalar index
Problem is both columns are duplicated, home_player_1 and also away_player_11 (and many another columns too).
So if same values in columns you can remove duplicated columns by:
match = match.loc[:, ~match.columns.duplicated()]
Or you can deduplicate columns names by:
s = match.columns.to_series()
match.columns = (match.columns +
s.groupby(s).cumcount().astype(str).radd('_').str.replace('_0',''))
You have to check if your index column is monotonic, because if not, it will not return the index number but a boolean array.
print(df.Index.is_monotonic)
At least if you don't want to modify the index column, you can try to add a step like:
df.index[matchArray] == True].tolist()
I have setup two dataframes and attempted to filter the results by moving a datetime object column to the index, and using .last('7D') to pull the entries whose datetime is 'stamped' within the last seven days. It worked for the first dataframe, but not the second. I have tried a variety of variations to filter the df to get what I need, but cannot get accurate output. I'm at a loss! This has been built iterative as well, so if you see some refactoring opportunities, let me know.
Original DataFrame: engagements
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2572 entries, 0 to 2571
Data columns (total 15 columns):
REQ_NAME 2572 non-null object
REQ_ID 2572 non-null object
STATUS 2572 non-null object
full_name 2572 non-null object
BIZ_UNIT 2572 non-null object
COMPLEXITY 2378 non-null object
PRIORITY 2390 non-null object
OPEN_DATE 2572 non-null datetime64[ns]
REQ_DATE 2572 non-null object
REQ_CAT 2572 non-null object
REQ_NOTE 2572 non-null object
CostCenter 2572 non-null int64
TargetCompletionDate 2572 non-null object
UpdateDTTM 2514 non-null datetime64[ns]
age 2572 non-null timedelta64[ns]
dtypes: datetime64[ns](2), int64(1), object(11), timedelta64[ns](1)
memory usage: 301.5+ KB
Separating DataFrame:
active_engagements = engagements[engagements['STATUS'].isin(active_status)]
comp_engagements = engagements[engagements['STATUS'].isin(comp_status)]
First Filter:
act_eng_open_lw = active engagements.set_index('OPEN_DATE')
act_eng_open_lw = act_eng_open_lw.last('7D')
Output is the 10 rows of data I expect to see
Problem Child DataFrame:
act_eng_comp_lw = comp_engagements.set_index('UpdateDTTM')
act_eng_comp_lw = act_eng_comp_lw.last('7D')
Output is 105 rows, where I would expect 32
Info calls on both filtered DFs: act_eng_open_lw:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10 entries, 2019-12-20 to 2019-12-26
Data columns (total 14 columns):
REQ_NAME 10 non-null object
REQ_ID 10 non-null object
STATUS 10 non-null object
full_name 10 non-null object
BIZ_UNIT 10 non-null object
COMPLEXITY 5 non-null object
PRIORITY 5 non-null object
REQ_DATE 10 non-null object
REQ_CAT 10 non-null object
REQ_NOTE 10 non-null object
CostCenter 10 non-null int64
TargetCompletionDate 10 non-null object
UpdateDTTM 5 non-null datetime64[ns]
age 10 non-null timedelta64[ns]
dtypes: datetime64[ns](1), int64(1), object(11), timedelta64[ns](1)
memory usage: 1.2+ KB
act_eng_comp_lw
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105 entries, 2019-12-26 to 2019-11-27
Data columns (total 14 columns):
REQ_NAME 105 non-null object
REQ_ID 105 non-null object
STATUS 105 non-null object
full_name 105 non-null object
BIZ_UNIT 105 non-null object
COMPLEXITY 102 non-null object
PRIORITY 104 non-null object
OPEN_DATE 105 non-null datetime64[ns]
REQ_DATE 105 non-null object
REQ_CAT 105 non-null object
REQ_NOTE 105 non-null object
CostCenter 105 non-null int64
TargetCompletionDate 105 non-null object
age 105 non-null int64
dtypes: datetime64[ns](1), int64(2), object(11)
memory usage: 12.3+ KB
Question: Using the same filter, why is one Datetime column filtering properly with .last and the other is not?
I ended up changing the method I was using to catch the last 7 days, versus .last:
act_eng_open_lw = act_eng_open_lw[act_eng_open_lw.index > dt.datetime.now() - pd.to_timedelta("7day")]
This method works on both of my dataframes effectively.
For those of you working into Foundry's environnement, I'm trying to build a pipeline in "Code repositories" to process a raw dataset (from Excel file) into a clean one that I'll analyse in "Contour" later.
To that end I used python, except that pipeline seems to be using pyspark and at some point I must convert the dataset I've cleaned with pandas into a pyspark one and that's where i'm stuck.
I've looked at several post on stackover flow to convert Pandas DF to Pyspark DF but none of the solutions seems to be working so far.
When I try to run the transform, there's always a datatype failing to be converted evethough I forced a schema.
The Python code section has been tested succefully in Spyder (importing and exporting has an Excel file) and give the expected result. It's only when I need to convert to pyspark that it fails somehow.
#transform_pandas(
Output("/MDM_OUT_OF_SERVICE_EVENTS_CLEAN"),
OOS_raw=Input("/MDM_OUT_OF_SERVICE_EVENTS"),
)
def DA_transform(OOS_raw):
''' Code Section in Python '''
mySchema=StructType([StructField(OOS_dup.columns[0], IntegerType(),
True),
StructField(OOS_dup.columns[1], StringType(), True),
...])
OOS_out=sqlContext.createDataFrame(OOS_dup,schema
=mySchema,verifySchema=False)
return OOS_out
I got this error message at some point :
AttributeError: 'unicode' object has no attribute 'toordinal'.
According to this post : What is causing 'unicode' object has no attribute 'toordinal' in pyspark?
it's because pyspark fail to convert the Data into Datetype
but data is in Datetime64[ns] in pandas. I've tried converting this columns into string and integer and it fails as well.
Here is a picture of the output dataset from Python :
Here is the datatypes return by pandas once the data set has been cleaned :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4972 entries, 0 to 4971
Data columns (total 51 columns):
OOS_ID 4972 non-null int64
OPERATOR_CODE 4972 non-null object
ATA_CAUSE 4972 non-null int64
EVENT_CODE 3122 non-null object
AC_MODEL 4972 non-null object
AC_SN 4972 non-null int64
OOS_DATE 4972 non-null datetime64[ns]
AIRPORT_CODE 4915 non-null object
RTS_DATE 4972 non-null datetime64[ns]
EVENT_TYPE 4972 non-null object
CORRECTIVE_ACTION 417 non-null object
DD_HOURS_OOS 4972 non-null float64
EVENT_DESCRIPTION 4972 non-null object
EVENT_CATEGORY 4972 non-null object
ATA_REPORTED 324 non-null float64
TOTAL_CAUSES 4875 non-null float64
EVENT_NUMBER 3117 non-null float64
RTS_TIME 4972 non-null object
OOS_TIME 4972 non-null object
PREV_REPORTED 4972 non-null object
FERRY_IND 4972 non-null object
REPAIR_STN_CODE 355 non-null object
MAINT_DOWN_TIME 4972 non-null float64
LOGBOOK_RECORD_IDENTIFIER 343 non-null object
RTS_IND 4972 non-null object
READY_FOR_USE 924 non-null object
DQ_COMMENTS 2 non-null object
REVIEWED 5 non-null object
DOES_NOT_MEET_SPECS 4 non-null object
CORRECTED 12 non-null object
EDITED_BY 4972 non-null object
EDIT_DATE 4972 non-null datetime64[ns]
OUTSTATION_INDICATOR 3801 non-null object
COMMENT_TEXT 11 non-null object
ATA_CAUSE_CHAPTER 4972 non-null int64
ATA_CAUSE_SECTION 4972 non-null int64
ATA_CAUSE_COMPONENT 770 non-null float64
PROCESSOR_COMMENTS 83 non-null object
PARTS_AVAIL_AT_STATION 4972 non-null object
PARTS_SHIPPED_AT_STATION 4972 non-null object
ENGINEER_AT_STATION 4972 non-null object
ENGINEER_SENT_AT_STATION 4972 non-null object
SOURCE_FILE 4972 non-null object
OOS_Month 4972 non-null float64
OOS_Hour 4972 non-null float64
OOS_Min 4972 non-null float64
RTS_Month 4972 non-null float64
RTS_Hour 4972 non-null float64
RTS_Min 4972 non-null float64
OOS_Timestamp 4972 non-null datetime64[ns]
RTS_Timestamp 4972 non-null datetime64[ns]
dtypes: datetime64[ns](5), float64(12), int64(5), object(29)
In case it might help some of you I found this in the offical Foundry documentation on how to properly transition between pandas and pyspark DF.
OOS_dup is my Pandas dataframe I want to convert back to Spark.
# Extract the name of each columns with its data type in pandas
col = OOS_dup.columns
col_type = list()
for c in col:
t = OOS_dup[c].dtype.name
col_type.append(t)
df_schema = pd.DataFrame({"field": col, "data_type": col_type})
# Define a function to replace missing (NaN sky coverage cells with Null
def replace_missing(df, col_names):
for col in col_names:
df = df.withColumn("{}".format(col),
F.when(df["{}".format(col)] == "NaN", None).otherwise(df["{}".format(col)]))
return df
# Replace missing values
OOS_dup = replace_missing(OOS_dup, col)
# Define a function to change column types to the proper type in spark
def change_type(df, col_names, dtypes):
for col in col_names:
df = df.withColumn("{}".format(col), F.when(dtypes == "float64", (df["{}".format(col)]).cast("double")).when(dtypes == "int64", (df["{}".format(col)]).cast("int")).when(dtypes == "datetime64[ns]", (df["{}".format(col)]).cast("date")).otherwise((df["{}".format(col)]).cast("string")))
return df
# Cast each columns to the proper data type
OOS_dup = change_type(OOS_dup, df_schema["field"], df_schema["data_type"])
OOS_dup = sqlContext.createDataFrame(OOS_dup)
This goal is to convert the type from 'object' to 'float' from KDD 99 dataset.
This is the information of the dataset :
class 'pandas.core.frame.DataFrame'
RangeIndex: 494020 entries, 0 to 494019
Data columns (total 42 columns):
duration 494020 non-null int64
protocol_type 494020 non-null object
service 494020 non-null object
src_bytes 494020 non-null object
dst_bytes 494020 non-null int64
flag 494020 non-null int64
land 494020 non-null int64
wrong_fragment 494020 non-null int64
urgent 494020 non-null int64
hot 494020 non-null int64
num_failed_logins 494020 non-null int64
logged_in 494020 non-null int64
num_compromised 494020 non-null int64
root_shell 494020 non-null int64
su_attempted 494020 non-null int64
num_root 494020 non-null int64
num_file_creations 494020 non-null int64
num_shells 494020 non-null int64
num_access_files 494020 non-null int64
num_outbound_cmds 494020 non-null int64
is_hot_login 494020 non-null int64
is_guest_login 494020 non-null int64
count 494020 non-null int64
serror_rate 494020 non-null int64
rerror_rate 494020 non-null float64
same_srv_rate 494020 non-null float64
diff_srv_rate 494020 non-null float64
srv_count 494020 non-null float64
srv_serror_rate 494020 non-null float64
srv_rerror_rate 494020 non-null float64
srv_diff_host_rate 494020 non-null float64
dst_host_count 494020 non-null int64
dst_host_srv_count 494020 non-null int64
dst_host_same_srv_rate 494020 non-null float64
dst_host_diff_srv_rate 494020 non-null float64
dst_host_same_src_port_rate 494020 non-null float64
dst_host_srv_diff_host_rate 494020 non-null float64
dst_host_serror_rate 494020 non-null float64
dst_host_srv_serror_rate 494020 non-null float64
dst_host_rerror_rate 494020 non-null float64
dst_host_srv_rerror_rate 494020 non-null float64
class 494020 non-null object
dtypes: float64(15), int64(23), object(4)
memory usage: 158.3+ MB
There are 4 object types that need to convert to float contains :
1. protocol type : 'tcp' , 'udp' , 'icmp'
2. service : 'http' , 'auth' , 'http_443' , etc
3. src_bytes : 'OTH' 'REJ' , 'SF' , etc
4. class : 'normal' , 'neptune' , 'smurf' , etc
model('protocol_type').astype(float)
But i got this error :
TypeError: 'DataFrame' object is not callable
I hope that someone can help me to fix this problem.
Thank you :)
First of all, as #thecruisy pointed out, you should use brackets instead of (), which leads to:
model['protocol_type'].astype(float)
However, since the column is in object (or str), that will raise a ValueError.
ValueError: could not convert string to float: 'tcp'
What you should do instead is to encode them. You can use either pandas.DataFrame:
model['protocol_type'].astype('category').cat.codes.astype(float)
# ^^^^^^^^^^^^^^
# This may be redundant, though
Or use sklearn.preprocessing.LabelEncoder
So I am using pandas to create a dataframe from a CSV file and I have a column which is of dtype datetime. This works as expected with smaller datasets. If the dataset is large the operations i perform on this column change it to an object instead of datatime. Is there any way to preserve the dtypes? I tried using iloc or ix with the dataframe but that didnt work. Below is some of my code and where the problem lies.
twitterDataFrame['CreatedAt'] = twitterDataFrame['CreatedAt'].map(lambda x: pandas.to_datetime(x))
twitterDataFrame['CreatedAtForCalculations'] = twitterDataFrame['CreatedAt']
The problem appears on Line 3 in the next set of code. It complains that tweetsByEachUser['CreatedAtForCalculations'].first() and tweetsByEachUser['CreatedAtForCalculations'].last() are strings and that it cannot compute the negation of strings.
# Frequency of Tweets
twitterDataFrame = twitterDataFrame.set_index(['CreatedAt'])
tweetsByEachUser = twitterDataFrame.groupby('UserID')
numberOfHoursBetweenFirstAndLastTweet = (tweetsByEachUser['CreatedAtForCalculations'].first() - tweetsByEachUser['CreatedAtForCalculations'].last()).astype('timedelta64[h]')
I have tried
twitterDataFrame.ix['CreatedAtForCalculations':].dtypes
but this does not work either. Would anyone know a solution for this?
Sample of data from df.info
Int64Index: 21836 entries, 0 to 21835
Data columns (total 17 columns):
CreatedAt 21836 non-null object
ActualTweet 21836 non-null object
InReplyToStatusID 21836 non-null bool
InReplyToUserID 21836 non-null bool
UserID 21836 non-null object
RetweetCount 21836 non-null object
FavouriteCount 21836 non-null object
Hashtags 21836 non-null bool
URL 21836 non-null bool
MediaURL 21836 non-null bool
MediaType 21836 non-null object
UserMentionID 21836 non-null bool
PossiblySensitive 21836 non-null object
Language 21836 non-null object
Classifier 21836 non-null object
TweetLength 21836 non-null object
CreatedAtForCalculations 21836 non-null object
dtypes: bool(6), object(11)None
Thanks :)