Python, pandas exclude outliers function - python

I tried to exclude a few outliers from a pandas dataframe, but the function just return the same table without any difference.I can't figure out why.
excluding outliers
def exclude_outliers(DataFrame, col_name):
interval = 2.5*DataFrame[col_name].std()
mean = DataFrame[col_name].mean()
m_i = mean + interval
DataFrame = DataFrame[DataFrame[col_name] < m_i]
outlier_column = ['util_linhas_inseguras', 'idade', 'vezes_passou_de_30_59_dias', 'razao_debito', 'salario_mensal', 'numero_linhas_crdto_aberto',
'numero_vezes_passou_90_dias', 'numero_emprestimos_imobiliarios', 'numero_de_vezes_que_passou_60_89_dias', 'numero_de_dependentes']
for col in outlier_column:
exclude_outliers(df_train, col)
df_train.describe()

As written, your function doesn't return anything and, as a result, your for loop is not making any changes to the DataFrame. Try the following:
At the end of your function, add the following line:
def exclude_outliers(DataFrame, col_name):
... # Function filters the DataFrame
# Add this line to return the filtered DataFrame
return DataFrame
And then modify your for loop to update the df_train:
for col in outlier_column:
# Now we update the DataFrame on each iteration
df_train = exclude_outliers(df_train, col)

Related

How to return a non empty data frame among multiple newly created data frames inside a function in python?

I am new to pandas, I have a doubt in returning a data frame from a function. I have a function which creates three new data frames based on the parameters given to it, the function has to return only the data frames which are non-empty. How do I do that?
my code:
def df_r(df,colname,t1):
t1_df = pd.DataFrame()
t2_df = pd.DataFrame()
t3_df = pd.DataFrame()
if t1 :
for colname in df:
some code
some code
t1_df = some data
if t2 :
for colname in df:
some code
some code
t2_df = some data
if t3 :
for colname in df:
some code
some code
t3_df = some data
list = [t1_df,t2_df,t3_df]
Now it should return only the t1_df as the parameter was given t1. So I have inserted all three into a list
list = [t1_df,t2_df,t3_df]
how to check if which df is non-empty and return it?
Just check for empty attribute for each DataFrame
eg.
df = pd.DataFrame()
if df.empty:
print("DataFrame is empty")
output:
DataFrame is empty
pd.empty would return True if DataFrame is empty, else it would return False
This would work even if column names are present but are still missing the data.
So to answer specific to your case
list = [t1_df,t2_df,t3_df]
for df in list:
if not df.empty:
return df
assuming your case has only one of the DataFrame non-empty
if t1_df.empty != True:
return t1_df
elif t2_df.empty !=True:
return t2_df
else:
return t2_df

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

How to use a faster loop than iterrows for the following code?

I have a function that calculates the return of prices on a slice of the return_df dataframe and a separate loop that calculates standard deviations again on a sliced dataframe. I can run both and calculate the result right now with iterrows method. I have tried to apply the function to speed it up and even read about pandas/numpy vectorization but I get nan values instead. Is there a way to run this code faster? Any help is appreciated.
#sigma calculation function
def sigma_calculation(sample_trade):
trade_index = sample_trade.index[0]
sample_return_lst=[]
eurusd_return_lst=[]
for i, row in return_df[trade_index-28:trade_index:1].iterrows():
sample_return_lst.append(return_df.iloc[i,0] * sample_trade.iloc[0,1])
eurusd_return_lst.append(return_df.iloc[i,1] * sample_trade.iloc[0,1])
stats_df = pd.DataFrame(
{'Asset Total Return' : sample_return_lst,
'Reference Total Return' : eurusd_return_lst
})
sigma_trade = stats_df['Asset Total Return'].std()
sigma_eurusd = stats_df['Reference Total Return'].std()
d_lev = sigma_trade/sigma_eurusd
return d_lev
#running the sigma_calculation function for 45 rows from the return_df dataframe and creating a separate df from it
d_lev_lst = []
for i, row in return_df[4433::1].iterrows():
d_lev_lst.append(sigma_calculation(return_df.iloc[[i]]))
d_lev_df = pd.DataFrame({'D-Leverage' : d_lev_lst})
This is what I tried for the second part (running sigma_calculation on 45 rows):
d_lev_df = pd.DataFrame(
{'D-Leverage' : return_df['Asset Return'].iloc[4433::1].apply(lambda row: sigma_calculation(return_df.iloc[[row]])),
})

How do I fix the For Loop to return a certain character from a DataFrame?

I have imported an excel file and made it into a DataFrame and iterated over a column called "Titles" to spit out titles with certain keywords. I have the list of titles as "match_titles." What I want to do now is to create a For Loop to return the column before "titles" for each title in match_titles." I'm not sure why the code is not working. Any help would be appreciated.
import pandas as pd
data = pd.read_excel(r'C:\Users\bryanmccormack\Downloads\asin_list.xlsx')
df = pd.DataFrame(data, columns=['Track','Asin','Title'])
excludes = ["Chainsaw", "Diaper pail", "Leaf Blower"]
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
match_titles = [e for e in df.Title if
any(keywords.issubset(e.lower().split()) for keywords in my_excludes)]
a = []
for i in match_titles:
a.append(df['Asin'])
print(a)
In your for loop you are appending the unfiltered column df['Asin'] to your list a as many times as there are values in match_titles. But there isn't any filtering of df.
One solution would be to make a column of the match_values then you can return the column Asin after filtering on that match_values column:
# make a function to perform your match analysis.
def is_match(title, excludes=["Chainsaw", "Diaper pail", "Leaf Blower"]):
my_excludes = [set(key_word.lower().split()) for key_word in excludes]
if any(keywords.issubset(title.lower().split()) for keywords in my_excludes):
return True
return False
# Make a new boolean column for the matches. This applies your
# function to each value in df['Title'] and puts the output in
# the new column.
df['match_titles'] = df['Title'].apply(is_match)
# Filter the df to only matches and return the column you want.
# Because the match_titles column is boolean it can be used as
# an index.
result = df[df['match_titles']]['Asin']

How to create a function to append data to a dataframe?

import pandas as pd
columns_list=['sport','home','away','selection','odds','tipster','stake',
'is_won','profit','bookie']
df = pd.DataFrame(columns=columns_list)
def bet(sport,home,away,selection,odds,tipster,stake,profit,bookie):
if profit > 0:
is_won = True
elif profit < 0:
is_won = False
temp_df = pd.DataFrame([[sport,home,away,selection,odds,tipster,stake,
is_won,profit,bookie]], columns=columns_list)
df.append(temp_df, ignore_index=True)
#print(smemels has been inserted)
#add market
bet(sport="football",home="sporting",away="porto",selection="sporting",
odds=2.7,tipster="me",stake=500,profit=500,bookie="marathon")
I am trying to create an empty DataFrame and then append new rows, by creating a function, so I only have to put the values and it inserts automatically. When I run the code bet(...) it doesn't really append the data.
def bet(**kwargs):
'''place your code here
bet(sport="football",home="sporting",away="porto",selection="sporting",
odds=2.7,tipster="me",stake=500,profit=500,bookie="marathon")
See http://book.pythontips.com/en/latest/args_and_kwargs.html
kwargs allows you to add keyworded , like your trying to do.
Just add "global df" below your function header and argument inplace = True in append function:
def bet(sport,home,away,selection,odds,tipster,stake,profit,bookie):
global df
if profit > 0:
is_won = True
elif profit < 0:
is_won = False
temp_df = pd.DataFrame([[sport,home,away,selection,odds,tipster,stake,
is_won,profit,bookie]], columns=columns_list)
df.append(temp_df, ignore_index=True, inplace = True)
The first one tells the function to take global (outside of the function) df variable instead of creating locally a new one. The second one changes the dataframe instead of returning a new object.

Categories