How to create a function to append data to a dataframe? - python

import pandas as pd
columns_list=['sport','home','away','selection','odds','tipster','stake',
'is_won','profit','bookie']
df = pd.DataFrame(columns=columns_list)
def bet(sport,home,away,selection,odds,tipster,stake,profit,bookie):
if profit > 0:
is_won = True
elif profit < 0:
is_won = False
temp_df = pd.DataFrame([[sport,home,away,selection,odds,tipster,stake,
is_won,profit,bookie]], columns=columns_list)
df.append(temp_df, ignore_index=True)
#print(smemels has been inserted)
#add market
bet(sport="football",home="sporting",away="porto",selection="sporting",
odds=2.7,tipster="me",stake=500,profit=500,bookie="marathon")
I am trying to create an empty DataFrame and then append new rows, by creating a function, so I only have to put the values and it inserts automatically. When I run the code bet(...) it doesn't really append the data.

def bet(**kwargs):
'''place your code here
bet(sport="football",home="sporting",away="porto",selection="sporting",
odds=2.7,tipster="me",stake=500,profit=500,bookie="marathon")
See http://book.pythontips.com/en/latest/args_and_kwargs.html
kwargs allows you to add keyworded , like your trying to do.

Just add "global df" below your function header and argument inplace = True in append function:
def bet(sport,home,away,selection,odds,tipster,stake,profit,bookie):
global df
if profit > 0:
is_won = True
elif profit < 0:
is_won = False
temp_df = pd.DataFrame([[sport,home,away,selection,odds,tipster,stake,
is_won,profit,bookie]], columns=columns_list)
df.append(temp_df, ignore_index=True, inplace = True)
The first one tells the function to take global (outside of the function) df variable instead of creating locally a new one. The second one changes the dataframe instead of returning a new object.

Related

Filtering dataframe in a loop with use of config file values

I have the following toy dataset
data = {"Subject":["1","2","3","3","4","5","5"],
"date": ["2020-05-01 16:54:25","2020-05-03 10:31:18","2020-05-08 10:10:40","2020-05-08 10:10:42","2020-05-06 09:30:40","2020-05-07 12:46:30","2020-05-07 12:55:10"],
"Accept": ["True","False","True","True","False","True","True"],
"Amount" : [150,30,32,32,300,100,50],
"accept_1": ["True","False","True","True","False","True","True"],
"amount_1" : [20,30,32,32,150,100,30],
"Transaction":["True","True","False","False","True","True","False"],
"Label":["True","True","True","False","True","True","True"]}
data = pd.DataFrame(data)
and a small toy config file
config = [{"colname": "Accept","KeepValue":"True","RemoveTrues":"True"},
{"colname":"Transaction","KeepValue":"False","RemoveTrues":"False"}]
I want to loop through the dataset and apply these filters. After I have applied the first filter,
I want to apply the following filter on the filtered data and so on.
I run the following code and it seems it applies the filter on the data the first time and then, it applies the second filter on the original data, not the filtered one.
for i in range(len(config)):
filtering = config[i]
if filtering["RemoveTrues"] == "True":
col = filtering["colname"]
test = data[data[col] == filtering["KeepValue"]]
print(test)
else:
col = filtering["colname"]
test = data[(data[col]== filtering["KeepValue"]) | data["Label"]]
print(test)
How can I apply the first filter on the data, then the second filter on the filtered data and so on ? I need to use a loop since I have to get the filters from the configuration file.
From what I get, you want to save the filtering each time it happened, and from what I see in the code each loop you are trying to filter, but using the original reference for the data frame, which it's going to do the filter each time on the original dataframe, you have to change it to a new reference call it "test", and save it to the same reference "test" so it can be used in next loop
test = data.copy() # copy the original dataframe so we can refreance for it each time in loop
for i in range(len(config)):
filtering = config[i]
if filtering["RemoveTrues"] == "True":
col = filtering["colname"]
test = test[test[col] == filtering["KeepValue"]] # change it to the new reference, and save it to the same reference so it can be used in next loop
print(test)
else:
col = filtering["colname"]
test = test[(test[col]== filtering["KeepValue"]) | test["Label"]] # change it to the new reference, and save it to the same reference so it can be used in next loop
print(test)
I'd suggest changing your True/False strings to booleans. You can just assign a new value to df that will persist during the loop (don't create an extra test variable).
df = pd.DataFrame(data)
config = [{"colname": "Accept","KeepValue":"True","RemoveTrues":"True"},
{"colname":"Transaction","KeepValue":"False","RemoveTrues":"False"}]
for conf in config:
if conf["RemoveTrues"] == "True":
df = df[df[conf['colname']] == conf['KeepValue']]
print(df)
else:
df = df[(df[conf['colname']]== conf["KeepValue"]) | df["Label"]]
print(df)

Python, pandas exclude outliers function

I tried to exclude a few outliers from a pandas dataframe, but the function just return the same table without any difference.I can't figure out why.
excluding outliers
def exclude_outliers(DataFrame, col_name):
interval = 2.5*DataFrame[col_name].std()
mean = DataFrame[col_name].mean()
m_i = mean + interval
DataFrame = DataFrame[DataFrame[col_name] < m_i]
outlier_column = ['util_linhas_inseguras', 'idade', 'vezes_passou_de_30_59_dias', 'razao_debito', 'salario_mensal', 'numero_linhas_crdto_aberto',
'numero_vezes_passou_90_dias', 'numero_emprestimos_imobiliarios', 'numero_de_vezes_que_passou_60_89_dias', 'numero_de_dependentes']
for col in outlier_column:
exclude_outliers(df_train, col)
df_train.describe()
As written, your function doesn't return anything and, as a result, your for loop is not making any changes to the DataFrame. Try the following:
At the end of your function, add the following line:
def exclude_outliers(DataFrame, col_name):
... # Function filters the DataFrame
# Add this line to return the filtered DataFrame
return DataFrame
And then modify your for loop to update the df_train:
for col in outlier_column:
# Now we update the DataFrame on each iteration
df_train = exclude_outliers(df_train, col)

How to return a non empty data frame among multiple newly created data frames inside a function in python?

I am new to pandas, I have a doubt in returning a data frame from a function. I have a function which creates three new data frames based on the parameters given to it, the function has to return only the data frames which are non-empty. How do I do that?
my code:
def df_r(df,colname,t1):
t1_df = pd.DataFrame()
t2_df = pd.DataFrame()
t3_df = pd.DataFrame()
if t1 :
for colname in df:
some code
some code
t1_df = some data
if t2 :
for colname in df:
some code
some code
t2_df = some data
if t3 :
for colname in df:
some code
some code
t3_df = some data
list = [t1_df,t2_df,t3_df]
Now it should return only the t1_df as the parameter was given t1. So I have inserted all three into a list
list = [t1_df,t2_df,t3_df]
how to check if which df is non-empty and return it?
Just check for empty attribute for each DataFrame
eg.
df = pd.DataFrame()
if df.empty:
print("DataFrame is empty")
output:
DataFrame is empty
pd.empty would return True if DataFrame is empty, else it would return False
This would work even if column names are present but are still missing the data.
So to answer specific to your case
list = [t1_df,t2_df,t3_df]
for df in list:
if not df.empty:
return df
assuming your case has only one of the DataFrame non-empty
if t1_df.empty != True:
return t1_df
elif t2_df.empty !=True:
return t2_df
else:
return t2_df

calculate a value basis some conditions and assign it to a new variable

I am calculating a value of net balance from a condition and I want to store it in a new variable altogether. The new variable should store this calculated value as integer or a float and not as an array
I have tried the following code:
#variable = something if condition else something_else
mar_final_bal = x_start_bal+df2['credit_line']+df2['Net_Balance'] if
df2['month' == 'March-2016']
apr_final_bal = mar_final_bal+df2['credit_line']+df2['Net_Balance'] if
df2['month' == 'Apr-2016']
mar_final_bal and apr_final_bal are my two variables that I want to create using the conditions on the right side
It is evident that you are new to using Pandas. The syntax looks more pseudo-like than pandas code. IIUC, this is what you meant:
mar_final_bal = x_start_bal+df2.loc[df2['month'] == 'March-2016', 'credit_line'].sum() + df2.loc[df2['month'] == 'March-2016', 'Net_Balance'].sum()
apr_final_bal = mar_final_bal+df2.loc[df2['month'] == 'Apr-2016', 'credit_line'].sum() + df2.loc[df2['month'] == 'Apr-2016', 'Net_Balance'].sum()

How can i save a Panda dataframe to a django model?

I am trying to read a csv file using panda and parse it and then upload the results in my django database. Well, for now i am converting each dataframe to a list and then iterating over the list to save it in the DB. But my solution is inefficient when the list is really big for each column. How can i make it better ?
fileinfo = pd.read_csv(csv_file, sep=',',
names=['Series_reference', 'Period', 'Data_value', 'STATUS',
'UNITS', 'Subject', 'Group', 'Series_title_1', 'Series_title_2',
'Series_title_3','Series_tile_4','Series_tile_5'],
skiprows = 1)
# serie = fileinfo[fileinfo['Series_reference']]
s = fileinfo['Series_reference'].values.tolist()
p = fileinfo['Period'].values.tolist()
d = fileinfo['Data_value'].values.tolist()
st = fileinfo['STATUS'].values.tolist()
u = fileinfo['UNITS'].values.tolist()
sub = fileinfo['Subject'].values.tolist()
gr = fileinfo['Group'].values.tolist()
stt= fileinfo['Series_title_1'].values.tolist()
while count < len(s):
b = Testdata(
Series_reference = s[count],
Period = p[count],
Data_value = d[count],
STATUS = st[count],
UNITS = u[count],
Subject = sub[count],
Group = gr[count],
Series_title_1 = stt[count]
)
b.save()
count = count + 1
You can use pandas apply function. You can pass axis=1 to apply a given function to every row:
df.apply(
creational_function, # Method that creates your structure
axis=1, # Apply to every row
args=(arg1, arg2) # Additional args to creational_function
)
in creational_function the first argument received is the row, where you can access specific columns likewise the original dataframe
def creational_function(row, arg1, arg2):
s = row['Series_reference']
# For brevity I skip the others arguments...
# Create TestData
# Save
Note that arg1 and arg2 are the same for every row.
If you want to do something more with your created TestData objects, you can change creational_function to return a value, then df.apply will return a list containing all elements returned by the passed function.

Categories