How to pass parameters from one function to the next in Python? - python

I have a basic question about how to structure my code.
I'm creating a simple gui to search and return my company's financial data. This data exists in a series of excel files, and I use pandas to merge, filter, and return tables or values. My present code is quite inefficient, whereby I import relevant Excel files each time I call a search. I would rather import these Excel files upon launch and commit them to the program's memory while the program runs.
I believe that my attempt fails because I don't know how to pass arguments from one function to the next. I'm sure that I'm using this "self" operator incorrectly. Looking for best practices here, and a Pythonic approach. Thank you in advance!
import pandas as pd
def getData(self):
self.Excel1 = pd.read_excel(r'asdf')
self.Excel2 = pd.read_excel(r'fdsa')
def func1():
df1 = getData.Excel1
df2 = getData.Excel2
df3 = df1 + df2
return df3
func1()

There are ways to pass a function as an argument, and geeks for geeks has a great article on 'decorators' that do exactly that. Link below:
https://www.geeksforgeeks.org/passing-function-as-an-argument-in-python/
However, could you perhaps just combine the two functions as one? i.e.:
def getData():
d1 = pd.read_excel(r'asdf')
d2 = pd.read_excel(r'fdsa')
d3 = d1 + d2
return d3
I think the advantage of doing this is that you reduce the number of things that python needs to hold in memory. However the disadvantage is that you won't be able to access d1 or d2.
I hope this helps, I can't think of anything else based on the information in the question.

Related

Snowpark-Python Dynamic Join

I have searched through a large amount of documentation to try to find an example of what I'm trying to do. I admit that the bigger issue may be my lack of python expertise. So i'm reaching out here in hopes that someone can point me in the right direction. I am trying to create a python function that dynamically queries tables based on a function parameters. Here is an example of what i'm trying to do:
def validateData(_ses, table_name,sel_col,join_col, data_state, validation_state):
sdf_t1 = _ses.table(table_name).select(sel_col).filter(col('state') == data_state)
sdf_t2 = _ses.table(table_name).select(sel_col).filter(col('state') == validation_state)
df_join = sdf_t1.join(sdf_t2, [sdf_t1[i] == sdf_t2[i] for i in join_col],'full')
return df_join.to_pandas()
This would be called like this:
df = validateData(ses,'table_name',[col('c1'),col('c2')],[col('c2'),col('c3')],'AZ','TX')
this issue i'm having is with line 5 from the funtion:
df_join = sdf_t1.join(sdf_t2, [col(sdf_t1[i]) == col(sdf_t2[i]) for i in join_col],'full')
I know that code is incorrect, but I'm hoping it explains what i'm trying to do. If anyone has any advice on if this is possible or how, I would greatly appreciate it.
Instead of joining in data frame, i think its easier to use a direct SQL and pull the data in a snow frame and convert it to a pandas data frame.
from snowflake.snowpark import Session
import pandas as pd
#snow df creation using SQL
data = session.sql("select t1.col1, t2.col2, t2.col2 from mytable t1 full outer join mytable2 t2 on t1.id=t2.id where t1.col3='something'")
#Convert snow DF to Pandas DF. You can use this pandas data frame.
data= pd.DataFrame(data.collect())
Essentially what you need is to create a python expression from two lists of variables. I don't have a better idea than using eval.
Maybe try eval(" & ".join(["(col(sdf_t1[i]) == col(sdf_t2[i]))" for i in join_col]). Be mindful that I have not completely test this but just to toss an idea.

Loop function to rename dataframes

I am new to coding and currently i want to create individual dataframes from each excel tab. It works out so far by doing a search in this forum (i found a sample using dictionary), but then i need one more step which i can't figure out.
This is the code i am using:
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
d[f'{sheet}'] = pd.read_excel(xls, sheet_name=sheet)
Let's say i have 3 excel tabs called 'alpha', 'beta' and 'charlie'.
the code above will gave me 3 dataframes and i can call them by typing: d['alpha'], d['beta'] and d['charlie'].
What i want is to rename the dataframes so instead of calling them by typing (for example) d['alpha'], i just need to write alpha (without any other extras).
Edit: The excel i want to parse has 50+ tabs and it can grow
Edit 2: Thank you all for the links and the answers! it is a great help
Don't rename them.
I can think of two scenarios here:
1. The sheets are fundamentally different
When people ask how to dynamically assign to variable names, the usual (and best) answer is "Use a dictionary". Here's one example.
Indeed, this is the reason Pandas does it this way!
In this case, my opinion is that your best move here is to do nothing, and just use the dictionary you have.
2. The sheets are roughly the same
If the sheets are all basically the same, and only differ by one attribute (e.g. they represent monthly sales and the names of the sheets are 'May', 'June', etc), then your best move is to merge them somehow, adding a column to reflect the sheet name (month, in my example).
Whatever you do, don't use exec or eval, no matter what anyone tells you. They are not options for beginner programmers.
I think you are looking for the build-in exec method, which executes strings.
But I do not recommend using exec, it is really widely discussed why it shouldn't be used or at least should be used cautiously.
As I do not have your data, I think it is achievable using the following code:
import pandas as pd
excel='sample.xlsx'
xls=pd.ExcelFile(excel)
for sheet in xls.sheet_names:
print(sheet)
code_to_execute = f'{sheet} = pd.read_excel(xls,sheet_name={sheet})'
exec(code_to_execute)
But again, I highlight that it is not the cleanest way to do that. Your approach is definitely cleaner, to be more precise, I would always use dicts for those kinds of assignments. See here for more about exec.
In general, you want to generate a string.
possible_string = 'a=10'
exec(possible_string)
print(a) # 10
You need to create variables which correspond to the three dataframes:
alpha, beta, charlie = d.values()
Edit:
Since you mentioned that the excel sheet could have 50+ tabs and could grow, you may prefer to do it your original loop. This can be done dynamically using exec
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
exec(f'{sheet}' + " = pd.read_excel(xls, sheet_name=sheet)")
It might be better practice, however, to simply index your sheets and access them by index. A 50+ length collection of excel sheets is probably better organized by appending to a list and accessing by index:
d = []
for sheet in xls.sheet_names:
print(sheet)
d.append(pd.read_excel(xls, sheet_name=sheet))
#d[0] = alpha; d[1] = beta, and so on...

variable dataframe name - loop works by itself, but not inside of function

I have dataframes that follow name syntax of 'df#' and I would like to be able to loop through these dataframes in a function. In the code below, if function "testing" is removed, the loop works as expected. When I add the function, it gets stuck on the "test" variable with keyerror = "iris1".
import statistics
iris1 = sns.load_dataset('iris')
iris2 = sns.load_dataset('iris')
def testing():
rows = []
for i in range(2):
test=vars()['iris'+str(i+1)]
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing()
The reason this will be valuable is because I am subsetting my dataframe df multiple times to create quick visualizations. So in Jupyter, I have one cell where I create visualizations off of df1,df2,df3. In the next cell, I overwrite df1,df2,df3 based on different subsetting rules. This is advantageous because I can quickly do this by calling a function each time, so the code stays quite uniform.
Store the datasets in a dictionary and pass that to the function.
import statistics
import seaborn as sns
datasets = {'iris1': sns.load_dataset('iris'), 'iris2': sns.load_dataset('iris')}
def testing(data):
rows = []
for i in range(1,3):
test=data[f'iris{i}']
rows.append([
statistics.mean(test['sepal_length']),
statistics.mean(test['sepal_width'])
])
testing(datasets)
No...
You should NEVER make a sentence like I have dataframes that follow name syntax of 'df#'
Then you have a list of dataframes, or a dict of dataframe, depending how you want to index them...
Here I would say a list
Then you can forget about vars(), trust me you don't need it... :)
EDIT :
And use list comprehensions, your code could hold in three lines :
import statistics
list_iris = [sns.load_dataset('iris'), sns.load_dataset('iris')]
rows = [
(statistics.mean(test['sepal_length']), statistics.mean(test['sepal_width']))
for test in list_iris
]
Storing as a list or dictionary allowed me to create the function. There is still a problem of the nubmer of dataframes in the list varies. It would be nice to be able to just input n argument specifying how many objects are in the list (I guess I could just add a bunch of if statements to define the list based off such an argument). **EDIT: Changing my code so that I don't use df# syntax, instead just putting it directly into a list
The problem I was experiencing is still perplexing. I can't for the life of me figure out why the "test" variable performs as expected outside of a function, but inside of a function it fails. I'm going to go the route of creating a list of dataframes, but am still curious to understand why it fails inside of the function.
I agree with #Icarwiz that it might not be the best way to go about it but you can make it work with.
test=eval('iris'+str(i+1))

Pythonic way to return pandas dataframe with helper function and copy

I was wondering what the most pythonic way to return a pandas dataframe would be in a class. I am curious if I need to include .copy() when returning, and generally would like to understand the pitfalls of not including it. I am using a helper function because the dataframe is called multiple times, and I don't want return from the manipulate_dataframe method.
My questions are:
Do I need to place .copy() after df when assigning it to a new object?
Do I need to place .copy() after self.final_df when returning it with the get_df helper function?
class df:
def __init__(self):
pass
def manipulate_dataframe(self, df):
""" DATAFRAME MANIPULATION """
self.final_df = df
def get_df(self):
return self.final_df
Question: do you want the thing referred to as self.final_df to directly alter the original data frame, or do you want it to live its own life? Recall Python treats many variables as "views", in the sense that
y = x
creates two links to same information and if you alter y, it also alters x. Example:
>>> x = [3, 4, 6]
>>> y = x
>>> y[0] = 77
>>> x
[77, 4, 6]
The pandas df is just another example of same Python fact.
Making a copy can be time consuming. It will also disconnect your self.final_df from the input data frame. People often do that because Pandas issues warnings about user efforts to assign values into view of data frames. The correct place to start reading on this is Pandas Document https://pandas.pydata.org/docs/user_guide/indexing.html?highlight=copy%20versus%20view. However there seem to be 100 blog posts about it, I prefer the RealPython site https://realpython.com/pandas-settingwithcopywarning but you will find a others, https://www.dataquest.io/blog/settingwithcopywarning). A common effort that careless users take to address that problem is to copy the data frame, which effectively "disconnects" the new copy from the old data frame.
I don't know why you might want to create a class that simply holds a pandas df. Probably you could just create a function that does all of those data manipulations. If those commands are supposed to alter original df, don't make a copy.

Dataframe Generator Based On Conditions Pandas

I have manually created a bunch of dataframes to later concatenate back together based on a list of bigrams I have(my reason for doing this is out of the scope of this question). The problem is, I want to set this code to run daily or weekly and the manually created dataframes I have created will no longer work if the data has changed once refreshed. For instance, looking at the code below, what if "data_science," is no longer a bigram being pulled from my code next week and I have another bigram like "hello_world," that is not listed below in my code. I need to set up one function that will do all of these for me. I have about 50 dataframes I am making from my real data so even without the automation purposes, it would be a huge time saver to get a function going for this. One KEY point to make is that I am grabbing all of these bigrams from a list and naming a dataframe for each one of them. My function below with the list_input is what I am using that for.
data_science = df[df['column_name'].str.contains("data") &
df['column_name'].str.contains("science")]
data_science['bigram'] = "(data_science)"
p_value = df[df['column_name'].str.contains("p") &
df['column_name'].str.contains("value")]
p_value['bigram'] = "(p_value)"
ab_testing = df[df['column_name'].str.contains("ab") &
df['column_name'].str.contains("testing")]
ab_testing['bigram'] = "(ab_texting)"```
I am trying something like this code below but have not figured out how to make it work yet.
```def df_creator(a,b, my_list):
for a,b in my_list:
a_b = df[df['Message_stop'].str.contains(a) &
df['Message_stop'].str.contains(b)]
a_b['bigram'] = "(a_b)"```

Categories