Snowpark-Python Dynamic Join - python

I have searched through a large amount of documentation to try to find an example of what I'm trying to do. I admit that the bigger issue may be my lack of python expertise. So i'm reaching out here in hopes that someone can point me in the right direction. I am trying to create a python function that dynamically queries tables based on a function parameters. Here is an example of what i'm trying to do:
def validateData(_ses, table_name,sel_col,join_col, data_state, validation_state):
sdf_t1 = _ses.table(table_name).select(sel_col).filter(col('state') == data_state)
sdf_t2 = _ses.table(table_name).select(sel_col).filter(col('state') == validation_state)
df_join = sdf_t1.join(sdf_t2, [sdf_t1[i] == sdf_t2[i] for i in join_col],'full')
return df_join.to_pandas()
This would be called like this:
df = validateData(ses,'table_name',[col('c1'),col('c2')],[col('c2'),col('c3')],'AZ','TX')
this issue i'm having is with line 5 from the funtion:
df_join = sdf_t1.join(sdf_t2, [col(sdf_t1[i]) == col(sdf_t2[i]) for i in join_col],'full')
I know that code is incorrect, but I'm hoping it explains what i'm trying to do. If anyone has any advice on if this is possible or how, I would greatly appreciate it.

Instead of joining in data frame, i think its easier to use a direct SQL and pull the data in a snow frame and convert it to a pandas data frame.
from snowflake.snowpark import Session
import pandas as pd
#snow df creation using SQL
data = session.sql("select t1.col1, t2.col2, t2.col2 from mytable t1 full outer join mytable2 t2 on t1.id=t2.id where t1.col3='something'")
#Convert snow DF to Pandas DF. You can use this pandas data frame.
data= pd.DataFrame(data.collect())

Essentially what you need is to create a python expression from two lists of variables. I don't have a better idea than using eval.
Maybe try eval(" & ".join(["(col(sdf_t1[i]) == col(sdf_t2[i]))" for i in join_col]). Be mindful that I have not completely test this but just to toss an idea.

Related

Filtering with multiple conditions and creating new csv

Look at the variations of code I tried here
I'm trying to use Pandas to filter rows with multiple conditions and create a new csv file with only those rows. I've tried several different ways and then commented out each of those attempts (sometimes I only tried one condition for simplicity but it still didn't work). When the csv file is created, the filters weren't applied.
This is my updated code
I got it to work for condition #1, but I'm not sure how to add/apply condition #2. I tried a lot of different combinations. I know the code I put in the linked image wouldn't work for applying the 2nd condition because all I did was assign the variable, but it seemed too cumbersome to try to show all the ways I tried to do it. Any hints on that part?
df = pd.read_csv(excel_file_path)
#condition #1
is_report_period = (df["Report Period"]=="2015-2016") | \
(df["Report Period"]=="2016-2017") | \
(df["Report Period"]=="2017-2018") | \
(df["Report Period"]=="2018-2019")
#condition #2
is_zip_code = (df["Zip Code"]<"14800")
new_df = df[is_report_period]
you can easily achieve this by using the '&':
new_df = df[is_report_period & is_zip_code]
also, you can make your code more readable and easy for you to apply changes
in the filtering by using this method:
Periods = ["2015-2016","2016-2017","2017-2018","2018-2019"]
is_report_period = df["Report Period"].isin(Periods)
this way you can easily alter your filter when needed, and it's
easier for you to maintain.

why and how to solve the data lost when multi encode in python pandas

Download the Data Here
Hi, I have a data something like below, and would like to multi label the data.
something to like this: target
But the problem here is data lost when multilabel it, something like below:
issue
using the coding of:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(sparse_output=True)
df_enc = df.drop('movieId', 1).join(df.movieId.str.join('|').str.get_dummies())
Someone can help me, feel free to download the dataset, thank you.
So that column when read in with pandas will be stored as a string. So first we'd need to convert that to an actual list.
From there use .explode() to expand out that list into a series (where the index will match the index it came from, and the column values will be the values in that list).
Then crosstab that from a series into each row and column being the value.
Then join that back up with the dataframe on the index values.
Keep in mind, when you do one-hot-encoding with high cardinality, you're table will blow up into a huge wide table. I just did this on the first 20 rows, and ended up with 233 columns. with the 225,000 + rows, it'll take a while (maybe a minute or so) to process and you end up with close to 1300 columns. This may be too complex for machine learning to do anything useful with it (although maybe would work with deep learning). You could still try it though and see what you get. What I would suggest to test out is find a way to simplify it a bit to make it less complex. Perhaps find a way to combine movie ids in a set number of genres or something like that? But then test to see if simplifying it improves your model/performance.
import pandas as pd
from ast import literal_eval
df = pd.read_csv('ratings_action.csv')
df.movieId = df.movieId.apply(literal_eval)
s = df['movieId'].explode()
df = df[['userId']].join(pd.crosstab(s.index, s))

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

How to pass parameters from one function to the next in Python?

I have a basic question about how to structure my code.
I'm creating a simple gui to search and return my company's financial data. This data exists in a series of excel files, and I use pandas to merge, filter, and return tables or values. My present code is quite inefficient, whereby I import relevant Excel files each time I call a search. I would rather import these Excel files upon launch and commit them to the program's memory while the program runs.
I believe that my attempt fails because I don't know how to pass arguments from one function to the next. I'm sure that I'm using this "self" operator incorrectly. Looking for best practices here, and a Pythonic approach. Thank you in advance!
import pandas as pd
def getData(self):
self.Excel1 = pd.read_excel(r'asdf')
self.Excel2 = pd.read_excel(r'fdsa')
def func1():
df1 = getData.Excel1
df2 = getData.Excel2
df3 = df1 + df2
return df3
func1()
There are ways to pass a function as an argument, and geeks for geeks has a great article on 'decorators' that do exactly that. Link below:
https://www.geeksforgeeks.org/passing-function-as-an-argument-in-python/
However, could you perhaps just combine the two functions as one? i.e.:
def getData():
d1 = pd.read_excel(r'asdf')
d2 = pd.read_excel(r'fdsa')
d3 = d1 + d2
return d3
I think the advantage of doing this is that you reduce the number of things that python needs to hold in memory. However the disadvantage is that you won't be able to access d1 or d2.
I hope this helps, I can't think of anything else based on the information in the question.

How to optimise the performance of this code?

I am trying to run the code below. It works fine for small data size, but for larger data size, it is taking almost a day.
Anyone who can help to optimise the code or who can tell me the approach. Can we use apply lambda to solve the issue?
for index in df.index:
for i in df.index:
if ((df.loc[index,"cityId"]==df.loc[i,"cityId"]) & (df.loc[index,"landingPagePath"]==df.loc[i,"landingPagePath"]) &
(df.loc[index,"exitPagePath"]==df.loc[i,"exitPagePath"]) &
(df.loc[index,"campaign"]==df.loc[i,"campaign"]) &
(df.loc[index,"pagePath"]==df.loc[i,"previousPagePath"]) &
((df.loc[index,"dateHourMinute"]+timedelta(minutes=math.floor(df.loc[index,"timeOnPage"]/60))==df.loc[i,"dateHourMinute"]) |
(df.loc[index,"dateHourMinute"]==df.loc[i,"dateHourMinute"]) |
((df.loc[index,"dateHourMinute"]+timedelta(minutes=math.floor(df.loc[index,"timeOnPage"]/60))+timedelta(minutes=1))==df.loc[i,"dateHourMinute"]))
):
if(df.loc[i,"sess"]==0):
df.loc[i,'sess']=df.loc[index,'sess']
elif(df.loc[index,"sess"]>df.loc[i,"sess"] ):
df.loc[index,'sess']=df.loc[i,'sess']
elif(df.loc[index,"sess"]==0):
df.loc[index,'sess']=df.loc[i,'sess']
elif(df.loc[index,"sess"]<df.loc[i,"sess"] ):
x=df.loc[i,"sess"]
for q in df.index:
if(df.loc[q,"sess"]==x):
df.loc[q,"sess"]=df.loc[index,'sess']
else:
if (df.loc[index,"sess"]==0):
df.loc[index,'sess'] = max(df["sess"])+1
looks like you're trying to do a database "join" manually, Pandas exposes this functionality as a merge and using this would go a long way to solving your issue
I'm having trouble following all your branches, but you should be able to get most of the way where if you use a merge and then maybe do some post-processing / filtering to get a final answer

Categories