PySpark - how to convert code pandas to pyspark - nested when - python

how are you?
So, I need to convert some code written in pandas to PySpark, but so far I'm still having some problems, I've tried using PySpark's function when, but the result is wrong, can you help me? This is very simple and I'm starting with PySpark
df = df.assign(new_col=np.where(df.col1.isnull()
,np.where(df.col2.isnull(), df.col3,
np.where(df.col3.isnull(), df.col2,((df.col2+df.col3/2))
),df.col1)))
The main goal here is to understand how to use nested when.
I thank you in advance!!

Without sample input output hard to know what to expect especially given your code does not run in pandas either. Multiple when statements should replace nested np.where statement.
Code below
from pyspark.sql.functions import *
import pyspark.sql.functions as F
df.withColumn('newcol1', when(col('col1').isNull()&col('col2').isNull(), col('col3'))).withColumn('newcol1', when(col('col3').isNull()&col('col1').isNotNull(), col('col2')).otherwise(col('col3'))).withColumn('newcol1', when(col('col3').isNull()&col('col2').isNull(), col('col1')).otherwise(col('newcol1'))).show()

Related

Converting print output into a dataframe

I am currently working on a world bank data project, and would like to convert the following output into a simple pandas dataframe.
import pandas as pd
##If already have wbgapi installed, if not pip install wbgapi in cmd prompt
import wbgapi as wbgapi
print(wbgapi.economy.info())
Note, you may need to pip install wbgapi if you do not already have it.
I wish to convert the output of this print statement, which pulls up what is essentially a table, as then I can use pandas functions on this table, sortby etc.
I tried, what I was 99.9% sure wouldn't work as the output is ofcourse a string not a dict,
economies = wbgapi.economy.info()
df = pd.DataFrame(economies)
Approaches such as converting a string to a list etc., I cannot figure out how to approach the string in question (the output of print(wb.economy.info())) given its spacing and column like nature, rather than being a block of text.
Any help would be greatly appreciated.
Try this:
df = wbgapi.economy.DataFrame()
I would suggest having a look at the documentation either way:
https://pypi.org/project/wbgapi/
To convert the output of wbgapi.economy.info() into a Pandas DataFrame, you can use the json_normalize() function from the pandas.io.json module. Here's an example code snippet that should work:
import pandas as pd
import wbgapi as wb
from pandas.io.json import json_normalize
economies = wb.economy.info()
df = json_normalize(economies)
In this code, we first import the necessary modules, including wbgapi and json_normalize. Then, we use wbgapi.economy.info() to get the data we want to convert to a DataFrame. Finally, we use json_normalize() to convert the data to a DataFrame.
Note that json_normalize() expects a list of dictionaries, so we don't need to do any additional string or list manipulation. We can pass the economies data directly to json_normalize().
Once you have the DataFrame, you can use Pandas functions like sort_values() and others to manipulate and analyze the data.

Snowpark-Python Dynamic Join

I have searched through a large amount of documentation to try to find an example of what I'm trying to do. I admit that the bigger issue may be my lack of python expertise. So i'm reaching out here in hopes that someone can point me in the right direction. I am trying to create a python function that dynamically queries tables based on a function parameters. Here is an example of what i'm trying to do:
def validateData(_ses, table_name,sel_col,join_col, data_state, validation_state):
sdf_t1 = _ses.table(table_name).select(sel_col).filter(col('state') == data_state)
sdf_t2 = _ses.table(table_name).select(sel_col).filter(col('state') == validation_state)
df_join = sdf_t1.join(sdf_t2, [sdf_t1[i] == sdf_t2[i] for i in join_col],'full')
return df_join.to_pandas()
This would be called like this:
df = validateData(ses,'table_name',[col('c1'),col('c2')],[col('c2'),col('c3')],'AZ','TX')
this issue i'm having is with line 5 from the funtion:
df_join = sdf_t1.join(sdf_t2, [col(sdf_t1[i]) == col(sdf_t2[i]) for i in join_col],'full')
I know that code is incorrect, but I'm hoping it explains what i'm trying to do. If anyone has any advice on if this is possible or how, I would greatly appreciate it.
Instead of joining in data frame, i think its easier to use a direct SQL and pull the data in a snow frame and convert it to a pandas data frame.
from snowflake.snowpark import Session
import pandas as pd
#snow df creation using SQL
data = session.sql("select t1.col1, t2.col2, t2.col2 from mytable t1 full outer join mytable2 t2 on t1.id=t2.id where t1.col3='something'")
#Convert snow DF to Pandas DF. You can use this pandas data frame.
data= pd.DataFrame(data.collect())
Essentially what you need is to create a python expression from two lists of variables. I don't have a better idea than using eval.
Maybe try eval(" & ".join(["(col(sdf_t1[i]) == col(sdf_t2[i]))" for i in join_col]). Be mindful that I have not completely test this but just to toss an idea.

How to get data from object in Python

I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id

Applying for loops on dataframe?

I am applying for loop to a column in python. But I am not able to execute it. It is producing error. I want square of a column. Please see where I am committing mistake. I know i can do this with lambda. But I want to perform it in traditional way.
import pandas as pd
output=[]
for i in pd.read_csv("infy.csv"):
output.append(i['Close']**2)
print(output)
the whole point of pandas is not to loop
output = pd.read_csv("infy.csv")['Close']**2

Using Faker with Pandas (Python 3.5)

I have imported a csv into data frame and want to use faker to mask the First and Last names.
I am using the following code to call the function (list comprehension)
MasterDE['FirstName'] = [fake.first_name for i in range(MasterDE.FirstName.size)]
MasterDE['LastName'] = [fake.last_name for i in range(MasterDE.FirstName.size)]
When I try to inspect the data, I get this:
I would like to have the actual values and not this description. Will appreciate pointers on this.
Figured out what was wrong. I missed out the parentheses after calling the method.
MasterDE['LastName'] = [fake.last_name() for i in range(MasterDE.LastName.size)]
works fine.

Categories