How to divide a column by few other sub columns in pyspark?

How to divide a column by few other sub columns in pyspark? - python

I need to convert the following python code into pyspark.
df['GRN_ratio'] = df['GRN Quantity']/ df.groupby(['File No','Delivery Note Number'])['GRN
Quantity'].transform(sum)
For that I am using following Pyspark code.But I am not getting the expected output.
df.groupby(['File No','Delivery Note Number']).agg(F.sum('GRN Quantity').alias('GRN_Sum')))
.withColumn("GRN_ratio", F.col("GRN Quantity")/F.col("GRN_Sum"))

You can use window function instead of group by:
from pyspark.sql import functions as F, Window
df2 = df.withColumn('GRN_ratio',
F.col('GRN Quantity') /
F.sum('GRN Quantity').over(Window.partitionBy('File No','Delivery Note Number'))
)

Related

Make a not available column in PySpark dataframe full of zero

I have a script which is currently running in sample dataframe. Here's my code:
fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count().select('msisdn', *parameter_cut.columns).fillna(0)
After pivoting, probably some columns in parameter_cut are not available on df fd_orion_apps, and it will give the error like this:
Py4JJavaError: An error occurred while calling o1381.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`8602`' given input columns: [7537, 7011, 2658, 3582, 12120, 31049, 35010, 16615, 10003, 15067, 1914, 1436, 6032, 422, 10636, 10388, 877,...

You can separate the select into a different step. Then you will be able to use a conditional expression together with list comprehension.
from pyspark.sql import functions as F
fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count()
fd_orion_apps = fd_orion_apps.select(
'msisdn',
*[c if c in fd_orion_apps.columns else F.lit(0).alias(c) for c in parameter_cut.columns]
)

TypeError:Invalid arguement, not a string or column

I am trying to do some source to target testing in pyspark. First part I am trying to do is a count of columns using a Lean Six Sigma method making sure there are less than 3/1000000 discrepancies in the columns. When I run this though, the if statement throws a:
TypeError: Invalid arguement, not a string or column: -276244 of type <class 'int'>. For columns literals, use 'lit', 'array','struct' or 'create_map' function.
Could anyone help?
import pyspark.sql.functions as f
from pyspark.sql.types import *
good_fields = []
bad_fields = {}
count_issues = {}
columns = list(spark.sql('show columns from tu_historical').toPandas()['col_name'])
for col in columns:
print(col)
df = spark.sql(f'select pid,fnum,{col} from historical_clean')
df1 = spark.sql(f'select pid,fnum,{col} from historical1')
#count issue testing
if abs(df1.count()-df.count()) > df1.count()*.000003:
count_issues[col] = df1.count()-df.count()
test_df = df.join(df1,(df.num == df1.file) & (df1.pid == df.pid),'left').filter(df1[col]!=df[col])

Seems like your columns has a funky value in it.
You may want to use this to get the columns names:
columns = spark.sql('select * from tu_historical limit 0').columns

Error while applying filter on dataframe - PySpark

I need to subtract 10days from run_date and apply filter on dataframe. However on running below code getting the error. - Error: AnalysisException: "cannot resolve '2020-01-10' given input columns: [cust, activity_day];;\n'Filter (to_date(activity_day#1341, Some(YYYY-MM-DD)) > date_sub(cast(to_date('2020-01-10, Some(YYYY-MM-DD)) as date), 10))\n+- LogicalRDD [cust#1340L, activity_day#1341]\n".
Data:
df = spark.createDataFrame(
[
(123,"2020-01-01"),
(123,"2020-01-01"),
(123,"2019-01-01")
],
("cust", "activity_day")
)
Code: I need to subtract 10 days from run_date and apply filter on dataframe
from pyspark.sql.functions import *
subtract_days=10
run_date = to_date("2020-01-10","YYYY-MM-DD").cast("date")
df.filter(to_date(df["activity_day"],"YYYY-MM-DD") > date_sub(run_date,subtract_days)).show()
can anyone help?

to_date function only works on column attributes. Hence you need to create a column of run_date literal value using lit function
from pyspark.sql.functions import *
subtract_days=10
run_date = to_date(lit("2020-01-10"),"YYYY-MM-DD")
df.filter(to_date(df["activity_day"],"YYYY-MM-DD") > date_sub(run_date,subtract_days)).show()

pyspark RDD to DataFrame

I am new to Spark.
I have a DataFrame and I used the following command to group it by 'userid'
def test_groupby(df):
return list(df)
high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
lambda row: row.userid).mapValues(test_groupby)
It gives a RDD which in following structure:
(326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])
326033430 is the big group.
My question is how can I convert this RDD back to a DataFrame Structure? If I cannot do that, how I can get values from the Row term?
Thank you.

You should just
from pyspark.sql.functions import *
high_volumn = self.df\
.filter(self.df.outmoney >= 1000)\
.groupBy('userid').agg(collect_list('col'))
and in .agg method pass what You want to do with rest of data.
Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg

How do I search for a tuple of values in pandas?

I'm trying to write a function to swap a dictionary of targets with results in a pandas dataframe. I'd like to match a tuple of values and swap out new values. I tried building it as follows, but the the row select isn't working. I feel like I'm missing some critical function here.
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
target=("Mammals","Birds")
swapVals={("Cats","Parrots"):("Rats","Canaries")}
for x in swapVals:
#Attempt 1:
#testData.loc[x,target]=swapVals[x]
#Attempt 2:
testData[testData.loc[:,target]==x,target]=swapVals[x]

This was written in Python 2, but the basic idea should work for you. It uses the apply function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries")}
target=["Mammals","Birds"]
def swapper(in_row):
temp =tuple(in_row.values)
if temp in swapVals:
return list(swapVals[temp])
else:
return in_row
testData[target] = testData[target].apply(swapper, axis=1)
testData
Note that if you loaded the other keys into the dict, you could do the apply without the swapper function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries"), ("Dogs","Cockatiels"):("Dogs","Cockatiels")}
target=["Mammals","Birds"]
testData[target] = testData[target].apply(lambda x: list(swapVals[tuple(x.values)]), axis=1)
testData

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to divide a column by few other sub columns in pyspark? - python

You can use window function instead of group by: from pyspark.sql import functions as F, Window df2 = df.withColumn('GRN_ratio', F.col('GRN Quantity') / F.sum('GRN Quantity').over(Window.partitionBy('File No','Delivery Note Number')) )

Related

Make a not available column in PySpark dataframe full of zero

TypeError:Invalid arguement, not a string or column

Error while applying filter on dataframe - PySpark

pyspark RDD to DataFrame

How do I search for a tuple of values in pandas?

Categories

Resources