Make a not available column in PySpark dataframe full of zero - python

I have a script which is currently running in sample dataframe. Here's my code:
fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count().select('msisdn', *parameter_cut.columns).fillna(0)
After pivoting, probably some columns in parameter_cut are not available on df fd_orion_apps, and it will give the error like this:
Py4JJavaError: An error occurred while calling o1381.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`8602`' given input columns: [7537, 7011, 2658, 3582, 12120, 31049, 35010, 16615, 10003, 15067, 1914, 1436, 6032, 422, 10636, 10388, 877,...

You can separate the select into a different step. Then you will be able to use a conditional expression together with list comprehension.
from pyspark.sql import functions as F
fd_orion_apps = fd_orion_apps.groupBy('msisdn', 'apps_id').pivot('apps_id').count()
fd_orion_apps = fd_orion_apps.select(
'msisdn',
*[c if c in fd_orion_apps.columns else F.lit(0).alias(c) for c in parameter_cut.columns]
)

Related

Get only the name of a DataFrame - Python - Pandas

I'm actually working on a ETL project with crappy data I'm trying to get right.
For this, I'm trying to create a function that would take the names of my DFs and export them to CSV files that would be easy for me to deal with in Power BI.
I've started with a function that will take my DFs and clean the dates:
df_liste = []
def facture(x) :
x = pd.DataFrame(x)
for s in x.columns.values :
if s.__contains__("Fact") :
x.rename(columns= {s : 'periode_facture'}, inplace = True)
x['periode_facture'] = x[['periode_facture']].apply(lambda x : pd.to_datetime(x, format = '%Y%m'))
If I don't set 'x' as a DataFrame, it doesn't work but that's not my problem.
As you can see, I have set a list variable which I would like to increment with the names of the DFs, and the names only. Unfortunately, after a lot of tries, I haven't succeeded yet so... There it is, my first question on Stack ever!
Just in case, this is the first version of the function I would like to have:
def export(x) :
for df in x :
df.to_csv(f'{df}.csv', encoding='utf-8')
You'd have to set the name of your dataframe first using df.name (probably, when you are creating them / reading data into them)
Then you can access the name like a normal attribute
import pandas as pd
df = pd.DataFrame( data=[1, 2, 3])
df.name = 'my df'
and can use
df.to_csv(f'{df.name}.csv', encoding='utf-8')

Cleaning column names in pandas

I have a Dataframe I receive from a crawler that I am importing into a database for long-term storage.
The problem I am running into is a large amount of the various dataframes have uppercase and whitespace.
I have a fix for it but I was wondering if it can be done any cleaner than this:
def clean_columns(dataframe):
for column in dataframe:
dataframe.rename(columns = {column : column.lower().replace(" ", "_")},
inplace = 1)
return dataframe
print(dataframe.columns)
Index(['Daily Foo', 'Weekly Bar'])
dataframe = clean_columns(dataframe)
print(dataframe.columns)
Index(['daily_foo', 'weekly_bar'])
You can try via columns attribute:
df.columns=df.columns.str.lower().str.replace(' ','_')
OR
via rename() method:
df=df.rename(columns=lambda x:x.lower().replace(' ','_'))

How to divide a column by few other sub columns in pyspark?

I need to convert the following python code into pyspark.
df['GRN_ratio'] = df['GRN Quantity']/ df.groupby(['File No','Delivery Note Number'])['GRN
Quantity'].transform(sum)
For that I am using following Pyspark code.But I am not getting the expected output.
df.groupby(['File No','Delivery Note Number']).agg(F.sum('GRN Quantity').alias('GRN_Sum')))
.withColumn("GRN_ratio", F.col("GRN Quantity")/F.col("GRN_Sum"))
You can use window function instead of group by:
from pyspark.sql import functions as F, Window
df2 = df.withColumn('GRN_ratio',
F.col('GRN Quantity') /
F.sum('GRN Quantity').over(Window.partitionBy('File No','Delivery Note Number'))
)

KeyError: "['C18orf17', 'UHRF1', 'OLR1', 'TBC1D2', 'AXUD1'] not in index"

I have seen this error here. But my problem is not that.
I am trying to extract some column of large dataframe:
dfx = df1[["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", 'C18orf17','UHRF1', "CEBPD",
'OLR1', 'TBC1D2', 'AXUD1',"TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]]
It throws an error as follows:
KeyError: "['C18orf17', 'UHRF1', 'OLR1', 'TBC1D2', 'AXUD1'] not in index"
After removing the above columns it started working. fine
dfx = df1[["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", "TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]]
But, I want ignore this error by not considering the column names if not present and consider which overlap. Any help appreciated..
Use Index.intersection for select only columns with list if exist:
L = ["THRSP", "SERHL2", "TARP", "ADH1C", "KRT4",
"SORD", "SERHL", 'C18orf17','UHRF1', "CEBPD",
'OLR1', 'TBC1D2', 'AXUD1',"TSC22D3",
"ADH1A", "VIPR1", "LRFN2", "ANKRD22"]
dfx = df1[df1.columns.intersection(L, sort=False)]
Or filter them in Index.isin, then need DataFrame.loc with first : for select all rows and columns by mask:
dfx = df1.loc[:, df1.columns.isin(L)]

pyspark RDD to DataFrame

I am new to Spark.
I have a DataFrame and I used the following command to group it by 'userid'
def test_groupby(df):
return list(df)
high_volumn = self.df.filter(self.df.outmoney >= 1000).rdd.groupBy(
lambda row: row.userid).mapValues(test_groupby)
It gives a RDD which in following structure:
(326033430, [Row(userid=326033430, poiid=u'114233866', _mt_datetime=u'2017-06-01 14:54:48', outmoney=1127.0, partner=2, paytype=u'157', locationcity=u'\u6f4d\u574a', locationprovince=u'\u5c71\u4e1c\u7701', location=None, dt=u'20170601')])
326033430 is the big group.
My question is how can I convert this RDD back to a DataFrame Structure? If I cannot do that, how I can get values from the Row term?
Thank you.
You should just
from pyspark.sql.functions import *
high_volumn = self.df\
.filter(self.df.outmoney >= 1000)\
.groupBy('userid').agg(collect_list('col'))
and in .agg method pass what You want to do with rest of data.
Follow this link : http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.agg

Categories