Trying to replicate a sql statement in pyspark, getting column not iterable - python

Using Pyspark to transform data a DataFrame. The old extract used this SQL line :
case when location_type = 'SUPPLIER' then SUBSTRING(location_id,1,length(location_id)-3)
I brought in the data and loaded it into a DF, then was trying to do the transform using:
df = df.withColumn("location_id", F.when(df.location_type == "SUPPLIER",
F.substring(df.location_id, 1, length(df.location_id) - 3))
.otherwise(df.location_id))`
The substring method takes a int as the third argument but the length() method is giving a Column object. I had no luck trying to cast it and haven't found a method that would accept the Column. Also tried using the expr() wrapper but again could not make it work.
the supplier IDs look like 12345-01. The transform needs to strip the -01.

As you mention it, you can use expr to be able to use substring with indices that come from other columns like this:
df = df.withColumn("location_id",
F.when(df.location_type == "SUPPLIER",
F.expr("substring(location_id, 1, length(location_id) - 3)")
).otherwise(df.location_id)
)

Related

Trying to Pass Pandas DataFrame to a Function and Return a Modified DataFrame

I'm trying to pass different pandas dataframes to a function that does some string modification (usually str.replace operation on columns based on mapping tables stored in CSV files) and return the modified dataframes. And I'm encountering errors especially with handling the dataframe as a parameter.
The mapping table in CSV is structured as follows:
From(Str)
To(Str)
Regex(True/False)
A
A2
B
B2
CD (.*) FG
CD FG
True
My code looks as something like this:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for index in range(df_mt.shape[0]):
# If regex is true
if df_mt.iloc[index][2] is True:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=df_mt.iloc[index][0], value = df_mt.iloc[index][1], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(df_mt.iloc[index][0], df_mt.iloc[index][1])
return df_p
df_new1 = apply_mapping_table1(df_old1, 'Target_Column1', 'MappingTable1.csv')
df_new2 = apply_mapping_table2(df_old2, 'Target_Column2', 'MappingTable2.csv')
I'm getting 'IndexError: single positional indexer is out-of-bounds' for 'df_mt.iloc[index][2]' and haven't gone to the portion where the actual replacement is happening. Any suggestions to make it work or even a better way to do the dataframe string replacements based on mapping tables?
You can use the .iterrows() function to iterate through lookup table rows. Generally, the .iterrows() function is slow, but in this case because the lookup table should be a small manageable table it will be completely fine.
You can adapt your give function as I did in the following snippet:
def apply_mapping_table (p_df, p_df_col_name, p_mt_name):
df_mt = pd.read_csv(p_mt_name)
for _, row in df_mt.iterrows():
# If regex is true
if row['Regex(True/False)']:
# perform regex replacing
df_p[p_df_col_name] = df_p[p_df_col_name].replace(to_replace=row['From(Str)'], value=row['To(Str)'], regex=True)
else:
# perform normal string replacing
p_df[p_df_col_name] = p_df[p_df_col_name].replace(row['From(Str)'], row['To(Str)'])
return df_p

How to get the date when using date_add in pyspark

I am trying to get the result from date_add function in pyspark, when I use the function it always returns as column type. To see the actual result we have to add the result to a column to a dataframe but I want the result to be stored in variable. How can I store the resulted date?
df = spark.createDataFrame([('2015-04-08',)], ['dt'])
r = date_add(df.dt, 1)
print(r)
output:- Column<'date_add(dt, 1)'>
But I want output like below
output:- date.time(2015,04,09)
or
'2015-04-09'
date_add has to be used within a withColumn. In case you want the desired output, consider a non-spark approach using datetime and timedelta.
Alternately, if your use case requires spark, use the collect method like so
r=df.withColumn(‘new_col’, date_add(col(‘dt’), 1)).select(‘new_col’).collect()

Python: mapper function argument doesn't take effect when passing the Rename method to a DataFrame

I am relatively beginner with Python programming language.
I have a dataframe defined as the following:
Record1 = pd.Series({'Type':'Fork', 'Material':'metal'})
Record2 = pd.Series({'Type':'Knife','Material':'Plastic'})
Record3 = pd.Series({'Type':'Spoon','Material':'wood'})
S = pd.DataFrame([Record1,Record2,Record3])
I try to apply it the rename method, with a mapper function intended to cast all the characters into uppercase:
S.rename(mapper = lambda x:x.upper(), axis = 1, inplace = True)
I expected to get the following dataframe with all its strings transformed to uppercase:
TYPE MATERIAL
0 Fork metal
1 Knife Plastic
2 Spoon wood
But when I type S after having applied the mapper function, I still get it with lowercase characters.
Can anyone help me?
Your rename() lambda expression is manipulating the column headers, but not the values. Check this by using S.columns before and after the call to rename().
S.applymap(lambda x : x.capitalize())
This will manipulate the values not the column headers.

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

How to self-reference column in pandas Data Frame?

In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)

Categories