Create new columns based on group by with Pyspark

Create new columns based on group by with Pyspark - python

I've got a scenario where I have to take the results from a group by and create a new columns.
For example, say I have this data:
| Tool | Category | Price |
| Hammer | Hand Tool | 25.00 |
| Drill | Power Tool | 56.33 |
| Screw Driver | Hand Tool | 4.99 |
My output should look like:
| Tool | Hand Tool | Power Tool |
| Hammer | 25.00 | NULL |
| Drill | NULL | 56.33 |
| Screw Driver | 4.99 | NULL |
I'm not sure how to get this output. I'm trying something like the snippet below but it blows up with column is not iterable.
def get_tool_info():
return tool_table.groupBy('Category').pivot('Price', 'Category')
What is the best way to dynamically generate these new columns and assign the price values?

Try this :
from pyspark.sql.types import StructType, StructField, StringType, FloatType
import pyspark.sql.functions as F
schema = StructType([StructField("Tool", StringType()), StructField("Category", StringType()), StructField("Price", FloatType())])
data = [["Hammer", "Hand Tool", 25.00], ["Drill", "Power Tool", 56.33], ["Screw Driver", "Hand Tool", 4.99]]
df = spark.createDataFrame(data, schema)
df.groupby("Tool").pivot("Category").agg(F.first("Price")).show()
Output :
+------------+---------+----------+
| Tool|Hand Tool|Power Tool|
+------------+---------+----------+
| Drill| null| 56.33|
|Screw Driver| 4.99| null|
| Hammer| 25.0| null|
+------------+---------+----------+

Related

How can I transfer the exploded object back to a pyspark dataframe?

I am trying to convert this back to a pyspark data frame.
I am currently trying to calculate the 2-gram distribution here. I am trying to use:
new_df = sample_df_updated.select(['ngrams'])
from pyspark.sql.functions import explode
new_df.select(explode(new_df.ngrams)).show(truncate=False)
+------------------+
|col |
+------------------+
|the project |
|project gutenberg |
|gutenberg ebook |
|ebook of |
|of alice’s |
|alice’s adventures|
|adventures in |
|in wonderland, |
|wonderland, by |
|by lewis |
|lewis carroll |
|this ebook |
|ebook is |
|is for |
|for the |
|the use |
|use of |
|of anyone |
|anyone anywhere |
|anywhere at |
+------------------+
I am trying to use code like this:
df2 = new_df.select(explode(new_df.ngrams)).show(truncate=False)
df2.groupBy('col').count().show()
But it results in the error
'NoneType' object has no attribute 'show'
How to transfer it into a dataframe?

The .show() command makes df2 not a DataFrame.
Try:
df2 = new_df.select(explode(new_df.ngrams))
df2.show(truncate=False)
df2.groupBy('col').count().show()
And, could also be helpful to rename the exploded column for clarity.
df2 = new_df.select(explode(new_df.ngrams).alias('exploded_ngrams'))
df2.show(truncate=False)
df2.groupBy('exploded_ngrams').count().show()

What about you just explode the column on the dataframe?
new_df.withColumn("ngrams", explode("ngrams")).show(truncate=False)

Python pivot-table with array value

For a project with table features, I try to create a new table with pivot_table.
Small problem however, one of my columns contains an array. Here an ex
| House | Job |
| ----- | --- |
| Gryffindor | ["Head of Auror Office", " Minister for Magic"]|
| Gryffindor | ["Auror"]|
| Slytherin | ["Auror","Student"] |
Ideally, I would like with a pivot table to create a table that looks like this
| House | Head of Auror Office | Minister for Magic | Auror | Student |
|:----- |:--------------------:|:------------------:|:-----:|:-------:|
| Gryffindor | 1 | 1| 1| 0|
| Slytherin | 0 | 0| 1| 1|
Of course I can have a value like 2,3 or 4 in the array so something that is not fixed. Anyone have a solution? Maybe the pivot_table is not the best solution :/
Sorry for the arrays, It's not working :(

suppose your table is df with two columns:
(df.explode('Job')
.groupby(['House', 'Job']).size().reset_index()
.pivot(index = 'House', columns = 'Job').fillna(0))
the code first expand the list into rows, then do the count, and finally do the pivot table

How to use list comprehension on a column with array in pyspark?

I have a pyspark dataframe that looks like this.
+--------------------+-------+--------------------+
| ID |country| attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2,3,4...] |
|3de27656 | US|[1,7,2,4...] |
|75ce4e58 | US|[1,2,1,4...] |
|908df65c | US|[1,8,3,0...] |
|f0503257 | US|[1,2,3,2...] |
|2tBxD6j | US|[1,2,3,4...] |
|33811685 | US|[1,5,3,5...] |
|aad21639 | US|[7,8,9,4...] |
|e3d9e3bb | US|[1,10,9,4...] |
|463f6f69 | US|[12,2,13,4...] |
+--------------------+-------+--------------------+
I also have a set that looks like this
reference_set = (1,2,100,500,821)
what I want to do is create a new list as a column in the dataframe using maybe a list comprehension like this [attr for attr in attrs if attr in reference_set]
so my final dataframe should be something like this
+--------------------+-------+--------------------+
| ID |country| filtered_attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2] |
|3de27656 | US|[1,2] |
|75ce4e58 | US|[1,2] |
|908df65c | US|[1] |
|f0503257 | US|[1,2] |
|2tBxD6j | US|[1,2] |
|33811685 | US|[1] |
|aad21639 | US|[] |
|e3d9e3bb | US|[1] |
|463f6f69 | US|[2] |
+--------------------+-------+--------------------+
How can I do this? as I'm new to pyspark I can't think of a logic.
Edit : posted a logic below, if there's a more efficient way of doing this please let me know.

You can use built-in function - array_intersect.
# Sample dataframe
df = spark.createDataFrame([('ffae10af', 'US', [1,2,3,4])], ["ID", "Country", "attrs"])
reference_set = {1,2,100,500,821}
# This step is to add set as column in dataframe
set_to_string = ",".join([str(x) for x in reference_set])
df.withColumn('reference_set', split(lit(set_to_string), ',').cast('array<bigint>')). \
withColumn('filtered_attrs', array_intersect('attrs','reference_set'))\
.show(truncate = False)
+--------+-------+------------+---------------------+--------------+
|ID |Country|attrs |reference_set |filtered_attrs|
+--------+-------+------------+---------------------+--------------+
|ffae10af|US |[1, 2, 3, 4]|[1, 2, 100, 500, 821]|[1, 2] |
+--------+-------+------------+---------------------+--------------+

I managed to use the filter function paired with a UDF to make this work.
def filter_items(item):
if item in reference_set:
return True
else:
return False
custom_udf = udf(lambda attributes : list(filter(filter_items, attributes)))
processed_df = df.withColumn('filtered_attrs',custom_udf(col('attrs')))
This gives me the required output

How to run TA-Lib on multiple tickers in a single dataframe

I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.

rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)

Dict2Columns - PySpark

I would like to convert one columns with dict values to expand columns with values as follows:
+-------+--------------------------------------------+
| Idx| value |
+-------+--------------------------------------------+
| 123|{'country_code': 'gb','postal_area': 'CR'} |
| 456|{'country_code': 'cn','postal_area': 'RS'} |
| 789|{'country_code': 'cl','postal_area': 'QS'} |
+-------+--------------------------------------------+
then i would like to get something like this:
display(df)
+-------+-------------------------------+
| Idx| country_code | postal_area |
+-------+-------------------------------+
| 123| gb | CR |
| 456| cn | RS |
| 789| cl | QS |
+-------+-------------------------------+
i just Try to do only for one line something like this:
#PySpark code
sc = spark.sparkContext
dict_lst = {'country_code': 'gb','postal_area': 'CR'}
rdd = sc.parallelize([json.dumps(dict_lst)])
df = spark.read.json(rdd)
display(df)
and i've got:
+-------------+-------------+
|country_code | postal_area |
+-------------+-------------+
| bg | CR |
+-------------+-------------+
so, here maybe i have part of the solution. now i would like to know hoy can i concat df with dataframe Result

well after Trying... the best solution is getting values from regexp_extract function from PySpark:
from pyspark.sql.functions import regexp_extract
df.withColumn("country_code", regexp_extract('value', "(?<=.country_code.:\s.)(.*?)(?=\')", 0)).withColumn("postal_area", regexp_extract('value', "(?<=.postal_area.:\s.)(.*?)(?=\')", 0))
hope this helps for futures askings about getting values from a String Dictionary

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create new columns based on group by with Pyspark - python

Related

How can I transfer the exploded object back to a pyspark dataframe?

Python pivot-table with array value

How to use list comprehension on a column with array in pyspark?

How to run TA-Lib on multiple tickers in a single dataframe

Dict2Columns - PySpark

Categories

Resources