Reading and grouping data to get count using python spark

Reading and grouping data to get count using python spark - python

I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark.
I have a file like below -
empid||deptid||salary
1||10||500
2||10||200
3||20||300
4||20||400
5||20||100
I want to write a small python spark to read the print the count of employees in each department.
I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example
I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html
Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code

You can read the text file as a dataframe using :
df = spark.createDataFrame(
sc.textFile("path/to/my/file").map(lambda l: l.split(',')),
["empid","deptid","salary"]
)
textFile loads the data sample as an RDD with only one column. Then we split each line through a map and convert it to a dataframe.
Starting from a python list of lists:
df = spark.createDataFrame(
sc.parallelize([[1,10,500],
[2,10,200],
[3,20,300],
[4,20,400],
[5,20,100]]),
["empid","deptid","salary"]
)
df.show()
+-----+------+------+
|empid|deptid|salary|
+-----+------+------+
| 1| 10| 500|
| 2| 10| 200|
| 3| 20| 300|
| 4| 20| 400|
| 5| 20| 100|
+-----+------+------+
Now to count the number of employees by department we'll use a groupBy and then use the count aggregation function:
df_agg = df.groupBy("deptid").count().show()
+------+-----+
|deptid|count|
+------+-----+
| 10| 2|
| 20| 3|
+------+-----+
For the max:
import pyspark.sql.functions as psf
df_agg.agg(psf.max("count")).show()

Related

PySpark - Transform multiple columns without using udf

I have a df like this one:
df = spark.createDataFrame(
[("1", "Apple", "cat"), ("2", "2.", "house"), ("3", "<strong>text</strong>", "HeLlo 2.5")],
["id", "text1", "text2"])
+---+---------------------+---------+
| id| text1| text2|
+---+---------------------+---------+
| 1| Apple| cat|
| 2| 2.| house|
| 3|<strong>text</strong>|HeLlo 2.5|
+---+---------------------+---------+
multiple functions to clean text like
def remove_html_tags(text):
document = html.fromstring(text)
return " ".join(etree.XPath("//text()")(document))
def lowercase(text):
return text.lower()
def remove_wrong_dot(text):
return re.sub(r'(?<!\d)[.,;:]|[.,;:](?!\d)', ' ', text)
and a list of columns to clean
COLS = ["text1", "text2"]
I would like to apply the functions to the columns in the list and also keep the original text
+---+---------------------+-----------+---------+-----------+
| id| text1|text1_clean| text2|text2_clean|
+---+---------------------+-----------+---------+-----------+
| 1| Apple| apple| cat| cat|
| 2| 2.| 2| house| house|
| 3|<strong>text</strong>| text|HeLlo 2.5| hello 2.5|
+---+---------------------+-----------+---------+-----------+
I already have an approach using UDF but it is not very efficient. I've been trying something like:
rdds = []
for col in TEXT_COLS:
rdd = df.rdd.map(lambda x: (x[col], lowercase(x[col])))
rdds.append(rdd.collect())
return df
My idea would be to join all rdds in the list but I don't know how efficient this would be or how to list more functions.
I appreciate any ideas or suggestions.
EDIT: Not all transformations can be done with regexp_replace. For example, the text can include nested html labels and in that case a simple replace wouldn't work or I don't want to replace all dots, only those at the end or beginning of substrings

Spark built-in functions can do all the transformations you wanted
from pyspark.sql import functions as F
cols = ["text1", "text2"]
for c in cols:
df = (df
.withColumn(f'{c}_clean', F.lower(c))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '<[^>]+>', ''))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '(?<!\d)[.,;:]|[.,;:](?!\d)', ''))
)
+---+--------------------+---------+-----------+-----------+
| id| text1| text2|text1_clean|text2_clean|
+---+--------------------+---------+-----------+-----------+
| 1| Apple| cat| apple| cat|
| 2| 2.| house| 2| house|
| 3|<strong>text</str...|HeLlo 2.5| text| hello 2.5|
+---+--------------------+---------+-----------+-----------+

Translating a SAS Ranking with Tie set to HIGH into PySpark

I'm trying to replicate the following SAS code in PySpark:
PROC RANK DATA = aud_baskets OUT = aud_baskets_ranks GROUPS=10 TIES=HIGH;
BY customer_id;
VAR expenditure;
RANKS basket_rank;
RUN;
The idea is to rank all expenditures under each customer_id block. The data would look like this:
+-----------+--------------+-----------+
|customer_id|transaction_id|expenditure|
+-----------+--------------+-----------+
| A| 1| 34|
| A| 2| 90|
| B| 1| 89|
| A| 3| 6|
| B| 2| 8|
| B| 3| 7|
| C| 1| 96|
| C| 2| 9|
+-----------+--------------+-----------+
In PySpark, I tried this:
spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc())
aud_baskets = (aud_baskets_ranks.withColumn('basket_rank', ntile(10).over(spendWindow)))
The problem is that PySpark doesn't let the user change the way it will handle Ties, like SAS does (that I know of). I need to set this behavior in PySpark so that values are moved up to the next tier each time one of those edge cases occur, as oppose to dropping them to the rank below.
Or is there a way to custom write this approach?

Use dense_rank it will give same rank in case of ties and next rank will not be skipped
ntile function split the group of records in each partition into n parts. In your case which is 10
from pyspark.sql.functions import dense_rank
spendWindow = Window.partitionBy('customer_id').orderBy(col('expenditure').asc())
aud_baskets = aud_baskets_ranks.withColumn('basket_rank',dense_rank.over(spendWindow))

Try The following code. It is generated by an automated tool called SPROCKET. It should take care of ties.
df = (aud_baskets)
for (colToRank,rankedName) in zip(['expenditure'],['basket_rank']):
wA = Window.orderBy(asc(colToRank))
df_w_rank = (df.withColumn('raw_rank', rank().over(wA)))
ties = df_w_rank.groupBy('raw_rank').count().filter("""count > 1""")
df_w_rank = (df_w_rank.join(ties,['raw_rank'],'left').withColumn(rankedName,expr("""case when count is not null
then (raw_rank + count - 1) else
raw_rank end""")))
rankedNameGroup = rankedName
n = df_w_rank.count()
df_with_rank_groups = (df_w_rank.withColumn(rankedNameGroup,expr("""FLOOR({rankedName}
*{k}/({n}+1))""".format(k=10, n=n,
rankedName=rankedName))))
df = df_with_rank_groups
aud_baskets_ranks = df_with_rank_groups.drop('raw_rank', 'count')

How can I access a specific column from Spark Data frame in python?

My Dataframe looks like this
------+-------+
|cat_id|counter|
+------+-------+
| 12| 61060|
| 1| 542118|
| 13| 164700|
| 3| 406622|
| 5| 54902|
| 10| 118281|
| 11| 13658|
| 14| 72229|
| 2| 131206|
+------+-------+
Query to get above data frame is :
grouped_data = dataframe.groupBy("cat_id").agg(count("*").alias("counter"))
Now I need to read values for different cat_id to save it in another database.
The way I can get it done is by using a for loop on my id's
for cat_id in cat_ids_map:
statsCount = grouped_data.select("counter").filter("cat_id = " + cat_id).collect()[0].counter
But I think there can be a better way to read the counter without for loop. Any suggestions would be helpful!!!
Thanks

If you're to iterate through the entire dataframe, the way to do it is usually using a .foreach function.
so you would do:
grouped_data.foreach(lambda x: f(x))
where f is your function that will do whatever you want with each element in the dataframe

Pyspark lazy evaluation in loops too slow

First of all I want to let you know that I am still very new in spark and getting used to the lazy-evaluation concept.
Here my issue:
I have two spark DataFrames that I load from reading CSV.GZ files.
What I am trying to do is to merge both tables in order to split the first table according keys that I have on the second one.
For example:
Table A
+----------+---------+--------+---------+------+
| Date| Zone| X| Type|Volume|
+----------+---------+--------+---------+------+
|2019-01-16|010010000| B| A| 684|
|2019-01-16|010020000| B| A| 21771|
|2019-01-16|010030000| B| A| 7497|
|2019-01-16|010040000| B| A| 74852|
Table B
+----+---------+
|Dept| Zone|
+----+---------+
| 01|010010000|
| 02|010020000|
| 01|010030000|
| 02|010040000|
Then when I merge both tables I have:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010010000|2019-01-16| B| A| 684| 01|
|010020000|2019-01-16| B| A| 21771| 02|
|010030000|2019-01-16| B| A| 7497| 01|
|010040000|2019-01-16| B| A| 74852| 02|
So what I want to do is to split this table in Y disjointed tables, where Y is the number of different 'Dept' values that I find on my merged table.
So for example:
Result1:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010010000|2019-01-16| B| A| 684| 01|
|010030000|2019-01-16| B| A| 7497| 01|
Result2:
+---------+----------+--------+---------+------+----+
| Zone| Date| X| Type|Volume|Dept|
+---------+----------+--------+---------+------+----+
|010020000|2019-01-16| B| A| 21771| 02|
|010040000|2019-01-16| B| A| 74852| 02|
My code looks like this:
sp_df_A = spark.read.csv(file_path_A, header=True, sep=';', encoding='cp1252')
sp_df_B = spark.read.csv(file_path_B, header=True, sep=';', encoding='cp1252')
sp_merged_df = sp_df_A.join(sp_df_B, on=['Zone'], how='left')
# list of unique 'Dept' values on the merged DataFrame
unique_buckets = [x.__getitem__('Dept') for x in sp_merged_df.select('Dept').distinct().collect()]
# Iterate over all 'Dept' found
for zone_bucket in unique_buckets:
print(zone_bucket)
bucket_dir = os.path.join(output_dir, 'Zone_%s' % zone_bucket)
if not os.path.exists(bucket_dir):
os.mkdir(bucket_dir)
# Filter target 'Dept'
tmp_df = sp_merged_df.filter(sp_merged_df['Dept'] == zone_bucket)
# write result
tmp_df.write.format('com.databricks.spark.csv').option('codec', 'org.apache.hadoop.io.compress.GzipCodec').save(bucket_dir, header = 'true')
The thing is that this very simple code is taking too much time to write a result. So my guess is that the lazy evaluation is loading, merging and filtering on every cycle of the loop.
Can this be the case?

Your guess is correct. Your code reads, joins and filters all the data for each of the buckets. This is indeed caused by the lazy evaluation of spark.
Spark waits with any data transformation until an action is performed. When an action is called, spark looks at all the transformations and creates a plan on how to efficiently get the results of the action. While spark executes this plan the program holds. When spark is done the program continues and spark "forgets" about everything it has done until the next action is called.
In your case spark "forgets" the joined dataframe sp_merged_df and each time a .collect() or .save() is called it reconstructs it.
If you want spark to "remember" a RDD or DataFrame you can .cache() it (see docs).

Adding column to dataframe and updating in pyspark

I have a dataframe in pyspark:
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: json.loads(l)),
)
ratings.show()
+--------+-------------------+------------+----------+-------------+-------+
|click_id| created_at| ip|product_id|product_price|user_id|
+--------+-------------------+------------+----------+-------------+-------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3|
+--------+-------------------+------------+----------+-------------+-------+
ratings.registerTempTable("transactions")
final_df = sqlContext.sql("select * from transactions");
I want to add a new column to this data frame called status and then update the status column based on created_at and user_id.
The created_at and user_id are read from the given table transations and passed to a function get_status(user_id,created_at) which returns the status. This status needs to be put into the transaction table as a new column for the corresponding user_id and created_at
Can I run alter and update command in pyspark?
How can this be done using pyspark ?

It's not clear what you want to do exactly. You should check out window functions they allow you to compare, sum... rows in a frame.
For instance
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("user_id").orderBy(psf.desc("created_at"))
ratings.withColumn(
"status",
psf.when(psf.row_number().over(w) == 1, "active").otherwise("inactive")).sort("click_id").show()
+--------+-------------------+------------+----------+-------------+-------+--------+
|click_id| created_at| ip|product_id|product_price|user_id| status|
+--------+-------------------+------------+----------+-------------+-------+--------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|inactive|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|inactive|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1| active|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|inactive|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|inactive|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2| active|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|inactive|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|inactive|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3| active|
+--------+-------------------+------------+----------+-------------+-------+--------+
It gives you each user's last click
If you want to pass a UDF to create a new column from two existing ones.
Say you have a function that takes the user_id and created_at as arguments
from pyspark.sql.types import *
def get_status(user_id,created_at):
...
get_status_udf = psf.udf(get_status, StringType())
StringType() or whichever datatype your function outputs
ratings.withColumn("status", get_status_udf("user_id", "created_at"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading and grouping data to get count using python spark - python

Related

PySpark - Transform multiple columns without using udf

Translating a SAS Ranking with Tie set to HIGH into PySpark

How can I access a specific column from Spark Data frame in python?

Pyspark lazy evaluation in loops too slow

Adding column to dataframe and updating in pyspark

Categories

Resources