Use enumerate to get partition columns from dataframe

Use enumerate to get partition columns from dataframe - python

I am trying to get all columns and their datatypes into a variable, also only the partition columns into another variable of list type in python.
Getting details from describe extended.
df = spark.sql("describe extended schema_name.table_name")
+----------------------------------------------------------+
|col_name |data_type |
+----------------------------+-----------------------------+
|col1 |string |
|col2 |int
|col3 |string
|col4 |int
|col5 |string
|# Partition Information | |
|# col_name |data_type |
|col4 |int |
|col5 |string |
| | |
|# Detailed Table Information| |
|Database |schema_name |
|Table |table_name |
|Owner |owner.name |
Converting result into a list.
des_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
Here is how I am trying to get all columns(all items until before # Partition Information).
all_cols_name_type=[]
for index,item in enumerate(des_list):
if item[0]=='# Partition Information':
all_cols_name_type.append(des_list[:index])
For partitions, i would like to get everything between the items '# col_name' and line before ''(line before # Detailed Table Information)
Any help is appreciated to be able to get this.

You can try using the following answer or equivalent in Scala:
val (partitionCols, dataCols) = spark.catalog.listColumns("schema_name.table_name")
.collect()
.partition(c => c.isPartition)
val parCols = partitionCols.map(c => (c.name, c.dataType))
val datCols = dataCols.map(c => (c.name, c.dataType))
If the table is not defined in the catalog (e.g reading parquet dataset directly from s3 using spark.read.parquet("s3://path/...")) then you can use the following snippet in Scala:
val (partitionSchema, dataSchema) = df.queryExecution.optimizedPlan match {
case LogicalRelation(hfs: HadoopFsRelation, _, _, _) =>
(hfs.partitionSchema, hfs.dataSchema)
case DataSourceV2ScanRelation(_, scan: FileScan, _) =>
(scan.readPartitionSchema, scan.readDataSchema)
case _ => (StructType(Seq()), StructType(Seq()))
}
val parCols = partitionSchema.map(f => (f.name, f.dataType))
val datCols = dataSchema.map(f => (f.name, f.dataType))

There is a trick to do so: You can use monotonically_increasing_id to give each row a number, find the row that has # col_name and get that index. Something like this
My sample table
df = spark.sql('describe data')
df = df.withColumn('id', F.monotonically_increasing_id())
df.show()
+--------------------+---------+-------+---+
| col_name|data_type|comment| id|
+--------------------+---------+-------+---+
| c1| int| null| 0|
| c2| string| null| 1|
|# Partition Infor...| | | 2|
| # col_name|data_type|comment| 3|
| c2| string| null| 4|
+--------------------+---------+-------+---+
tricky part
idx = df.where(F.col('col_name') == '# col_name').first()['id']
# 3
partition_cols = [r['col_name'] for r in df.where(F.col('id') > idx).collect()]
# ['c2']

Related

PySpark - Transform multiple columns without using udf

I have a df like this one:
df = spark.createDataFrame(
[("1", "Apple", "cat"), ("2", "2.", "house"), ("3", "<strong>text</strong>", "HeLlo 2.5")],
["id", "text1", "text2"])
+---+---------------------+---------+
| id| text1| text2|
+---+---------------------+---------+
| 1| Apple| cat|
| 2| 2.| house|
| 3|<strong>text</strong>|HeLlo 2.5|
+---+---------------------+---------+
multiple functions to clean text like
def remove_html_tags(text):
document = html.fromstring(text)
return " ".join(etree.XPath("//text()")(document))
def lowercase(text):
return text.lower()
def remove_wrong_dot(text):
return re.sub(r'(?<!\d)[.,;:]|[.,;:](?!\d)', ' ', text)
and a list of columns to clean
COLS = ["text1", "text2"]
I would like to apply the functions to the columns in the list and also keep the original text
+---+---------------------+-----------+---------+-----------+
| id| text1|text1_clean| text2|text2_clean|
+---+---------------------+-----------+---------+-----------+
| 1| Apple| apple| cat| cat|
| 2| 2.| 2| house| house|
| 3|<strong>text</strong>| text|HeLlo 2.5| hello 2.5|
+---+---------------------+-----------+---------+-----------+
I already have an approach using UDF but it is not very efficient. I've been trying something like:
rdds = []
for col in TEXT_COLS:
rdd = df.rdd.map(lambda x: (x[col], lowercase(x[col])))
rdds.append(rdd.collect())
return df
My idea would be to join all rdds in the list but I don't know how efficient this would be or how to list more functions.
I appreciate any ideas or suggestions.
EDIT: Not all transformations can be done with regexp_replace. For example, the text can include nested html labels and in that case a simple replace wouldn't work or I don't want to replace all dots, only those at the end or beginning of substrings

Spark built-in functions can do all the transformations you wanted
from pyspark.sql import functions as F
cols = ["text1", "text2"]
for c in cols:
df = (df
.withColumn(f'{c}_clean', F.lower(c))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '<[^>]+>', ''))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '(?<!\d)[.,;:]|[.,;:](?!\d)', ''))
)
+---+--------------------+---------+-----------+-----------+
| id| text1| text2|text1_clean|text2_clean|
+---+--------------------+---------+-----------+-----------+
| 1| Apple| cat| apple| cat|
| 2| 2.| house| 2| house|
| 3|<strong>text</str...|HeLlo 2.5| text| hello 2.5|
+---+--------------------+---------+-----------+-----------+

pyspark replace lowercase characters in column with 'x'

I'm trying to do the following but for a column in pyspark but no luck. Any idea on isolating just the lowercase characters in column of a spark df?
''.join('x' if x.islower() else 'X' if x.isupper() else x for x in text)

You can directly use regex_replace to substitute the lowercase values to any desired value -
In your case you will have to chain regex_replace to get the final output -
Data Preparation
inp_string = """
lRQWg2IZtB
hVzsJhPVH0
YXzc4fZDwu
qRyOUhT5Hn
b85O0H41RE
vOxPLFPWPy
fE6o5iMJ6I
918JI00EC7
x3yEYOCwek
m1eWY8rZwO
""".strip().split()
df = pd.DataFrame({
'value':inp_string
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
Regex Replace
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value", r'[a-z]', "x"))
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value_modified", r'[A-Z]', "X"))
sparkDF.show()
+----------+--------------+
| value|value_modified|
+----------+--------------+
|lRQWg2IZtB| xXXXx2XXxX|
|hVzsJhPVH0| xXxxXxXXX0|
|YXzc4fZDwu| XXxx4xXXxx|
|qRyOUhT5Hn| xXxXXxX5Xx|
|b85O0H41RE| x85X0X41XX|
|vOxPLFPWPy| xXxXXXXXXx|
|fE6o5iMJ6I| xX6x5xXX6X|
|918JI00EC7| 918XX00XX7|
|x3yEYOCwek| x3xXXXXxxx|
|m1eWY8rZwO| x1xXX8xXxX|
+----------+--------------+

Using the following dataframe as an example
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
You can use a pyspark.sql function called regexpr_replace to isolate the lowercase letters in the column with the following code
from pyspark.sql import functions
df = (df.withColumn("value",
functions.regexp_replace("value", r'[A-Z]|[0-9]|[,.;##?!&$]', "")))
df.show()
+-----+
|value|
+-----+
| lgt|
| hzsh|
|zcfwu|
| qyhn|
| b|
| vxy|
| foi|
| |
|xywek|
| merw|
+-----+

Creating pyspark dataframe from list of dictionaries

I have below list of dictionaries
results =
[
{
"type:"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type:"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
I want to create two different pyspark dataframe with below schema -
args_id column in results table will be same when we have unique pair of (type,kwargs). This JSON has to be run on a daily basis and hence if it find out same pair of (type,kwargs) again, it should give the same args_id value.
Till now, i have written this code -
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
check_type_results = [[elt['type']] for elt in results]
checkColumns = ['type']
spark = SparkSession.builder.getOrCreate()
checkResultsDF = spark.createDataFrame(data=check_type_results, schema=checkColumns)
checkResultsDF = checkResultsDF.withColumn("time", F.current_timestamp())
checkResultsDF = checkResultsDF.withColumn("args_id", F.row_number().over(Window.orderBy(F.monotonically_increasing_id())))
checkResultsDF.printSchema()
Now, with my code , i am always getting args_id in increasing order which is correct for the first run but if i again run the json on next day or may be on same day and in the json file some pair of (type,kwargs) comes which has already come before so i should be using the same args_id for that pair.
If some pair (type,kwargs) has no entry in Arguments table, then only i will insert into arguments table but if the pair (type,kwargs) already exists in arguments table, then no insert should happen there.
Once these two dataframes are filled properly, then i want to load them into separate delta tables.
Hashcode column in arguments table is unique identifier for each "kwargs".

Issues
Your schema is a bit incomplete. A more detailed schema will allow you to take advantage of more spark features. See Solution below using spark-sql and pyspark. Instead of window functions that require ordered partitions, you may take advantage of a few of the table generating array functions such as explode and posexplode available in spark-sql. As it pertains to writing to your delta table, you may see examples here
Solution 1 : Using Spark SQL
Setup
from pyspark.sql.types import ArrayType,StructType, StructField, StringType, MapType
from pyspark.sql import Row, SparkSession
sparkSession = SparkSession.builder.appName("Demo").getOrCreate()
Schema Definition
Your sample record is an Array of Structs/Objects where the kwargs is a Maptype with optional keys. NB. The True indicates optional and should assist when there are missing keys or entries with different formats
schema = StructType([
StructField("entry",ArrayType(
StructType([
StructField("type",StringType(),True),
StructField("kwargs",MapType(StringType(),StringType()),True)
])
),True)
])
Reproducible Example
result_entry =[
{
"type":"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type":"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
df_results = sparkSession.createDataFrame([Row(entry=result_entry)],schema=schema)
df_results.createOrReplaceTempView("df_results")
df_results.show()
Results
+--------------------+
| entry|
+--------------------+
|[{check_datatype,...|
+--------------------+
Results Table Generation
I've used current_date to capture the current date however you may change this based on your pipeline.
results_table = sparkSession.sql("""
WITH raw_results as (
SELECT
posexplode(entry),
current_date as time
FROM
df_results
)
SELECT
col.type as Type,
time,
pos as arg_id
FROM
raw_results
""")
results_table.show()
Results
+-----------------+----------+------+
| Type| time|arg_id|
+-----------------+----------+------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+------+
Arguments Table Generation
args_table = sparkSession.sql("""
WITH raw_results as (
SELECT
posexplode(entry)
FROM
df_results
),
raw_arguments AS (
SELECT
explode(col.kwargs),
pos as args_id
FROM
raw_results
),
raw_arguments_before_array_check AS (
SELECT
args_id,
key as bac_key,
value as bac_value
FROM
raw_arguments
),
raw_arguments_after_array_check AS (
SELECT
args_id,
bac_key,
bac_value,
posexplode(split(regexp_replace(bac_value,"[\\\[\\\]]",""),","))
FROM
raw_arguments_before_array_check
)
SELECT
args_id,
bac_key as key,
col as value,
CASE
WHEN bac_value LIKE '[%' THEN pos
ELSE NULL
END as list_index,
abs(hash(args_id, bac_key,col,pos)) as hashcode
FROM
raw_arguments_after_array_check
""")
args_table.show()
Results
+-------+-----------+------+----------+----------+
|args_id| key| value|list_index| hashcode|
+-------+-----------+------+----------+----------+
| 0| d_type|string| null| 216841494|
| 0|column_name| vin| null| 502458545|
| 0| table| cars| null|1469121505|
| 1|column_name| vin| null| 604007568|
| 1| table| cars| null| 784654488|
| 2| columns| vin| 0|1503105124|
| 2| columns| index| 1| 454389776|
| 2| table| cars| null| 858757332|
+-------+-----------+------+----------+----------+
Solution 2: Using UDF
You may also define user-defined-functions with your already implemented python logic and apply this with spark
Setup
We will define our functions to create our results and arguments table here. I have chosen to create generator type functions but this is optional.
result_entry =[
{
"type":"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type":"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
import json
result_entry_str = json.dumps(result_entry)
result_entry_str
def extract_results_table(entry,current_date=None):
if current_date is None:
from datetime import date
current_date = str(date.today())
if type(entry)==str:
import json
entry = json.loads(entry)
for arg_id,arg in enumerate(entry):
yield {
"Type":arg["type"],
"time":current_date,
"args_id":arg_id
}
def extract_arguments_table(entry):
if type(entry)==str:
import json
entry = json.loads(entry)
for arg_id,arg in enumerate(entry):
if "kwargs" in arg:
for arg_entry in arg["kwargs"]:
orig_key,orig_value = arg_entry, arg["kwargs"][arg_entry]
if type(orig_value)==list:
for list_index,value in enumerate(orig_value):
yield {
"args_id":arg_id,
"key":orig_key,
"value":value,
"list_index":list_index,
"hash_code": hash((arg_id,orig_key,value,list_index))
}
else:
yield {
"args_id":arg_id,
"key":orig_key,
"value":orig_value,
"list_index":None,
"hash_code": hash((arg_id,orig_key,orig_value,"null"))
}
Pyspark Setup
from pyspark.sql.functions import udf,col,explode
from pyspark.sql.types import StructType,StructField,IntegerType,StringType, ArrayType
results_table_schema = ArrayType(StructType([
StructField("Type",StringType(),True),
StructField("time",StringType(),True),
StructField("args_id",IntegerType(),True)
]),True)
arguments_table_schema = ArrayType(StructType([
StructField("args_id",IntegerType(),True),
StructField("key",StringType(),True),
StructField("value",StringType(),True),
StructField("list_index",IntegerType(),True),
StructField("hash",StringType(),True)
]),True)
extract_results_table_udf = udf(lambda entry,current_date=None : [*extract_results_table(entry,current_date)],results_table_schema)
extract_arguments_table_udf = udf(lambda entry: [*extract_arguments_table(entry)],arguments_table_schema)
# this is useful if you intend to use your functions in spark-sql
sparkSession.udf.register('extract_results_table',extract_results_table_udf)
sparkSession.udf.register('extract_arguments_table',extract_arguments_table_udf)
Spark Data Frame
df_results_1 = sparkSession.createDataFrame([Row(entry=result_entry_str)],schema="entry string")
df_results_1.createOrReplaceTempView("df_results_1")
df_results_1.show()
Extracting Results Table
# Using Spark SQL
sparkSession.sql("""
WITH results_table AS (
select explode(extract_results_table(entry)) as entry FROM df_results_1
)
SELECT entry.* from results_table
""").show()
# Just python
df_results_1.select(
explode(extract_results_table_udf(df_results_1.entry)).alias("entry")
).selectExpr("entry.*").show()
Output
+-----------------+----------+-------+
| Type| time|args_id|
+-----------------+----------+-------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+-------+
+-----------------+----------+-------+
| Type| time|args_id|
+-----------------+----------+-------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+-------+
Extracting Results Table
# Using spark sql
sparkSession.sql("""
WITH arguments_table AS (
select explode(extract_arguments_table(entry)) as entry FROM df_results_1
)
SELECT entry.* from arguments_table
""").show()
# Just python
df_results_1.select(
explode(extract_arguments_table_udf(df_results_1.entry)).alias("entry")
).selectExpr("entry.*").show()
Output
+-------+-----------+------+----------+----+
|args_id| key| value|list_index|hash|
+-------+-----------+------+----------+----+
| 0| table| cars| null|null|
| 0|column_name| vin| null|null|
| 0| d_type|string| null|null|
| 1| table| cars| null|null|
| 1|column_name| vin| null|null|
| 2| table| cars| null|null|
| 2| columns| vin| 0|null|
| 2| columns| index| 1|null|
+-------+-----------+------+----------+----+
+-------+-----------+------+----------+----+
|args_id| key| value|list_index|hash|
+-------+-----------+------+----------+----+
| 0| table| cars| null|null|
| 0|column_name| vin| null|null|
| 0| d_type|string| null|null|
| 1| table| cars| null|null|
| 1|column_name| vin| null|null|
| 2| table| cars| null|null|
| 2| columns| vin| 0|null|
| 2| columns| index| 1|null|
+-------+-----------+------+----------+----+
Reference
Spark SQL Functions
Delta Batch Writes

PySpark "explode" dict in column

I have a column 'true_recoms' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
true_recoms | {"5556867":1,"5801144":5,"7397596":21}
I need to 'explode' this column to get something like this:
item | 20380109
recom_item | 5556867
recom_cnt | 1
..............
item | 20380109
recom_item | 5801144
recom_cnt | 5
..............
item | 20380109
recom_item | 7397596
recom_cnt | 21
I've tried to use from_json but its doesnt work:
schema_json = StructType(fields=[
StructField("item", StringType()),
StructField("recoms", StringType())
])
df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)
+--------+--------------------+------+
| item| true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...| [,]|
|17359322|{"17359392":1,"17...| [,]|
|31480894|{"31480598":1,"31...| [,]|
| 7265665|{"7265891":1,"503...| [,]|
|31350949|{"32218698":1,"31...| [,]|
+--------+--------------------+------+
only showing top 5 rows

The schema is incorrectly defined. You declare to be as struct with two string fields
item
recoms
while neither field is present in the document.
Unfortunately from_json can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
is not an option.
Personally I would use an udf
from pyspark.sql.functions import udf, explode
import json
#udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
which can be applied like this
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+

Adding column to dataframe and updating in pyspark

I have a dataframe in pyspark:
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: json.loads(l)),
)
ratings.show()
+--------+-------------------+------------+----------+-------------+-------+
|click_id| created_at| ip|product_id|product_price|user_id|
+--------+-------------------+------------+----------+-------------+-------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3|
+--------+-------------------+------------+----------+-------------+-------+
ratings.registerTempTable("transactions")
final_df = sqlContext.sql("select * from transactions");
I want to add a new column to this data frame called status and then update the status column based on created_at and user_id.
The created_at and user_id are read from the given table transations and passed to a function get_status(user_id,created_at) which returns the status. This status needs to be put into the transaction table as a new column for the corresponding user_id and created_at
Can I run alter and update command in pyspark?
How can this be done using pyspark ?

It's not clear what you want to do exactly. You should check out window functions they allow you to compare, sum... rows in a frame.
For instance
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("user_id").orderBy(psf.desc("created_at"))
ratings.withColumn(
"status",
psf.when(psf.row_number().over(w) == 1, "active").otherwise("inactive")).sort("click_id").show()
+--------+-------------------+------------+----------+-------------+-------+--------+
|click_id| created_at| ip|product_id|product_price|user_id| status|
+--------+-------------------+------------+----------+-------------+-------+--------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|inactive|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|inactive|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1| active|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|inactive|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|inactive|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2| active|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|inactive|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|inactive|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3| active|
+--------+-------------------+------------+----------+-------------+-------+--------+
It gives you each user's last click
If you want to pass a UDF to create a new column from two existing ones.
Say you have a function that takes the user_id and created_at as arguments
from pyspark.sql.types import *
def get_status(user_id,created_at):
...
get_status_udf = psf.udf(get_status, StringType())
StringType() or whichever datatype your function outputs
ratings.withColumn("status", get_status_udf("user_id", "created_at"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use enumerate to get partition columns from dataframe - python

Related

PySpark - Transform multiple columns without using udf

pyspark replace lowercase characters in column with 'x'

Creating pyspark dataframe from list of dictionaries

PySpark "explode" dict in column

Adding column to dataframe and updating in pyspark

Categories

Resources