Use enumerate to get partition columns from dataframe - python

I am trying to get all columns and their datatypes into a variable, also only the partition columns into another variable of list type in python.
Getting details from describe extended.
df = spark.sql("describe extended schema_name.table_name")
+----------------------------------------------------------+
|col_name |data_type |
+----------------------------+-----------------------------+
|col1 |string |
|col2 |int
|col3 |string
|col4 |int
|col5 |string
|# Partition Information | |
|# col_name |data_type |
|col4 |int |
|col5 |string |
| | |
|# Detailed Table Information| |
|Database |schema_name |
|Table |table_name |
|Owner |owner.name |
Converting result into a list.
des_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
Here is how I am trying to get all columns(all items until before # Partition Information).
all_cols_name_type=[]
for index,item in enumerate(des_list):
if item[0]=='# Partition Information':
all_cols_name_type.append(des_list[:index])
For partitions, i would like to get everything between the items '# col_name' and line before ''(line before # Detailed Table Information)
Any help is appreciated to be able to get this.

You can try using the following answer or equivalent in Scala:
val (partitionCols, dataCols) = spark.catalog.listColumns("schema_name.table_name")
.collect()
.partition(c => c.isPartition)
val parCols = partitionCols.map(c => (c.name, c.dataType))
val datCols = dataCols.map(c => (c.name, c.dataType))
If the table is not defined in the catalog (e.g reading parquet dataset directly from s3 using spark.read.parquet("s3://path/...")) then you can use the following snippet in Scala:
val (partitionSchema, dataSchema) = df.queryExecution.optimizedPlan match {
case LogicalRelation(hfs: HadoopFsRelation, _, _, _) =>
(hfs.partitionSchema, hfs.dataSchema)
case DataSourceV2ScanRelation(_, scan: FileScan, _) =>
(scan.readPartitionSchema, scan.readDataSchema)
case _ => (StructType(Seq()), StructType(Seq()))
}
val parCols = partitionSchema.map(f => (f.name, f.dataType))
val datCols = dataSchema.map(f => (f.name, f.dataType))

There is a trick to do so: You can use monotonically_increasing_id to give each row a number, find the row that has # col_name and get that index. Something like this
My sample table
df = spark.sql('describe data')
df = df.withColumn('id', F.monotonically_increasing_id())
df.show()
+--------------------+---------+-------+---+
| col_name|data_type|comment| id|
+--------------------+---------+-------+---+
| c1| int| null| 0|
| c2| string| null| 1|
|# Partition Infor...| | | 2|
| # col_name|data_type|comment| 3|
| c2| string| null| 4|
+--------------------+---------+-------+---+
tricky part
idx = df.where(F.col('col_name') == '# col_name').first()['id']
# 3
partition_cols = [r['col_name'] for r in df.where(F.col('id') > idx).collect()]
# ['c2']

Related

PySpark - Transform multiple columns without using udf

I have a df like this one:
df = spark.createDataFrame(
[("1", "Apple", "cat"), ("2", "2.", "house"), ("3", "<strong>text</strong>", "HeLlo 2.5")],
["id", "text1", "text2"])
+---+---------------------+---------+
| id| text1| text2|
+---+---------------------+---------+
| 1| Apple| cat|
| 2| 2.| house|
| 3|<strong>text</strong>|HeLlo 2.5|
+---+---------------------+---------+
multiple functions to clean text like
def remove_html_tags(text):
document = html.fromstring(text)
return " ".join(etree.XPath("//text()")(document))
def lowercase(text):
return text.lower()
def remove_wrong_dot(text):
return re.sub(r'(?<!\d)[.,;:]|[.,;:](?!\d)', ' ', text)
and a list of columns to clean
COLS = ["text1", "text2"]
I would like to apply the functions to the columns in the list and also keep the original text
+---+---------------------+-----------+---------+-----------+
| id| text1|text1_clean| text2|text2_clean|
+---+---------------------+-----------+---------+-----------+
| 1| Apple| apple| cat| cat|
| 2| 2.| 2| house| house|
| 3|<strong>text</strong>| text|HeLlo 2.5| hello 2.5|
+---+---------------------+-----------+---------+-----------+
I already have an approach using UDF but it is not very efficient. I've been trying something like:
rdds = []
for col in TEXT_COLS:
rdd = df.rdd.map(lambda x: (x[col], lowercase(x[col])))
rdds.append(rdd.collect())
return df
My idea would be to join all rdds in the list but I don't know how efficient this would be or how to list more functions.
I appreciate any ideas or suggestions.
EDIT: Not all transformations can be done with regexp_replace. For example, the text can include nested html labels and in that case a simple replace wouldn't work or I don't want to replace all dots, only those at the end or beginning of substrings
Spark built-in functions can do all the transformations you wanted
from pyspark.sql import functions as F
cols = ["text1", "text2"]
for c in cols:
df = (df
.withColumn(f'{c}_clean', F.lower(c))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '<[^>]+>', ''))
.withColumn(f'{c}_clean', F.regexp_replace(f'{c}_clean', '(?<!\d)[.,;:]|[.,;:](?!\d)', ''))
)
+---+--------------------+---------+-----------+-----------+
| id| text1| text2|text1_clean|text2_clean|
+---+--------------------+---------+-----------+-----------+
| 1| Apple| cat| apple| cat|
| 2| 2.| house| 2| house|
| 3|<strong>text</str...|HeLlo 2.5| text| hello 2.5|
+---+--------------------+---------+-----------+-----------+

pyspark replace lowercase characters in column with 'x'

I'm trying to do the following but for a column in pyspark but no luck. Any idea on isolating just the lowercase characters in column of a spark df?
''.join('x' if x.islower() else 'X' if x.isupper() else x for x in text)
You can directly use regex_replace to substitute the lowercase values to any desired value -
In your case you will have to chain regex_replace to get the final output -
Data Preparation
inp_string = """
lRQWg2IZtB
hVzsJhPVH0
YXzc4fZDwu
qRyOUhT5Hn
b85O0H41RE
vOxPLFPWPy
fE6o5iMJ6I
918JI00EC7
x3yEYOCwek
m1eWY8rZwO
""".strip().split()
df = pd.DataFrame({
'value':inp_string
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
Regex Replace
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value", r'[a-z]', "x"))
sparkDF = sparkDF.withColumn('value_modified',F.regexp_replace("value_modified", r'[A-Z]', "X"))
sparkDF.show()
+----------+--------------+
| value|value_modified|
+----------+--------------+
|lRQWg2IZtB| xXXXx2XXxX|
|hVzsJhPVH0| xXxxXxXXX0|
|YXzc4fZDwu| XXxx4xXXxx|
|qRyOUhT5Hn| xXxXXxX5Xx|
|b85O0H41RE| x85X0X41XX|
|vOxPLFPWPy| xXxXXXXXXx|
|fE6o5iMJ6I| xX6x5xXX6X|
|918JI00EC7| 918XX00XX7|
|x3yEYOCwek| x3xXXXXxxx|
|m1eWY8rZwO| x1xXX8xXxX|
+----------+--------------+
Using the following dataframe as an example
+----------+
| value|
+----------+
|lRQWg2IZtB|
|hVzsJhPVH0|
|YXzc4fZDwu|
|qRyOUhT5Hn|
|b85O0H41RE|
|vOxPLFPWPy|
|fE6o5iMJ6I|
|918JI00EC7|
|x3yEYOCwek|
|m1eWY8rZwO|
+----------+
You can use a pyspark.sql function called regexpr_replace to isolate the lowercase letters in the column with the following code
from pyspark.sql import functions
df = (df.withColumn("value",
functions.regexp_replace("value", r'[A-Z]|[0-9]|[,.;##?!&$]', "")))
df.show()
+-----+
|value|
+-----+
| lgt|
| hzsh|
|zcfwu|
| qyhn|
| b|
| vxy|
| foi|
| |
|xywek|
| merw|
+-----+

Creating pyspark dataframe from list of dictionaries

I have below list of dictionaries
results =
[
{
"type:"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type:"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
I want to create two different pyspark dataframe with below schema -
args_id column in results table will be same when we have unique pair of (type,kwargs). This JSON has to be run on a daily basis and hence if it find out same pair of (type,kwargs) again, it should give the same args_id value.
Till now, i have written this code -
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window
check_type_results = [[elt['type']] for elt in results]
checkColumns = ['type']
spark = SparkSession.builder.getOrCreate()
checkResultsDF = spark.createDataFrame(data=check_type_results, schema=checkColumns)
checkResultsDF = checkResultsDF.withColumn("time", F.current_timestamp())
checkResultsDF = checkResultsDF.withColumn("args_id", F.row_number().over(Window.orderBy(F.monotonically_increasing_id())))
checkResultsDF.printSchema()
Now, with my code , i am always getting args_id in increasing order which is correct for the first run but if i again run the json on next day or may be on same day and in the json file some pair of (type,kwargs) comes which has already come before so i should be using the same args_id for that pair.
If some pair (type,kwargs) has no entry in Arguments table, then only i will insert into arguments table but if the pair (type,kwargs) already exists in arguments table, then no insert should happen there.
Once these two dataframes are filled properly, then i want to load them into separate delta tables.
Hashcode column in arguments table is unique identifier for each "kwargs".
Issues
Your schema is a bit incomplete. A more detailed schema will allow you to take advantage of more spark features. See Solution below using spark-sql and pyspark. Instead of window functions that require ordered partitions, you may take advantage of a few of the table generating array functions such as explode and posexplode available in spark-sql. As it pertains to writing to your delta table, you may see examples here
Solution 1 : Using Spark SQL
Setup
from pyspark.sql.types import ArrayType,StructType, StructField, StringType, MapType
from pyspark.sql import Row, SparkSession
sparkSession = SparkSession.builder.appName("Demo").getOrCreate()
Schema Definition
Your sample record is an Array of Structs/Objects where the kwargs is a Maptype with optional keys. NB. The True indicates optional and should assist when there are missing keys or entries with different formats
schema = StructType([
StructField("entry",ArrayType(
StructType([
StructField("type",StringType(),True),
StructField("kwargs",MapType(StringType(),StringType()),True)
])
),True)
])
Reproducible Example
result_entry =[
{
"type":"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type":"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
df_results = sparkSession.createDataFrame([Row(entry=result_entry)],schema=schema)
df_results.createOrReplaceTempView("df_results")
df_results.show()
Results
+--------------------+
| entry|
+--------------------+
|[{check_datatype,...|
+--------------------+
Results Table Generation
I've used current_date to capture the current date however you may change this based on your pipeline.
results_table = sparkSession.sql("""
WITH raw_results as (
SELECT
posexplode(entry),
current_date as time
FROM
df_results
)
SELECT
col.type as Type,
time,
pos as arg_id
FROM
raw_results
""")
results_table.show()
Results
+-----------------+----------+------+
| Type| time|arg_id|
+-----------------+----------+------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+------+
Arguments Table Generation
args_table = sparkSession.sql("""
WITH raw_results as (
SELECT
posexplode(entry)
FROM
df_results
),
raw_arguments AS (
SELECT
explode(col.kwargs),
pos as args_id
FROM
raw_results
),
raw_arguments_before_array_check AS (
SELECT
args_id,
key as bac_key,
value as bac_value
FROM
raw_arguments
),
raw_arguments_after_array_check AS (
SELECT
args_id,
bac_key,
bac_value,
posexplode(split(regexp_replace(bac_value,"[\\\[\\\]]",""),","))
FROM
raw_arguments_before_array_check
)
SELECT
args_id,
bac_key as key,
col as value,
CASE
WHEN bac_value LIKE '[%' THEN pos
ELSE NULL
END as list_index,
abs(hash(args_id, bac_key,col,pos)) as hashcode
FROM
raw_arguments_after_array_check
""")
args_table.show()
Results
+-------+-----------+------+----------+----------+
|args_id| key| value|list_index| hashcode|
+-------+-----------+------+----------+----------+
| 0| d_type|string| null| 216841494|
| 0|column_name| vin| null| 502458545|
| 0| table| cars| null|1469121505|
| 1|column_name| vin| null| 604007568|
| 1| table| cars| null| 784654488|
| 2| columns| vin| 0|1503105124|
| 2| columns| index| 1| 454389776|
| 2| table| cars| null| 858757332|
+-------+-----------+------+----------+----------+
Solution 2: Using UDF
You may also define user-defined-functions with your already implemented python logic and apply this with spark
Setup
We will define our functions to create our results and arguments table here. I have chosen to create generator type functions but this is optional.
result_entry =[
{
"type":"check_datatype",
"kwargs":{
"table":"cars","column_name":"vin","d_type":"string"
}
},
{
"type":"check_emptystring",
"kwargs":{
"table":"cars","column_name":"vin"
}
},
{
"type":"check_null",
"kwargs":{
"table":"cars","columns":["vin","index"]
}
}
]
import json
result_entry_str = json.dumps(result_entry)
result_entry_str
def extract_results_table(entry,current_date=None):
if current_date is None:
from datetime import date
current_date = str(date.today())
if type(entry)==str:
import json
entry = json.loads(entry)
for arg_id,arg in enumerate(entry):
yield {
"Type":arg["type"],
"time":current_date,
"args_id":arg_id
}
def extract_arguments_table(entry):
if type(entry)==str:
import json
entry = json.loads(entry)
for arg_id,arg in enumerate(entry):
if "kwargs" in arg:
for arg_entry in arg["kwargs"]:
orig_key,orig_value = arg_entry, arg["kwargs"][arg_entry]
if type(orig_value)==list:
for list_index,value in enumerate(orig_value):
yield {
"args_id":arg_id,
"key":orig_key,
"value":value,
"list_index":list_index,
"hash_code": hash((arg_id,orig_key,value,list_index))
}
else:
yield {
"args_id":arg_id,
"key":orig_key,
"value":orig_value,
"list_index":None,
"hash_code": hash((arg_id,orig_key,orig_value,"null"))
}
Pyspark Setup
from pyspark.sql.functions import udf,col,explode
from pyspark.sql.types import StructType,StructField,IntegerType,StringType, ArrayType
results_table_schema = ArrayType(StructType([
StructField("Type",StringType(),True),
StructField("time",StringType(),True),
StructField("args_id",IntegerType(),True)
]),True)
arguments_table_schema = ArrayType(StructType([
StructField("args_id",IntegerType(),True),
StructField("key",StringType(),True),
StructField("value",StringType(),True),
StructField("list_index",IntegerType(),True),
StructField("hash",StringType(),True)
]),True)
extract_results_table_udf = udf(lambda entry,current_date=None : [*extract_results_table(entry,current_date)],results_table_schema)
extract_arguments_table_udf = udf(lambda entry: [*extract_arguments_table(entry)],arguments_table_schema)
# this is useful if you intend to use your functions in spark-sql
sparkSession.udf.register('extract_results_table',extract_results_table_udf)
sparkSession.udf.register('extract_arguments_table',extract_arguments_table_udf)
Spark Data Frame
df_results_1 = sparkSession.createDataFrame([Row(entry=result_entry_str)],schema="entry string")
df_results_1.createOrReplaceTempView("df_results_1")
df_results_1.show()
Extracting Results Table
# Using Spark SQL
sparkSession.sql("""
WITH results_table AS (
select explode(extract_results_table(entry)) as entry FROM df_results_1
)
SELECT entry.* from results_table
""").show()
# Just python
df_results_1.select(
explode(extract_results_table_udf(df_results_1.entry)).alias("entry")
).selectExpr("entry.*").show()
Output
+-----------------+----------+-------+
| Type| time|args_id|
+-----------------+----------+-------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+-------+
+-----------------+----------+-------+
| Type| time|args_id|
+-----------------+----------+-------+
| check_datatype|2021-03-31| 0|
|check_emptystring|2021-03-31| 1|
| check_null|2021-03-31| 2|
+-----------------+----------+-------+
Extracting Results Table
# Using spark sql
sparkSession.sql("""
WITH arguments_table AS (
select explode(extract_arguments_table(entry)) as entry FROM df_results_1
)
SELECT entry.* from arguments_table
""").show()
# Just python
df_results_1.select(
explode(extract_arguments_table_udf(df_results_1.entry)).alias("entry")
).selectExpr("entry.*").show()
Output
+-------+-----------+------+----------+----+
|args_id| key| value|list_index|hash|
+-------+-----------+------+----------+----+
| 0| table| cars| null|null|
| 0|column_name| vin| null|null|
| 0| d_type|string| null|null|
| 1| table| cars| null|null|
| 1|column_name| vin| null|null|
| 2| table| cars| null|null|
| 2| columns| vin| 0|null|
| 2| columns| index| 1|null|
+-------+-----------+------+----------+----+
+-------+-----------+------+----------+----+
|args_id| key| value|list_index|hash|
+-------+-----------+------+----------+----+
| 0| table| cars| null|null|
| 0|column_name| vin| null|null|
| 0| d_type|string| null|null|
| 1| table| cars| null|null|
| 1|column_name| vin| null|null|
| 2| table| cars| null|null|
| 2| columns| vin| 0|null|
| 2| columns| index| 1|null|
+-------+-----------+------+----------+----+
Reference
Spark SQL Functions
Delta Batch Writes

PySpark "explode" dict in column

I have a column 'true_recoms' in spark dataframe:
-RECORD 17-----------------------------------------------------------------
item | 20380109
true_recoms | {"5556867":1,"5801144":5,"7397596":21}
I need to 'explode' this column to get something like this:
item | 20380109
recom_item | 5556867
recom_cnt | 1
..............
item | 20380109
recom_item | 5801144
recom_cnt | 5
..............
item | 20380109
recom_item | 7397596
recom_cnt | 21
I've tried to use from_json but its doesnt work:
schema_json = StructType(fields=[
StructField("item", StringType()),
StructField("recoms", StringType())
])
df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5)
+--------+--------------------+------+
| item| true_recoms|true_r|
+--------+--------------------+------+
|31746548|{"32731749":3,"31...| [,]|
|17359322|{"17359392":1,"17...| [,]|
|31480894|{"31480598":1,"31...| [,]|
| 7265665|{"7265891":1,"503...| [,]|
|31350949|{"32218698":1,"31...| [,]|
+--------+--------------------+------+
only showing top 5 rows
The schema is incorrectly defined. You declare to be as struct with two string fields
item
recoms
while neither field is present in the document.
Unfortunately from_json can take return only structs or array of structs so redefining it as
MapType(StringType(), LongType())
is not an option.
Personally I would use an udf
from pyspark.sql.functions import udf, explode
import json
#udf("map<string, bigint>")
def parse(s):
try:
return json.loads(s)
except json.JSONDecodeError:
pass
which can be applied like this
df = spark.createDataFrame(
[(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
("item", "true_recoms")
)
df.select("item", explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# | item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548| 5801144| 5|
# |31746548| 7397596| 21|
# |31746548| 5556867| 1|
# +--------+----------+---------+

Adding column to dataframe and updating in pyspark

I have a dataframe in pyspark:
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: json.loads(l)),
)
ratings.show()
+--------+-------------------+------------+----------+-------------+-------+
|click_id| created_at| ip|product_id|product_price|user_id|
+--------+-------------------+------------+----------+-------------+-------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3|
+--------+-------------------+------------+----------+-------------+-------+
ratings.registerTempTable("transactions")
final_df = sqlContext.sql("select * from transactions");
I want to add a new column to this data frame called status and then update the status column based on created_at and user_id.
The created_at and user_id are read from the given table transations and passed to a function get_status(user_id,created_at) which returns the status. This status needs to be put into the transaction table as a new column for the corresponding user_id and created_at
Can I run alter and update command in pyspark?
How can this be done using pyspark ?
It's not clear what you want to do exactly. You should check out window functions they allow you to compare, sum... rows in a frame.
For instance
import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("user_id").orderBy(psf.desc("created_at"))
ratings.withColumn(
"status",
psf.when(psf.row_number().over(w) == 1, "active").otherwise("inactive")).sort("click_id").show()
+--------+-------------------+------------+----------+-------------+-------+--------+
|click_id| created_at| ip|product_id|product_price|user_id| status|
+--------+-------------------+------------+----------+-------------+-------+--------+
| 123|2016-10-03 12:50:33| 10.10.10.10| 98373| 220.5| 1|inactive|
| 124|2017-02-03 11:51:33| 10.13.10.10| 97373| 320.5| 1|inactive|
| 125|2017-10-03 12:52:33| 192.168.2.1| 96373| 20.5| 1| active|
| 126|2017-10-03 13:50:33|172.16.11.10| 88373| 220.5| 2|inactive|
| 127|2017-10-03 13:51:33| 10.12.15.15| 87373| 320.5| 2|inactive|
| 128|2017-10-03 13:52:33|192.168.1.10| 86373| 20.5| 2| active|
| 129|2017-08-03 14:50:33| 10.13.10.10| 78373| 220.5| 3|inactive|
| 130|2017-10-03 14:51:33| 12.168.1.60| 77373| 320.5| 3|inactive|
| 131|2017-10-03 14:52:33| 10.10.30.30| 76373| 20.5| 3| active|
+--------+-------------------+------------+----------+-------------+-------+--------+
It gives you each user's last click
If you want to pass a UDF to create a new column from two existing ones.
Say you have a function that takes the user_id and created_at as arguments
from pyspark.sql.types import *
def get_status(user_id,created_at):
...
get_status_udf = psf.udf(get_status, StringType())
StringType() or whichever datatype your function outputs
ratings.withColumn("status", get_status_udf("user_id", "created_at"))

Categories