How to convert an array to string efficiently in PySpark / Python - python

I have a df with the following schema:
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
in which one of the columns, col2 is an array [1#b, 2#b, 3#c]. I want to convert this to the string format 1#b,2#b,3#c.
I am currently doing this through the following snippet
df2 = (df1.select("*", explode(col2)).drop('col2'))
df2.groupBy("col1").agg(concat_ws(",", collect_list('col')).alias("col2"))
While this gets the job done, it is taking time and also seems inefficient.
Is there a better alternative?

You can call concat_ws directly on a column, like this:
df1.withColumn('col2', concat_ws(',', 'col2'))

Related

How to append dataframes together in pyspark?

I have a pyspark dataframe that is the output of machine learning predictions like this:
predictions = model.transform(test_data)
+-----------------+-----------------+-----+------------------+-------+--------------------+--------------------+----------+
|col1_imputed |col2_imputed |label| features|row_num| rawPrediction| probability|prediction|
+-----------------+-----------------+-----+------------------+-------+--------------------+--------------------+----------+
| -0.002353| 0.9762| 0|[-0.002353,0.9762]| 1|[-0.8726465863653...|[0.29470390100153...| 1.0|
| -0.08637| 0.06524| 0|[-0.08637,0.06524]| 3|[-0.6029409441836...|[0.35367114067727...|
root
|-- col1_imputed: double (nullable = true)
|-- col2_imputed: double (nullable = true)
|-- label: integer (nullable = true)
|-- features: vector (nullable = true)
|-- row_num: integer (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
I convert the probability column to select only the positive predictions in its vector, but I want to append this new conversion to the dataframe above (or replace the currently probability column with this new one of only positive probabilities) and I'm getting errors when trying this.
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
secondelement=udf(lambda v:float(v[1]),FloatType())
pos_prob = predictions.select(secondelement('probability')) #selects second element in probability column
#trying to add the new pos_prob column and naming it 'prob' to the dataframe:
df = predictions.withColumn('prob', predictions.select(secelement('probability'))).collect()
AssertionError: col should be Column
I have also tried wrapping lit() around it from reading similar questions but this gives another error:
df = all_preds.withColumn('prob', lit(all_preds.select(secelement('probability')))).collect()
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
You can use the UDF with withColumn, e.g.
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
secondelement = udf(lambda v: float(v[1]), FloatType())
df = predictions.withColumn('prob', secondelement('probability'))

converting RDD to dataframe fails on string to date conversion

I am working on extracting some data from xml. My overall workflow, which might be inefficient, is:
Read xml into a dataframe ('df_individual')
Filter unwanted columns
Make the target schema (shared below)
Convert the dataframe to RDD
Create a dataframe using schema and RDD from step 3 and 4
I created the RDD like below:
rddd = df_individual.rdd.map(tuple)
'df_individual' is the orignal dataframe where read the xml.
Below is the schema:
schema = types.StructType([
types.StructField('applicaion_id', types.StringType()),
types.StructField('cd_type', types.StringType()),
types.StructField('cd_title', types.StringType()),
types.StructField('firstname', types.StringType()),
types.StructField('middlename', types.StringType()),
types.StructField('nm_surname', types.StringType()),
types.StructField('dt_dob', types.DateType()),
types.StructField('cd_gender', types.StringType()),
types.StructField('cd_citizenship', types.StringType())
])
It fails on
df_result = spark.createDataFrame(rddd, schema)
The error is
TypeError: field dt_dob: DateType can not accept object '1973-02-19' in type <class 'str'>
The main purpose of creating the 'df_result' dataframe is having a predefined schema and implicitly casting all the columns where there is difference between RDD and dataframe. This is my first time working with RDD and I couldn't find a straight forward casting mechanism for such a case.
If you can help with solving the casting error or share a better workflow that would be great.
Thanks
If your aim is only to get your data into the right schema and transform some string columns into date columns, I would use a select combined with to_date.
df.select('applicaion_id', 'cd_type', 'cd_title', 'firstname', 'middlename', 'nm_surname', \
F.to_date('dt_dob').alias('dt_bob'), \
'cd_gender', 'cd_citizenship') \
.printSchema()
prints
root
|-- applicaion_id: string (nullable = true)
|-- cd_type: string (nullable = true)
|-- cd_title: string (nullable = true)
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- nm_surname: string (nullable = true)
|-- dt_bob: date (nullable = true)
|-- cd_gender: string (nullable = true)
|-- cd_citizenship: string (nullable = true)
with the column dt_bob having a date datatype.

How to access dataframe returned by UDF inside column in Spark Streaming

I'm writing code to perform step detection on streaming sensor data. To do so I separate all incoming values as sensor parameters. After that I detect the steps with a UDF. Inside this UDF I create a dict to store two lists of important timestamps per determined step.
#udf(StructType())
def detect_steps(timestamp, x, y, z):
.....
d = dict()
d['timestamp_ic'] = times_ic
d['timestamp_to'] = timestamp_to
......
return d
In the Spark main function I created a dataframe that calculates all these steps in a sliding window like so:
stepData = LLLData \
.withWatermark("time", "10 seconds") \
.groupBy(
window("time", windowDuration="5 seconds",slideDuration="1 second"),
"sensor"
) \
.agg(
collect_list("time").alias("time_window"),
collect_list(sensorData.Acceleration_x).alias("Acceleration_x_window"),
collect_list(sensorData.Acceleration_y).alias("Acceleration_y_window"),
collect_list(sensorData.Acceleration_z).alias("Acceleration_z_window"),
) \
.select(
"window",
"sensor",
detect_steps("time_window", "Acceleration_x_window", "Acceleration_y_window", "Acceleration_z_window")
)
Now, when I print the df schema it looks like this:
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- sensor: string (nullable = true)
|-- detect_steps("time_window", "Acceleration_x_window", "Acceleration_y_window", "Acceleration_z_window"): string (nullable = true)
While I want this:
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- sensor: string (nullable = true)
|-- timestamp_ic: string (nullable = true)
|-- timestamp_to: string (nullable = true)
However I cannot perform a select statement on the UDF column in stepData, error: Column is not iterable.
When I try to alter the root dataframe afterwards, by for example parsing the ic column to a spark dataframe like so:
df_stepData = spark.createDataFrame(data=stepData.select("ic"))
it gives me TypeError: data is already a DataFrame.
Looking at the dataframe schema, however, ic is typed as string.
I've also tried to read ic as a json file, but that gives the following error:
TypeError: path can be only string, list or RDD
I could fix the problem by triggering the detect_steps UDF twice, the first one returning timestamp_ic and the second one returning timestamp_to, in order to get two columns but I'm sure there is a better, more efficient way.

New to Pyspark - importing a CSV and creating a parquet file with array columns

I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The CSV file has a few simple columns, but one column is a delimited array of integers that I want to expand/unzip into a parquet file. This parquet file is actually used by a .net core micro service which uses a Parquet Reader to do calculations downstream. To keep this question simple, the structure of the column is:
"geomap" 5:3:7|4:2:1|8:2:78 -> this represents an array of 3 items, it is split at the "|" and then a tuple is build of the values (5,3,7), (4,2,1), (8,2,78)
I have tried various processes and schemas and I can't get this correct. Via UDF I am creating either a list of lists or a list of tuple, but I can't get the schema correct or unzip explode the data into the parquet write operation. I either get nulls, an error or other problems. Do I need to approach this differently? Relevant code is below. I am just showing the problem column for simplicity since I have the rest working. This is my first Pyspark attempt, so apologies for missing something obvious:
def convert_geo(geo):
return [tuple(x.split(':')) for x in geo.split('|')]
compression_type = 'snappy'
schema = ArrayType(StructType([
StructField("c1", IntegerType(), False),
StructField("c2", IntegerType(), False),
StructField("c3", IntegerType(), False)
]))
spark_convert_geo = udf(lambda z: convert_geo(z),schema)
source_path = '...path to csv'
destination_path = 'path for generated parquet file'
df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)
EDIT: Per request adding the printSchema() output, I'm not sure what's wrong in here either. I still can't seem to get the string split values to show up or render properly. This contains all the columns. I do see the c1 and c2 and c3 struct names...
root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)
The problem is that the convert_geo function returns a list of tuples with character elements rather than ints as specified in the schema. If you modify as follows it will work:
def convert_geo(geo):
return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type.
How I can change them to int type. I replaced the nan values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:
data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()
my data looks like this:
here columns Plays and drafts containing integer values but because of nan present in these columns, they are treated as string type.
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
You can run loop for each column but this is the simplest way to convert string column into integer.
You could use cast(as int) after replacing NaN with 0,
data_df = df.withColumn("Plays", df.call_time.cast('float'))
Another way to do it is using the StructField if you have multiple fields that needs to be modified.
Ex:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
StructField('OPEN_FLG',IntegerType(),True),
StructField('I1_GNDR_CODE',StringType(),True),
StructField('TRW_INCOME_CD_V4',StringType(),True),
StructField('ASIAN_CD',IntegerType(),True),
StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)
Output:
Before
root
|-- CLICK_FLG: string (nullable = true)
|-- OPEN_FLG: string (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)
After:
root
|-- CLICK_FLG: integer (nullable = true)
|-- OPEN_FLG: integer (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)
This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.
It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.

Categories