This is the structure of my dataframe using df.columns.
['LastName',
'FirstName',
'Stud. ID',
'10 Relations',
'Related to Politics',
'3NF',
'Documentation & Scripts',
'SQL',
'Data (CSV, etc.)',
'20 Relations',
'Google News',
'Cheated',
'Sum',
'Delay Factor',
'Grade (out of 2)']
I have transformed this dataframe in pyspark using
assembler = VectorAssembler(inputCols=['10 Relations',
'Related to Politics',
'3NF'],outputCol='features')
and output = assembler.transform(df). Now it contains some Row objects. These objects have this architecture (This is what I get when I run output.printSchema())
root
|-- LastName: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- Stud. ID: integer (nullable = true)
|-- 10 Relations: integer (nullable = true)
|-- Related to Politics: integer (nullable = true)
|-- 3NF: integer (nullable = true)
|-- Documentation & Scripts: integer (nullable = true)
|-- SQL: integer (nullable = true)
|-- Data (CSV, etc.): integer (nullable = true)
|-- 20 Relations: integer (nullable = true)
|-- Google News: integer (nullable = true)
|-- Cheated: integer (nullable = true)
|-- Sum: integer (nullable = true)
|-- Delay Factor: double (nullable = true)
|-- Grade (out of 2): double (nullable = true)
|-- features: vector (nullable = true)
For each row, the assembler chooses to make the features vector Sparse or Dense (For memory reasons). But this is a big problem. Because I want to use this transformed data for making a linear regression model. So, I'm searching for a way to make VectorAssembler always choose Dense Vector.
Any idea?
Note: I have read this post. But the problem is that since the Row class is a subclass of tuple, I cannot change a Row object after it is made.
Sparse and Dense vector are both inherited from pyspark.ml.linalg.Vector. So both vector types have .toarray() method in common. You can convert them into numpy array then Dense vetor with simple udf.
from pyspark.ml.linalg import DenseVector, SparseVector, Vectors, VectorUDT
from pyspark.sql import functions as F
from pyspark.sql.types import *
v = Vectors.dense([1,3]) # dense vector
u = SparseVector(2, {}) # sparse vector
# toDense function converts both vector type into Dense Vector
toDense = lambda v: Vectors.dense(v.toArray())
toDense(u), toDense(v)
Results:
DenseVector([0.0, 0.0]), DenseVector([1.0, 3.0])
Then You can create udf with this function.
df = sqlContext.createDataFrame([
((v,)),
((u,))
], ['feature'])
toDense = lambda v: Vectors.dense(v.toArray())
toDenseUdf = F.udf(toDense, VectorUDT())
df.withColumn('feature', toDenseUdf('feature')).show()
results:
+---------+
| feature|
+---------+
|[1.0,3.0]|
|[0.0,0.0]|
+---------+
You have single vectortype in column.
Related
I have dataframe with below schema
root
|-- array_column: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
|-- label_info: array (nullable = true)
| |-- element: string (containsNull = true)
|-- extras: string (nullable = true)
How can i find programatically that my schema has column of array of string or array of struct. Above is just sample schema. I will have dynamic schema.
Till now i could do something like this
if isinstance(df.schema["array_column"].dataType, ArrayType):
But this only tells the column is of arraytype.
When your column is an array column, you can access the schema of the elements of it with elementType. Then you can check for the Type of those elements like this:
if isinstance(df.schema["array_column"].dataType.elementType, StringType):
I have a pyspark dataframe that is the output of machine learning predictions like this:
predictions = model.transform(test_data)
+-----------------+-----------------+-----+------------------+-------+--------------------+--------------------+----------+
|col1_imputed |col2_imputed |label| features|row_num| rawPrediction| probability|prediction|
+-----------------+-----------------+-----+------------------+-------+--------------------+--------------------+----------+
| -0.002353| 0.9762| 0|[-0.002353,0.9762]| 1|[-0.8726465863653...|[0.29470390100153...| 1.0|
| -0.08637| 0.06524| 0|[-0.08637,0.06524]| 3|[-0.6029409441836...|[0.35367114067727...|
root
|-- col1_imputed: double (nullable = true)
|-- col2_imputed: double (nullable = true)
|-- label: integer (nullable = true)
|-- features: vector (nullable = true)
|-- row_num: integer (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)
I convert the probability column to select only the positive predictions in its vector, but I want to append this new conversion to the dataframe above (or replace the currently probability column with this new one of only positive probabilities) and I'm getting errors when trying this.
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
secondelement=udf(lambda v:float(v[1]),FloatType())
pos_prob = predictions.select(secondelement('probability')) #selects second element in probability column
#trying to add the new pos_prob column and naming it 'prob' to the dataframe:
df = predictions.withColumn('prob', predictions.select(secelement('probability'))).collect()
AssertionError: col should be Column
I have also tried wrapping lit() around it from reading similar questions but this gives another error:
df = all_preds.withColumn('prob', lit(all_preds.select(secelement('probability')))).collect()
AttributeError: 'DataFrame' object has no attribute '_get_object_id'
You can use the UDF with withColumn, e.g.
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
secondelement = udf(lambda v: float(v[1]), FloatType())
df = predictions.withColumn('prob', secondelement('probability'))
I'm writing code to perform step detection on streaming sensor data. To do so I separate all incoming values as sensor parameters. After that I detect the steps with a UDF. Inside this UDF I create a dict to store two lists of important timestamps per determined step.
#udf(StructType())
def detect_steps(timestamp, x, y, z):
.....
d = dict()
d['timestamp_ic'] = times_ic
d['timestamp_to'] = timestamp_to
......
return d
In the Spark main function I created a dataframe that calculates all these steps in a sliding window like so:
stepData = LLLData \
.withWatermark("time", "10 seconds") \
.groupBy(
window("time", windowDuration="5 seconds",slideDuration="1 second"),
"sensor"
) \
.agg(
collect_list("time").alias("time_window"),
collect_list(sensorData.Acceleration_x).alias("Acceleration_x_window"),
collect_list(sensorData.Acceleration_y).alias("Acceleration_y_window"),
collect_list(sensorData.Acceleration_z).alias("Acceleration_z_window"),
) \
.select(
"window",
"sensor",
detect_steps("time_window", "Acceleration_x_window", "Acceleration_y_window", "Acceleration_z_window")
)
Now, when I print the df schema it looks like this:
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- sensor: string (nullable = true)
|-- detect_steps("time_window", "Acceleration_x_window", "Acceleration_y_window", "Acceleration_z_window"): string (nullable = true)
While I want this:
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- sensor: string (nullable = true)
|-- timestamp_ic: string (nullable = true)
|-- timestamp_to: string (nullable = true)
However I cannot perform a select statement on the UDF column in stepData, error: Column is not iterable.
When I try to alter the root dataframe afterwards, by for example parsing the ic column to a spark dataframe like so:
df_stepData = spark.createDataFrame(data=stepData.select("ic"))
it gives me TypeError: data is already a DataFrame.
Looking at the dataframe schema, however, ic is typed as string.
I've also tried to read ic as a json file, but that gives the following error:
TypeError: path can be only string, list or RDD
I could fix the problem by triggering the detect_steps UDF twice, the first one returning timestamp_ic and the second one returning timestamp_to, in order to get two columns but I'm sure there is a better, more efficient way.
I have a Dataframe with the following schema:
root
|-- id: long (nullable = true)
|-- ... (other columns)
|-- my_array_col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- col_a: string (nullable = true)
| | |-- col_b: date (nullable = true)
How can I change the type of col_b to a StringType?
You can cast arbitrarily deep array and struct columns with the cast operator like so:
.withColumn("my_array_col", col("my_array_col")
.cast("array<struct<col_a: string, col_b: string>>")
I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type.
How I can change them to int type. I replaced the nan values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:
data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()
my data looks like this:
here columns Plays and drafts containing integer values but because of nan present in these columns, they are treated as string type.
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))
You can run loop for each column but this is the simplest way to convert string column into integer.
You could use cast(as int) after replacing NaN with 0,
data_df = df.withColumn("Plays", df.call_time.cast('float'))
Another way to do it is using the StructField if you have multiple fields that needs to be modified.
Ex:
from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
StructField('OPEN_FLG',IntegerType(),True),
StructField('I1_GNDR_CODE',StringType(),True),
StructField('TRW_INCOME_CD_V4',StringType(),True),
StructField('ASIAN_CD',IntegerType(),True),
StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)
Output:
Before
root
|-- CLICK_FLG: string (nullable = true)
|-- OPEN_FLG: string (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)
After:
root
|-- CLICK_FLG: integer (nullable = true)
|-- OPEN_FLG: integer (nullable = true)
|-- I1_GNDR_CODE: string (nullable = true)
|-- TRW_INCOME_CD_V4: string (nullable = true)
|-- ASIAN_CD: integer (nullable = true)
|-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)
This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.
It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.