work with result of pyspark.sql.SQLContext in pyspark [duplicate] - python

Let's say I have spark dataframe
+--------+-----+
| letter|count|
+--------+-----+
| a| 2|
| b| 2|
| c| 1|
+--------+-----+
Then I wanted to find mean. So, I did
df = df.groupBy().mean('letter')
which give a dataframe
+------------------+
| avg(letter)|
+------------------+
|1.6666666666666667|
+------------------+
how can I hash it to get only value 1.6666666666666667 like df["avg(letter)"][0] in Pandas dataframe? Or any workaround to get 1.6666666666666667
Note: I need a float returned. Not a list nor dataframe.
Thank you

Take first:
>>> df.groupBy().mean('letter').first()[0]

Related

How to do a cummsum in a lambda call using PySpark

I am trying to replicate a code in Python using PySpark, and I found myself in a problem. So this is the code I am trying to replicate:
df_act = (df_act.assign(n_cycles = (lambda x: (x.cycles_bol != x.cycles_bol.shift(1)).cumsum())))
Keep in mind that I am working with a dataframe, and that cycles_bol is a column of dataframe "df_act".
and I simply can't. The closest I think I have gotten to the solution is the following:
df_act=df_act.withColumn(
"grp",
when(df_act['cycles_bol'] == lead("cycles_bol").over(Window.partitionBy("user_id").orderBy("timestamp")),0).otherwise(1).over(Window.orderBy("timestamp"))
).drop("grp").show()
Can anyone please help me?
Thanks in advance!
You dindt give much information
You have to orderBy,use lag to check if cycles_bol consecutives are the same and conditionally add. Use an existing column to orderBy if it wont change the order of cycles_bol. If you dont have such a column, generate one using monotonically_increasing function like I did.
df_act.withColumn('id', monotonically_increasing_id()).withColumn('n_cycles',sum(when(lag('cycles_bol').over(Window.orderBy('id'))!=col('cycles_bol'),1).otherwise(0)).over(Window.orderBy('id'))).drop('id').show()
+----------+--------+
|cycles_bol|n_cycles|
+----------+--------+
| A| 0|
| B| 1|
| B| 1|
| A| 2|
| B| 3|
| A| 4|
| A| 4|
| B| 5|
| C| 6|
+----------+--------+

Is there a way to add a column with range of values to a Spark Dataframe?

I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.
Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+
Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.

how to select first n row items based on multiple conditions in pyspark

Now I have data like this:
+----+----+
|col1| d|
+----+----+
| A| 4|
| A| 10|
| A| 3|
| B| 3|
| B| 6|
| B| 4|
| B| 5.5|
| B| 13|
+----+----+
col1 is StringType, d is TimestampType, here I use DoubleType instead.
I want to generate data based on conditions tuples.
Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)]
I want to have the result like
+----+---+
|col1| d|
+----+---+
| A| 4|
| A| 10|
| B| 4|
| B| 13|
+----+---+
That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.
What I've already written is:
df_res=spark_empty_dataframe
for (x,y) in tuples:
dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
df_res=df_res.union(dft)
But I think this might have efficiency problem, I do not know if I were right.
A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:
t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))
Then we can join on the input dataframe(df) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")
.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| A|10.0|
| A| 4.0|
| B| 4.0|
| B|13.0|
+----+----+
Note, if order of the dataframe is important , you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.
EDIT another way instead of ordering and get first directly with min:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')
.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| B| 4.0|
| B|13.0|
| A| 4.0|
| A|10.0|
+----+----+

Looking to convert String Column to Integer Column in PySpark. What happens to strings that can't be converted?

I'm trying to convert a column in a dataframe to IntegerType. Here is an example of the dataframe:
+----+-------+
|From| To|
+----+-------+
| 1|1664968|
| 2| 3|
| 2| 747213|
| 2|1664968|
| 2|1691047|
| 2|4095634|
+----+-------+
I'm using the following code:
exploded_df = exploded_df.withColumn('From', exploded_df['To'].cast(IntegerType()))
However, I wanted to know what happens to strings that are not digits, for example, what happens if I have a string with several spaces? The reason is that I want to filter the dataframe in order to get the values of the column 'From' that don't have numbers in column 'To'.
Is there a simpler way to filter by this condition without converting the columns to IntegerType?
Thank you!
Values which cannot be cast are set to null, and the column will be considered a nullable column of that type. Here's a simple example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("1",),
("2",),
("3",),
("4",),
("hello world",)], schema=['id'])
print(df.show())
df = df.withColumn("id", F.col("id").astype(IntegerType()))
print(df.show())
Output:
+-----------+
| id|
+-----------+
| 1|
| 2|
| 3|
| 4|
|hello world|
+-----------+
+----+
| id|
+----+
| 1|
| 2|
| 3|
| 4|
|null|
+----+
And to verify the schema is correct:
print(df.printSchema())
Output:
None
root
|-- id: integer (nullable = true)
Hope this helps!
We can use regex to check does To column have some alphabets,spaces in the data, Using .rlike funtion in spark to filter out the matching rows.
Example:
df=spark.createDataFrame([("1","1664968"),("2","3"),("2","742a7"),("2"," "),("2","a")],["From","To"])
df.show()
#+----+-------+
#|From| To|
#+----+-------+
#| 1|1664968|
#| 2| 3|
#| 2| 742a7|
#| 2| |
#| 2| a|
#+----+-------+
#get the rows which have space or word in them
df.filter(col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-----+
#|From|To |
#+----+-----+
#|2 |742a7|
#|2 | |
#|2 |a |
#+----+-----+
#to get rows which doesn't have any space or word in them.
df.filter(~col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-------+
#|From|To |
#+----+-------+
#|1 |1664968|
#|2 |3 |
#+----+-------+

PySpark: Add a new column with a tuple created from columns

Here I have a dateframe created as follow,
df = spark.createDataFrame([('a',5,'R','X'),('b',7,'G','S'),('c',8,'G','S')],
["Id","V1","V2","V3"])
It looks like
+---+---+---+---+
| Id| V1| V2| V3|
+---+---+---+---+
| a| 5| R| X|
| b| 7| G| S|
| c| 8| G| S|
+---+---+---+---+
I'm looking to add a column that is a tuple consisting of V1,V2,V3.
The result should look like
+---+---+---+---+-------+
| Id| V1| V2| V3|V_tuple|
+---+---+---+---+-------+
| a| 5| R| X|(5,R,X)|
| b| 7| G| S|(7,G,S)|
| c| 8| G| S|(8,G,S)|
+---+---+---+---+-------+
I've tried to use similar syntex as in Python but it didn't work:
df.withColumn("V_tuple",list(zip(df.V1,df.V2,df.V3)))
TypeError: zip argument #1 must support iteration.
Any help would be appreciated!
I'm coming from scala but I do believe that there's a similar way in python :
Using sql.functions package mehtod :
If you want to get a StructType with this three column use the struct(cols: Column*): Column method like this :
from pyspark.sql.functions import struct
df.withColumn("V_tuple",struct(df.V1,df.V2,df.V3))
but if you want to get it as a String you can use the concat(exprs: Column*): Column method like this :
from pyspark.sql.functions import concat
df.withColumn("V_tuple",concat(df.V1,df.V2,df.V3))
With this second method you may have to cast the columns into Strings
I'm not sure about the python syntax, Just edit the answer if there's a syntax error.
Hope this help you. Best Regards
Use struct:
from pyspark.sql.functions import struct
df.withColumn("V_tuple", struct(df.V1,df.V2,df.V3))

Categories