I currently have a PySpark dataframe that has many columns populated by integer counts. Many of these columns have counts of zero. I would like to find a way to sum how many columns have counts greater than zero.
In other words, I would like an approach that sums values across a row, where all the columns for a given row are effectively boolean (although the datatype conversion may not be necessary). Several columns in my table are datetime or string, so ideally I would have an approach that first selects the numeric columns.
Current Dataframe example and Desired Output
+---+---------- +----------+------------
|USER| DATE |COUNT_COL1| COUNT_COL2|... DESIRED COLUMN
+---+---------- +----------+------------
| b | 7/1/2019 | 12 | 1 | 2 (2 columns are non-zero)
| a | 6/9/2019 | 0 | 5 | 1
| c | 1/1/2019 | 0 | 0 | 0
Pandas: As an example, in pandas this can be accomplished by selecting the numeric columns,converting to bool and summing with the axis=1. I am looking for a PySpark equivalent.
test_cols=list(pandas_df.select_dtypes(include=[np.number]).columns.values)
pandas_df[test_cols].astype(bool).sum(axis=1)
For numericals, you can do it by creating an array of all the columns with the integer values(using df.dtypes), and then use higher order functions. In this case I used filter to get rid of all 0s, and then used size to get the number of all non zero elements per row.(spark2.4+)
from pyspark.sql import functions as F
df.withColumn("arr", F.array(*[F.col(i[0]) for i in df.dtypes if i[1] in ['int','bigint']]))\
.withColumn("DESIRED COLUMN", F.expr("""size(filter(arr,x->x!=0))""")).drop("arr").show()
#+----+--------+----------+----------+--------------+
#|USER| DATE|COUNT_COL1|COUNT_COL2|DESIRED COLUMN|
#+----+--------+----------+----------+--------------+
#| b|7/1/2019| 12| 1| 2|
#| a|6/9/2019| 0| 5| 1|
#| c|1/1/2019| 0| 0| 0|
#+----+--------+----------+----------+--------------+
Let's say you have below df:
df.show()
df.printSchema()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| a| 1| 2| 3|
| a| 0| 2| 1|
| a| 0| 0| 1|
| a| 0| 0| 0|
+---+---+---+---+
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
Using case when statement you can check if column is numeric and then if it is larger than 0. In the next step f.size will return count thanks to f.array_remove which left only cols with True value.
from pyspark.sql import functions as f
cols = [f.when(f.length(f.regexp_replace(f.col(x), '\\d+', '')) > 0, False).otherwise(f.col(x).cast('int') > 0) for x in df2.columns]
df.select("*", f.size(f.array_remove(f.array(*cols), False)).alias("count")).show()
+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|count|
+---+---+---+---+-----+
| a| 1| 2| 3| 3|
| a| 0| 2| 1| 2|
| a| 0| 0| 1| 1|
| a| 0| 0| 0| 0|
+---+---+---+---+-----+
Related
I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+
I'm trying to uniquely label consecutive rows with equal values in a PySpark dataframe. In Pandas, one could do this quite simply with:
s = pd.Series([1,1,1,2,2,1,1,3])
s.ne(s.shift()).cumsum()
0 1
1 1
2 1
3 2
4 2
5 3
6 3
7 4
dtype: int64
How could this be done in PySpark? Setup -
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
mySchema = StructType([StructField("col1", IntegerType(), True)])
df_sp = spark.createDataFrame(s.to_frame(), schema=mySchema)
I've found slightly related questions such as this one, but none of them about this same scenario.
I'm thinking a good starting point could be to find the first differences as in this answer
I've come up with a solution. The idea is similar to what is done in Pandas. We start by adding an unique identifier column, over which we'll compute the lagged column (using over here is necessary since it is a window function).
We then compare the column of interest with the lagged column and take the cumulative sum of the result cast to int:
mySchema = StructType([StructField("col1", IntegerType(), True)])
df_sp = spark.createDataFrame(s.to_frame(), schema=mySchema)
win = Window.orderBy("id")
df_sp = (df_sp.withColumn("id", f.monotonically_increasing_id())
.withColumn("col1_shift", f.lag("col1", offset=1, default=0).over(win))
.withColumn("col1_shift_ne", (f.col("col1") != f.col("col1_shift")).cast("int"))
.withColumn("col1_shift_ne_cumsum", f.sum("col1_shift_ne").over(win))
.drop(*['id','col1_shift', 'col1_shift_ne']))
df_sp.show()
---+--------------------+
|col1|col1_shift_ne_cumsum|
+----+--------------------+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 2|
| 2| 2|
| 1| 3|
| 1| 3|
| 3| 4|
+----+--------------------+
Another way of solving this would be using a rangebetween and using unbounded preceeding sum after comparing the lag:
from pyspark.sql import functions as F, Window as W
w1 = W.orderBy(F.monotonically_increasing_id())
w2 = W.orderBy(F.monotonically_increasing_id()).rangeBetween(W.unboundedPreceding,0)
cond = F.col("col1") != F.lag("col1").over(w1)
df_sp.withColumn("col1_shift_ne_cumsum",F.sum(F.when(cond,1).otherwise(0)).over(w2)+1).show()
+----+--------------------+
|col1|col1_shift_ne_cumsum|
+----+--------------------+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 2|
| 2| 2|
| 1| 3|
| 1| 3|
| 3| 4|
+----+--------------------+
Now I have data like this:
+----+----+
|col1| d|
+----+----+
| A| 4|
| A| 10|
| A| 3|
| B| 3|
| B| 6|
| B| 4|
| B| 5.5|
| B| 13|
+----+----+
col1 is StringType, d is TimestampType, here I use DoubleType instead.
I want to generate data based on conditions tuples.
Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)]
I want to have the result like
+----+---+
|col1| d|
+----+---+
| A| 4|
| A| 10|
| B| 4|
| B| 13|
+----+---+
That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.
What I've already written is:
df_res=spark_empty_dataframe
for (x,y) in tuples:
dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
df_res=df_res.union(dft)
But I think this might have efficiency problem, I do not know if I were right.
A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:
t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))
Then we can join on the input dataframe(df) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")
.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| A|10.0|
| A| 4.0|
| B| 4.0|
| B|13.0|
+----+----+
Note, if order of the dataframe is important , you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.
EDIT another way instead of ordering and get first directly with min:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')
.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| B| 4.0|
| B|13.0|
| A| 4.0|
| A|10.0|
+----+----+
I'm trying to convert a column in a dataframe to IntegerType. Here is an example of the dataframe:
+----+-------+
|From| To|
+----+-------+
| 1|1664968|
| 2| 3|
| 2| 747213|
| 2|1664968|
| 2|1691047|
| 2|4095634|
+----+-------+
I'm using the following code:
exploded_df = exploded_df.withColumn('From', exploded_df['To'].cast(IntegerType()))
However, I wanted to know what happens to strings that are not digits, for example, what happens if I have a string with several spaces? The reason is that I want to filter the dataframe in order to get the values of the column 'From' that don't have numbers in column 'To'.
Is there a simpler way to filter by this condition without converting the columns to IntegerType?
Thank you!
Values which cannot be cast are set to null, and the column will be considered a nullable column of that type. Here's a simple example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("1",),
("2",),
("3",),
("4",),
("hello world",)], schema=['id'])
print(df.show())
df = df.withColumn("id", F.col("id").astype(IntegerType()))
print(df.show())
Output:
+-----------+
| id|
+-----------+
| 1|
| 2|
| 3|
| 4|
|hello world|
+-----------+
+----+
| id|
+----+
| 1|
| 2|
| 3|
| 4|
|null|
+----+
And to verify the schema is correct:
print(df.printSchema())
Output:
None
root
|-- id: integer (nullable = true)
Hope this helps!
We can use regex to check does To column have some alphabets,spaces in the data, Using .rlike funtion in spark to filter out the matching rows.
Example:
df=spark.createDataFrame([("1","1664968"),("2","3"),("2","742a7"),("2"," "),("2","a")],["From","To"])
df.show()
#+----+-------+
#|From| To|
#+----+-------+
#| 1|1664968|
#| 2| 3|
#| 2| 742a7|
#| 2| |
#| 2| a|
#+----+-------+
#get the rows which have space or word in them
df.filter(col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-----+
#|From|To |
#+----+-----+
#|2 |742a7|
#|2 | |
#|2 |a |
#+----+-----+
#to get rows which doesn't have any space or word in them.
df.filter(~col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-------+
#|From|To |
#+----+-------+
#|1 |1664968|
#|2 |3 |
#+----+-------+
Im working on pyspark to deal with big CSV files more than 50gb.
Now I need to find the number of distinct values between two references to the same value.
for example,
input dataframe:
+----+
|col1|
+----+
| a|
| b|
| c|
| c|
| a|
| b|
| a|
+----+
output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
| a| null|
| b| null|
| c| null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+-----+
I'm struggling with this for past one week. Tried window functions and many things in spark. But couldn't get anything. It would be a great help if someone knows how to fix this. Thank you.
Comment if you need any clarification in the question.
I am providing solution, with some assumptions.
Assuming, previous reference can be found in max of previous 'n' rows. If 'n' is reasonable less value, i think this is good solution.
I assumed you can find the previous reference in 5 rows.
def get_distincts(list, current_value):
cnt = {}
flag = False
for i in list:
if current_value == i :
flag = True
break
else:
cnt[i] = "some_value"
if flag:
return len(cnt)
else:
return None
get_distincts_udf = udf(get_distincts, IntegerType())
df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column
df = df.withColumn("seq_id", monotonically_increasing_id())
window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))
df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()
which results
+----+----+
|col1|col2|
+----+----+
| a|null|
| b|null|
| c|null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+----+
You can try the following approach:
add a monotonically_increasing column id to keep track the order of rows
find prev_id for each col1 and save the result to a new df
for the new DF (alias 'd1'), make a LEFT JOIN to the DF itself (alias 'd2') with a condition (d2.id > d1.prev_id) & (d2.id < d1.id)
then groupby('d1.col1', 'd1.id') and aggregate on the countDistinct('d2.col1')
The code based on the above logic and your sample data is shown below:
from pyspark.sql import functions as F, Window
df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])
# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')
# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
.withColumn('prev_id', F.lag('id').over(win))
# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# | a| 0| null|
# | b| 1| null|
# | c| 2| null|
# | c| 3| 2|
# | a| 4| 0|
# | b| 5| 1|
# | a| 6| 4|
# +----+---+-------+
# let's cache the new dataframe
df11.persist()
# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
.join(df11.alias('d2')
, (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
.select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
.groupBy('col1','id') \
.agg(F.countDistinct('ids').alias('distinct_values'))
# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# | a| 0| 0|
# | b| 1| 0|
# | c| 2| 0|
# | c| 3| 0|
# | a| 4| 2|
# | b| 5| 2|
# | a| 6| 1|
# +----+---+---------------+
# release the cached df11
df11.unpersist()
Note you will need to keep this id column to sort rows, otherwise your resulting rows will be totally messed up each time you collect them.
reuse_distance = []
block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []
Here block is nothing but the characters you want to read and search from csv
stack_list = []
stack_dist = -1
reuse_dist = -1
if block in block_dict:
reuse_dist = counter_reuse - block_dict[block]-1
block_dict[block] = counter_reuse
counter_reuse += 1
stack_dist_ind= stack_list.index(block)
stack_dist = counter_stack -stack_dist_ind - 1
del stack_list[stack_dist_ind]
stack_list.append(block)
else:
block_dict[block] = counter_reuse
counter_reuse += 1
counter_stack += 1
stack_list.append(block)
reuse_distance_2.append([block, stack_dist, reuse_dist])