id
texts
vector
0
[a, b, c]
(3,[0,1,2],[1.0,1.0,1.0])
1
[a, b, c]
(3,[0,1,2],[2.0,2.0,1.0])
This is my above spark dataframe, I want to convert it to something like below -
id
texts
list_2
0
a
1.0
0
b
1.0
0
c
1.0
1
a
2.0
1
b
2.0
1
c
1.0
from pyspark.sql.types import *
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import *
def to_array_(v):
return v.toArray().tolist()
def to_vector_(v):
return Vectors.dense(v)
to_array = udf(lambda z: to_array_(z),ArrayType(DoubleType())) #watch your return type
to_vector = udf(lambda z: to_vector_(z),VectorUDT()) # helper to make an example for your question.
getFeatureVector=udf(lambda v:v[2],VectorUDT()) #this should work on your Feature Vector, but I'm too lazy to contrive an example with Vectors of vectors.
getFeatureVectorExample=udf(lambda v:v[2],FloatType()) # This works for this example but gives you the general idea of how to access vectors.
schema = ["id","texts","vector"]
data = [
(0,['a', 'b', 'c'],[1.0,1.0,1.0]), #small cheat
(1,['a', 'b', 'c'],[2.0,2.0,1.0]),
]
df = spark.createDataFrame( data, schema )
df = df.withColumn("vector", to_vector(df.vector) ) #convert the array to a vector so I can prove this works
#DataFrame[id: bigint, texts: array<string>, vector: vector]
This may make you ask the question how do I access the element of vector to turn it into an array: (we use another udf that will translate for us.)
df.select(col('*'), getFeatureVectorExample( df.vector ) ).show()
+---+---------+-------------+----------------+
| id| texts| vector|<lambda>(vector)|
+---+---------+-------------+----------------+
| 0|[a, b, c]|[1.0,1.0,1.0]| 1.0|
| 1|[a, b, c]|[2.0,2.0,1.0]| 1.0|
+---+---------+-------------+----------------+
Ok so now we know how to get the element we're interest in so the rest of this example show how to convert a vector into an array, and then explode it.
df.withColumn( 'text', explode( df.texts) )\# I use with column as I'm lazy
.withColumn( 'feature', explode( to_array(df.vector) ) )\#can't have to explodes in 1 select so don't try to do that.
.drop('texts','vector')\#book keeping to clean up columns you don't want.
.show()
| id|text|feature|
+---+----+-------+
| 0| a| 1.0|
| 0| a| 1.0|
| 0| a| 1.0|
| 0| b| 1.0|
| 0| b| 1.0|
| 0| b| 1.0|
| 0| c| 1.0|
| 0| c| 1.0|
| 0| c| 1.0|
| 1| a| 2.0|
| 1| a| 2.0|
| 1| a| 1.0|
| 1| b| 2.0|
| 1| b| 2.0|
| 1| b| 1.0|
| 1| c| 2.0|
| 1| c| 2.0|
| 1| c| 1.0|
+---+----+-------+
To further clarify if you wish to access elements of a vector you can create a static function:
This function pulls the last element(2) of a vector out and returns it as a vector, but gives a hint to how to access other elements.
getFeatureVector=udf(lambda v:v[2],VectorUDT())
If the elements are different types you will need to write extra logic to handle it and the return type:
Here's an example to access the first(0) element of a vector and return it as a FloatType:
getFeatureVectorExample=udf(lambda v:v[0],FloatType())
You can of course combine these elements and return a more complex structure, that may suit your needs. I suggest returning them as a struct as you can use 'column_name.*' to turn the columns from the struct as rows or struct_column.field_name to access elements and return them as columns. See this example for how to build out the return type.
Further example using multitple elements in struct and turning them into a column
def structExample(v):
return (
float(v[0]),
float(v[0])
)
getstructExample=udf(structExample,StructType([StructField( "flt", FloatType(), False), StructField( "array", FloatType() ) ]))
df.select(col('*'), getstructExample( df.vector ).alias("struct") ).select(col("struct.*")).show()
+---+-----+
|flt|array|
+---+-----+
|1.0| 1.0|
|2.0| 2.0|
+---+-----+
Related
I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+
Now I have data like this:
+----+----+
|col1| d|
+----+----+
| A| 4|
| A| 10|
| A| 3|
| B| 3|
| B| 6|
| B| 4|
| B| 5.5|
| B| 13|
+----+----+
col1 is StringType, d is TimestampType, here I use DoubleType instead.
I want to generate data based on conditions tuples.
Given a tuple[(A,3.5),(A,8),(B,3.5),(B,10)]
I want to have the result like
+----+---+
|col1| d|
+----+---+
| A| 4|
| A| 10|
| B| 4|
| B| 13|
+----+---+
That is for each element in the tuple, we select from the pyspark dataframe the first 1 row that d is larger than the tuple number and col1 is equal to the tuple string.
What I've already written is:
df_res=spark_empty_dataframe
for (x,y) in tuples:
dft=df.filter(df.col1==x).filter(df.d>y).limit(1)
df_res=df_res.union(dft)
But I think this might have efficiency problem, I do not know if I were right.
A possible approach avoiding loops can be creating a dataframe from the tuple you have as input:
t = [('A',3.5),('A',8),('B',3.5),('B',10)]
ref=spark.createDataFrame([(i[0],float(i[1])) for i in t],("col1_y","d_y"))
Then we can join on the input dataframe(df) on condition and then group on the keys and values of tuple which will be repeated to get the first value on each group, then drop the extra columns:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner').orderBy("col1","d")
.groupBy("col1_y","d_y").agg(F.first("col1").alias("col1"),F.first("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| A|10.0|
| A| 4.0|
| B| 4.0|
| B|13.0|
+----+----+
Note, if order of the dataframe is important , you can assign an index column with monotonically_increasing_id and include them in the aggregation then orderBy the index column.
EDIT another way instead of ordering and get first directly with min:
(df.join(ref,(df.col1==ref.col1_y)&(df.d>ref.d_y),how='inner')
.groupBy("col1_y","d_y").agg(F.min("col1").alias("col1"),F.min("d").alias("d"))
.drop("col1_y","d_y")).show()
+----+----+
|col1| d|
+----+----+
| B| 4.0|
| B|13.0|
| A| 4.0|
| A|10.0|
+----+----+
I currently have a PySpark dataframe that has many columns populated by integer counts. Many of these columns have counts of zero. I would like to find a way to sum how many columns have counts greater than zero.
In other words, I would like an approach that sums values across a row, where all the columns for a given row are effectively boolean (although the datatype conversion may not be necessary). Several columns in my table are datetime or string, so ideally I would have an approach that first selects the numeric columns.
Current Dataframe example and Desired Output
+---+---------- +----------+------------
|USER| DATE |COUNT_COL1| COUNT_COL2|... DESIRED COLUMN
+---+---------- +----------+------------
| b | 7/1/2019 | 12 | 1 | 2 (2 columns are non-zero)
| a | 6/9/2019 | 0 | 5 | 1
| c | 1/1/2019 | 0 | 0 | 0
Pandas: As an example, in pandas this can be accomplished by selecting the numeric columns,converting to bool and summing with the axis=1. I am looking for a PySpark equivalent.
test_cols=list(pandas_df.select_dtypes(include=[np.number]).columns.values)
pandas_df[test_cols].astype(bool).sum(axis=1)
For numericals, you can do it by creating an array of all the columns with the integer values(using df.dtypes), and then use higher order functions. In this case I used filter to get rid of all 0s, and then used size to get the number of all non zero elements per row.(spark2.4+)
from pyspark.sql import functions as F
df.withColumn("arr", F.array(*[F.col(i[0]) for i in df.dtypes if i[1] in ['int','bigint']]))\
.withColumn("DESIRED COLUMN", F.expr("""size(filter(arr,x->x!=0))""")).drop("arr").show()
#+----+--------+----------+----------+--------------+
#|USER| DATE|COUNT_COL1|COUNT_COL2|DESIRED COLUMN|
#+----+--------+----------+----------+--------------+
#| b|7/1/2019| 12| 1| 2|
#| a|6/9/2019| 0| 5| 1|
#| c|1/1/2019| 0| 0| 0|
#+----+--------+----------+----------+--------------+
Let's say you have below df:
df.show()
df.printSchema()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| a| 1| 2| 3|
| a| 0| 2| 1|
| a| 0| 0| 1|
| a| 0| 0| 0|
+---+---+---+---+
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
Using case when statement you can check if column is numeric and then if it is larger than 0. In the next step f.size will return count thanks to f.array_remove which left only cols with True value.
from pyspark.sql import functions as f
cols = [f.when(f.length(f.regexp_replace(f.col(x), '\\d+', '')) > 0, False).otherwise(f.col(x).cast('int') > 0) for x in df2.columns]
df.select("*", f.size(f.array_remove(f.array(*cols), False)).alias("count")).show()
+---+---+---+---+-----+
|_c0|_c1|_c2|_c3|count|
+---+---+---+---+-----+
| a| 1| 2| 3| 3|
| a| 0| 2| 1| 2|
| a| 0| 0| 1| 1|
| a| 0| 0| 0| 0|
+---+---+---+---+-----+
Im working on pyspark to deal with big CSV files more than 50gb.
Now I need to find the number of distinct values between two references to the same value.
for example,
input dataframe:
+----+
|col1|
+----+
| a|
| b|
| c|
| c|
| a|
| b|
| a|
+----+
output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
| a| null|
| b| null|
| c| null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+-----+
I'm struggling with this for past one week. Tried window functions and many things in spark. But couldn't get anything. It would be a great help if someone knows how to fix this. Thank you.
Comment if you need any clarification in the question.
I am providing solution, with some assumptions.
Assuming, previous reference can be found in max of previous 'n' rows. If 'n' is reasonable less value, i think this is good solution.
I assumed you can find the previous reference in 5 rows.
def get_distincts(list, current_value):
cnt = {}
flag = False
for i in list:
if current_value == i :
flag = True
break
else:
cnt[i] = "some_value"
if flag:
return len(cnt)
else:
return None
get_distincts_udf = udf(get_distincts, IntegerType())
df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column
df = df.withColumn("seq_id", monotonically_increasing_id())
window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))
df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()
which results
+----+----+
|col1|col2|
+----+----+
| a|null|
| b|null|
| c|null|
| c| 0|
| a| 2|
| b| 2|
| a| 1|
+----+----+
You can try the following approach:
add a monotonically_increasing column id to keep track the order of rows
find prev_id for each col1 and save the result to a new df
for the new DF (alias 'd1'), make a LEFT JOIN to the DF itself (alias 'd2') with a condition (d2.id > d1.prev_id) & (d2.id < d1.id)
then groupby('d1.col1', 'd1.id') and aggregate on the countDistinct('d2.col1')
The code based on the above logic and your sample data is shown below:
from pyspark.sql import functions as F, Window
df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])
# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')
# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
.withColumn('prev_id', F.lag('id').over(win))
# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# | a| 0| null|
# | b| 1| null|
# | c| 2| null|
# | c| 3| 2|
# | a| 4| 0|
# | b| 5| 1|
# | a| 6| 4|
# +----+---+-------+
# let's cache the new dataframe
df11.persist()
# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
.join(df11.alias('d2')
, (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
.select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
.groupBy('col1','id') \
.agg(F.countDistinct('ids').alias('distinct_values'))
# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# | a| 0| 0|
# | b| 1| 0|
# | c| 2| 0|
# | c| 3| 0|
# | a| 4| 2|
# | b| 5| 2|
# | a| 6| 1|
# +----+---+---------------+
# release the cached df11
df11.unpersist()
Note you will need to keep this id column to sort rows, otherwise your resulting rows will be totally messed up each time you collect them.
reuse_distance = []
block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []
Here block is nothing but the characters you want to read and search from csv
stack_list = []
stack_dist = -1
reuse_dist = -1
if block in block_dict:
reuse_dist = counter_reuse - block_dict[block]-1
block_dict[block] = counter_reuse
counter_reuse += 1
stack_dist_ind= stack_list.index(block)
stack_dist = counter_stack -stack_dist_ind - 1
del stack_list[stack_dist_ind]
stack_list.append(block)
else:
block_dict[block] = counter_reuse
counter_reuse += 1
counter_stack += 1
stack_list.append(block)
reuse_distance_2.append([block, stack_dist, reuse_dist])
I am currently trying to find efficient ways of grouping levels in a categorical column that have a low occurrence in columns of StringType(). I want to do this based on a percentage threshold, i.e. replace all values that occur in less than z% of the rows. Also, it is important that we can return the mapping between numerical values (after applying StringIndexer) and the original values.
So basically with a threshold of 25%, this dataframe:
+---+---+---+---+
| x1| x2| x3| x4|
+---+---+---+---+
| a| a| a| a|
| b| b| a| b|
| a| a| a| c|
| b| b| a| d|
| c| a| a| e|
+---+---+---+---+
Should become this:
+------+------+------+------+
|x1_new|x2_new|x3_new|x4_new|
+------+------+------+------+
| a| a| a| other|
| b| b| a| other|
| a| a| a| other|
| b| b| a| other|
| other| a| a| other|
+------+------+------+------+
where c has been replaced with other in column x1, and all values have been replaced with other in column x4, because they occur in less than 25% of the rows.
I was hoping to use a regular StringIndexer, and make use of the fact that values are ordered based on their frequency. We can calculate how many values to keep and replace all others with e.g. -1. The issue with this approach: This raises errors later within IndexToString, I assume because the metadata is lost.
My questions; is there a good way to do this? Are there built-in functions that I might be overlooking? Is there a way to keep the metadata?
Thanks in advance!
df = pd.DataFrame({'x1' : ['a','b','a','b','c'], # a: 0.4, b: 0.4, c: 0.2
'x2' : ['a','b','a','b','a'], # a: 0.6, b: 0.4, c: 0.0
'x3' : ['a','a','a','a','a'], # a: 1.0, b: 0.0, c: 0.0
'x4' : ['a','b','c','d','e']}) # a: 0.2, b: 0.2, c: 0.2, d: 0.2, e: 0.2
df = sqlContext.createDataFrame(df)
I did some futher investigation and stumbled upon this post about adding metadata to a column in pyspark. Based on this, I was able to create a function called group_low_freq that I think is quite efficient; it uses the StringIndexer only once, and then modifies this column and the metadata to bin all elements that occur less than x% in a separate group called "other". Since we also modify the metadata, we are able to retrieve the Strings later on IndexToString. The function and an example are given below:
Code:
import findspark
findspark.init()
import pyspark as ps
from pyspark.sql import SQLContext, Column
import pandas as pd
import numpy as np
from pyspark.sql.functions import col, count as sparkcount, when, lit
from pyspark.sql.types import StringType
from pyspark.ml.feature import StringIndexer, IndexToString
from pyspark.ml import Pipeline
import json
try:
sc
except NameError:
sc = ps.SparkContext()
sqlContext = SQLContext(sc)
from pyspark.sql.functions import col
def withMeta(self, alias, meta):
sc = ps.SparkContext._active_spark_context
jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))
def group_low_freq(df,inColumns,threshold=.01,group_text='other'):
"""
Index string columns and group all observations that occur in less then a threshold% of the rows in df per column.
:param df: A pyspark.sql.dataframe.DataFrame
:param inColumns: String columns that need to be indexed
:param group_text: String to use as replacement for the observations that need to be grouped.
"""
total = df.count()
for string_col in inColumns:
# Apply string indexer
pipeline = Pipeline(stages=[StringIndexer(inputCol=string_col, outputCol="ix_"+string_col)])
df = pipeline.fit(df).transform(df)
# Calculate the number of unique elements to keep
n_to_keep = df.groupby(string_col).agg((sparkcount(string_col)/total).alias('perc')).filter(col('perc')>threshold).count()
# If elements occur below (threshold * number of rows), replace them with n_to_keep.
this_meta = df.select('ix_' + string_col).schema.fields[0].metadata
if n_to_keep != len(this_meta['ml_attr']['vals']):
this_meta['ml_attr']['vals'] = this_meta['ml_attr']['vals'][0:(n_to_keep+1)]
this_meta['ml_attr']['vals'][n_to_keep] = group_text
df = df.withColumn('ix_'+string_col,when(col('ix_'+string_col)>=n_to_keep,lit(n_to_keep)).otherwise(col('ix_'+string_col)))
# add the new column with correct metadata, remove original.
df = df.withColumn('ix_'+string_col, withMeta(col('ix_'+string_col), "", this_meta))
return df
# SAMPLE DATA -----------------------------------------------------------------
df = pd.DataFrame({'x1' : ['a','b','a','b','c'], # a: 0.4, b: 0.4, c: 0.2
'x2' : ['a','b','a','b','a'], # a: 0.6, b: 0.4, c: 0.0
'x3' : ['a','a','a','a','a'], # a: 1.0, b: 0.0, c: 0.0
'x4' : ['a','b','c','d','e']}) # a: 0.2, b: 0.2, c: 0.2, d: 0.2, e: 0.2
df = sqlContext.createDataFrame(df)
# TEST THE FUNCTION -----------------------------------------------------------
df = group_low_freq(df,df.columns,0.25)
ix_cols = [x for x in df.columns if 'ix_' in x]
for string_col in ix_cols:
idx_to_string = IndexToString(inputCol=string_col, outputCol=string_col[3:]+'grouped')
df = idx_to_string.transform(df)
df.show()
Output with a threshold of 25% (so each group had to occur in at least 25% of the rows):
+---+---+---+---+-----+-----+-----+-----+---------+---------+---------+---------+
| x1| x2| x3| x4|ix_x1|ix_x2|ix_x3|ix_x4|x1grouped|x2grouped|x3grouped|x4grouped|
+---+---+---+---+-----+-----+-----+-----+---------+---------+---------+---------+
| a| a| a| a| 0.0| 0.0| 0.0| 0.0| a| a| a| other|
| b| b| a| b| 1.0| 1.0| 0.0| 0.0| b| b| a| other|
| a| a| a| c| 0.0| 0.0| 0.0| 0.0| a| a| a| other|
| b| b| a| d| 1.0| 1.0| 0.0| 0.0| b| b| a| other|
| c| a| a| e| 2.0| 0.0| 0.0| 0.0| other| a| a| other|
+---+---+---+---+-----+-----+-----+-----+---------+---------+---------+---------+