I have performed the data cleaning of my dataframe with pyspark, including the removal of the Stop-Words.
Removing the Stop-Word produces a list for each line, containing words that are NOT Stop-Words.
Now I would like to count all the words left in that column, to make the Word-Cloud or the Word-Frequency.
This is my pyspark dataframe:
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
| content|score|label|classWeigth| words| filtered| terms_stemmed|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
|absolutely love d...| 5| 1| 0.48|[absolutely, love...|[absolutely, love...|[absolut, love, d...|
|absolutely love t...| 5| 1| 0.48|[absolutely, love...|[absolutely, love...|[absolut, love, g...|
|absolutely phenom...| 5| 1| 0.48|[absolutely, phen...|[absolutely, phen...|[absolut, phenome...|
|absolutely shocki...| 1| 0| 0.52|[absolutely, shoc...|[absolutely, shoc...|[absolut, shock, ...|
|accept the phone ...| 1| 0| 0.52|[accept, the, pho...|[accept, phone, n...|[accept, phone, n...|
+--------------------+-----+-----+-----------+--------------------+--------------------+--------------------+
terms_stemmed is the final column, from which I would like to get a new data frame like the following:
+-------------+--------+
|terms_stemmed| count |
+-------------+--------+
|app | 592059|
|use | 218178|
|good | 187671|
|like | 155304|
|game | 149941|
|.... | .... |
Someone can help me?
One option is to use explode
import pyspark.sql.functions as F
new_df = df\
.withColumn('terms_stemmed', F.explode('terms_stemmed'))\
.groupby('terms_stemmed')\
.count()
Example
import pyspark.sql.functions as F
df = spark.createDataFrame([
(1, ["Apple", "Banana"]),
(2, ["Banana", "Orange", "Banana"]),
(3, ["Orange"])
], ("id", "terms_stemmed"))
df.show(truncate=False)
+---+------------------------+
|id |terms_stemmed |
+---+------------------------+
|1 |[Apple, Banana] |
|2 |[Banana, Orange, Banana]|
|3 |[Orange] |
+---+------------------------+
new_df = df\
.withColumn('terms_stemmed', F.explode('terms_stemmed'))\
.groupby('terms_stemmed')\
.count()
new_df.show()
+-------------+-----+
|terms_stemmed|count|
+-------------+-----+
| Banana| 3|
| Apple| 1|
| Orange| 2|
+-------------+-----+
Related
I have two dataframes one is the main and another one is the lookup dataframe. I need to achieve the third one in the customized form using pyspark. I need check the values in the column list_ids and check the match in the lookup dataframe and mark the count in the final dataframe. I have tried array intersect and array lookup but it is not working.
Main dataframe:
df = spark.createDataFrame([(123, [75319, 75317]), (212, [136438, 25274]), (215, [136438, 75317])], ("ID", "list_IDs"))
df.show()
+---+---------------+
| ID| list_IDs|
+---+---------------+
|123| [75319, 75317]|
|212|[136438, 25274]|
|215|[136438, 75317]|
+---+---------------+
Lookup Dataframe:
df_2 = spark.createDataFrame([(75319, "Wheat", 20), (75317, "Rice", 10), (136438, "Jowar", 30), (25274, "Rajma", 40)], ("ID", "Material", "Count"))
df_2.show()
+------+--------+-----+
| ID|Material|Count|
+------+--------+-----+
| 75319| Wheat| A|
| 75317| Rice| B|
|136438| Jowar| C|
| 25274| Rajma| D|
+------+--------+-----+
Need Resultant dataframe as
+---+---------------+------+------+-------+------+
| ID| list_IDs|Wheat | Rice | Jowar | Rajma|
+---+---------------+------+------+-------+------+
|123| [75319, 75317]| A| B| 0 | 0|
|212|[136438, 25274]| 0| 0| C | D|
|215|[136438, 75317]| 0| B| C | 0 |
+---+---------------+------+------+-------+------+
You can join the two dataframes and then pivot:
import pyspark.sql.functions as F
df2 = df.join(
df_2,
F.array_contains(df.list_IDs, df_2.ID)
).groupBy(df.ID, 'list_IDs').pivot('Material').agg(F.first('Count')).fillna(0)
result.show()
+---+---------------+-----+-----+----+-----+
| ID| list_IDs|Jowar|Rajma|Rice|Wheat|
+---+---------------+-----+-----+----+-----+
|212|[136438, 25274]| 30| 40| 0| 0|
|215|[136438, 75317]| 30| 0| 10| 0|
|123| [75319, 75317]| 0| 0| 10| 20|
+---+---------------+-----+-----+----+-----+
I'm trying to convert a column in a dataframe to IntegerType. Here is an example of the dataframe:
+----+-------+
|From| To|
+----+-------+
| 1|1664968|
| 2| 3|
| 2| 747213|
| 2|1664968|
| 2|1691047|
| 2|4095634|
+----+-------+
I'm using the following code:
exploded_df = exploded_df.withColumn('From', exploded_df['To'].cast(IntegerType()))
However, I wanted to know what happens to strings that are not digits, for example, what happens if I have a string with several spaces? The reason is that I want to filter the dataframe in order to get the values of the column 'From' that don't have numbers in column 'To'.
Is there a simpler way to filter by this condition without converting the columns to IntegerType?
Thank you!
Values which cannot be cast are set to null, and the column will be considered a nullable column of that type. Here's a simple example:
from pyspark import SQLContext
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
df = sql_context.createDataFrame([("1",),
("2",),
("3",),
("4",),
("hello world",)], schema=['id'])
print(df.show())
df = df.withColumn("id", F.col("id").astype(IntegerType()))
print(df.show())
Output:
+-----------+
| id|
+-----------+
| 1|
| 2|
| 3|
| 4|
|hello world|
+-----------+
+----+
| id|
+----+
| 1|
| 2|
| 3|
| 4|
|null|
+----+
And to verify the schema is correct:
print(df.printSchema())
Output:
None
root
|-- id: integer (nullable = true)
Hope this helps!
We can use regex to check does To column have some alphabets,spaces in the data, Using .rlike funtion in spark to filter out the matching rows.
Example:
df=spark.createDataFrame([("1","1664968"),("2","3"),("2","742a7"),("2"," "),("2","a")],["From","To"])
df.show()
#+----+-------+
#|From| To|
#+----+-------+
#| 1|1664968|
#| 2| 3|
#| 2| 742a7|
#| 2| |
#| 2| a|
#+----+-------+
#get the rows which have space or word in them
df.filter(col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-----+
#|From|To |
#+----+-----+
#|2 |742a7|
#|2 | |
#|2 |a |
#+----+-----+
#to get rows which doesn't have any space or word in them.
df.filter(~col("To").rlike('([a-z]|\\s+)')).show(truncate=False)
#+----+-------+
#|From|To |
#+----+-------+
#|1 |1664968|
#|2 |3 |
#+----+-------+
I have 2 DataFrame like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foo|
| b| bar|
| c| egg|
| d| fog|
+--+-----------+
and this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| hoi|
| b| hei|
| c| hai|
| e| hui|
+--+-----------+
I want to join them to be like this:
+--+-----------+
|id|some_string|
+--+-----------+
| a| foohoi|
| b| barhei|
| c| egghai|
| d| fog|
| e| hui|
+--+-----------+
so, the column some_string from the first dataframe is concantenated to the column some_string from the second dataframe. If I am using
df_join = df1.join(df2,on='id',how='outer')
it would return
+--+-----------+-----------+
|id|some_string|some_string|
+--+-----------+-----------+
| a| foo| hoi|
| b| bar| hei|
| c| egg| hai|
| d| fog| null|
| e| null| hui|
+--+-----------+-----------+
Is there any way to do it?
You need to use when in order to achieve proper concatenation. Other than that the way you were using outer join was almost correct.
You need to check if anyone of these two columns is Null or not Null and then do the concatenation.
from pyspark.sql.functions import col, when, concat
df1 = sqlContext.createDataFrame([('a','foo'),('b','bar'),('c','egg'),('d','fog')],['id','some_string'])
df2 = sqlContext.createDataFrame([('a','hoi'),('b','hei'),('c','hai'),('e','hui')],['id','some_string'])
df_outer_join=df1.join(df2.withColumnRenamed('some_string','some_string_x'), ['id'], how='outer')
df_outer_join.show()
+---+-----------+-------------+
| id|some_string|some_string_x|
+---+-----------+-------------+
| e| null| hui|
| d| fog| null|
| c| egg| hai|
| b| bar| hei|
| a| foo| hoi|
+---+-----------+-------------+
df_outer_join = df_outer_join.withColumn('some_string_concat',
when(col('some_string').isNotNull() & col('some_string_x').isNotNull(),concat(col('some_string'),col('some_string_x')))
.when(col('some_string').isNull() & col('some_string_x').isNotNull(),col('some_string_x'))
.when(col('some_string').isNotNull() & col('some_string_x').isNull(),col('some_string')))\
.drop('some_string','some_string_x')
df_outer_join.show()
+---+------------------+
| id|some_string_concat|
+---+------------------+
| e| hui|
| d| fog|
| c| egghai|
| b| barhei|
| a| foohoi|
+---+------------------+
Considering you want to perform an outer join you can try the following:
from pyspark.sql.functions import concat, col, lit, when
df_join= df1.join(df2,on='id',how='outer').when(isnull(df1.some_string1), ''). when(isnull(df2.some_string2),'').withColumn('new_column',concat(col('some_string1'),lit(''),col('some_string2'))).select('id','new_column')
(Please note that the some_string1 and 2 refer to the some_string columns from the df1 and df2 dataframes. I would advise you to name them differently instead of giving the same name some_string, so that you can call them)
This question already has answers here:
Create a group id over a window in Spark Dataframe
(3 answers)
Closed 4 years ago.
I would like to assign each group in a groupby a unique id number starting from 0 or 1 and incrementing by 1 for each group using pyspark.
I have done this previously using pandas with python with the command:
df['id_num'] = (df
.groupby('column_name')
.grouper
.group_info[0])
A toy example of the input and desired output is:
Input
+------+
|object|
+------+
|apple |
|orange|
|pear |
|berry |
|apple |
|pear |
|berry |
+------+
output:
+------+--+
|object|id|
+------+--+
|apple |1 |
|orange|2 |
|pear |3 |
|berry |4 |
|apple |1 |
|pear |3 |
|berry |4 |
+------+--+
I am not sure if the order is important. If not you can use dense_rank window function in this case
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> df.show()
+------+
|object|
+------+
| apple|
|orange|
| pear|
| berry|
| apple|
| pear|
| berry|
+------+
>>>
>>> df.withColumn("id", F.dense_rank().over(Window.orderBy(df.object))).show()
+------+---+
|object| id|
+------+---+
| apple| 1|
| apple| 1|
| berry| 2|
| berry| 2|
|orange| 3|
| pear| 4|
| pear| 4|
+------+---+
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
values = [('apple',),('orange',),('pear',),('berry',),('apple',),('pear',),('berry',)]
df = sqlContext.createDataFrame(values,['object'])
#Creating a column of distinct elements and converting them into dictionary with unique indexes.
df1 = df.distinct()
distinct_list = list(df1.select('object').toPandas()['object'])
dict_with_index = {distinct_list[i]:i+1 for i in range(len(distinct_list))}
#Applying the mapping of dictionary.
mapping_expr = create_map([lit(x) for x in chain(*dict_with_index.items())])
df=df.withColumn("id", mapping_expr.getItem(col("object")))
df.show()
+------+---+
|object| id|
+------+---+
| apple| 2|
|orange| 1|
| pear| 3|
| berry| 4|
| apple| 2|
| pear| 3|
| berry| 4|
+------+---+
I have 2 pyspark dataframes,
i
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 2| 456|
| 3| 111|
| 4| 678|
+---+-----+
j
+----+-----+
|ID_B|COL_B|
+----+-----+
| 2| 456|
| 3| 111|
| 4| 876|
+----+-----+
I'm trying to subtract i from j based on values of a particular column i.e., values present in COL_A of i should not be present in COL_B of j.
Expected output should be,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 1| 123|
| 4| 678|
+---+-----+
This is my code,
common = i.join(j.withColumnRenamed('COL_B', 'COL_A'), ['COL_A'], 'leftsemi')
diff = i.subtract(common)
diff.show()
But the output is coming wrong,
diff
+---+-----+
| ID|COL_A|
+---+-----+
| 2| 456|
| 1| 123|
| 4| 678|
| 3| 111|
+---+-----+
Am I doing something wrong here? Thanks in advance.
Try:
left_join = i.join(j, j.COL_B == i.COL_A,how='left')
left_join.filter(left_join.COL_A.isNull()).show()
If you are having column names as args, you can do like:
left_join = i.join(j, j[colb] == i[cola],how='left')
left_join.filter(left_join[cola].isNull()).show()