how to split json objects in to words in pyspark - python

i am loading a dataframe of tweets in as JSON objects in pyspark.
I am trying to split the text in to individual words, and then select all the words that include a #. I want to try to avoid using regular python functions, adn try to stick with what is available inside of pyspark.
I am running the code as a jupyter notebook window: but this is the code overall.
import findspark
findspark.init()
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
.builder \
.appName("Jupyter Spark shell") \
.getOrCreate()
sc = spark.sparkContext
folder = 'tweet-id-text-345'
tweets = spark.read.format("json").option("delimeter", "\t").load(folder)
tweets.count()
I am very unsure on how to do this. the viable result would be to get a sort of array of all the different words, and one for the different words that include a #. These would be two seperate lists.
Here is what the content looks like
+------------------------+
| text|
+------------------------+
| โปรทุนน้อย สุดประ...|
| RT #sOLehOXClj1XE...|
|RT #rkayama: 論文「関...|
| SixTONES OneSTのグッ...|
| मुख्यमंत्री #mlkh...|
+------------------------+
only showing top 5 rows

Assuming that the json follows this format the text of the tweet is stored in a field called text.
The text column is split into single words and the resulting array is filtered using rlike:
from pyspark.sql import functions as f
df=spark.read.option("multiline", "true").json(<...>).select("text")
df.withColumn("all_words", f.split("text", " "))\
.withColumn("only_hash", f.expr("filter(all_words, w -> rlike(w, '.*#.*'))")) \
.show(truncate=False)
If the original text was hello #world how a#re you today# the output would be
+--------------------------------+---------------------------------------+----------------------+
|text |all_words |only_hash |
+--------------------------------+---------------------------------------+----------------------+
|hello #world how a#re you today#|[hello, #world, how, a#re, you, today#]|[#world, a#re, today#]|
+--------------------------------+---------------------------------------+----------------------+

Related

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

How to create a DataFrame from a text file in PySpark?

I am new to pyspark and I want to convert a txt file into a Dataframe in Pyspark. I am trying to make the tidy data in pyspark. Any help? Thanks
I´ve already tried to convert it as an RDD and then into datafram, but it is not working for me, so I decided to convert it once into a dataframe from a txt file
I was trying with this but it has not worked yet.
# read input text file to RDD
lines = sc.textFile("/home/h110-3/workspace/spark/weather01.txt")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
I have not being able to convert it into a Dataframe. Help please
You can via the text reader ... example here:
! cat sample.txt
hello there
loading line by line
via apache spark
text df api
print(spark.version)
df = spark.read.text("sample.txt")
df.printSchema()
df.show()
df.selectExpr("split(value, ' ') as rows").show(3, False)
2.4.3
root
|-- value: string (nullable = true)
+--------------------+
| value|
+--------------------+
| hello there|
|loading line by line|
| via apache spark|
| text df api|
+--------------------+
+-------------------------+
|rows |
+-------------------------+
|[hello, there] |
|[loading, line, by, line]|
|[via, apache, spark] |
+-------------------------+

Remove column names from spark dataframe while storing it as textfile

My dataframe output is as below,
DF.show(2)
+--------------+
|col1|col2|col3|
+--------------+
| 10| 20| 30|
| 11| 21| 31|
+--------------+
after saving it as textfile - DF.rdd.saveAsTextFile("path")
Row(col1=u'10', col2=u'20', col3=u'30')
Row(col1=u'11', col2=u'21', col3=u'31')
the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes
10|20|30
11|21|31
while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,
data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))
Thanks in advance !
I think you can do just
.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)
In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
So in you're case, could just do something like
df.write.option("sep", "|").option("header", "false").csv("some/path/")
There is a databricks plugin that provides this functionality in spark 1.x
https://github.com/databricks/spark-csv
As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

Write spark dataframe to file using python and '|' delimiter

I have constructed a Spark dataframe from a query. What I wish to do is print the dataframe to a text file with all information delimited by '|', like the following:
+-------+----+----+----+
|Summary|col1|col2|col3|
+-------+----+----+----+
|row1 |1 |14 |17 |
|row2 |3 |12 |2343|
+-------+----+----+----+
How can I do this?
You can try to write to csv choosing a delimiter of |
df.write.option("sep","|").option("header","true").csv(filename)
This would not be 100% the same but would be close.
Alternatively you can collect to the driver and do it youself e.g.:
myprint(df.collect())
or
myprint(df.take(100))
df.collect and df.take return a list of rows.
Lastly you can collect to the driver using topandas and use pandas tools
In Spark 2.0+, you can use in-built CSV writer. Here delimiter is , by default and you can set it to |
df.write \
.format('csv') \
.options(delimiter='|') \
.save('target/location')

How to get row_number is pyspark dataframe

In order to rank, i need to get the row_number is a pyspark dataframe. I saw that there is row_number function in the windows function of pyspark but this is require using HiveContext.
I tried to replace the sqlContext with HiveContext
import pyspark
self.sc = pyspark.SparkContext()
#self.sqlContext = pyspark.sql.SQLContext(self.sc)
self.sqlContext = pyspark.sql.HiveContext(self.sc)
But it now throws exception TypeError: 'JavaPackage' object is not callable
Can you help in either operating the HiveContext or to get the row number in a different way?
Example of data:
I want to first rank by my prediction and then calculate a loss function (ndcg) based on this ranking. In order to calculate the loss function i will nee the ranking (i.e. the position of the prediction in the sorting)
So the first step is to sort the data by pred but then i need a running counter of the sorted data.
+-----+--------------------+
|label|pred|
+-----+--------------------+
| 1.0|[0.25313606997906...|
| 0.0|[0.40893413256608...|
| 0.0|[0.18353492079000...|
| 0.0|[0.77719741215204...|
| 1.0|[0.62766290642569...|
| 1.0|[0.40893413256608...|
| 1.0|[0.63084085591913...|
| 0.0|[0.77719741215204...|
| 1.0|[0.36752166787523...|
| 0.0|[0.40893413256608...|
| 1.0|[0.25528507573737...|
| 1.0|[0.25313606997906...|
Thanks.
You don't need to create the HiveContext if your data is not in Hive. You can just carry on with your sqlContext.
There is no row_number for your dataframe unless you create one. pyspark.sql.functions.row_number` is for a different purpose and it only works with a windowed partition.
What you need may be to create a new column as the row_id using monotonically_increasing_id then query it later.
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import Row
data = sc.parallelize([
Row(key=1, val='a'),
Row(key=2, val='b'),
Row(key=3, val='c'),
]).toDF()
data = data.withColumn(
'row_id',
monotonically_increasing_id()
)
data.collect()
Out[8]:
[Row(key=1, val=u'a', row_id=17179869184),
Row(key=2, val=u'b', row_id=42949672960),
Row(key=3, val=u'c', row_id=60129542144)]

Categories