Write spark dataframe to file using python and '|' delimiter - python

I have constructed a Spark dataframe from a query. What I wish to do is print the dataframe to a text file with all information delimited by '|', like the following:
+-------+----+----+----+
|Summary|col1|col2|col3|
+-------+----+----+----+
|row1 |1 |14 |17 |
|row2 |3 |12 |2343|
+-------+----+----+----+
How can I do this?

You can try to write to csv choosing a delimiter of |
df.write.option("sep","|").option("header","true").csv(filename)
This would not be 100% the same but would be close.
Alternatively you can collect to the driver and do it youself e.g.:
myprint(df.collect())
or
myprint(df.take(100))
df.collect and df.take return a list of rows.
Lastly you can collect to the driver using topandas and use pandas tools

In Spark 2.0+, you can use in-built CSV writer. Here delimiter is , by default and you can set it to |
df.write \
.format('csv') \
.options(delimiter='|') \
.save('target/location')

Related

how to split json objects in to words in pyspark

i am loading a dataframe of tweets in as JSON objects in pyspark.
I am trying to split the text in to individual words, and then select all the words that include a #. I want to try to avoid using regular python functions, adn try to stick with what is available inside of pyspark.
I am running the code as a jupyter notebook window: but this is the code overall.
import findspark
findspark.init()
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession \
.builder \
.appName("Jupyter Spark shell") \
.getOrCreate()
sc = spark.sparkContext
folder = 'tweet-id-text-345'
tweets = spark.read.format("json").option("delimeter", "\t").load(folder)
tweets.count()
I am very unsure on how to do this. the viable result would be to get a sort of array of all the different words, and one for the different words that include a #. These would be two seperate lists.
Here is what the content looks like
+------------------------+
| text|
+------------------------+
| โปรทุนน้อย สุดประ...|
| RT #sOLehOXClj1XE...|
|RT #rkayama: 論文「関...|
| SixTONES OneSTのグッ...|
| मुख्यमंत्री #mlkh...|
+------------------------+
only showing top 5 rows
Assuming that the json follows this format the text of the tweet is stored in a field called text.
The text column is split into single words and the resulting array is filtered using rlike:
from pyspark.sql import functions as f
df=spark.read.option("multiline", "true").json(<...>).select("text")
df.withColumn("all_words", f.split("text", " "))\
.withColumn("only_hash", f.expr("filter(all_words, w -> rlike(w, '.*#.*'))")) \
.show(truncate=False)
If the original text was hello #world how a#re you today# the output would be
+--------------------------------+---------------------------------------+----------------------+
|text |all_words |only_hash |
+--------------------------------+---------------------------------------+----------------------+
|hello #world how a#re you today#|[hello, #world, how, a#re, you, today#]|[#world, a#re, today#]|
+--------------------------------+---------------------------------------+----------------------+

Databricks pyspark, Difference in result of Dataframe.count() and Display(Dataframe) while using header='false'

I am reading CSV (present on Azure datalake store) file in dataframe by following code:
df = spark.read.load(filepath, format="csv", schema = mySchema, header="false", mode="DROPMALFORMED");
File filepath contain 100 rows and header. I want to ignore header from file while reading so I defined header="false" . (As sometimes file come with header and sometimes not)
After reading in dataframe when I display dataframe by display(df) statement I got all the data and showed 100 rows which is correct. But when I used to check count of dataframe by using df.count() It displayed me 101 rows. Does dataframe show count with header? or Am I missing something?
mySchema and filepath already separately defined in cells.
You have mode="DROPMALFORMED" while reading csv file.
When there are some malformed records spark drops them out in df.show() but counts them in df.count().
In your case as header is false and schema specified so spark reads data as per your specified types if there are issues then records will not be shown
Example:
#sample data
#cat employee.csv
#id,name,salary,deptid
#1,a,1000,101
#2,b,2000,201
ss=StructType([StructField("id",IntegerType()),StructField("name",StringType()),StructField("salary",StringType()),StructField("deptid",StringType())])
df=spark.read.load("employee.csv",format="csv",schema=ss,header="false",mode="DROPMALFORMED")
df.show()
#+---+----+------+------+
#| id|name|salary|deptid|
#+---+----+------+------+
#| 1| a| 1000| 101|
#| 2| b| 2000| 201|
#+---+----+------+------+
#issue in df.count
df.count()
#3 #has to be 2
To Fix:
Add notNull filter while reading as dataframe.
df=spark.read.load("employee.csv",format="csv",schema=ss,header="false",mode="DROPMALFORMED").filter(col("id").isNotNull())
df.show()
#+---+----+------+------+
#| id|name|salary|deptid|
#+---+----+------+------+
#| 1| a| 1000| 101|
#| 2| b| 2000| 201|
#+---+----+------+------+
#fixed count
df.count()
#2
To view malformed data remove mode:
spark.read.load("employee.csv",format="csv",schema= mySchema,header="false").show(100,False)
As per the pyspark documentation,
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
header – uses the first line as names of columns. If None is set, it uses the default value, false.
You might need to standardize the way you get your data, either with headers or without headers, and then set the flag.
If you set header=False, then the spark engine will simply read the first row as a data row.
To answer your question, Dataframe count does not count header.
I would recommend reading the data first and then dropping the headers for debugging purposes.
Also, display(df) is a python operation provided by Ipython, I would use dataframe.show() which a spark provided utility for debugging purposes.

Create one dataframe from multi csv files with different headers in Spark

In Spark, with Pyspark, i want to create one dataframe (where the path is actually a folder in S3), which contains multi csv files with common columns and different columns.
To say it more easily, i want only one dataframe from multiple csv files with different headers.
I can have a file with this header "raw_id, title, civility", and another file with this header "raw_id, first_name, civility".
This is my code in python 3 :
df = spark.read.load(
s3_bucket + 'data/contacts/normalized' + '/*/*/*/*',
format = 'csv',
delimiter = '|',
encoding = 'utf-8',
header = 'true',
quote = ''
)
This is an example of file_1.csv :
|raw_id|title|civility|
|1 |M |male |
And an example of file2.csv :
|raw_id|first_name|civility|
|2 |Tom |male |
The result i expect in my dataframe is :
|raw_id|first_name|title|civility|
|1 | |M |male |
|2 |Tom | |male |
But, what is happening is that i have all united columns but the data is not in the right place after the first file.
Do you know how to do this ?
Thank you very much by advance.
You need to load each of them in a different dataframe and join them together on the raw_id column.

Filtering Spark Dataframe

I've created a dataframe as:
ratings = imdb_data.sort('imdbRating').select('imdbRating').filter('imdbRating is NOT NULL')
Upon doing ratings.show() as shown below, i can see that
the imdbRating field has a mixed type of data such as random strings, movie title, movie url and actual ratings. So the dirty data looks this:
+--------------------+
| imdbRating|
+--------------------+
|Mary (TV Episode...|
| Paranormal Activ...|
| Sons (TV Episode...|
| Spion (2011)|
| Winter... und Fr...|
| and Gays (TV Epi...|
| grAs - Die Serie...|
| hat die Wahl (2000)|
| 1.0|
| 1.3|
| 1.4|
| 1.5|
| 1.5|
| 1.5|
| 1.6|
| 1.6|
| 1.7|
| 1.9|
| 1.9|
| 1.9|
+--------------------+
only showing top 20 rows
Is there anyway i can filter out the unwanted strings and all just get the ratings ? I tried using UDF as:
ratings_udf = udf(lambda imdbRating: imdbRating if isinstance(imdbRating, float) else None)
and tried calling it as:
ratings = imdb_data.sort('imdbRating').select('imdbRating')
filtered = rating.withColumn('imdbRating',ratings_udf(ratings.imdbRating))
The problem with above is, since it tried calling the udf on each row, each row of the dataframe mapped to a Row type and hence returning None on all the values.
Is there any straightforward way to filter out those data ?
Any help will be much appreciated. Thank you
Finally, i was able to resolve it.The problem was there was some corrupt data with not all fields present. Firstly, i tried is using pandas by reading the csv files in pandas as:
pd_frame = pd.read_csv('imdb.csv', error_bad_lines=False)
This skipped/dropped the corrupt rows which had less columns than the actual. I tried to read the above panda dataframe, pd_frame, to spark using:
imdb_data= spark.createDataFrame(pd_frame)
but got some error because of mismatch while inferring schema. Turns out spark csv reader has something similar which drops the corrupt rows as:
imdb_data = spark.read.csv('imdb.csv', header='true', mode='DROPMALFORMED')

Remove column names from spark dataframe while storing it as textfile

My dataframe output is as below,
DF.show(2)
+--------------+
|col1|col2|col3|
+--------------+
| 10| 20| 30|
| 11| 21| 31|
+--------------+
after saving it as textfile - DF.rdd.saveAsTextFile("path")
Row(col1=u'10', col2=u'20', col3=u'30')
Row(col1=u'11', col2=u'21', col3=u'31')
the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes
10|20|30
11|21|31
while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,
data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))
Thanks in advance !
I think you can do just
.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)
In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
So in you're case, could just do something like
df.write.option("sep", "|").option("header", "false").csv("some/path/")
There is a databricks plugin that provides this functionality in spark 1.x
https://github.com/databricks/spark-csv
As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

Categories