Remove column names from spark dataframe while storing it as textfile - python

My dataframe output is as below,
DF.show(2)
+--------------+
|col1|col2|col3|
+--------------+
| 10| 20| 30|
| 11| 21| 31|
+--------------+
after saving it as textfile - DF.rdd.saveAsTextFile("path")
Row(col1=u'10', col2=u'20', col3=u'30')
Row(col1=u'11', col2=u'21', col3=u'31')
the dataframe has millions of rows and 20 columns, how can i save it as textfile as below, i.e., without column names and python unicodes
10|20|30
11|21|31
while creating initial RDD i used below code to remove unicodes, though still getting the unicodes,
data = sc.textFile("file.txt")
trans = data.map(lambda x: x.encode("ascii", "ignore").split("|"))
Thanks in advance !

I think you can do just
.map(lambda l: (l[0] + '|' + l[1] + '|' + l[3])).saveAsTextFile(...)

In spark 2.0 you can write dataframes out directly to csv, which is all I think you need here. See: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html
So in you're case, could just do something like
df.write.option("sep", "|").option("header", "false").csv("some/path/")
There is a databricks plugin that provides this functionality in spark 1.x
https://github.com/databricks/spark-csv
As far as converting your unicode strings to ascii, see this question: Convert a Unicode string to a string in Python (containing extra symbols)

Related

How to create new rows in spark dataframe containing special characters?

I am trying to insert test data into a pyspark dataframe (Spark 1.6 and Python 2.7), but one of the columns need to have "special characters" (such as 'ç' and 'ã'). The issue is, even setting the default encoding and creating the dataframe as such:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
columns = df_origin.columns
values = [('SOLICITAÇÃO','0', True, 'CAN'),('SOLICITAÇÃO','0', False, 'CAN')]
df_teste = _hiveContext.createDataFrame(values, columns)
It still creates the dataframe with wrong encoding:
+--------------------+---------------+----------+-----------------+
| ds_evento_|ds_tipo_evento_|in_evento_|cd_painel_evento_|
+--------------------+---------------+----------+-----------------+
|SOLICITA����O...| 0| true| CAN|
|SOLICITA����O...| 0| false| CAN|
+--------------------+---------------+----------+-----------------+
I also tried:
df_teste = df_teste.withColumn('ds_evento_', encode('ds_evento_','iso-8859-1'))
df_teste = df_teste.withColumn('ds_evento_', encode('ds_evento_','utf-8'))
But the error persisted. How can I manually create test rows when special characters are needed?

Trim String Characters in Pyspark dataframe

Suppose if I have dataframe in which I have the values in a column like :
ABC00909083888
ABC93890380380
XYZ7394949
XYZ3898302
PQR3799_ABZ
MGE8983_ABZ
I want to trim these values like, remove first 3 characters and remove last 3 characters if it ends with ABZ.
00909083888
93890380380
7394949
3898302
3799
8983
Tried some methods but did not work.
from pyspark.sql import functions as f
new_df = df.withColumn("new_column", f.when((condition on some column),
f.substring('Existing_COL', 4, f.length(f.col("Existing_COL"))), ))
Can anyone please tell me which function I can use in pyspark.
Trim only removes white space or tab something characters.
Based upon your input and expected output. See below logic -
from pyspark.sql.functions import *
df = spark.createDataFrame(data = [("ABC00909083888",) ,("ABC93890380380",) ,("XYZ7394949",) ,("XYZ3898302",) ,("PQR3799_ABZ",) ,("MGE8983_ABZ",)], schema = ["values",])
(df.withColumn("new_vals", when(col('values').rlike("(_ABZ$)"), regexp_replace(col('values'),r'(_ABZ$)', '')).otherwise(col('values')))
.withColumn("final_vals", expr(("substring(new_vals, 4 ,length(new_vals))")))
).show()
Output
+--------------+--------------+-----------+
| values| new_vals| final_vals|
+--------------+--------------+-----------+
|ABC00909083888|ABC00909083888|00909083888|
|ABC93890380380|ABC93890380380|93890380380|
| XYZ7394949| XYZ7394949| 7394949|
| XYZ3898302| XYZ3898302| 3898302|
| PQR3799_ABZ| PQR3799| 3799|
| MGE8983_ABZ| MGE8983| 8983|
+--------------+--------------+-----------+
If I get you correctly and if you don't insist on using pyspark substring or trim functions, you can easily define a function to do what you want and then make use of that with udfs in spark:
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def mysub(word):
if word.endswith('_ABZ'):
word = word[:-4]
return word[3:]
udf1 = udf(lambda x: mysub(x), StringType())
df.withColumn('new_label',udf1('label')).show()
The output will be like:
+---+--------------+-----------+
| id| label| new_label|
+---+--------------+-----------+
| 1|ABC00909083888|00909083888|
| 2|ABC93890380380|93890380380|
| 3| XYZ7394949| 7394949|
| 4| XYZ3898302| 3898302|
| 5| PQR3799_ABZ| 3799|
| 6| MGE8983_ABZ| 8983|
+---+--------------+-----------+
Please let me know if I got you wrong in some cases.

Python - Write data from list into specific Excel column

I have a list of data which I want to write into a specific column (B2 column) in Excel
Input example:
mydata =[12,13,14,15]
Desired Output in Excel:
A2| B2|
| 12|
| 13|
| 14|
| 15|
I have tried using openpyxl to access the specific sheet (which works fine) and specific cell(B2) but it throws an error to write to the excel file as it is a list. It works fine if I assign a single value as shown in the code extract below:
mydata= my_wb['sheet2']['B2'] = 4
Can anyone point me in the right direction?
Iterate over the list and paste each value into the desired row in column B:
for i, n in enumerate(mydata):
my_wb["sheet2"].cell(i+2, 2).value = n

Databricks pyspark, Difference in result of Dataframe.count() and Display(Dataframe) while using header='false'

I am reading CSV (present on Azure datalake store) file in dataframe by following code:
df = spark.read.load(filepath, format="csv", schema = mySchema, header="false", mode="DROPMALFORMED");
File filepath contain 100 rows and header. I want to ignore header from file while reading so I defined header="false" . (As sometimes file come with header and sometimes not)
After reading in dataframe when I display dataframe by display(df) statement I got all the data and showed 100 rows which is correct. But when I used to check count of dataframe by using df.count() It displayed me 101 rows. Does dataframe show count with header? or Am I missing something?
mySchema and filepath already separately defined in cells.
You have mode="DROPMALFORMED" while reading csv file.
When there are some malformed records spark drops them out in df.show() but counts them in df.count().
In your case as header is false and schema specified so spark reads data as per your specified types if there are issues then records will not be shown
Example:
#sample data
#cat employee.csv
#id,name,salary,deptid
#1,a,1000,101
#2,b,2000,201
ss=StructType([StructField("id",IntegerType()),StructField("name",StringType()),StructField("salary",StringType()),StructField("deptid",StringType())])
df=spark.read.load("employee.csv",format="csv",schema=ss,header="false",mode="DROPMALFORMED")
df.show()
#+---+----+------+------+
#| id|name|salary|deptid|
#+---+----+------+------+
#| 1| a| 1000| 101|
#| 2| b| 2000| 201|
#+---+----+------+------+
#issue in df.count
df.count()
#3 #has to be 2
To Fix:
Add notNull filter while reading as dataframe.
df=spark.read.load("employee.csv",format="csv",schema=ss,header="false",mode="DROPMALFORMED").filter(col("id").isNotNull())
df.show()
#+---+----+------+------+
#| id|name|salary|deptid|
#+---+----+------+------+
#| 1| a| 1000| 101|
#| 2| b| 2000| 201|
#+---+----+------+------+
#fixed count
df.count()
#2
To view malformed data remove mode:
spark.read.load("employee.csv",format="csv",schema= mySchema,header="false").show(100,False)
As per the pyspark documentation,
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader
header – uses the first line as names of columns. If None is set, it uses the default value, false.
You might need to standardize the way you get your data, either with headers or without headers, and then set the flag.
If you set header=False, then the spark engine will simply read the first row as a data row.
To answer your question, Dataframe count does not count header.
I would recommend reading the data first and then dropping the headers for debugging purposes.
Also, display(df) is a python operation provided by Ipython, I would use dataframe.show() which a spark provided utility for debugging purposes.

Write spark dataframe to file using python and '|' delimiter

I have constructed a Spark dataframe from a query. What I wish to do is print the dataframe to a text file with all information delimited by '|', like the following:
+-------+----+----+----+
|Summary|col1|col2|col3|
+-------+----+----+----+
|row1 |1 |14 |17 |
|row2 |3 |12 |2343|
+-------+----+----+----+
How can I do this?
You can try to write to csv choosing a delimiter of |
df.write.option("sep","|").option("header","true").csv(filename)
This would not be 100% the same but would be close.
Alternatively you can collect to the driver and do it youself e.g.:
myprint(df.collect())
or
myprint(df.take(100))
df.collect and df.take return a list of rows.
Lastly you can collect to the driver using topandas and use pandas tools
In Spark 2.0+, you can use in-built CSV writer. Here delimiter is , by default and you can set it to |
df.write \
.format('csv') \
.options(delimiter='|') \
.save('target/location')

Categories