Converting zip compressed csv to parquet using pyspark - python

I have a zip compressed csv stored on S3. I would like to convert this file to parquet format, partitioned on a specific column in the csv. When I try the following (using Python 3.6.5, and Pyspark 2.7.14):
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder.appName("Python Spark SQL basic example").config('spark.hadoop.fs.s3a.access.key','<my access key>').config('spark.hadoop.fs.s3a.secret.key','<my secret key>').getOrCreate()
df = spark.read.csv("s3a://mybucket/path/myfile.zip")
df.show(n=10)
This is the output:
+--------------------+
| _c0|
+--------------------+
|PK- 4*PM<ȏ...|
|����W����lq��...|
|jk�ulE����
Uձ�...|
|U횵Сc�=t�kd�0z...|
|T�;t��gն>t�:�y...|
|ݵK!뼠PT���DЉ*�...|
|�}�B��h)t����H!k?...|
| ��y�B֧|
|��� �1�NTȞB(�...|
+--------------------+
only showing top 10 rows
When I convert to parquet using:
df.write.partitionBy("column").parquet("s3a://otherbucket/path/myfile_partitioned",mode="overwrite")
The results in S3 do not match the actual column values in the source files.
I have also tried using:
sqlctx = SQLContext(spark)
df = sqlctx.read.csv("s3a://cta-ridership/seeds/nm45dayall_2.zip")
But the results are the same. Is something wrong with my csv? I'm new to Pyspark so I might be missing something basic.
UPDATE: Per #Prazy's help, I've updated my code to:
spark = SparkSession.builder.appName("Python Spark SQL basic example").config('spark.hadoop.fs.s3a.access.key','<my key>').config('spark.hadoop.fs.s3a.secret.key','<my secret key>').getOrCreate()
sc = spark.sparkContext
rdd = sc.textFile("s3a://mybucket/mypath/myfile.zip")
print(rdd.take(10))
But this still returns:
['PK\x03\x04-\x00\x00\x00\t\x004*PM<ȏ\x1f��������\x13\x00\x14\x00nm45dayall_2017.csv\x01\x00\x10\x00� �\x07\x00\x00\x00�_J�\x00\x00\x00\x00����㺳���;�7Q�;�"%R�a�{;ܶg��3��9���\x0b\x17I��<Y��of��ڿU��\x19�\x01A\x10\x04\x12�?��\x7f�\x1f������/����������\x7f��\x7f�����?���������?��\x7f�������\x7f����?�����\x7f���������������������\x1f��\x7f����_����\x7f�\x7f��n\x7f������?�����?��������_�\x7f\x1b����\x7f���������g�\\�i�]���\x7f�����3���������ǟ��<_����}���_�������?�\x7f�n�1h��t��.5�Z\x05ͺk��1Zߎ��\x16���ڿE�A��\x1fE�<UAs\x11���z�\\�n��s��\x1ei�XiĐ\x1ej\x0cݪ������`�yH�5��8Rh-+�K�\x11z�N�)[�v̪}', "���\x10�W�\x07���\x12l\x10q��� �qq,i�6ni'��\x10\x14�h\x08��V\x04]��[P�!h�ڢ���GwF\x04=�.���#��>����h", 'jk�\x1culE\x15����\x0cUձ\x7f���#\x1d��\x10Tu���o����\x0eݎ\x16�E\x0f\x11r�q\x08Ce[�\x0c\x0e�s\x10z�?Th\x1aj��O\x1f�\x0f�\x10A��X�<�HC�Y�=~;���!', 'U횵Сc�=t�k\x15d�0\x14z\x16\x1d��R\x05M��', 'T�;t��\x10\x11gն>t�\x01:�y:�c�U��\x1d\x7ff�Т�a', 'ݵ\x19K!뼠PT�\x11��DЉ*\x10\u2d2e�<d� Й��\x08AQ\x03\x04AQ�� {��P����\x1e��Z\x7f���AG�3�b\x19T�E�%;�"ޡ�El�rס�}��qg���qg|������7�8�k\x1e:j�\x7f���c�Bv���\\t�[�ܚ�nz��PU���(\x14��\x08�����CϢc�=|\x14���Ⱥ', ')d]�\x10Z�o\x0e:�v����\x0er�oѣj��\x06DA%b�>', '�}�B��h)t����H!k?R�zf)���5k�B��h?�h���Ao}�S��\x17i\x14�\x1eU', '��y�B֧', '��\x16� �1�NT\x1b1ȞB(�\x16�k\x7f�B!�d��m\x0c:�\x03��˵\x1f�����ޥa�\x16#� ���V"Ա�k']
UPDATE
Again, thanks to Prazy for their help. I'm trying to convert the RDD to a dataframe using:
spark = SparkSession.builder.appName("Python Spark SQL basic example").config('spark.hadoop.fs.s3a.access.key','<mykey>').config('spark.hadoop.fs.s3a.secret.key','<myotherkey>').getOrCreate()
sc = spark.sparkContext
schema = StructType([
StructField("YYYYMMDD", IntegerType(), True),
StructField("ENTRANCE_ID", IntegerType(), True),
StructField("FARE_MEDIA_TYPE", IntegerType(), True),
StructField("TRANS_EVENT", IntegerType(), True),
StructField("HALFHOUR", FloatType(), True),
StructField("RIDES", IntegerType(), True)])
rdd = sc.textFile("s3a://mybucket/path/myfile.zip")
df = sqlctx.createDataFrame(rdd, schema)
df.show(n=10)

zip file is not directly supported. You can follow the link here and here to try the workaround. Use gzip and other supported formats if possible

Related

filter on the pyspark dataframe schema to get new dataframe with columns having specific type

I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.

How to read in JSON so each element of dict/hash is a new row in dataframe?

I'm attempting to read a large dataset written in JSON into a dataframe.
a minimal working example of this dataframe:
{"X":{"sex":"Male","age":57,"BMI":"19.7"},"XX":{"BMI":"30.7","age":44,"sex":"Female"},"XXX":{"age":18,"sex":"Female","BMI":"22.3"},"XXXX":{"sex":"Male","age":29,"BMI":"25.7"},"ZZZ":{"sex":"Male","age":61,"BMI":"40.5"}}
However, the dataset is not being read correctly, as it should have about 10,999 elements, and I'm only getting 1.
The JSON is a hash/dict where each element should be a new row.
I've tried
df = spark.read.option.json("dbfs:/FileStore/shared_uploads/xyz/data.json")
df = spark.read.option("multiline", "true").json("dbfs:/FileStore/shared_uploads/xyz/data.json")
I've also tried inferSchema, but this doesn't interpret the schema even close to correctly: I still get 1 row.
and made a custom schema, where each field is a sub-key of each row.
e.g.
custom_schema = StructType([
StructField('Admission_Date', StringType(), True),
StructField('BMI', StringType(), True),
StructField('age', IntegerType(), True),
StructField('latest_date', StringType(), True),...
...
StructField('sex', StringType(), True),True)
])
and then load with the custom schema:
df = spark.read.option("multiline", "true").schema(custom_schema).json("dbfs:/FileStore/shared_uploads/xyz/data.json")
but this again yields a single row.
How can I load this JSON so that every key is considered a single row?
You can create array column from all the dataframe columns, explode it and star expand the resulting struct column :
from pyspark.sql import functions as F
df1 = df.select(
F.explode(F.array(*df.columns)).alias("rows")
).select("rows.*")
df1.show()
#+----+---+------+
#| BMI|age| sex|
#+----+---+------+
#|19.7| 57| Male|
#|30.7| 44|Female|
#|22.3| 18|Female|
#|25.7| 29| Male|
#|40.5| 61| Male|
#+----+---+------+

pyspark corrupt_record while reading json file

I have a json which can't be read by spark(spark.read.json("xxx").show())
{'event_date_utc': None,'deleted': False, 'cost':1 , 'name':'Mike'}
The problem seems to be the None and False are not under single quote, and spark can't default them to boolean, null or even string.
I tried to give my spark read a schema instead of inferred by forcing those 2 column to be string and have the same error.
Feel like to me spark is trying to read the data first then apply schema then failed in the read part.
Is there a way to tell spark to read those values without modify the input data? I am using python.
You input isn't a valid JSON so you can't read it using spark.read.json. Instead, you can load it as text DataFrame with spark.read.text and parse the stringified dict into json using UDF:
import ast
import json
from pyspark.sql import functions as F
from pyspark.sql.types import *
schema = StructType([
StructField("event_date_utc", StringType(), True),
StructField("deleted", BooleanType(), True),
StructField("cost", IntegerType(), True),
StructField("name", StringType(), True)
])
dict_to_json = F.udf(lambda x: json.dumps(ast.literal_eval(x)))
df = spark.read.text("xxx") \
.withColumn("value", F.from_json(dict_to_json("value"), schema)) \
.select("value.*")
df.show()
#+--------------+-------+----+----+
#|event_date_utc|deleted|cost|name|
#+--------------+-------+----+----+
#|null |false |1 |Mike|
#+--------------+-------+----+----+
The JSON doesn't look good. Field values needs to be quoted.
You can eval the lines first, which look like they're in Python dict format.
df = spark.createDataFrame(
sc.textFile('true.json').map(eval),
'event_date_utc boolean, deleted boolean, cost int, name string'
)
df.show()
+--------------+-------+----+----+
|event_date_utc|deleted|cost|name|
+--------------+-------+----+----+
| null| false| 1|Mike|
+--------------+-------+----+----+

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?
you may wanted to apply userdefined schema to speedup data loading.
There are 2 ways to apply that-
using the input DDL-formatted string
spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")
Use StructType schema
customSchema = StructType([
StructField("a", IntegerType(), True),
StructField("b", StringType(), True),
StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")
You should read the file and then typecast all the columns as required and save them
from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')
Data File:
| data_extract_id| Alien_Dollardiff| Alien_Dollar
|ab1def1gh-123-ea0| 0| 0
Script:
def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
print("## Parsing " + datPath)
df = ssc.read.schema(outputdfSchema).parquet(datPath)
print("## Writing " + parquetPath)
df.write.mode("overwrite").parquet(parquetPath)
Output:
An error occured while calling Parquet.
Column: Alien_Dollardiff| Expected double Found BINARY.

What should I add to the code to avoid 'exceeds max allowed bytes' error using pyspark?

I have a dataframe with 4 million rows and 10 columns. I am trying to write this to a table in hdfs from the Cloudera Data Science Workbench using pyspark. I am running into an error when trying to do this:
[Stage 0:> (0 + 1) /
2]19/02/20 12:31:04 ERROR datasources.FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 318690577 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.
I can break up the dataframe into 3 dataframes and perform the spark write 3 seperate times but I would like to do this just one time if possible by possibly adding something to the spark code like coalesce.
import pandas as pd
df=pd.read_csv('BulkWhois/2019-02-20_Arin_Bulk/Networks_arin_db_2-20-2019_parsed.csv')
'''PYSPARK'''
from pyspark.sql import SQLContext
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark import SparkContext
spark = SparkSession.builder.appName('Arin_Network').getOrCreate()
schema = StructType([StructField('NetHandle', StringType(), False),
StructField('OrgID', StringType(), True),
StructField('Parent', StringType(), True),
StructField('NetName', StringType(), True),
StructField('NetRange', StringType(), True),
StructField('NetType', StringType(), True),
StructField('Comment', StringType(), True),
StructField('RegDate', StringType(), True),
StructField('Updated', StringType(), True),
StructField('Source', StringType(), True)])
dataframe = spark.createDataFrame(df, schema)
dataframe.write. \
mode("append"). \
option("path", "/user/hive/warehouse/bulkwhois_analytics.db/arin_network"). \
saveAsTable("bulkwhois_analytics.arin_network")
User10465355 mentioned that I should use Spark directly. Doing this is simpler and the correct way to accomplish this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Networks').getOrCreate()
dataset = spark.read.csv('Networks_arin_db_2-20-2019_parsed.csv', header=True, inferSchema=True)
dataset.show(5)
dataset.write \
.mode("append") \
.option("path", "/user/hive/warehouse/analytics.db/arin_network") \
.saveAsTable("analytics.arin_network")

Categories