pyspark dataframe is saved in s3 bucket with junk data - python

While trying to save pyspark DataFrame to csv and trying to s3 bucket directly,
file is getting saved but it has junk data. and all file sizes are 1B.
please help me where iam doing wrong.
python code
df.write.options("header","true").csv("s3a://example/csv")
trying this code also
df.coalesce(1).write.format("csv").option("header", "true").option("path", "s3://example/test.csv").save()
But not getting proper csv in s3 bucket
junk data in csv file

I think you are saving your dataframe as parquet which is the default.
df.write.format("csv")
.option("header", "true")
.option("encoding", "UTF-8")
.option("path", "s3a://example/csv")
.save()
Note: The syntax is also option not options.
update
As #Samkart mentioned you should check if your encoding is correct. I have updated my answer to include the encoding option. You can check here for the encoding options in pyspark.

Related

Pandas can not read S3 excel file. Error: Excel file format cannot be determined

I am reading excel files (.xls) from S3 with Pandas. The code works properly for a few files, but for the rest not. The files are received daily with different values per day (excel files structure is the same, so we can consider the files identical).
The error is:
ValueError: Excel file format cannot be determined, you must specify an engine manually.
at this line:
pd.read_excel(io.BytesIO(excel), sheet_name=sheet, index_col=None, header=[0])
I have tried all the solutions mentioned on internet: specifying the engine='openpyxl' gives the error:
zipfile.BadZipFile: File is not a zip file
and specifying the engine='xlrd' gives the error:
expected str, bytes or os.PathLike object, not NoneType
I am using boto3 to connect to S3 resource.
Once again, for a few files my code works fine.
What can be the cause of this different behaviour for excel files that looks identical?
My question is very similar with Excel file format cannot be determined with Pandas, randomly happening but it doesn't have a proper response yet.
It's always possible that the files you are reading have mislabeled extensions, bad data, etc.
It's also not clear how you are getting 'excel' in io.BytesIO(excel)
See if something like this will work. this is reading a .xls file. I'm able to return contents of Sheet1 to a dataframe.
bucket = 'your bucket'
key = 'test.xls'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=key)
pd.read_excel(obj['Body'].read(), sheet_name='Sheet1', index_col=None, header=0)

Python : How to convert a CSV file stored in Byte stream to a List?

I am trying to get a csv file from Azure Data Lake Gen2, and then perform some operations on each row. However, the requirement is, not to download the file to a physical location. And hence, I am using file_client.download_File().readAll() to get the file in a Byte Stream.
However, I am unable to split the file rows/columns and get them into a list.
x = file_client.download_file()
bystream = x.readall()
WHAT TO DO WITH THIS bystream ?
I am however able to do this with downloaded file using WITH OPEN () AS CSV and then using this CSV stream in csv.reader()
Can someone please help with handling this bytestream?
A late update that I was able to resolve this issue by converting the downloaded stream to Text I/O. (didnt need to convert it to List, as Pandas Dataframe was better option)
Here is the code snippet that worked :
stream = io.StringIO(file_client.download_file().readall().decode("utf-8"))
dataframe1 = pd.read_csv(stream, sep= "|")
Here, file_client is connection to an Azure Data Lake where the csv file is stored.
The code downloads the file as a stream in-memory, and loads it to a dataframe. (No need to write it to a local file location)

I am facing problem with .CSV format in PANDAS

I will explain in detail:
I have an Excel file and my client is using one tool which reads .csv format files only.
Now I am opening the Excel file in Excel and saving into .CSV format by using Save As option in excel. let me take this is a File_1.
I wrote Python code by using pandas module and i converted that Excel file into csv. let me take this is as a File_2.
My client tool is able to read File_1 but not File_2. Why? What would be the problem?
My observations:
When I am reading File_1 in pandas (which is converted into .CSV manually) I had to mention --> encoding = "ISO-8859-1", otherwise it is giving Unicode error.
Ex: pd.read_csv("File_1.csv", encoding = 'ISO-8859-1")
But when I am reading File_2 in pandas, it simply reading and not giving any error.
Ex: pd.read_csv("File_2.csv")
So what would be the reason to not read File_2 by client tool? Is it Unicode problem or any other?

How to fix upload csv file in bigquery using python

while uploading csv file on BigQuery through storage , I am getting below error:
CSV table encountered too many errors, giving up. Rows: 5; errors: 1. Please look into the error stream for more details.
In schema , I am using all parameter as string.
In csv file,I have below data:
It's Time. Say "I Do" in my style.
I am not able upload csv file in BigQuery containing above sentence
Does the CSV file have the exact same structure of the dataset schema? Both must match for the upload to be successful.
If your CSV file has only one sentence in the first row of the first column, then your schema must have a table with exactly one field as STRING. If there is content in the second column of the CSV, the schema must then have a second field for it, etc. Conversely, if your scheman has say 2 fields set as STRING, there must be data in first two columns in the CSV.
Data location must also match, if your BigQuery dataset is in US, then your Cloud Storage bucket must be in US too for the upload to work.
Check here for details of uploading CSV into BigQuery.
Thanks to all for a response.
Here is my solution to this problem:
with open('/path/to/csv/file', 'r') as f:
text = f.read()
converted_text = text.replace('"',"'") print(converted_text)
with open('/path/to/csv/file', 'w') as f:
f.write(converted_text)

How to read gz compressed file by pyspark

I have line data in .gz compressed format. I have to read it in pyspark
Following is the code snippet
rdd = sc.textFile("data/label.gz").map(func)
But I could not read the above file successfully. How do I read gz compressed file. I have found a similar question here but my current version of spark is different that the version in that question. I expect there should be some built in function as in hadoop.
Spark document clearly specify that you can read gz file automatically:
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
I'd suggest running the following command, and see the result:
rdd = sc.textFile("data/label.gz")
print rdd.take(10)
Assuming that spark finds the the file data/label.gz, it will print the 10 rows from the file.
Note, that the default location for a file like data/label.gz will be in the hdfs folder of the spark-user. Is it there?
You can load compressed files directly into dataframes through the spark instance, you just need to specify the compression in the path:
df = spark.read.csv("filepath/part-000.csv.gz")
You can also optionally specify if a header present or if schema needs applying too
df = spark.read.csv("filepath/part-000.csv.gz", header=True, schema=schema).
You didn't write the error message you got, but it's probably not going well for you because gzipped files are not splittable. You need to use a splittable compression codec, like bzip2.

Categories