Reading Empty CSV Pyspark - python

I have a process to read csv files and do some processing in pyspark. At times I might get a zero byte empty file. In such cases when I use the below code
df = spark.read.csv('/path/empty.txt', header = False)
It is failing with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o139.csv.
: java.lang.UnsupportedOperationException: empty collection
Since its empty file I tried to read as a json it worked fine
df = spark.read.json('/path/empty.txt')
When I add header to the empt csv manually to the code reads fine.
df = spark.read.csv('/path/empty.txt', header = True)
In few places I read to use databricks csv but
I don't have the data bricks csv package options to use as those jars are not available in my environment.

Related

Check if Excel is empty

I have a bunch of excel files automatically generated by a process. However, some of them are empty because the process stopped before actually writing anything. These excels do not even contain any columns, so they are just an empty sheet.
I'm now runnin some scripts on each of the excels, so I would like to check if the excel is empty, and if so, skip it.
I have tried:
pandas.DataFrame.empty
But I still get the message: EmptyDataError: No columns to parse from file
How can I perform this check?
Why not using a try/except:
try:
# try reading the excel file
df = pd.read_excel(…) # or pd.read_csv(…)
except pd.errors.EmptyDataError:
# do something else if this fails
df = pd.DataFrame()

Pandas can not read S3 excel file. Error: Excel file format cannot be determined

I am reading excel files (.xls) from S3 with Pandas. The code works properly for a few files, but for the rest not. The files are received daily with different values per day (excel files structure is the same, so we can consider the files identical).
The error is:
ValueError: Excel file format cannot be determined, you must specify an engine manually.
at this line:
pd.read_excel(io.BytesIO(excel), sheet_name=sheet, index_col=None, header=[0])
I have tried all the solutions mentioned on internet: specifying the engine='openpyxl' gives the error:
zipfile.BadZipFile: File is not a zip file
and specifying the engine='xlrd' gives the error:
expected str, bytes or os.PathLike object, not NoneType
I am using boto3 to connect to S3 resource.
Once again, for a few files my code works fine.
What can be the cause of this different behaviour for excel files that looks identical?
My question is very similar with Excel file format cannot be determined with Pandas, randomly happening but it doesn't have a proper response yet.
It's always possible that the files you are reading have mislabeled extensions, bad data, etc.
It's also not clear how you are getting 'excel' in io.BytesIO(excel)
See if something like this will work. this is reading a .xls file. I'm able to return contents of Sheet1 to a dataframe.
bucket = 'your bucket'
key = 'test.xls'
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=bucket, Key=key)
pd.read_excel(obj['Body'].read(), sheet_name='Sheet1', index_col=None, header=0)

Python : How to convert a CSV file stored in Byte stream to a List?

I am trying to get a csv file from Azure Data Lake Gen2, and then perform some operations on each row. However, the requirement is, not to download the file to a physical location. And hence, I am using file_client.download_File().readAll() to get the file in a Byte Stream.
However, I am unable to split the file rows/columns and get them into a list.
x = file_client.download_file()
bystream = x.readall()
WHAT TO DO WITH THIS bystream ?
I am however able to do this with downloaded file using WITH OPEN () AS CSV and then using this CSV stream in csv.reader()
Can someone please help with handling this bytestream?
A late update that I was able to resolve this issue by converting the downloaded stream to Text I/O. (didnt need to convert it to List, as Pandas Dataframe was better option)
Here is the code snippet that worked :
stream = io.StringIO(file_client.download_file().readall().decode("utf-8"))
dataframe1 = pd.read_csv(stream, sep= "|")
Here, file_client is connection to an Azure Data Lake where the csv file is stored.
The code downloads the file as a stream in-memory, and loads it to a dataframe. (No need to write it to a local file location)

Unable to read modified csv file with pandas

I have exported a Excel file using the pandas .to_csv method on a 9-column DataFrame successfully, as well as accessing the created file with the .to_csv method likewise, with no errors whatsoever using the following code:
dfBase = pd.read_csv('C:/Users/MyUser/Documents/Scripts/Base.csv',
sep=';', decimal=',', index_col=0, parse_dates=True,
encoding='utf-8', engine='python')
However, upon modifying the same CSV file manually using Notepad (which also extends to simply opening the file and saving it without making any actual alterations), pandas won't read it anymore, giving the following error message:
ParserError: Expected 2 fields in line 2, saw 9
In the case of the modified CSV, if the index_col=0 parameter is removed from the code, pandas is able to read the DataFrame again, however the first 8 columns become the index as a tuple and only the last column is brought as a field.
Could anyone point me out as to why I am unable to access the DataFrame after modifying it? Also, why does the removal of index_col enables its reading again with nearly all the columns as the index?
Have you tried opening and saving the file with some other text editor? Notepad really isn't that great, probably it's adding some special characters upon opening of the file or maybe the file already contains those characters and Notepad does not let you see them, hence pandas can't convert correctly
try Notepad++ or some more advanced IDEs like Atom, VSCode or PyCharm

Problem loading csv into DataFrame in PySpark

I'm trying to aggregate a bunch of CSV files into one and output it to S3 in ORC format using an ETL job in AWS Glue. My aggregated CSV looks like this:
header1,header2,header3
foo1,foo2,foo3
bar1,bar2,bar3
I have a string representation of the aggregated CSV called aggregated_csv has content header1,header2,header3\nfoo1,foo2,foo3\nbar1,bar2,bar3.
I've read that pyspark has a straightforward way to convert CSV files into DataFrames (which I need so that I can leverage Glue's ability to easily output in ORC). Here is a snippet of what I've tried:
def f(glueContext, aggregated_csv, schema):
with open('somefile', 'a+') as agg_file:
agg_file.write(aggregated_csv)
#agg_file.seek(0)
df = glueContext.read.csv(agg_file, schema=schema, header="true")
df.show()
I've tried it both with and without seek. When I don't call seek(), the job completes successfully but df.show() doesn't display any data other than the headers. When I do call seek(), I get the following exception:
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-48-255.us-west-2.compute.internal:8020/user/root/header1,header2,header3\n;'
Since seek seems to change the behavior and since the headers in my csv are part of the exception string, I'm assuming that the problem is somehow related to where the file cursor is when I pass the file to glueContext.read.csv() but I'm not sure how to resolve it. If I uncomment the seek(0) call and add an agg_file.read() command, I can see the entire contents of the file as expected. What do I need to change so that I'm able to successfully read a csv file that I've just written into a spark dataframe?
I think you're passing wrong arguments to csv function. I believe that GlueContext.read.csv() will get an instance of DataFrameReader.csv(), and it's signature takes a filename as a first argument, and what you're passing is a file-like object.
def f(glueContext, aggregated_csv, schema):
with open('somefile', 'a+') as agg_file:
agg_file.write(aggregated_csv)
#agg_file.seek(0)
df = glueContext.read.csv('somefile', schema=schema, header="true")
df.show()
BUT, if all that you want it to write an ORC file, and you already have the data read as aggregated_csv you can create a DataFrame directly out of a list of tuples.
df = spark.createDataFrame([('foo1','foo2','foo3'), ('bar1','bar2','bar3')], ['header1', 'header2', 'header3'])
then, if you need a Glue DynamicFrame use fromDF function
dynF = fromDF(df, glueContext, 'myFrame')
ONE MORE BUT: you don't need glue to write ORC - spark it totally capable of it. Just use DataFrameWriter.orc() function:
df.write.orc('s3://path')

Categories