Problem loading csv into DataFrame in PySpark

Problem loading csv into DataFrame in PySpark - python

I'm trying to aggregate a bunch of CSV files into one and output it to S3 in ORC format using an ETL job in AWS Glue. My aggregated CSV looks like this:
header1,header2,header3
foo1,foo2,foo3
bar1,bar2,bar3
I have a string representation of the aggregated CSV called aggregated_csv has content header1,header2,header3\nfoo1,foo2,foo3\nbar1,bar2,bar3.
I've read that pyspark has a straightforward way to convert CSV files into DataFrames (which I need so that I can leverage Glue's ability to easily output in ORC). Here is a snippet of what I've tried:
def f(glueContext, aggregated_csv, schema):
with open('somefile', 'a+') as agg_file:
agg_file.write(aggregated_csv)
#agg_file.seek(0)
df = glueContext.read.csv(agg_file, schema=schema, header="true")
df.show()
I've tried it both with and without seek. When I don't call seek(), the job completes successfully but df.show() doesn't display any data other than the headers. When I do call seek(), I get the following exception:
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-172-31-48-255.us-west-2.compute.internal:8020/user/root/header1,header2,header3\n;'
Since seek seems to change the behavior and since the headers in my csv are part of the exception string, I'm assuming that the problem is somehow related to where the file cursor is when I pass the file to glueContext.read.csv() but I'm not sure how to resolve it. If I uncomment the seek(0) call and add an agg_file.read() command, I can see the entire contents of the file as expected. What do I need to change so that I'm able to successfully read a csv file that I've just written into a spark dataframe?

I think you're passing wrong arguments to csv function. I believe that GlueContext.read.csv() will get an instance of DataFrameReader.csv(), and it's signature takes a filename as a first argument, and what you're passing is a file-like object.
def f(glueContext, aggregated_csv, schema):
with open('somefile', 'a+') as agg_file:
agg_file.write(aggregated_csv)
#agg_file.seek(0)
df = glueContext.read.csv('somefile', schema=schema, header="true")
df.show()
BUT, if all that you want it to write an ORC file, and you already have the data read as aggregated_csv you can create a DataFrame directly out of a list of tuples.
df = spark.createDataFrame([('foo1','foo2','foo3'), ('bar1','bar2','bar3')], ['header1', 'header2', 'header3'])
then, if you need a Glue DynamicFrame use fromDF function
dynF = fromDF(df, glueContext, 'myFrame')
ONE MORE BUT: you don't need glue to write ORC - spark it totally capable of it. Just use DataFrameWriter.orc() function:
df.write.orc('s3://path')

Related

Pandas library writing corrupted xlsx files when using an ExcelWriter

I'm trying to write some data to an excel spreadsheet. Whenever I've tried to use the DataFrame.to_excel method with a file path instead of an ExcelWriter object as the first argument
(e.g. pd.DataFrame([1, 2, 3]).to_excel("test.xlsx") and that works fine except that it rewrites the whole file every time. I want to append data and I don't see an option in the documentation that lets you set it to something like append mode. So, I'm using an ExcelWriter object because that seems to have an append mode if you initialise is as follows (documentation):
writer = ExcelWriter("test.xlsx", mode='a', if_sheet_exists="overlay").
Then, if I understand correctly, you should be able to pass that object into the to_excel function like this:
pd.DataFrame([1, 2, 3]).to_excel(writer) and it shouldn't rewrite the whole file.
But, when I use an ExcelWriter to create or modify the file, excel gives me the error:
"Excel cannot open the file 'test.xlsx' because the file format or file extension is not valid. Verify that the file has not been corrupted and that the file extension matches the format of the file"
I have tried initialising the ExcelWriter with only the first argument, writer = ExcelWriter("test.xlsx"), and that produces the same error when opening the file.
I think the writer is writing corrupted excel files, anyone know a fix?

Fixed it, I wasn't closing the XlsxWriter with
writer.close()

Using pandas.read_csv() is conflicting with csv.reader() - ValueError: I/O operation on closed file

I'm parsing a csv file, that was sent via POST FormData(), and then converting it to JSON. The problem appears when I used a package to validate the csv before pass through pandas. The validator function does her job and then, the normal reading with pandas gives the error ValueError: I/O operation on closed file
if request.method == 'POST':
content = request.form
data_header = json.loads(content.get('JSON'))
filename = data_header['data'][0]['name']
# Here! starts the problem
# validator = validCSV(validator={'header': ["id","type","name","subtype","tag","block","latitude","longitude","height","max_alt","min_alt","power","tia","fwl"]})
# print(validator.verify_header(request.files[filename]))
# then pseudo-code: if returned false, will abort(404)
try:
df = pd.read_csv(request.files[filename], dtype='object')
dictObj = df.to_dict(orient='records')
If we follow the issue to inside of this package, this is what we will see:
def verify_header(self, inputfile):
with TextIOWrapper(inputfile, encoding="utf-8") as wrapper:
header = next(csv.reader(wrapper))
It seems when the file is opened and closed by TextIOWrapper, pandas is no longer allowed to open the file using the read_csv(). But makying a copy of the file seems a waste for only read a header and i like the idea of using the csv.reader() because showed in other examples more efficiency reading a csv file than pandas.
what can be done to prevent I/O Error after another package had opened the file? Or a simple and efficient way to validate the csv without using a heavy pandas

The solution was seek() the pointer to beginning of the file after read the first line. The process of reading is almost the same as what pandas do. The only apparently advantage is that it does not depend on importing/installing pandas.
wrapper = StringIO(inputfile.readline().decode('utf-8'))
header = next(csv.reader(wrapper, delimiter=','))
inputfile.seek(0,0)

how to write csv to "variable" instead of file?

I'm not sure how to word my question exactly, and I have seen some similar questions asked but not exactly what I'm trying to do. If there already is a solution please direct me to it.
Here is what I'm trying to do:
At my work, we have a few pkgs we've built to handle various data types. One I am working with is reading in a csv file into a std_io object (std_io is our all-purpose object class that reads in any type of data file).
I am trying to connect this to another pkg I am writing, so I can make an object in the new pkg, and covert it to a std_io object.
The problem is, the std_io object is meant to read an actual file, not take in an object. To get around this, I can basically write my data to temp.csv file then read it into a std_io object.
I am wondering if there is a way to eliminate this step of writing the temp.csv file.
Here is my code:
x #my object
df = x.to_df() #object class method to convert to a pandas dataframe
df.to_csv('temp.csv') #write data to a csv file
std_io_obj = std_read('temp.csv') #read csv file into a std_io object
Is there a way to basically pass what the output of writing the csv file would be directly into std_read? Does this make sense?
The only reason I want to do this is to avoid having to code additional functionality into either of the pkgs to directly accept an object as input.
Hope this was clear, and thanks to anyone who contributes.

For those interested, or who may have this same kind of issue/objective, here's what I did to solve this problem.
I basically just created a temporary named file, linked a .csv filename to this temp file, then passed it into my std_read function which requires a csv filename as an input.
This basically tricks the function into thinking it's taking the name of a real file as an input, and it just opens it as usual and uses csvreader to parse it up.
This is the code:
import tempfile
import os
x #my object I want to convert to a std_io object
text = x.to_df().to_csv() #object class method to convert to a pandas dataframe then generate the 'text' of a csv file
filename = 'temp.csv'
with tempfile.NamedTemporaryFile(dir = os.path.dirname('.')) as f:
f.write(text.encode())
os.link(f.name, filename)
stdio_obj = std_read(filename)
os.unlink(filename)
del f
FYI - the std_read function essentially just opens the file the usual way, and passes it into csvreader:
with open(filename, 'r') as f:
rdr = csv.reader(f)

Reading Empty CSV Pyspark

I have a process to read csv files and do some processing in pyspark. At times I might get a zero byte empty file. In such cases when I use the below code
df = spark.read.csv('/path/empty.txt', header = False)
It is failing with error:
py4j.protocol.Py4JJavaError: An error occurred while calling o139.csv.
: java.lang.UnsupportedOperationException: empty collection
Since its empty file I tried to read as a json it worked fine
df = spark.read.json('/path/empty.txt')
When I add header to the empt csv manually to the code reads fine.
df = spark.read.csv('/path/empty.txt', header = True)
In few places I read to use databricks csv but
I don't have the data bricks csv package options to use as those jars are not available in my environment.

Pandas to_csv saving as NoneType and raising TypeError

I am attempting to create an upload tool that takes an .xls file and then converts it to a pandas dataframe before finally saving it as a csv file to be processed and analyzed. After the file comes out of this code:
def xls_to_csv(data):
#Formats into pandas dataframe. Index removes first column of .xls file.
formatted_file = pd.read_excel(data, index_col=0)
#Converts the formatted file into a csv file and saves it.
final_file = formatted_file.to_csv('out.csv')
It saves properly and in the right location, however when I attempt to plug the resulted file into other functions that contain loops, it raises
TypeError: 'NoneType' object is not iterable.
The file is saved as 'out.csv' and I am able to open it manually, however the open command won't even work without this error being raised.
I'm using Python 3.6.

to_csv returns None that's why you got that error
to maintain formatted_file, you could try this,
final_file=formatted_file.copy()
or
final_file=pd.read_csv('out.csv')

Pandas to_csv function saves the file but does not return anything. To loop through the csv file later you'll have to change the code to look like this.
formatted_file.to_csv('out.csv')
final_file = open('out.csv', 'r')

def xls_to_csv(data):
# Read the excel file as a dataframe object
formatted_dataframe = pd.read_excel(data, index_col=0)
# Save the dataframe to a csv file in disk. The method returns None.
formatted_file.to_csv('out.csv')
# The dataframe object is still here
final_dataframe = formatted_dataframe
# The final file NAME
final_filename = 'out.csv'
Your variable names are misleading
Your formatted_file is in fact a data frame object
Your final_file: it is unclear for me whether you want the filename or the dataframe.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem loading csv into DataFrame in PySpark - python

Related

Pandas library writing corrupted xlsx files when using an ExcelWriter

Using pandas.read_csv() is conflicting with csv.reader() - ValueError: I/O operation on closed file

how to write csv to "variable" instead of file?

Reading Empty CSV Pyspark

Pandas to_csv saving as NoneType and raising TypeError

Categories

Resources