How to read gz compressed file by pyspark - python

I have line data in .gz compressed format. I have to read it in pyspark
Following is the code snippet
rdd = sc.textFile("data/label.gz").map(func)
But I could not read the above file successfully. How do I read gz compressed file. I have found a similar question here but my current version of spark is different that the version in that question. I expect there should be some built in function as in hadoop.

Spark document clearly specify that you can read gz file automatically:
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
I'd suggest running the following command, and see the result:
rdd = sc.textFile("data/label.gz")
print rdd.take(10)
Assuming that spark finds the the file data/label.gz, it will print the 10 rows from the file.
Note, that the default location for a file like data/label.gz will be in the hdfs folder of the spark-user. Is it there?

You can load compressed files directly into dataframes through the spark instance, you just need to specify the compression in the path:
df = spark.read.csv("filepath/part-000.csv.gz")
You can also optionally specify if a header present or if schema needs applying too
df = spark.read.csv("filepath/part-000.csv.gz", header=True, schema=schema).

You didn't write the error message you got, but it's probably not going well for you because gzipped files are not splittable. You need to use a splittable compression codec, like bzip2.

Related

pyspark dataframe is saved in s3 bucket with junk data

While trying to save pyspark DataFrame to csv and trying to s3 bucket directly,
file is getting saved but it has junk data. and all file sizes are 1B.
please help me where iam doing wrong.
python code
df.write.options("header","true").csv("s3a://example/csv")
trying this code also
df.coalesce(1).write.format("csv").option("header", "true").option("path", "s3://example/test.csv").save()
But not getting proper csv in s3 bucket
junk data in csv file
I think you are saving your dataframe as parquet which is the default.
df.write.format("csv")
.option("header", "true")
.option("encoding", "UTF-8")
.option("path", "s3a://example/csv")
.save()
Note: The syntax is also option not options.
update
As #Samkart mentioned you should check if your encoding is correct. I have updated my answer to include the encoding option. You can check here for the encoding options in pyspark.

Python : How to convert a CSV file stored in Byte stream to a List?

I am trying to get a csv file from Azure Data Lake Gen2, and then perform some operations on each row. However, the requirement is, not to download the file to a physical location. And hence, I am using file_client.download_File().readAll() to get the file in a Byte Stream.
However, I am unable to split the file rows/columns and get them into a list.
x = file_client.download_file()
bystream = x.readall()
WHAT TO DO WITH THIS bystream ?
I am however able to do this with downloaded file using WITH OPEN () AS CSV and then using this CSV stream in csv.reader()
Can someone please help with handling this bytestream?
A late update that I was able to resolve this issue by converting the downloaded stream to Text I/O. (didnt need to convert it to List, as Pandas Dataframe was better option)
Here is the code snippet that worked :
stream = io.StringIO(file_client.download_file().readall().decode("utf-8"))
dataframe1 = pd.read_csv(stream, sep= "|")
Here, file_client is connection to an Azure Data Lake where the csv file is stored.
The code downloads the file as a stream in-memory, and loads it to a dataframe. (No need to write it to a local file location)

How to read the text files in gzip format without unzipping it write that line to excel using python?

Problem Statement :
I have a directory with gzip files , and each gzip file contains a text file.
I have written a code in such a way that it unzips all the gzip files and then used to read each unzipped text file and then combined that output to one text file, then applied a condition , if that condition meets then it writes to excel.
The above process is bit tedious and lengthy.
Can anyone please help me out in writing the code where the data is read directly from the txt file which is gzipped and write it contents to excel.
IIUC you can use pandas using first read_csv:
df = read_csv('yourfile.gzip', compression='gzip')
then apply your conditions on df and write back the dataframe to excel using to_excel:
df.to_excel(file.xls)

How to open a .data file extension

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges
Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")
It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.
To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset
I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")
It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Batch convert json to csv python

Similar to this question batch process text to csv using python
I've got a batch of json files that need to be converted to csv so that they can be imported into Tableau.
The first step was to get json2csv ( https://github.com/evidens/json2csv ) working, which I did. I can successfully convert a single file via the command line.
Now I need an operation that goes through the files in a directory and converts each in a single batch operation using that json2csv script.
TIA
I actually created a jsontocsv python script to run myself. It basically reads the json file in chunks, and then goes through determining the rows and columns of the csv file.
Check out Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6 for the details of what was done and how it built the .csv from the json. The actual conversion would depend on your actual json format.
The json parse function with a chunk size of 0x800000 was what was used to read in the json data.
If the data becomes available at specific times, you can set this up using crontab.
I used
from optparse import OptionParser
to get the input and output files as arguments as well as setting the various options that were required for the analysis and mapping.
You can also use a batch script in the given directory
for f in *.json; do
mybase=`basename $f .json`
json2csv $f -o ${mybase}.csv
done
alternatively, use find with the -exec {} option
If you want all the json files to go into a single .csv file you can use
json2csv *.json -o myfile.csv

Categories