Deleting files from the Hadoop with pyspark (Query) - python

I'm using Hadoop for storing my data- for some data I'm using partitions, for some data I don't.
I'm saving the data with parquet format using the pyspark DataFrame class, like this:
df = sql_context.read.parquet('/some_path')
df.write.mode("append").parquet(parquet_path)
I want to write a script that deletes an old data, with a similar way (I need to query this old data with filtering on the data frame) with pyspark. I haven't found something in the pyspark documentation...
Is there a way to achieve this?

Pyspark is predominantly a processing engine. The deletion can be handled by subprocess module of raw python itself.
import subprocess
some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

Related

Creating a spark Dataframe within foreach() while using autoloader with BinaryFile option in databricks

I am using autoloader with BinaryFile option to decode .proto based files in databricks. I am able to decode the proto file and write it in csv format using foreach() and pandas library. But having challenge in writing it in delta format. End of the day, I want to write in delta format and trying to avoid one more hop in storage i.e., storing in csv.
There are few ways I could think of but it has challenges :
Convert pandas dataframe to spark dataframe. I have to use sparkContext to createDataframe but I can't broadcast sparkContext to worker nodes.
Avoid using pandas DF, still I need to create dataframe which is not possible with in foreach() (since load is distributed across workers)
Other ways like UDF, where I will decode and explode the string returned from the decode. But that's not applicable here because, we are getting spark non-native file format i.e., proto.
Also come across few blogs but not helpful in foreach() and BinaryFile option.
https://github.com/delta-io/delta-rs/tree/main/python - This is not stable yet in python
https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_delta.html - This points us to the challenge 2 mentioned above.
Any leads on this is much appreciated.
Below is the skeleton code snippet for reference:
cloudfile_options = {
"cloudFiles.subscriptionId": subscription_ID,
"cloudFiles.connectionString": queue_connection_string,
"cloudFiles.format": "BinaryFile",
"cloudFiles.tenantId":tenant_ID,
"cloudFiles.clientId":client_ID,
"cloudFiles.clientSecret":client_secret,
"cloudFiles.resourceGroup": storage_resource_group,
"cloudFiles.useNotifications" :"true"
}
reader_df = spark.readStream.format("cloudFiles") \
.options(**cloudfile_options) \
.load("some_storage_input_path")
def decode_proto(self, row):
with open(row['path'], 'rb') as f:
// Do decoding
// convert decoded string to Json and write to storage using pandas df
write_stream = reader_df.select("path") \
.writeStream \
.foreach(decode_proto) \
.option("checkpointLocation", checkpoint_path) \
.trigger(once=True) \
.start()

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:
If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:
Out Of Memory error (could not allocate memory).
Is directly writing to a csv using file I/O a better way? Can it preserve the separators?
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?
Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?
def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
"""get spark_df from hadoop and save to a csv file
Parameters
----------
spark_df: incoming dataframe
n: number of rows to get
save_csv=None: filename for exported csv
Returns
-------
"""
# use the more robust method
# set temp names
tmpfilename = save_csv or (wfu.random_filename() + '.csv')
tmpfoldername = wfu.random_filename()
print n
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
# get merge file from hadoop
HDFSUtil.getmerge(tmpfoldername, tmpfilename)
HDFSUtil.rmdir(tmpfoldername)
# read into pandas df, remove tmp csv file
pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
os.remove(tmpfilename)
# re-write the csv file with header!
if save_csv is not None:
pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)
If the DataFrame is too big, how can I avoid using Pandas?
You can just save the file to HDFS or S3 or whichever distributed storage you have.
Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?
If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?
In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.
Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?
Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.
We used databricks library . It works fine
df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))
Library :
<!-- spark df to csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.3.0</version>
</dependency>

Converting JSON file to SQLITE or CSV

I'm attempting to convert a JSON file to an SQLite or CSV file so that I can manipulate the data with python. Here is where the data is housed: JSON File.
I found a few converters online, but those couldn't handle the quite large JSON file I was working with. I tried using a python module called sqlbiter but again, like the others, was never really able to output or convert the file.
I'm not. sure where to go now, if anyone has any recommendations or insights on how to get this data into a database, I'd really appreciate it.
Thanks in advance!
EDIT: I'm not looking for anyone to do it for me, I just need to be pointed in the right direction. Are there other methods I haven't tried that I could learn?
You can utilize pandas module for this data processing task as follows:
First, you need to read the JSON file using with, open and json.load.
Second, you need to change the format of your file a bit by changing the large dictionary that has a main key for every airport into a list of dictionaries instead.
Third, you can now utilize some pandas magic to convert your list of dictionaries into a DataFrame using pd.DataFrame(data=list_of_dicts).
Finally, you can utilize pandas's to_csv function to write your DataFrame as a CSV file into disk.
It would look something like this:
import pandas as pd
import json
with open('./airports.json.txt','r') as f:
j = json.load(f)
l = list(j.values())
df = pd.DataFrame(data=l)
df.to_csv('./airports.csv', index=False)
You need to load your json file and parse it to have all the fields available, or load the contents to a dictionary, then you could using pyodbc to write to the database these fields, or write them to the csv if you use import csv first.
But this is just a general idea. You need to study python and how to do every step.
For instance for writting to the database you could do something like:
for i in range(0,max_len):
sql_order = "UPDATE MYTABLE SET MYTABLE.MYFIELD ...."
cursor1.execute(sql_order)
cursor1.commit()

Ingesting 150 csv's into one data source

Hello i am completely new to handling big data and comfortable in python
I have 150 csv's each of size 70MB which i have to integrate in one source to remove basic stats like unique counts, unique names and all.
Any one could suggest how should i go about it?
I came across a package 'pyelastic search' in python how feasible it is for me to use in enthaught canopy.
Suggestion needed!
Try to use pandas package.
reading a single csv would be:
import pandas as pd
df = pd.read_csv('filelocation.csv')
in case of multiple files, just concat them. let's say ls is a list of file locations, then:
df = pd.concat([pd.read_csv(f) for f in ls])
and then to write them as a single file, do:
df.to_csv('output.csv')
of course all this is valid for in-memory operations (70x150 = ~ 10.5 GB RAM). If that's not possible - consider building an incremental process or using dask dataframes.
One option if you are in AWS
Step1 - move data to S3 (AWS native File storage)
Step2 - create table for each data structure in redshift
Step3 - run COPY command to move data from S3 to Redshift (AWS native DW)
COPY command loads data in bulk, detects file name pattern

How to save a table in pyspark sql?

I want to save my resulting table into a csv, textfile or similiar to be able to perform visualization with RStudio.
I am using pyspark.sql to perform some queries in a hadoop setup. I want to save my result in hadoop and then copy the result into my local drive.
myTable = sqlContext.sql("SOME QUERIES")
myTable.show() # Show my result
myTable.registerTempTable("myTable") # Save as table
myTable.saveAsTextFile("SEARCH PATH") # Saving result in my hadoop
This returns this:
AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'
This is how I usually do it when using only pyspark i.e. not pyspark.sql.
And then I copy to local drive with
hdfs dfs –copyToLocal SEARCH PATH
Can anyone help me?
You can use DataFrameWriter with one of the supported formats. For example for JSON:
myTable.write.json(path)

Categories