How to save a table in pyspark sql?

How to save a table in pyspark sql? - python

I want to save my resulting table into a csv, textfile or similiar to be able to perform visualization with RStudio.
I am using pyspark.sql to perform some queries in a hadoop setup. I want to save my result in hadoop and then copy the result into my local drive.
myTable = sqlContext.sql("SOME QUERIES")
myTable.show() # Show my result
myTable.registerTempTable("myTable") # Save as table
myTable.saveAsTextFile("SEARCH PATH") # Saving result in my hadoop
This returns this:
AttributeError: 'DataFrame' object has no attribute 'saveAsTextFile'
This is how I usually do it when using only pyspark i.e. not pyspark.sql.
And then I copy to local drive with
hdfs dfs –copyToLocal SEARCH PATH
Can anyone help me?

You can use DataFrameWriter with one of the supported formats. For example for JSON:
myTable.write.json(path)

Related

Snowpark Snowflake Python to run a sql statement and export to Excel

I'm creating a Snowflake procedure using Snowpark (python) package executing a query into a snowflake dataframe and I would like to export that into Excel, how can I accomplish that? Is it a better approach to do this? The end goal is to export it the query results into Excel. Needs to be in a Snowflake procedure since we already have others "parent" procedures. Thanks!
CREATE OR REPLACE PROCEDURE EXPORT_SP()
RETURNS string not null
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'run'
AS
$$
import pandas
def run(snowpark_session):
## Execute the query into a Snowflake dataframe
results_df = snowpark_session.sql('''
SELECT * FROM
MY TABLES
;
''').collect()
return results_df
$$
;

In general, you can do this by:
"Unloading" the data from the table using the COPY INTO <location> command.
Using the GET command to copy the data to your local filesystem.
Open the file with Excel! If you used the CSV format and the appropriate format options in step 1, you should be able to easily open the resulting data with Excel.
Snowpark directly supports step 1 in the DataFrameWriter.copy_into_location method. An instance of DataFrameWriter contained in the DataFrame.write attribute, as described here.
Snowpark also directly supports step 2 in the FileOperation.get method. As per the example in that documentation page, you can access this method using the .file attribute of your Snowpark session object.
Putting this all together, you should be able to do something like this in Snowpark to save a single exported file into the current working directory:
source_table = "my_table"
unload_location = "#my_stage/export.csv"
def run(session):
df = session.table(source_table)
df.write.copy_into_location(
unload_location,
file_format_type="csv",
format_type_options=dict(
compression="none",
field_delimiter="\t",
),
single=True,
header=True,
)
session.file.get(unload_location, ".")
You can of course use session.sql() instead of session.table() as needed. You might also want to consider unloading data to the stage associated with the source data, instead of creating a separate stage, i.e. if the data is from table my_table then you would unload to the stage #%my_table.
For more details, refer to the documentation pages I linked, which contain important reference information as well as several examples.
Note that I am not sure if session.file is accessible from inside a stored procedure; you will have to experiment to see what works in your specific situation.
As always, remember that this is untested code written by an unpaid volunteer. Always triple-check and test any code that is provided here. Please do ask questions in the comments if anything is still unclear.

Databricks - pyspark.pandas.Dataframe.to_excel does not recognize abfss protocol

I want to save a Dataframe (pyspark.pandas.Dataframe) as an Excel file on the Azure Data Lake Gen2 using Azure Databricks in Python.
I've switched to the pyspark.pandas.Dataframe because it is the recommended one since Spark 3.2.
There's a method called to_excel (here the doc) that allows to save a file to a container in ADL but I'm facing problems with the file system access protocols.
From the same class I use the methods to_csv and to_parquet using abfss and I'd like to use the same for the excel.
So when I try so save it using:
import pyspark.pandas as ps
# Omit the df initialization
file_name = "abfss://CONTAINER#SERVICEACCOUNT.dfs.core.windows.net/FILE.xlsx"
sheet = "test"
df.to_excel(file_name, test)
I get the error from fsspec:
ValueError: Protocol not known: abfss
Can someone please help me?
Thanks in advance!

The pandas dataframe does not support the protocol. It seems on Databricks you can only access and write the file on abfss via Spark dataframe. So, the solution is to write file locally and manually move to abfss. See this answer here.

You can not save it directly but you can have it as its stored in temp location and move it to your directory. My code piece is:
import xlsxwriter import pandas as pd1
workbook = xlsxwriter.Workbook('data_checks_output.xlsx')
worksheet = workbook.add_worksheet('top_rows')
Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd1.ExcelWriter('data_checks_output.xlsx', engine='xlsxwriter')
output = dataset.limit(10)
output = output.toPandas()
output.to_excel(writer, sheet_name='top_rows',startrow=row_number)
writer.save()
After write.save
run below code, which is nothing but moves temp location of file to your desginated location .
Below code does the work of moving files.
%sh
sudo mv file_name.xlsx /dbfs/mnt/fpmount/

Python pyspark Write DF to .csv and store it in local c drive

I wanted to save a dataframe that pull data using SQLContext and save it into .csv file in c drive. I am using Zeppelin to run my code.
The below code runs but I can't see the file in the location specified. The select query in SQLContect is pulling data from HIVE DB.
%spark.pyspark
df = sqlContext.sql("SELECT * from TEST")
df.write.format("csv").mode("overwrite").save("\Users\testuser\testfolder\test.csv")
z.show(df)

You're in Windows if I'm getting it correctly. In that case you need to add the required prefix to your path. Your path will be something like C:\Users\testuser\testfolder\test.csv

Deleting files from the Hadoop with pyspark (Query)

I'm using Hadoop for storing my data- for some data I'm using partitions, for some data I don't.
I'm saving the data with parquet format using the pyspark DataFrame class, like this:
df = sql_context.read.parquet('/some_path')
df.write.mode("append").parquet(parquet_path)
I want to write a script that deletes an old data, with a similar way (I need to query this old data with filtering on the data frame) with pyspark. I haven't found something in the pyspark documentation...
Is there a way to achieve this?

Pyspark is predominantly a processing engine. The deletion can be handled by subprocess module of raw python itself.
import subprocess
some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

AWS Glue and update duplicating data

I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?

I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()

Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.

I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.