Pyspark external table compression does not work - python

I am trying to save an external table from PySpark in parquet format and I need
to compress it. The PySpark version I am using is 2.4.7. I am updating the table after the initial
creation and appending data in a loop manner.
So far I have set the following options:
.config("spark.sql.parquet.compression.codec", "snappy") df.write.mode("append").format("parquet").option("compression","snappy").saveAsTable(...) df.write.mode("overwrite").format("parquet").option("compression","snappy").saveAsTable(...)
Is there anything else that I need to set or am I doing something wrong?
Thank you

Related

.Parquet to .Hyper file conversion for any schema

I want to convert parquet file to hyper file format using python. There is the following git for this - https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/parquet-to-hyper/create_hyper_file_from_parquet.py.
But in this case the parquet format /schema is known beforehand. What should I do if I want it to work for any parquet file, irrespective of the schema.
About me, I mostly work in analytics and data science with python but wanted to work on this project to make some files accessible to tableau. Thank you in advance and please let me know if you want any more information.
If you do not wish to define a schema when creating a .hyper file from a parquet file you can use the CREATE TABLE command instead of the COPY command.
To use the CREATE TABLE command you can skip the schema and table definition like this:
# Start the Hyper process.
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
# Open a connection to the Hyper process. This will also create the new Hyper file.
# The `CREATE_AND_REPLACE` mode causes the file to be replaced if it
# already exists.
with Connection(endpoint=hyper.endpoint,
database=hyper_database_path,
create_mode=CreateMode.CREATE_AND_REPLACE) as connection:
connection.execute_command("CREATE TABLE products AS (SELECT * FROM external('products.parquet'))")
From Tableau's official Github: https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/parquet-to-hyper/create_hyper_file_from_parquet.py

Best practise to handle and analyze RRD/RRA files with python or SQL

I have a server with about 2000 RRD files that I need to analyze and visualize. So these RRD files need to be converted into some sort of format that I can Query, may it be a pandas DataFrame or SQL database or something else. They should preferably be in a single database so that its possible to cross-analyze the data.
I'm a complete noobie with regards to working with RRD files. So the question is how do I convert the RRD files into a pandas DataFrame? or is there a better practice with handling and working with RRD files?
Here is a similar question by another user: Advice on manipulating RRD/XML files to use with pandas/python on windows
I've tried to export the files into an influx database, I'm still working on this as I'm getting a lot of different errors that I'm working through, also the influxDB doesn't handle NaN and they are important for my research. I'm also not sure if the influxDB will even provide me with the tools that I need.
I tried to export XMLs with the rrdtool, but the the format is weird and there is a lot of data missing when importing into a pandas DataFrame even when using etree.

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

I wrote a DataFrame with pySpark into HDFS with this command:
df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')
When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.
Whats going wrong here, am I missing something?
Interestingly, df.write.format('parquet').saveAsTable("tablename") works properly.
It's an expected behaviour from spark as:
df...etc.parquet("") writes the data to HDFS location and won't create any table in Hive.
but df..saveAsTable("") creates the table in hive and writes data to it.
In the case the table already exists, behavior of this function
depends on the save mode, specified by the mode function (default to
throwing an exception). When mode is Overwrite, the schema of the
DataFrame does not need to be the same as that of the existing table.
That's the reason why you are not able to find table in hive after performing df...parquet("")

Google Cloud Dataflow (Python): function to join multiple files

I am new to Google cloud and know python to write few scripts, currently learning cloud functions and BiqQuery.
my question:
I need to join a large CSV file with multiple lookup files and replace values from lookup files.
learnt that dataflow can be used to do ETL,but don't know how to write the code in Python.
can you please share your insights.
Appreciate your help.
Rather than joining data in python, I suggest you separately extract and load the CSV and lookup data. Then run a BigQuery query that joins the data and writes the result to a permanent table. You can then delete the separately import data.

How to_csv in Bluemix

We have a dataframe we are working it in a ipython notebook. Granted, if one could save a dataframe in such a way that the whole group could have access to it through their notebooks, would be ideal, and I'd love to know how to do that. However could you help with the following specific problem?
When we do df.to_csv("Csv file name") it appears that it is located in the exact same place as the files we placed in object storage to utilize in the ipython notebook. However, when one goes to Manage Files, it's nowhere to be found.
When one runs pd.DataFrame.to_csv(df), text of the csv file is apparently given. However when one copies that into a text editor (ex- Sublime text), saves it at a csv, and attempts to read it in to a dataframe, the expected dataframe is not yielded.
How does one export a dataframe to csv format, and then access it?
I'm not familiar with bluemix, but it sounds like you're trying to save a pandas dataframe in a way that all of your collaborators can access and it look the same way for everyone.
Maybe saving and reading from CSVs is messing up the formatting of your dataframe. Have you tried using pickling? Since pickling is based around python, it should give consistent results.
Try this:
import pandas as pd
pd.to_pickle(df, "/path/to/pickle/My_pickle")
and on the read side:
df_read = pd.read_pickle("/path/to/pickle/My_pickle")

Categories