.Parquet to .Hyper file conversion for any schema

.Parquet to .Hyper file conversion for any schema - python

I want to convert parquet file to hyper file format using python. There is the following git for this - https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/parquet-to-hyper/create_hyper_file_from_parquet.py.
But in this case the parquet format /schema is known beforehand. What should I do if I want it to work for any parquet file, irrespective of the schema.
About me, I mostly work in analytics and data science with python but wanted to work on this project to make some files accessible to tableau. Thank you in advance and please let me know if you want any more information.

If you do not wish to define a schema when creating a .hyper file from a parquet file you can use the CREATE TABLE command instead of the COPY command.
To use the CREATE TABLE command you can skip the schema and table definition like this:
# Start the Hyper process.
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
# Open a connection to the Hyper process. This will also create the new Hyper file.
# The `CREATE_AND_REPLACE` mode causes the file to be replaced if it
# already exists.
with Connection(endpoint=hyper.endpoint,
database=hyper_database_path,
create_mode=CreateMode.CREATE_AND_REPLACE) as connection:
connection.execute_command("CREATE TABLE products AS (SELECT * FROM external('products.parquet'))")

From Tableau's official Github: https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/parquet-to-hyper/create_hyper_file_from_parquet.py

Related

How to read UniDac .log files (database dumps) with python

I have an old delphi program exporting electrical measurement data from a device as .log files as a database dump. I need to convert them to .csv or any other easily accessible dataformat.
As I have no access to the source code, my options to find the actual database format which was used is relatively low. I know, the author was using UniDac and the password for database access is "admin".
When I try to open the .log files in pythons binary-read mode, they don't make much sense as I don't know the correct encoding. Here is an example line:
b'\xa0=\xb1\xfc\xaa<\xb0\xd6\x8e=\xbbr\xd7<\xde\x92\x1a=\x82\xab"=|5%>o\xa4\xbe=\xce:\xe3>\xdf\xb9\xe2=~\x14\x8b=\xbb\xac(=-\x8bE=g\r\x03=\xeb\xc9-=\xcb\x81\xbf</(\x86=\xb8\x07\xad<\xaf\xd0)=\x94\x18\xa6<j#\x05=N\xd0\xa7<\xfe\x92\xbd=\xf0I\xfa<\x9dl\xec<\x08t\xba<9(\xf5<R}\xc8<E\xecI=?\x17\xaa=\xd9\xf4\x01=\xeet\x92=}\xb8\x82<>D\xe1<\x83w\x99<\xab\xd7\xe9<\x92\n'
As you can see, it is a mixture of hex- and ASCII-Codes, sometimes even containing comprehensible text headers.
I have also tried opening the database with pyodbc:
import pyodbc as db
conn = db.connect('DRIVER={SQL Server};DBQ={link_to_.log};UID=admin;PWD=admin')
But as I am not sure of the UniDac database format, this doesn't work.
I am not used to working with databases so even basic information is welcome. Which are the formats I need to try with pyodbc, when I want to open UniDac-databases? Do I need to find the correct encoding first? Are there other possibilities to make the.log files accessible?
I can upload an example .log if someone is interested in this problem.

Pyspark external table compression does not work

I am trying to save an external table from PySpark in parquet format and I need
to compress it. The PySpark version I am using is 2.4.7. I am updating the table after the initial
creation and appending data in a loop manner.
So far I have set the following options:
.config("spark.sql.parquet.compression.codec", "snappy") df.write.mode("append").format("parquet").option("compression","snappy").saveAsTable(...) df.write.mode("overwrite").format("parquet").option("compression","snappy").saveAsTable(...)
Is there anything else that I need to set or am I doing something wrong?
Thank you

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

I wrote a DataFrame with pySpark into HDFS with this command:
df.repartition(col("year"))\
.write.option("maxRecordsPerFile", 1000000)\
.parquet('/path/tablename', mode='overwrite', partitionBy=["year"], compression='snappy')
When taking a look into the HDFS I can see that the files are properly laying there. Anyhow, when I try to read the table with HIVE or Impala, the table cannot be found.
Whats going wrong here, am I missing something?
Interestingly, df.write.format('parquet').saveAsTable("tablename") works properly.

It's an expected behaviour from spark as:
df...etc.parquet("") writes the data to HDFS location and won't create any table in Hive.
but df..saveAsTable("") creates the table in hive and writes data to it.
In the case the table already exists, behavior of this function
depends on the save mode, specified by the mode function (default to
throwing an exception). When mode is Overwrite, the schema of the
DataFrame does not need to be the same as that of the existing table.
That's the reason why you are not able to find table in hive after performing df...parquet("")

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.

I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.

I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.

You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

How to create a SQL table from several SQL files?

All explained above is in the context of an ETL process. I have a git repository full of sql files. I need to put all those sql files (once pulled) into a sql table with 2 columns: name and query, so that I can access each file later on using a SQL query instead of loading them from the file path. How can I make this? I am free to use the tool I want to, but I just know python and Pentaho.
Maybe the assumption that this method would require less computation time than simply accessing to the pull file located in the hard drive is wrong. In that case let me know.

You can first define the table you're interested in using something along the lines of (you did not mention the database you are using):
CREATE TABLE queries (
name TEXT PRIMARY KEY,
query TEXT
);
After creating the table, you can use perhaps os.walk to iterate through the files in your repository, and insert both the contents (e.g. file.read()) and the name of the file into the table you created previously.
It sounds like you're trying to solve a different problem though. It seems like you're interested in speeding up some process, because you asked about whether accessing queries using a table would be faster than opening a file on disk. To investigate that (separate!) question further, see this.
I would recommend that you profile the existing process you are trying to speed up using profiling tools. After that, you can see whether IO is your bottleneck. Otherwise, you may do all of this work without any benefit.
As a side note, if you are looking up queries in this way, it may indicate that you need to rearchitect your application. Please consider that possibility as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

.Parquet to .Hyper file conversion for any schema - python

From Tableau's official Github: https://github.com/tableau/hyper-api-samples/blob/main/Community-Supported/parquet-to-hyper/create_hyper_file_from_parquet.py

Related

How to read UniDac .log files (database dumps) with python

Pyspark external table compression does not work

DataFrame.write.parquet - Parquet-file cannot be read by HIVE or Impala

Get a massive csv file from GCS to BQ

How to create a SQL table from several SQL files?

Categories

Resources