How to create a managed Hive Table using pyspark

How to create a managed Hive Table using pyspark - python

I am facing the problem that every table I create using pyspark has type EXTERNAL_TABLE in hive, but I want to create managed tables and don't know what I am doing wrong. I tried different possibilities to create those tables. For instance:
spark.sql('CREATE TABLE dev.managed_test(key int, value string) STORED AS PARQUET')
spark.read.csv('xyz.csv').write.saveAsTable('dev.managed_test2')
In both options the resulting table is an EXTERNAL_TABLE. When I describe the table in Apache Hue or Beeline, I find also the property TRANSLATED_TO_EXTERNAL is true.
Does anyone have an idea, what could be wrong or what I could do instead of these two options shown above? Maybe, I am missing some Configuration parameter?
Thank you!

Related

Is there a way of creating Biglake Tables through Python?

In the documentation I see no reference to BigLake tables. I wonder if there's a way of setting ExternalDataConfiguration to use them.

Found it out: if you provide a connection ID it will be used for setting the table as a BigLake Table (see https://stackoverflow.com/a/73987775/9944075 on how to create the connection)

Relational DB - separate joined tables

Is there any way to join tables from a relational database and then separate them again ?
I'm working on a project that involves modifying the data after having joined them. Unfortunately, modifying the data bejore the join is not an option. I would then want to separate the data according to the original schema.
I'm stuck at the separating part. I have metadata (python dictionary) with the information on the tables (primary keys, foreign keys, fields, etc.).
I'm working with python. So, if you have a solution in python, it would be greatly appretiated. If an SQL solution works, that also helps.
Edit : Perhaps the question was unclear. To summarize I would like to create a new database with an identical schema to the old one. I do not want to make any modifications to the original database. The data that makes up the new database must originally be in a single table (result of a join of the old tables) as the operations that need to be run must be run on a single table and I cannot run these operations on invidual tables as the outcome will not be as desired.
I would like to know if this is possible and, if so, how can I achieve this?
Thanks!

How to read hive partitioned table via pyspark

New to spark programming and had a doubt regarding the method to read partitioned tables using pyspark.
Let us say we have a table partitioned as below:
~/$table_name/category=$category/year=$year/month=$month/day=$day
Now, I want to read data from all the categories, but want to restrict data by time period. Is there any way to specify this with wild cards rather than writing out all the individual paths?
Something to the effect of
table_path = ["~/$table_name/category=*/year=2019/month=03",
"~/$table_name/category=*/year=2019/month=04"]
table_df_raw = spark.read.option(
"basePath", "~/$table_name").parquet(*table_path)
Also, as bonus is there a more pythonic way to specify the time ranges which may fall in different years rather than listing the paths individually.
Edit: To clarify a few things, I don't have access to the hive metastore for this table and hence can't access with just a SQL query. Also, the size of the data doesn't allow filtering post conversion to dataframe.

You can try this
Wildcards can also be used to specify a range of days:
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month={3,4,8}")
Or
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month=[3-4]")

Are you using a Hortonworks HDP cluster? If yes, try to use HiveWarehouse connector. Its allow Spark to access Hive catalog. After this, you can perform any Spark SQL command over hive tables: https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html
If you aren't using Hortonworks, i suggest you look at this link: https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql

Python, SQLAlchemy, MySQL: Insert data between existing records

Unfortunately I couldn't find any useful information on this topic. I have an existing database with existing tables and also existing data in it. Now I have to add new data in between the existing data. My code would look something like this, but it doesn't work:
INSERT INTO table_name(data) VALUES('xyz')
WHERE DATETIME(datetime) > DATETIME('2017-01-01 02:00:00');
I have created an image for a better understanding of my question.
Please take notice, that I need the Primary Key to adapt to the made changes as you can see in the picture. My tools are Python, SQLAlchemy and MySQL. I look forward to every help.

Python or SQL: Populating an excel form (multiple times and saving outputs) from another table

Problem: Customer has requested we fill out a form (excel) for each item we provide them. Due to us providing them a large amount of parts, I would like to figure out a way to automate it as much as possible.
Idea: Create a table ('Data') with each part number and relevant information in the columns. Use Python to read 'Data' table, open blank customer form, populate blank customer form, and then save customer form.
Questions:
Can SQL accomplish this task as well? In relation to this task, I've only really created flat table outputs with SQL. Not really sure how this would work.
Recommended Python packages / documentation?
Similar example with code available? Just helps me learn being able to walk through something.
Any other ideas? Maybe I am tackling this issue the wrong way.
I am just unsure of my best path of action.

You could create a simple table on your SQL system (PostgreSQL, MySQL), so you can add modify simply your items.
Then you can export your table in excel format as the customer wants with:
Copy (Select * From foo) To '/tmp/test.csv' With CSV DELIMITER ',';
You can also do it with python, but i think it's more complicated to update item with python, with a SQL system you could create and HTML/PHP front-end page making it more customizable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a managed Hive Table using pyspark - python

Related

Is there a way of creating Biglake Tables through Python?

Relational DB - separate joined tables

How to read hive partitioned table via pyspark

Python, SQLAlchemy, MySQL: Insert data between existing records

Python or SQL: Populating an excel form (multiple times and saving outputs) from another table

Categories

Resources