I have tables in different impala data bases, stored as parquet files, structured as below. I'm trying to figure out a good way to scan all the table names and column names, under all databases, from there I hope to check if the table or column names contain certain values, if so I'd like to read the values etc.
I understand that there are impala query like describe database.tablename, but with all the other processing, I'd like to do this in a spark job. Could some one please help to shed some light on this? Many thanks.
database1.tableOne
database1.tableTwo
database2.tableThree
....
You need to connect to impala with jdbc data frame of spark with .query() option containing describe table. And then reading the dataframe returning from jdbc operator gives the column information.
Related
Is there any way to join tables from a relational database and then separate them again ?
I'm working on a project that involves modifying the data after having joined them. Unfortunately, modifying the data bejore the join is not an option. I would then want to separate the data according to the original schema.
I'm stuck at the separating part. I have metadata (python dictionary) with the information on the tables (primary keys, foreign keys, fields, etc.).
I'm working with python. So, if you have a solution in python, it would be greatly appretiated. If an SQL solution works, that also helps.
Edit : Perhaps the question was unclear. To summarize I would like to create a new database with an identical schema to the old one. I do not want to make any modifications to the original database. The data that makes up the new database must originally be in a single table (result of a join of the old tables) as the operations that need to be run must be run on a single table and I cannot run these operations on invidual tables as the outcome will not be as desired.
I would like to know if this is possible and, if so, how can I achieve this?
Thanks!
I am facing the problem that every table I create using pyspark has type EXTERNAL_TABLE in hive, but I want to create managed tables and don't know what I am doing wrong. I tried different possibilities to create those tables. For instance:
spark.sql('CREATE TABLE dev.managed_test(key int, value string) STORED AS PARQUET')
spark.read.csv('xyz.csv').write.saveAsTable('dev.managed_test2')
In both options the resulting table is an EXTERNAL_TABLE. When I describe the table in Apache Hue or Beeline, I find also the property TRANSLATED_TO_EXTERNAL is true.
Does anyone have an idea, what could be wrong or what I could do instead of these two options shown above? Maybe, I am missing some Configuration parameter?
Thank you!
I have some large (+500 Mbytes) .CSV files that I need to import into a Postgres SQL database.
I am looking for a script or tool that helps me to:
Generate the table columns SQL CREATE code, ideally taking into account the data in the .CSV file in order to create the optimal data types for each column.
Use the header of the .CSV as the name of the column.
It would be perfect if such functionality existed in PostgreSQL or could be added as an add-on.
Thank you very much
you can use this open source tool called pgfutter to create table from your csv file.
git hub link
also postgresql has COPY functionality however copy expect that the table already exists.
New to spark programming and had a doubt regarding the method to read partitioned tables using pyspark.
Let us say we have a table partitioned as below:
~/$table_name/category=$category/year=$year/month=$month/day=$day
Now, I want to read data from all the categories, but want to restrict data by time period. Is there any way to specify this with wild cards rather than writing out all the individual paths?
Something to the effect of
table_path = ["~/$table_name/category=*/year=2019/month=03",
"~/$table_name/category=*/year=2019/month=04"]
table_df_raw = spark.read.option(
"basePath", "~/$table_name").parquet(*table_path)
Also, as bonus is there a more pythonic way to specify the time ranges which may fall in different years rather than listing the paths individually.
Edit: To clarify a few things, I don't have access to the hive metastore for this table and hence can't access with just a SQL query. Also, the size of the data doesn't allow filtering post conversion to dataframe.
You can try this
Wildcards can also be used to specify a range of days:
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month={3,4,8}")
Or
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month=[3-4]")
Are you using a Hortonworks HDP cluster? If yes, try to use HiveWarehouse connector. Its allow Spark to access Hive catalog. After this, you can perform any Spark SQL command over hive tables: https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html
If you aren't using Hortonworks, i suggest you look at this link: https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql
so i am building a database for a larger program and do not have much experience in this area of coding (mostly embedded system programming). My task is to import a large excel file into python. It is large so i'm assuming I must convert it to a CSV then truncate it by parsing and then partitioning and then import to avoid my computer crashing. Once the file is imported i must be able to extract/search specific information based on the column titles. There are other user interactive aspects that are simply string based so not very difficult. As for the rest, I am getting the picture but would like a more efficient and specific design. Can anyone offer me guidance on this?
An excel or csv can be read into python using pandas. The data is stored as rows and columns and is called a dataframe. To import data in such a structure, you need to import pandas first and then read the csv or excel into the dataframe structure.
import pandas as pd
df1= pd.read_csv('excelfilename.csv')
This dataframe structure is similar to tables and you can perform joining of different dataframes, grouping of data etc.
I am not sure if this is what you need, let me know if you need any further clarifications.
I would recommend actually loading it into a proper database such as Mariadb or Postgresql. This will allow you to access the data from other applications and it takes the load off of you for writing a database. You can then use a ORM if you would like to interact with the data or simply use plain SQL via python.
read the CSV
df = pd.read_csv('sample.csv')
connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
store your table in the database:
df.to_sql('Some_Table_Name', conn)
read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name' df = pd.read_sql(sql_string, conn)