How to read hive partitioned table via pyspark - python

New to spark programming and had a doubt regarding the method to read partitioned tables using pyspark.
Let us say we have a table partitioned as below:
~/$table_name/category=$category/year=$year/month=$month/day=$day
Now, I want to read data from all the categories, but want to restrict data by time period. Is there any way to specify this with wild cards rather than writing out all the individual paths?
Something to the effect of
table_path = ["~/$table_name/category=*/year=2019/month=03",
"~/$table_name/category=*/year=2019/month=04"]
table_df_raw = spark.read.option(
"basePath", "~/$table_name").parquet(*table_path)
Also, as bonus is there a more pythonic way to specify the time ranges which may fall in different years rather than listing the paths individually.
Edit: To clarify a few things, I don't have access to the hive metastore for this table and hence can't access with just a SQL query. Also, the size of the data doesn't allow filtering post conversion to dataframe.

You can try this
Wildcards can also be used to specify a range of days:
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month={3,4,8}")
Or
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month=[3-4]")

Are you using a Hortonworks HDP cluster? If yes, try to use HiveWarehouse connector. Its allow Spark to access Hive catalog. After this, you can perform any Spark SQL command over hive tables: https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html
If you aren't using Hortonworks, i suggest you look at this link: https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql

Related

Export Bigquery table to gcs bucket into multiple folders/files corresponding to clusters

Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.

Using Impala to select multiple tables with wildcard pattern and concatenate them

I'm starting with Impala SQL and Hadoop and have a (probably simple) question.
I have a Hadoop database with hundrets of tables with the same schema and naming convention (e.g. process_1, process_2, process_3 and so on). How would I query all the tables and concatenate them into one big table or dataframe? Is it possible to do so by using just Impala SQL which returns one dataframe in python?
Something like:
SELECT * FROM 'process_*';
Or do I need to run SHOW TABLES 'process_*', use a loop in python and query each table seperately?
If you are looking purely Impala solution, then one approach would be to create a view on top of all of the tables. Something as below:
create view process_view_all_tables as
select * from process1
union all
select * from process2
union all
...
select * from processN;
The disadvantage with this approach is as below:
You need to union multiple tables together. Union is an expensive operation in terms of memory utilisation. Works ok if you have less number of tables say in range of 2-5 tables.
You need to add all the tables manually. If you a new process table in future, you would need to ALTER the view and then add the new table. This is a maintenance headache.
The view assumes that all the PROCESS tables are of the same schema.
In the Second approach, as you said, you could query the list of tables from Impala using SHOW TABLES LIKE 'process*' and write a small program to iterate over the list of tables and create the files.
Once you have the file generated, you could port the file back to HDFS and create a table on top of it.
The only disadvantage with the second approach is that for every iteration there would impala database requests which is particularly disadvantageous in a multi-tenant database env.
In my opinion, you should try the second approach.
Hope this helps :)

Best way to read in part of a huge table to AWS GLUE

I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.
In testing, I've found that the table is too large to be fully read in.
Here's how I'm loading the data frame:
large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")
Important Factors
I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.
I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.
I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.
Primary Question
How can I read in part of a table without using a pushdown predicate?
You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3n://path/for/temp/data") \
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.
Unfortunately predicate pushdown works only for S3 as you figured out already.
Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
Or you can create a view in Postgres and use that as source.
Please let me know if that helped.

Zeppelin: What the best way to query data with SQL and work with it?

I want to use Zeppelin to query databases. I currently see two possibilities but none of them is sufficient for me:
Configure a database connection as "interpreter", name it e.g. "sql1", use it in a paragraph, run a sql query and use the inbuilt nice plotting tools. It seems that all the tutorials and tips deal with it but then the documentation suddenly stops! But I want to do more with the data: I want to filter and process. If I want to plot it again (with other limitations), I have to do the query (that may last some seconds or minutes) again (see my other question Zeppelin SQL: reuse data of query without another interpreter or a new query)
Use spark with python, scala or similar. But the documentation seems only to load csv data, put in into a dataframe and then accesses this dataframe with sql. There is no accessing the data with sql in the first place. How do I access the sql data the best way? Can I use a already configured "interpreter" (database connection)?
You can use Zeppelin API to retrieve paragraph data:
val buffer = scala.io.Source.fromURL("http://XXXXX:9995/api/notebook/2CN2QP93H/paragraph/20170713-092810_1633770798").mkString
val df = sqlContext.read.json(sc.parallelize(buffer :: Nil)).select("body.text")
df.first.getAs[String](0)
This Spark Scala lines will retrieve the SQL query used by a paragprah. You could do same thing to get results I think.
I cannot find a solution for 1. But I have made a short solution for 2. that works within zeppelin with python (2.7), sqlalchemy (sql wrapper), mysqldb (mysql implementation) and pandas (make sure that have these packages installed, all of them are in Debian 9). I wonder why I have not found such a solution before...
%python
from sqlalchemy import create_engine
import pandas as pd
sql = "select col1, col2 from table limit 10"
df = pd.read_sql(sql,
create_engine('mysql+mysqldb://user:password#host:3306/database').connect())
z.show(df)
If you want to connect to another database like db2 or oracle, you have to use other python packages and adjust the first part in the create_engine string.

Multiple pandas users connecting to SQL DB

New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.

Categories