Using pyspark to connect to hive tables - python

I am trying to query a Hive table from pyspark.
I am using the below statements:
from pyspark.sql import HiveContext
HiveContext(sc).sql('from `dbname.tableName` select `*`')
I am very new to hadoop systems.
Need to understand what is the correct way to bring some data from a hive table and storing it into a dataframe to further write a program.

sqlCtx.sql has access to hive table. You can use it following way.
my_dataframe = sqlCtx.sql("Select * from employees")
my_dataframe.show()

Related

Run hive queries on empty dataframe using pyspark

I want to test my complex hive queries beforehand by executing on empty dataframes using pyspark or pandas. How can I do this. I don't want to create hive connection just mocking tables as df and then execute query on them

Local Spark config

I have created a local spark environment in docker. I intend to use this as part of a CICD pipeline for unit testing code executed in the spark environment. I have two scripts which I want to use: 1 will create a set of persistent spark databases and tables and the other will read those tables. Even though the tables should be persistent, they only persist in that specific spark session. If I create a new spark session, I cannot access the tables, even though it is visible in the file system. Code examples are below:
Create db and table
Create_script.py
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName('Example').getOrCreate()
columns = ["language", "users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
spark.sql("create database if not exists schema1")
df.write.mode("ignore").saveAsTable('schema1.table1')
Load Data
load_data.py
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('example').getOrCreate()
df = spark.sql("select * from schema1.table1")
I know there is a problem as when I run this command: print(spark.catalog.listDatabases()) It can only find database default. But if I import Create_script.py then it will find schema1 db.
How do I make persistent tables across all spark sessions?
These files in /repo/test/spark-warehouse is only data of the tables, without meta info of database/table/column.
If you don't enable Hive, Spark use an InMemoryCatalog, which is ephemeral and only for testing, only available in same spark context. This InMemoryCatalog doesn't provide any function to load db/table from file system.
So there is two way:
Columnar Format
spark.write.orc(), write data into orc/parquet format in your Create_script.py script. orc/parquet format store column info aside with data.
val df = spark.read.orc(), then createOrReplaceTempView if you need use it in sql.
Use Embed Hive
You don't need to install Hive, Spark can work with embed hive, just two steps.
add spark-hive dependency. (I'm using Java which use pom.xml to manage dependencies, I don't know how to do it in pyspark)
SparkSession.builder().enableHiveSupport()
Then data will be /repo/test/spark-warehouse/schema1.db, meta info will be /repo/test/metastore_db, which contains files of Derby db. You can read or write tables across all spark sessions.

python dataframe to postgresql

I imported a table from sql server database as a dataframe, I am trying to export it as PostgreSQL table
this is what I am doing
from sqlalchemy import create_engine
import psycopg2
engine = create_engine('postgresql://postgres:000000#localhost:5432/sinistrePY')
df.to_sql('table_name3', engine)
and this is the result
the data integration is working fine but
I get the table with read-only privileges
data types are not as I should be
no primary key
I don't need the index column
how can I fix that and control how I want my table to be, from my notebook or directly from PostgreSQL server if needed, thanks.

How to use Spark BigQuery connector to join multiple tables and then fetch the data into dataframe?

I have to read three different bigquery tables and then join them to get some data which will be stored to GCS bucket. I was using Spark BQ connector.
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', bq_dataset + bq_table) \
.load()
bqdf.createOrReplaceTempView('bqdf')
This reads entire table data to dataframe. I know that I can apply filter on tables and also select required columns. Thereafter create three dataframes and then join them to get the output.
Is there any equivalent way to achieve this?
I have an option of using BigQuery client API (https://googleapis.dev/python/bigquery/latest/index.html) and import it from pyspark script. However, if I can achive that through Spark BQ connector, dont want to use the API call from python script.
Please help.

Zeppelin: What the best way to query data with SQL and work with it?

I want to use Zeppelin to query databases. I currently see two possibilities but none of them is sufficient for me:
Configure a database connection as "interpreter", name it e.g. "sql1", use it in a paragraph, run a sql query and use the inbuilt nice plotting tools. It seems that all the tutorials and tips deal with it but then the documentation suddenly stops! But I want to do more with the data: I want to filter and process. If I want to plot it again (with other limitations), I have to do the query (that may last some seconds or minutes) again (see my other question Zeppelin SQL: reuse data of query without another interpreter or a new query)
Use spark with python, scala or similar. But the documentation seems only to load csv data, put in into a dataframe and then accesses this dataframe with sql. There is no accessing the data with sql in the first place. How do I access the sql data the best way? Can I use a already configured "interpreter" (database connection)?
You can use Zeppelin API to retrieve paragraph data:
val buffer = scala.io.Source.fromURL("http://XXXXX:9995/api/notebook/2CN2QP93H/paragraph/20170713-092810_1633770798").mkString
val df = sqlContext.read.json(sc.parallelize(buffer :: Nil)).select("body.text")
df.first.getAs[String](0)
This Spark Scala lines will retrieve the SQL query used by a paragprah. You could do same thing to get results I think.
I cannot find a solution for 1. But I have made a short solution for 2. that works within zeppelin with python (2.7), sqlalchemy (sql wrapper), mysqldb (mysql implementation) and pandas (make sure that have these packages installed, all of them are in Debian 9). I wonder why I have not found such a solution before...
%python
from sqlalchemy import create_engine
import pandas as pd
sql = "select col1, col2 from table limit 10"
df = pd.read_sql(sql,
create_engine('mysql+mysqldb://user:password#host:3306/database').connect())
z.show(df)
If you want to connect to another database like db2 or oracle, you have to use other python packages and adjust the first part in the create_engine string.

Categories