PySpark: Not able to read hive orc table using spark.sql - python

Used spark to write df to hdfs:
df.write.partitionBy("date").mode("append").format("ORC").save("/tmp/table1")
Created Hive External Table on top of this (default.table1)
I am able to read this table using beeline.
select * from default.table1; --> works fine
I am able to read this folder using spark
spark.read.orc("/tmp/table1").show() # --> works fine
However, when I use spark to read this hive table, i am getting error:
spark.sql("select * from default.table1").show() # --> error
When i take a count of this table, it works fine also:
spark.sql("select count(*) from default.table1").show() # --> works fine
Also, when I write spark df as csv to hdfs, i have no issues in reading it as spark.sql (hive)
Following is the error message:
"Py4JJavaError: An error occurred while calling 0192.showString."

Managed to fix the issue by fixing the hive ddl.
Must include the line
"WITH SERDEPROPERTIES ..." to make sure hiveContext is able to read the data in spark

Related

Can't associate temp view with database in spark session

I'm trying to create a temp view using spark, from a csv file.
To reproduce my production scenario, I need to test my script locally, however in production I'm using Glue Jobs (AWS) where there are databases and tables.
In the code below, I'm creating a database in my spark session and using it, after that, I create a temp view.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("pulsar_data").getOrCreate()
df = spark.read.format('csv')\
.options(infer_schema=True)\
.options(header=True)\
.load('pulsar_stars.csv')
spark.sql('CREATE DATABASE IF NOT EXISTS MYDB')
spark.sql('USE MYDB')
df.createOrReplaceTempView('MYDB.TB_PULSAR_STARS')
spark.catalog.listTables()
spark.sql('SELECT * FROM MYDB.TB_PULSAR_STARS').show()
However, when I try to select db.table, Spark can't find the relation between my temp view and my database and throws following error:
*** pyspark.sql.utils.AnalysisException: Table or view not found: MYDB.TB_PULSAR_STARS; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [MYDB, TB_PULSAR_STARS], [], false
Debugging my code with pdb, I have listed my spark session catalog, where I find that my table is in fact associated:
(Pdb) spark.catalog.listTables()
[Table(name='tb_pulsar_stars', database='MYDB', description=None, tableType='TEMPORARY', isTemporary=True)]
How can I make this relationship work?
Temporary view name associated to a DataFrame can only be one segment. This is explicitly checked here in Spark code. I would expect your code to throw AnalysisException: CREATE TEMPORARY VIEW or the corresponding Dataset APIs only accept single-part view names, but got: MYDB.TB_PULSAR_STARS - not sure why in your case it's a bit different.
Anyway, use:
df.createOrReplaceTempView('TB_PULSAR_STARS')
spark.sql('SELECT * FROM TB_PULSAR_STARS').show()
And if you need to actually write this data to a table, create it using:
spark.sql("CREATE TABLE MYDB.TB_PULSAR_STARS AS SELECT * FROM TB_PULSAR_STARS")

spark.read parquet into a dataframe gives null values

I'm new in PySpark and long story short:
I have a parquet file and I am trying to read it and use it with SPARK SQL, but currently I can:
Read the file with schema but gives NULL values - spark.read.format
Read the file without schema (header has first row values as column names) - read_parquet
I have a parquet file "locations.parquet" and
location_schema = StructType([
StructField("loc_id", IntegerType()),
StructField("descr", StringType())
])
I have been trying to read the parquet file:
location_df = spark.read.format('parquet') \
.options(header='false') \
.schema(location_schema) \
.load("data/locations.parquet")
Then I put everything into a Temp table to run queries:
location_df .registerTempTable("location")
But after trying to run a query:
query = spark.sql("select * from location")
query.show(100)
It gives me NULL values:
The parquet file is correct since I have been running successfully this:
great = pd.read_parquet('data/locations.parquet', engine='auto')
But the problem with read_parquet (from my understanding) is that I cannot set a schema like I did with spark.read.format.
If I use the spark.read.format with csv, It also runs successfully and brings data.
Any advice is greatly appreciated, thanks.
location_df.schema gives me : StructType(List(StructField(862,LongType,true),StructField(Animation,StringType,true)))
You have a Struct with an array, not just two columns. That explains why your first example is null (the names/types don't match). It would also throw errors if you starting selecting loc_id or descr individually.
It appears that the field names within the array of Structs are 862 and Animation, too, not the fields you're interested in
You can download a parquet-tools utility separately to inspect the file data and print the file schema without Spark. Start debugging there...
If you don't give the Spark reader a schema, then that information is already interred from the file itself since Parquet files include that information (in the footer of the file, not the header)

MySQL table definition has changed error when reading from a table that has been written to by PySpark

I am currently working on a data pipeline with pyspark. As part of the pipeline, I write a spark dataframe to mysql using the following function:
def jdbc_insert_overwrite_table(df, mysql_user, mysql_pass, mysql_host, mysql_port, mysql_db, num_executors, table_name,
logger):
mysql_url = "jdbc:mysql://{}:{}/{}?characterEncoding=utf8".format(mysql_host, mysql_port, mysql_db)
logger.warn("JDBC Writing to table " + table_name)
df.write.format('jdbc')\
.options(
url=mysql_url,
driver='com.mysql.cj.jdbc.Driver',
dbtable=table_name,
user=mysql_user,
password=mysql_pass,
truncate=True,
numpartitions=num_executors,
batchsize=100000
).mode('Overwrite').save()
This works with no issue. However, later on in the pipeline (within the same PySpark app/ spark session), this table is a dependency for another transformation, and I try reading from this table using the following function:
def read_mysql_table_in_session_df(spark, mysql_conn, query_str, query_schema):
cursor = mysql_conn.cursor()
cursor.execute(query_str)
records = cursor.fetchall()
df = spark.createDataFrame(records, schema=query_schema)
return df
And I get this MySQL error: Error 1412: Table definition has changed, please retry transaction.
I've been able to resolve this by closing and ping(reconnect=True) to the database, but I don't like this solution as it feels like a band-aid.
Any ideas why I'm getting this error? I've confirmed writing to the table does not change the table definition (schema wise, at least).

Python pyspark Write DF to .csv and store it in local c drive

I wanted to save a dataframe that pull data using SQLContext and save it into .csv file in c drive. I am using Zeppelin to run my code.
The below code runs but I can't see the file in the location specified. The select query in SQLContect is pulling data from HIVE DB.
%spark.pyspark
df = sqlContext.sql("SELECT * from TEST")
df.write.format("csv").mode("overwrite").save("\Users\testuser\testfolder\test.csv")
z.show(df)
You're in Windows if I'm getting it correctly. In that case you need to add the required prefix to your path. Your path will be something like C:\Users\testuser\testfolder\test.csv

Python pandas.to_sql error when trying too upload many columns into Microsoft SQL Server

I have been attempting to use Python to upload a table into Microsoft SQL Server. I have had great success with smaller tables, but start to get errors when there is a large number of columns or rows. I don't believe it is the filesize that is the issue, but I may be mistaken.
The same error comes up whether the data is from an Excel file, csv file, or query.
When I run the code, it does create a table in SQL Server, but only has the column headers (the rest being blank).
This is the code that I am using, which works for smaller files but gives me the below error for the larger ones:
import pyodbc
#import cx_Oracle
import pandas as pd
from sqlalchemy import create_engine
connstr_Dev = ('DSN='+ODBC_Dev+';UID='+SQLSN+';PWD='+SQLpass)
conn_Dev = pyodbc.connect(connstr_Dev)
cursor_Dev=conn_Dev.cursor()
engine_Dev = create_engine('mssql+pyodbc://'+ODBC_Dev)
upload_file= "M:/.../abc123.xls"
sql_table_name='abc_123_sql'
pd.read_excel(upload_file).to_sql(sql_table_name, engine_Dev, schema='dbo', if_exists='replace', index=False, index_label=None, chunksize=None, dtype=None)
conn_Dev.commit()
conn_Dev.close()
This gives me the following error:
ProgrammingError: (pyodbc.ProgrammingError) ('The SQL contains -13854
parameter markers, but 248290 parameters were supplied', 'HY000') .......
(Background on this error at: http://sqlalche.me/e/f405)
The error log in the provided link doesn't give me any ideas on troubleshooting.
Anything I can tweak in the code to make this work?
Thanks!
Upgrading to pandas 0.23.4 solved it for me. What is your version ?

Categories