Pyspark: how to read a .csv file in google bucket?

Pyspark: how to read a .csv file in google bucket? - python

I have some files stored in a google bucket. Those are my settings as suggested here.
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config("spark.serializer", KryoSerializer.getName).\
config("spark.jars", "/usr/local/.sdkman/candidates/spark/2.4.4/jars/gcs-connector-hadoop2-2.1.1.jar").\
config("spark.kryo.registrator", GeoSparkKryoRegistrator.getName).\
getOrCreate()
#Recommended settings for using GeoSpark
spark.conf.set("spark.driver.memory", 6)
spark.conf.set("spark.network.timeout", 1000)
#spark.conf.set("spark.driver.maxResultSize", 5)
spark.conf.set
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
# This is required if you are using service account and set true,
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "myJson.json")
path = 'mBucket-c892b51f8579.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = path
client = storage.Client()
name = 'https://console.cloud.google.com/storage/browser/myBucket/'
bucket_id = 'myBucket'
bucket = client.get_bucket(bucket_id)
I can read them simple using the following:
df = pd.read_csv('gs://myBucket/myFile.csv.gz', compression='gzip')
df.head()
time_zone_name province_short
0 America/Chicago US.TX
1 America/Chicago US.TX
2 America/Los_Angeles US.CA
3 America/Chicago US.TX
4 America/Los_Angeles US.CA
I am trying to read the same file with pyspark
myTable = spark.read.format("csv").schema(schema).load('gs://myBucket/myFile.csv.gz', compression='gzip')
but I get the following error
Py4JJavaError: An error occurred while calling o257.load.
: java.lang.NoClassDefFoundError: Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Welcome to the hadoop dependency hell !
1. Use packages rather than jars
Your configuration is basically correct but when you add the gcs-connector as a local jar you also need to manually ensure all its dependencies are available in the JVM classpath.
It's usually easier to add the connector as a package and let spark deal with the dependencies so instead of config("spark.jars", "/usr/local/.sdkman/candidates/spark/2.4.4/jars/gcs-connector-hadoop2-2.1.1.jar") use config('spark.jars.packages', 'com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.1')
2. Manage ivy2 dependencies resolution issues
When you do as above, spark will likely complain that it can't download some dependencies due to resolution differences between maven (used for publication) and ivy2 (used by spark for dependency resolution).
You can usually fix this by simply asking spark to ignore the unresolved dependencies using spark.jars.excludes so add a new config line such as
config('spark.jars.excludes','androidx.annotation:annotation,org.slf4j:slf4j-api')
3. Manage classpath conflicts
When this is done, the SparkSession will start but the filesystem will still fail because the standard distribution of pyspark packages an old version of guava library that doesn't implement the API the gcs-connector relies on.
You need to ensure that gcs-connector will find its expected version first by using the following configs config('spark.driver.userClassPathFirst','true') and config('spark.executor.userClassPathFirst','true')
4. Manage dependency conflicts
Now you may think everything is OK but actually no because the default pyspark distribution contains version 2.7.3 of hadoop libraries but the gcs-connector version 2.1.1 relies on 2.8+ only APIs.
Now your options are:
use a custom build of spark with a newer hadoop (or the package with no built-in hadoop libraries)
use an older version of gcs-connector (version 1.9.17 works fine)
5. A working config at last
Assuming you want to stick with the PyPi or Anaconda latest distribution of pyspark, the following config should work as expected.
I've included only the gcs relevant configs, moved the Hadoop config directly into the spark config and assumed you are correctly setting your GOOGLE_APPLICATION_CREDENTIALS:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
master("local[*]").\
appName("TestApp").\
config('spark.jars.packages',
'com.google.cloud.bigdataoss:gcs-connector:hadoop2-1.9.17').\
config('spark.jars.excludes',
'javax.jms:jms,com.sun.jdmk:jmxtools,com.sun.jmx:jmxri').\
config('spark.driver.userClassPathFirst','true').\
config('spark.executor.userClassPathFirst','true').\
config('spark.hadoop.fs.gs.impl',
'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem').\
config('spark.hadoop.fs.gs.auth.service.account.enable', 'false').\
getOrCreate()
Note that gcs-connector version 1.9.17 has a different set of excludes than 2.1.1 because why not...
PS: You also need to ensure you're using a Java 1.8 JVM because Spark 2.4 doesn't work on newer JVMs.

In addition to #rluta's great answer, you can also replace the userClassPathFirst lines by specifically putting the guava jars in extraClassPath:
spark.driver.extraClassPath=/root/.ivy2/jars/com.google.guava_guava-27.0.1-jre.jar:/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar:/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
spark.executor.extraClassPath=/root/.ivy2/jars/com.google.guava_guava-27.0.1-jre.jar:/root/.ivy2/jars/com.google.guava_failureaccess-1.0.1.jar:/root/.ivy2/jars/com.google.guava_listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
It's a bit hackish as you need the exact local ivy2 path, although you can also download/copy the jars to somewhere more permanent.
But, this reduces other potential dependency conflicts such as with livy, which throws java.lang.NoClassDefFoundError: org.apache.livy.shaded.json4s.jackson.Json4sModule if gcs-connector's jackson dependencies are in front of the classpath.

Related

File metadata such as time in Azure Storage from Databricks

I m trying to get creationfile metadata.
File is in: Azure Storage
Accesing data throw: Databricks
right now I m using:
file_path = my_storage_path
dbutils.fs.ls(file_path)
but it returns
[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]
I do not have any information about creation time, there is a way to get that information ?
other solutions in Stackoverflow are refering to files that are already in databricks
Does databricks dbfs support file metadata such as file/folder create date or modified date
in my case we access to the data from Databricks but the data are in Azure Storage.

It really depends on the version of Databricks Runtime (DBR) that you're using. For example, modification timestamp is available if you use DBR 10.2 (didn't test with 10.0/10.1, but definitely not available on 9.1):
If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:
URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(URI("/tmp"), Configuration())
status = fs.listStatus(Path('/tmp/'))
for fileStatus in status:
print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")

What does the PyTables warning "a closed node found in the registry" mean?

When using pandas.to_hdf function to save data to a HDF5 file, I'm getting the following warning:
C:\{my path to conda environment}\lib\site-packages\tables\file.py:426:
UserWarning: a closed node found in the registry: ``/{my object key}/meta/{my column name}/meta/_i_table``
warnings.warn("a closed node found in the registry: "
I wasn't able to google what this warning could mean and looking into the relevant PyTables docs didn't help.
Does anybody have an idea?
Details:
Python 3.8.2, PyTables 3.6.1, pandas 1.0.3
The saved data seem to be alright.
I'm calling the pandas.to_hdf function as follows: to_hdf(filename, key, mode = mode, complib='blosc', format = 'table', append = True), where mode is either 'w' or 'a' (the warning is raised in both cases).

For all its worth i managed to overcome the warning by installing tables 3.6.1 from:
https://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables
using the +gpl version for python 3.9.
Never the less i do not know the details of this choice and i cannot justify why it is working.

I observed the same issue. For me, this happens (from time to time, I do not see the logic) for category columns in the DataFrame.
Solution:
transform category columns to object
run to_hdf(...)

Provide blob type to read an Azure append blob from PySpark

The ultimate goal is to able to read the data in my Azure container into a PySpark dataframe.
Steps until now
The steps I have followed till now:
Written this code
spark = SparkSession(SparkContext())
spark.conf.set(
"fs.azure.account.key.%s.blob.core.windows.net" % AZURE_ACCOUNT_NAME,
AZURE_ACCOUNT_KEY
)
spark.conf.set(
"fs.wasbs.impl",
"org.apache.hadoop.fs.azure.NativeAzureFileSystem"
)
container_path = "wasbs://%s#%s.blob.core.windows.net" % (
AZURE_CONTAINER_NAME, AZURE_ACCOUNT_NAME
)
blob_folder = "%s/%s" % (container_path, AZURE_BLOB_NAME)
df = spark.read.format("text").load(blob_folder)
print(df.count())
Set public access and anonymous access to my Azure container.
Added two jars hadoop-azure-2.7.3.jar and azure-storage-2.2.0.jar to the path.
Problem
But now I am stuck with this error: Caused by: com.microsoft.azure.storage.StorageException: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual UNSPECIFIED..
I have not been able to find anything which talks about / resolves this issue. The closest I have found is this which does not work / is outdated.
EDIT
I found that the azure-storage-2.2.0.jar did not support APPEND_BLOB. I upgraded to azure-storage-4.0.0.jar and it changed the error from Expected BLOCK_BLOB, actual UNSPECIFIED. to Expected BLOCK_BLOB, actual APPEND_BLOB.. Does anyone know how to pass the correct type to expect?
Can someone please help me with resolving this.
I have minimal expertise in working with Azure but I don't think it should be this difficult to read and create a Spark dataframe from it. What am I doing wrong?

java.lang.NoSuchFieldError: DECIMAL128 mongoDB spark

I'm writing a spark job using pyspark; I should only read from mongoDB collection and print the content on the screen; the code is the following:
import pyspark
from pyspark.sql import SparkSession
my_spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/marco.weather_test").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/marco.out").getOrCreate()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/marco.weather_test").load()
#df.printSchema()
df.show()
the problem is that when I want to print the schema the job works, but when I want to print the content of the dataFrame with show() function I get back the error:
#java.lang.NoSuchFieldError: DECIMAL128
the command I use is:
#bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3 /home/erca/Scrivania/proveTesi/test_batch.py

I got the same error due to wrong jar is used for mongo-java-driver.
The NoSuchFieldError is raised from
org.bson.BsonType.Decimal128
, and the field Decimal128 is added in class BsonType after mongo-java-driver 3.4。While the
org.mongodb.spark:mongo-spark-connector_2.11:2.2.4
contains mongo-java-driver 3.6.2, a existing jar "mongo-java-driver" with version 3.2.1 is located in driverExtraClassPath.
Just start the spark-shell with verbose:
spark-shell --verbose
i.e. the output:
***
***
Parsed arguments:
master local[*]
deployMode null
executorMemory 1024m
executorCores 1
totalExecutorCores null
propertiesFile /etc/ecm/spark-conf/spark-defaults.conf
driverMemory 1024m
driverCores null
driverExtraClassPath /opt/apps/extra-jars/*
driverExtraLibraryPath /usr/lib/hadoop-current/lib/native
***
***
and pay attention to the driverExtraClassPath, driverExtraLibraryPath.
Check these paths, and remove mongo-java-driver if it exists within these paths.

Importing the library explicitly might solve the issue or give you a better error message for debugging.
import org.bson.types.Decimal128

To BSON jar I a dded a version major and it worked
(In gradle)
compile group: 'org.mongodb', name: 'bson', version: '3.10.2', 'force': "true"

Python -> Py4j -> Spark -> Cassandra

I would like to test a simply Spark row count job on a test Cassandra table with only four rows just to verify that everything works.
I can quickly get this working from Java:
JavaSparkContext sc = new JavaSparkContext(conf);
SparkContextJavaFunctions sparkContextJavaFunctions = CassandraJavaUtil.javaFunctions(sc);
CassandraJavaRDD<CassandraRow> table = sparkContextJavaFunctions.cassandraTable("demo", "playlists");
long count = table.count();
Now, I'd like to get the same thing working in Python. The Spark distribution comes with a set of unbundled PySpark source code to use Spark from Python. It uses a library called py4j to launch a Java server and marshal java commands through a TCP gateway. I'm using that gateway directly to get this working.
I specify the following extra jars to the Java SparkSubmit host via the --driver-class-path option:
spark-cassandra-connector-java_2.11-1.2.0-rc1.jar
spark-cassandra-connector_2.11-1.2.0-rc1.jar
cassandra-thrift-2.1.3.jar
cassandra-clientutil-2.1.3.jar
cassandra-driver-core-2.1.5.jar
libthrift-0.9.2.jar
joda-convert-1.2.jar
joda-time-2.3.jar
Here is the core Python code to do the row count test:
from pyspark.java_gateway import launch_gateway
jvm_gateway = launch_gateway()
sc = jvm_gateway.jvm.org.apache.spark.api.java.JavaSparkContext(conf)
spark_cass_functions = jvm_gateway.jvm.com.datastax.spark.connector.japi.CassandraJavaUtil.javaFunctions(sc)
table = spark_cass_functions.cassandraTable("demo", "playlists");
On this last line, I get the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o5.cassandraTable.
: com.datastax.spark.connector.util.ConfigCheck$ConnectorConfigurationException: Invalid Config Variables
Only known spark.cassandra.* variables are allowed when using the Spark Cassandra Connector.
spark.cassandra.connection.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
spark.cassandra.auth.conf.factory is not a valid Spark Cassandra Connector variable.
No likely matches found.
at com.datastax.spark.connector.util.ConfigCheck$.checkConfig(ConfigCheck.scala:38)
at com.datastax.spark.connector.rdd.CassandraRDD.<init>(CassandraRDD.scala:18)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.<init>(CassandraTableScanRDD.scala:59)
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$.apply(CassandraTableScanRDD.scala:182)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:88)
at com.datastax.spark.connector.japi.SparkContextJavaFunctions.cassandraTable(SparkContextJavaFunctions.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Clearly, there is some configuration or setup issue. I'm not sure how to reasonably debug or investigate or what I could try. Can anyone with more Cassandra/Python/Spark expertise provide some advice? Thank you!
EDIT: A coworker setup a spark-defaults.conf file that was the root of this. I don't fully understand why this caused problems from Python and not from Java, but it doesn't matter. I don't want that conf file and removing it resolved by issue.

That is a known bug in the Spark Cassandra Connector in 1.2.0-rc1 and 1.2.0-rc2, it will be fixed in rc3.
Relevant Tickets
https://datastax-oss.atlassian.net/browse/SPARKC-102
https://datastax-oss.atlassian.net/browse/SPARKC-105
https://datastax-oss.atlassian.net/browse/SPARKC-108

You could always try out pyspark_cassandra. It's built against 1.2.0-rc3 and probably is a lot easier when working with Cassandra in pyspark.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.