I'm writing a spark job using pyspark; I should only read from mongoDB collection and print the content on the screen; the code is the following:
import pyspark
from pyspark.sql import SparkSession
my_spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://127.0.0.1/marco.weather_test").config("spark.mongodb.output.uri", "mongodb://127.0.0.1/marco.out").getOrCreate()
df = my_spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/marco.weather_test").load()
#df.printSchema()
df.show()
the problem is that when I want to print the schema the job works, but when I want to print the content of the dataFrame with show() function I get back the error:
#java.lang.NoSuchFieldError: DECIMAL128
the command I use is:
#bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.3 /home/erca/Scrivania/proveTesi/test_batch.py
I got the same error due to wrong jar is used for mongo-java-driver.
The NoSuchFieldError is raised from
org.bson.BsonType.Decimal128
, and the field Decimal128 is added in class BsonType after mongo-java-driver 3.4。While the
org.mongodb.spark:mongo-spark-connector_2.11:2.2.4
contains mongo-java-driver 3.6.2, a existing jar "mongo-java-driver" with version 3.2.1 is located in driverExtraClassPath.
Just start the spark-shell with verbose:
spark-shell --verbose
i.e. the output:
***
***
Parsed arguments:
master local[*]
deployMode null
executorMemory 1024m
executorCores 1
totalExecutorCores null
propertiesFile /etc/ecm/spark-conf/spark-defaults.conf
driverMemory 1024m
driverCores null
driverExtraClassPath /opt/apps/extra-jars/*
driverExtraLibraryPath /usr/lib/hadoop-current/lib/native
***
***
and pay attention to the driverExtraClassPath, driverExtraLibraryPath.
Check these paths, and remove mongo-java-driver if it exists within these paths.
Importing the library explicitly might solve the issue or give you a better error message for debugging.
import org.bson.types.Decimal128
To BSON jar I a dded a version major and it worked
(In gradle)
compile group: 'org.mongodb', name: 'bson', version: '3.10.2', 'force': "true"
Related
1.I am using this code in catalog.yml file
equipment_data:
type: pandas.ExcelDataSet
filepath: data\01_raw\Equipment Profile.xlsb
layer: raw
getting error after executing kedro run command.
`
kedro.io.core.DataSetError: Failed while loading data from data set ExcelDataSet(filepath=C:/Users/Akshay Salvi/Desktop/Bizmetrics/kedro-environment/petrocaeRepo/data/01_raw/2. Cycle data (per trip)-20210113T042557Z-001/2. Cycle data (per trip)/CycleData,2020.xlsb, load_args={'engine': xlrd}, protocol=file, save_args={'index': False}, writer_args={'engine': xlsxwriter}).
Excel 2007 xlsb file; not supported
`
So the pandas.ExcelDataset simply calls pandas underneath so hopefully you can have luck following this example from another thread where the engine (provided by pip install pyxlsb installing another package) is used to parse it and simply provide the engine parameter as load_args in your YAML catalog.
Working on a project where I am trying to query a SSAS data source we have at work through Python. The connection is presently within Excel files, but I am trying to reverse engineer the process with Python to automate part of the analysis I do on a day to day... I use the pyadomd library to connect to the data source, here`s my code:
clr.AddReference(r"C:\Program Files (x86)\Microsoft Office\root\vfs\ProgramFilesX86\Microsoft.NET\ADOMD.NET\130\Microsoft.AnalysisServices.AdomdClient.dll")
clr.AddReference('Microsoft.AnalysisServices.AdomdClient')
from Microsoft.AnalysisServices.AdomdClient import AdomdConnection , AdomdDataAdapter
from sys import path
path.append('C:\Program Files (x86)\Microsoft Office\root\vfs\ProgramFilesX86\Microsoft.NET\ADOMD.NET\130\Microsoft.AnalysisServices.AdomdClient.dll')
import pyadomd
from pyadomd import Pyadomd
from pyadomd._type_code import adomd_type_map, convert
constr= "connection string"
with Pyadomd(constr) as conn:
with conn.cursor().execute(query) as cur:
print(cur.fetchall())
Which works (in part), seemingly I am able to connect to the SSAS data source. Say I do conn = Pyadomd(constr), it returns no error (no more as it did before). The issue is when I try to execute the query with the cursor it returns an error saying:
File "C:\Users\User\Anaconda3\lib\site-packages\pyadomd\pyadomd.py", line 71, in execute
adomd_type_map[self._reader.GetFieldType(i).ToString()].type_name
KeyError: 'System.Object'
By doing a bit of research, I found that KeyError meant that the code was trying to access a key within a dictionary in which that key isn't present. By digging through my variables and going through the code, I realized that the line:
from pyadomd._type_code import adomd_type_map
Created this dictionary of keys:values:
See dictionary here
Containing these keys: System.Boolean, System.DateTime, System.Decimal, System.Double, System.Int64, System.String. I figured that the "KeyError: System.Object" was referring to that dictionary. My issue is how can I import this System.Object key to that dictionary? From which library/module/IronPython Clr reference can I get it from?
What I tried:
clr.AddReference("System.Object")
Gave me error message saying "Unable to find assembly 'System.Object'. at Python.Runtime.CLRModule.AddReference(String name)"
I also tried:
from System import Object #no error but didn't work
from System import System.Object #error saying invalid syntax
I think it has to do with some clr.AddReference IronPython thing that I am missing, but I've been looking everywhere and can't find it.
Thanks!
Glad that the newer version solved the problem.
A few comments to the code snippet above. It can be done a bit more concise 😊
Pyadomd will import the necessary classes from the AdomdClient, which means that the following lines can be left out.
clr.AddReference(r"C:\Program Files (x86)\MicrosoftOffice\root\vfs\ProgramFilesX86\Microsoft.NET\ADOMD.NET\130\Microsoft.AnalysisServices.AdomdClient.dll")
clr.AddReference('Microsoft.AnalysisServices.AdomdClient')
from Microsoft.AnalysisServices.AdomdClient import AdomdConnection , AdomdDataAdapter
Your code will then look like this:
import pandas as pd
from sys import path
path.append(r'C:\Program Files (x86)\MicrosoftOffice\root\vfs\ProgramFilesX86\Microsoft.NET\ADOMD.NET\130')
from pyadomd import Pyadomd
constr= "constring"
query = "query"
with Pyadomd(constr) as con:
with con.cursor().execute(query) as cur:
DF = pd.DataFrame(cur.fetchone(), columns = [i.name for i in cur.description])
The most important thing is to add the AdomdClient.dll to your path before importing the pyadomd package.
Furthermore, the package is mainly meant to be used with CPython version 3.6 and 3.7.
Well big problems require big solutions..
After endlessly searching the web, I went on https://pypi.org/project/pyadomd/ and directly contacted the author of the package (SCOUT). Emailed him the same question and apparently there was a bug within the code that he fixed overnight and produced a new version of the package, going from 0.0.5 to 0.0.6. In his words:
[Hi,
Thanks for writing me 😊
I investigated the error, and you are correct, the type map doesn’t support converting System.Object.
That is a bug!
I have uploaded a new version of the Pyadomd package to Pypi which should fix the bug – Pyadomd will now just pass a System.Object type through as a .net object. Because Pyadomd doesn’t know the specifics of the System.Object type at runtime, you will then be responsible yourself to convert to a python type if necessary.
Please install the new version using pip.]1
So after running a little pip install pyadomd --upgrade, I restarted Spyder and retried the code and it now works and I can query my SSAS cube !! So hopefully it can help others.
Snippet of the code:
import pandas as pd
import clr
clr.AddReference(r"C:\Program Files (x86)\MicrosoftOffice\root\vfs\ProgramFilesX86\Microsoft.NET\ADOMD.NET\130\Microsoft.AnalysisServices.AdomdClient.dll")
clr.AddReference('Microsoft.AnalysisServices.AdomdClient')
from Microsoft.AnalysisServices.AdomdClient import AdomdConnection , AdomdDataAdapter
from sys import path
path.append(r'C:\Program Files (x86)\MicrosoftOffice\root\vfs\ProgramFilesX86\Microsoft.NET\ADOMD.NET\130\Microsoft.AnalysisServices.Ado mdClient.dll')
import pyadomd
from pyadomd import Pyadomd
constr= "constring"
query = "query"
and then as indicated on his package website:
with Pyadomd(constr) as con:
with con.cursor().execute(query) as cur:
DF = pd.DataFrame(cur.fetchone(), columns = [i.name for i in cur.description])
and bam! 10795 rows by 39 columns DataFrame, I haven't precisely calculated time yet, but looking good so far considering the amount of data.
I'm using BigQuery and Dataproc in Google Cloud. Both are in the same project, let's call it "project-123". I use Composer (Airflow) to run my code.
I have a simple python script, test_script.py, that uses pyspark to get read data from a table in the bigquery public dataset:
if __name__ == "__main__":
# Create Spark Cluster
try:
spark = SparkSession.builder.appName("test_script").getOrCreate()
log.info("Created a SparkSession")
except ValueError:
warnings.warn("SparkSession already exists in this scope")
df = (
spark.read.format("bigquery")
.option("project", "project-123")
.option("dataset", "bigquery-public-data")
.option("table", "crypto_bitcoin.outputs")
.load()
)
I run the script using the DataProcPySparkOperator in airflow:
# This task corresponds to the ""
test_script_task = DataProcPySparkOperator(
task_id="test_script",
main="./test_script.py",
cluster_name="test_script_cluster",
arguments=[],
# Since we are using bigquery, we need to explicity add the connector jar
dataproc_pyspark_jars="gs://spark-lib/bigquery/spark-bigquery-latest.jar",
)
However, every time I try I get the following error:
Invalid project ID '/tmp/test_script_20200304_407da59b/test_script.py'. Project IDs must contain 6-63 lowercase letters, digits, or dashes. Some project IDs also include domain name separated by a colon. IDs must start with a letter and may not end with a dash.
Where is this project ID coming from? It's obviously not being overwritten by my .option("project", "project-123"). My guess is that Composer is storing my spark job script at the location /tmp/test_script_20200304_407da59b/test_script.py. If that's the case, how can I overwrite the project ID?
Any help is much appreciated
I'm afraid you are mixing the parameters. project is the project the table belongs to, and bigquery-public-data is a project rather than dataset. Please try the following call:
df = (
spark.read.format("bigquery")
.option("parentProject", "project-123")
.option("project", "bigquery-public-data")
.option("table", "crypto_bitcoin.outputs")
.load()
)
I'm trying to import a CSV into MySQL using odo but am getting a datashape error.
My understanding is that datashape takes the format:
var * {
column: type
...
}
where var means a variable number of rows. I'm getting the following error:
AssertionError: datashape must be Record type, got 0 * {
tod: ?string,
interval: ?string,
iops: float64,
mb_per_sec: float64
}
I'm not sure where that 0 number of rows is coming from. I've tried explicitly setting the datashape using dshape(), but continue to get the same error.
Here's a stripped down version of the code that recreates the error:
from odo import odo
odo('test.csv', mysql_database_uri)
I'm running Ubuntu 16.04 and Python 3.6.1 using Conda.
Thanks for any input.
I had this error, needed to specify table
# error
odo('data.csv', 'postgresql://usr:pwd#ip/db')
# works
odo('data.csv', 'postgresql://usr:pwd#ip/db::table')
Try replacing
odo('test.csv', mysql_database_uri)
with
odo(pandas.read_csv('test.csv') , mysql_database_uri)
Odo seems to be buggy and discontinued. As an alternative you can use d6tstack which has fast pandas to SQL functionality because it uses native DB import commands. It supports Postgres, MYSQL and MS SQL,
cfg_uri_mysql = 'mysql+mysqlconnector://testusr:testpwd#localhost/testdb'
d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'),
apply_after_read=apply_fun).to_mysql_combine(uri_psql, 'table')
Also particularly useful for importing multiple CSV with data schema changes and/or preprocess with pandas before writing to db, see further down in examples notebook
I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:
serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.
You can set spark.driver.maxResultSize parameter in the SparkConf object:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext as well:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g can also be used to increase the max result size.
Tuning spark.driver.maxResultSize is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As #Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df, then you can call df.rdd and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
Do not turn on spark.sql.parquet.binaryAsString. String objects take more space
Use spark.rdd.compress to compress RDDs when you collect them
Try to collect it using pagination. (code in Scala, from another answer Scala: How to get a range of rows in a dataframe)
long count = df.count()
int limit = 50;
while(count > 0){
df1 = df.limit(limit);
df1.show(); //will print 50, next 50, etc rows
df = df.except(df1);
count = count - limit;
}
Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize. there are two ways of defining this variable
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable
in spark-defaults.conf file present in conf folder of spark. like
spark.driver.maxResultSize 3g and restart the spark.
while starting the job or terminal, you can use
--conf spark.driver.maxResultSize="0"
to remove the bottleneck
There is also a Spark bug
https://issues.apache.org/jira/browse/SPARK-12837
that gives the same error
serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize
even though you may not be pulling data to the driver explicitly.
SPARK-12837 addresses a Spark bug that accumulators/broadcast variables prior to Spark 2 were pulled to driver unnecessary causing this problem.
You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize