How to update rrd database in python - python

I am new to the programing and I am using already created scripts, I am trying to update my RRD database in python. I have manage to create below code which don’t come back to me with any errors but when I am trying to generate a graph it don’t contain any data.
#!/usr/bin/python
#modules
import sys
import os
import time
import rrdtool
import Adafruit_DHT as dht
#assign data
h,t = dht.read_retry(dht.DHT22, 22)
#display data
print 'Temp={0:0.1f}*C'.format(t, h)
print 'Humidity={1:0.1f}%'.format(t,h)
#update database
data = "N:h:t"
ret = rrdtool.update("%s/humidity.rrd" % (os.path.dirname(os.path.abspath(__file__))),data)
if ret:
print rrdtool.error()
time.sleep(300)
Below my data base specification:
#! /bin/bash
rrdtool create humidity.rrd \
--start "01/01/2015" \
--step 300 \
DS:th_dht22:GAUGE:1200:-40:100 \
DS:hm_dht22:GAUGE:1200:-40:100 \
RRA:AVERAGE:0.5:1:288 \
RRA:AVERAGE:0.5:6:336 \
RRA:AVERAGE:0.5:24:372 \
RRA:AVERAGE:0.5:144:732 \
RRA:MIN:0.5:1:288 \
RRA:MIN:0.5:6:336 \
RRA:MIN:0.5:24:372 \
RRA:MIN:0.5:144:732 \
RRA:MAX:0.5:1:288 \
RRA:MAX:0.5:6:336 \
RRA:MAX:0.5:24:372 \
RRA:MAX:0.5:144:732 \

rrdtool will silently ignore updates that are either too far apart or lie outside the predefined input range. I would add a logging feature to your code to see what you are trying to feed to rrdtool.

Related

How to set up Kafka as a dependency when using Delta Lake in PySpark?

This is the code to set up Delta Lake as part of a regular Python script, according to their documentation:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
The official docs for Kafka integration in Spark show how to set up Kafka when using a spark-submit command (through the --packages parameter), but not in Python.
Digging around, turns out that you can also include this parameter when building the Spark session:
import pyspark
from delta import *
packages = [
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1",
]
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", ",".join(packages))
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
However, when I try to stream to Kafka using the spark session created above I still get the following error:
Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
I'm using Delta 2.1.0 and PySpark 3.3.0.
Turns out that Delta overwrites any packages provided in spark.jars.packages if you're using configure_spark_with_delta_pip (source). The proper way is to make use of the extra_packages parameter when setting up your Spark Session:
import pyspark
from delta import *
packages = [
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1",
]
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder, extra_packages=packages).getOrCreate()

flink run -py wordcount.py caused NullPointerException

i want to process data with flink's python api on windows . But when i use the command to submit a job to Local cluster, it throws NullPointerException。
bin/flink run -py D:\workspace\python-test\flink-test.py
flink-test.py:
from pyflink.dataset import ExecutionEnvironment
from pyflink.table import TableConfig, DataTypes, BatchTableEnvironment
from pyflink.table.descriptors import Schema, OldCsv, FileSystem
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\data.txt')) \
.with_format(OldCsv()
.line_delimiter(' ')
.field('word', DataTypes.STRING())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())) \
.register_table_source('mySource')
t_env.connect(FileSystem().path('D:\\workspace\\python-test\\result.txt')) \
.with_format(OldCsv()
.field_delimiter('\t')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT())) \
.register_table_sink('mySink')
t_env.scan('mySource') \
.group_by('word') \
.select('word, count(1)') \
.insert_into('mySink')
t_env.execute("tutorial_job")
Does anyone know why?
I have solved this problem. I read the source code by the error message.
The NullPointerException is caused by that flinkOptPath is empty!. I use the flink.bat to submit the job , and the flink.bat don't set the flinkOptPath. So I add some code in the flink.bat like this . The flink.bat is Incomplete for now. we should run flink on linux.

Python generated file does not execute properly

I have a problem with files I generated with Python. I have some files (.sh-files) that I want to create dynamically. The files themselves can be executed properly and do what they are supposed to.
The Python generated files however are IDENTICAL, the diff command in Linux has an empty result! But when I execute the generated .sh scripts, they give me random errors.
Here is my normal file (for example):
OBJDUMP=`which riscv32-unknown-elf-objdump`
OBJCOPY=`which riscv32-unknown-elf-objcopy`
COMPILER=`which riscv32-unknown-elf-gcc`
RANLIB=`which riscv32-unknown-elf-ranlib`
VSIM=`which vsim`
echo $VSIM
TARGET_C_FLAGS="-O3 -m32 -g"
#TARGET_C_FLAGS="-O2 -g -falign-functions=16 -funroll-all-loops"
# if you want to have compressed instructions, set this to 1
RVC=0
# if you are using zero-riscy, set this to 1, otherwise it uses RISCY
USE_ZERO_RISCY=0
# set this to 1 if you are using the Floating Point extensions for riscy only
RISCY_RV32F=0
# zeroriscy with the multiplier
ZERO_RV32M=0
# zeroriscy with only 16 registers
ZERO_RV32E=0
# riscy with PULPextensions, it is assumed you use the ETH GCC Compiler
GCC_MARCH="IMXpulpv2"
#compile arduino lib
ARDUINO_LIB=1
PULP_GIT_DIRECTORY=../../
SIM_DIRECTORY="$PULP_GIT_DIRECTORY/vsim"
#insert here your post-layout netlist if you are using IMPERIO
PL_NETLIST=""
cmake "$PULP_GIT_DIRECTORY"/sw/ \
-DPULP_MODELSIM_DIRECTORY="$SIM_DIRECTORY" \
-DCMAKE_C_COMPILER="$COMPILER" \
-DVSIM="$VSIM" \
-DRVC="$RVC" \
-DRISCY_RV32F="$RISCY_RV32F" \
-DUSE_ZERO_RISCY="$USE_ZERO_RISCY" \
-DZERO_RV32M="$ZERO_RV32M" \
-DZERO_RV32E="$ZERO_RV32E" \
-DGCC_MARCH="$GCC_MARCH" \
-DARDUINO_LIB="$ARDUINO_LIB" \
-DPL_NETLIST="$PL_NETLIST" \
-DCMAKE_C_FLAGS="$TARGET_C_FLAGS" \
-DCMAKE_OBJCOPY="$OBJCOPY" \
-DCMAKE_OBJDUMP="$OBJDUMP"
And here is the python generated one.
OBJDUMP=`which riscv32-unknown-elf-objdump`
OBJCOPY=`which riscv32-unknown-elf-objcopy`
COMPILER=`which riscv32-unknown-elf-gcc`
RANLIB=`which riscv32-unknown-elf-ranlib`
VSIM=`which vsim`
TARGET_C_FLAGS="-O3 -m32 -g"
RVC=0
USE_ZERO_RISCY=0
RISCY_RV32F=0
ZERO_RV32M=0
ZERO_RV32E=0
GCC_MARCH="IMXpulpv2"
ARDUINO_LIB=1
PULP_GIT_DIRECTORY=../../
SIM_DIRECTORY="$PULP_GIT_DIRECTORY/vsim"
PL_NETLIST=""
cmake "$PULP_GIT_DIRECTORY"/sw/ \
-DPULP_MODELSIM_DIRECTORY="$SIM_DIRECTORY" \
-DCMAKE_C_COMPILER="$COMPILER" \
-DVSIM="$VSIM" \
-DRVC="$RVC" \
-DRISCY_RV32F="$RISCY_RV32F" \
-DUSE_ZERO_RISCY="$USE_ZERO_RISCY" \
-DZERO_RV32M="$ZERO_RV32M" \
-DZERO_RV32E="$ZERO_RV32E" \
-DGCC_MARCH="$GCC_MARCH" \
-DARDUINO_LIB="$ARDUINO_LIB" \
-DPL_NETLIST="$PL_NETLIST" \
-DCMAKE_C_FLAGS="$TARGET_C_FLAGS" \
-DCMAKE_OBJCOPY="$OBJCOPY" \
-DCMAKE_OBJDUMP="$OBJDUMP"
I dont know how much this will help you. But that's how it is.
Now I execute this scripts with ./script and the first script executes. The second one gives me the error:
CMake Error: The source directory "../sw/build/ " does not exist.
And the path where the script resides in is exactly ../sw/build/
What's going on here?

AWS Glue - Truncate destination postgres table prior to insert

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.
Has anyone been able to do so?
I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.
Download the tar of pg8000 from pypi
Create an empty __init__.py in the root folder
Zip up the contents & upload to S3
Reference the zip file in the Python lib path of the job
Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.
Then you can simply create a connection and execute SQL.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pg8000
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'PW',
'HOST',
'USER',
'DB'
])
# ...
# Create Spark & Glue context
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# ...
config_port = 5432
conn = pg8000.connect(
database=args['DB'],
user=args['USER'],
password=args['PW'],
host=args['HOST'],
port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()
After following step (4) of #thenaturalist's response,
sc.addPyFile("/home/glue/downloads/python/pg8000.zip")
import pg8000
worked for me in a development endpoint (zeppelin notebook)
More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
To clarify #thenaturalist instructions for the zip as I still struggled with this
Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure
pg8000-1.15.3.zip
|
| -- pg8000 <dir>
| -- __init__.py
| -- _version.py <optional>
| -- core.py
Upload to s3 and then you should be able to just do a simple import pg8000.
NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.
data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
.mode('overwrite') \
.format("jdbc") \
.option("url", conf['url']) \
.option("database", 'Pacvue_Commerce') \
.option("dbtable", "dbo.glue_1") \
.option("user", conf['user']) \
.option('truncate','true') \
.option("password", conf['password']) \
.save()
glue api not support , but spark api support.
jdbc-commerce is your connection name at crawl.
use extract_jdbc_conf to get url、username and password.

What is wrong with my boto elastic mapreduce jar jobflow parameters?

I am using the boto library to create a job flow in Amazons Elastic MapReduce Webservice (EMR). The following code should create a step:
step2 = JarStep(name='Find similiar items',
jar='s3n://recommendertest/mahout-core/mahout-core-0.5-SNAPSHOT.jar',
main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob',
step_args=['s3n://bucket/output/' + run_id + '/aggregate_watched/',
's3n://bucket/output/' + run_id + '/similiar_items/',
'SIMILARITY_PEARSON_CORRELATION'
])
When I run the job flow, it always fails throwing this error:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/JobContext
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.JobContext
This is the line in the EMR logs invoking the java code:
2011-01-24T22:18:54.491Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java \
-cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop \
/hadoop-0.18-core.jar:/home/hadoop/hadoop-0.18-tools.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* \
-Xmx1000m \
-Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 \
-Dhadoop.log.file=syslog \
-Dhadoop.home.dir=/home/hadoop \
-Dhadoop.id.str=hadoop \
-Dhadoop.root.logger=INFO,DRFA \
-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp \
-Djava.library.path=/home/hadoop/lib/native/Linux-i386-32 \
org.apache.hadoop.mapred.JobShell \
/mnt/var/lib/hadoop/steps/3/mahout-core-0.5-SNAPSHOT.jar \
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob \
s3n://..../output/job_2011-01-24_23:09:29/aggregate_watched/ \
s3n://..../output/job_2011-01-24_23:09:29/similiar_items/ \
SIMILARITY_PEARSON_CORRELATION
What is wrong with the parameters? The java class definition can be found here:
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
I found the solution for the problem:
You need to specify hadoop version 0.20 in the jobflow parameters
You need to run the JAR step with mahout-core-0.5-SNAPSHOT-job.jar, not with the mahout-core-0.5-SNAPSHOT.jar
If you have an additional streaming step in your jobflow, you need to fix a bug in boto:
Open boto/emr/step.py
Change line 138 to "return '/home/hadoop/contrib/streaming/hadoop-streaming.jar'"
Save and reinstall boto
This is how the job_flow function should be invoked to run with mahout:
jobid = emr_conn.run_jobflow(name = name,
log_uri = 's3n://'+ main_bucket_name +'/emr-logging/',
enable_debugging=1,
hadoop_version='0.20',
steps=[step1,step2])
The fix to boto described in step #2 above (i.e. using the non-versioned hadoop-streamin.jar file) has been incorporated into the github master in this commit:
https://github.com/boto/boto/commit/a4e8e065473b5ff9af554ceb91391f286ac5cac7
For Some reference doing this from boto
import boto.emr.connection as botocon
import boto.emr.step as step
con = botocon.EmrConnection(aws_access_key_id='', aws_secret_access_key='')
step = step.JarStep(name='Find similar items', jar='s3://mahout-core-0.6-job.jar', main_class='org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob', action_on_failure='CANCEL_AND_WAIT', step_args=['--input', 's3://', '--output', 's3://', '--similarityClassname', 'SIMILARITY_PEARSON_CORRELATION'])
con.add_jobflow_steps('jflow', [step])
Obviously you need to upload the mahout-core-0.6-job.jar to an accessible s3 location. And the input and out put have to be accessible.

Categories