Spark streaming not reading from local directory - python

I am trying to write a spark streaming application using Spark Python API.
The application should read text files from local directory and send it to Kafka cluster.
When submitting the python script to spark engine, nothing sent to kafka at all.
I tried to print the events instead of send it to Kafka and found that there is nothing read.
Here is the code of the script.
#!/usr/lib/python
# -*- coding: utf-8 -*-
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from kafka import KafkaProducer
import sys
import time
reload(sys)
sys.setdefaultencoding('utf8')
producer = KafkaProducer(bootstrap_servers="kafka-b01.css.org:9092,kafka-b02.css.org:9092,kafka-b03.css.org:9092,kafka-b04.css.org:9092,kafka-b05.css.org:9092")
def send_to_kafka(rdd):
tweets = rdd.collect()
print ("--------------------------")
print (tweets)
print "--------------------------"
#for tweet in tweets:
# producer.send('test_historical_job', value=bytes(tweet))
if __name__ == "__main__":
conf = SparkConf().setAppName("TestSparkFromPython")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)
tweetsDstream = ssc.textFileStream("/tmp/historical/")
tweetsDstream.foreachRDD(lambda rdd: send_to_kafka(rdd))
ssc.start()
ssc.awaitTermination()
I am submitting the script using this command
./spark-submit --master spark://spark-master:7077 /apps/historical_streamer.py
The output of the print statement is an empty list.
--------------------------
[]
--------------------------
EDIT
based on this question I changed the path of the data directory from "/tmp/historical/" to "file:///tmp/historical/".
I tried to run the job first and then move files to the directory but unfortunately it did not work also.

File stream based sources like fileStream or textFileStream expect data files to be:
be created in the dataDirectory by atomically moving or renaming them into the data directory.
If there are no new files in a given window there is nothing to proces so pre-existing files (it seems to be the case here) won't be read on won't show on the output.

Your function:
def send_to_kafka(rdd):
tweets = rdd.collect()
print ("--------------------------")
print (tweets)
print "--------------------------"
#for tweet in tweets:
# producer.send('test_historical_job', value=bytes(tweet))
will collect all the rdd, but it won't print the content of the rdd. To do so, you should use the routine:
tweets.foreach(println)
that will, for every element in the RDD, give as output the elements. As explained in the Spark Documentation
Hope this will help

Related

Data loading on MongoDB server works from jupyter notebook but not from script

I have a dataset stored in a json file and I need to upload it to a MongoDB server. Everything works fine if I upload the data using a Jupyter notebook, but not if I use a python script instead. The code is exactly the same. How do you suggest fixing this?
Here is the code:
import pandas as pd
import pymongo
from pymongo import MongoClient
import json
import DNS
# Function to upload the dialogue lines to MongoDB Server
def prepare_dataframe():
dialogue_edited = pd.read_json("5lines1response_random100from880_cleaned.json")
dialogue_edited.reset_index(inplace=True)
data_dict = dialogue_edited.to_dict("records")# Insert collection
# To communicate with the MongoDB Server
cluster = MongoClient()
db = cluster['DebuggingSystem']
collection = db['MCS_dialogue']
collection.insert_many(data_dict)
return collection
if __name__ == '__main__':
collection = prepare_dataframe()
Here is a screenshot of the python script and of the jupyter notebook. I'm running the notebook using Visual Studio.
Replace
if __name__ == '__main__':
collection = prepare_dataframe()
with
collection = prepare_dataframe()
and try runnning your script. __main__ explained here pretty well.

Cannot ingest data using snowpipe more than once

I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here

Post beanstalk queue data to Mssql using python script

Guys I'm currently developing an offline ALPR solution.
So far I've used OpenAlpr software running on Ubuntu. By using a python script I found on StackOverlFlow I'm able to read the beanstalk queue data (plate number & meta data) of the ALPR but I need to send this data from the beanstalk queue to a mssql database. Does anyone know how to export beanstalk queue data or JSON data to the database? The code below is for local-host, how do i modify it to automatically post data to the mssql database? The data in the beanstalk queue is in JSON format [key=value].
The read & write csv was my addition to see if it can save the json data as csv on localdisk
import beanstalkc
import json
from pprint import pprint
beanstalk = beanstalkc.Connection(host='localhost', port=11300)
TUBE_NAME='alprd'
text_file = open('output.csv', 'w')
# For diagnostics, print out a list of all the tubes available in Beanstalk.
print beanstalk.tubes()
# For diagnostics, print the number of items on the current alprd queue.
try:
pprint(beanstalk.stats_tube(TUBE_NAME))
except beanstalkc.CommandFailed:
print "Tube doesn't exist"
# Watch the "alprd" tube; this is where the plate data is.
beanstalk.watch(TUBE_NAME)
while True:
# Wait for a second to get a job. If there is a job, process it and delete it from the queue.
# If not, return to sleep.
job = beanstalk.reserve(timeout=5000)
if job is None:
print "No plates yet"
else:
plates_info = json.loads(job.body)
# Do something with this data (e.g., match a list, open a gate, etc.).
# if 'data_type' not in plates_info:
# print "This shouldn't be here... all OpenALPR data should have a data_type"
# if plates_info['data_type'] == 'alpr_results':
# print "Found an individual plate result!"
if plates_info['data_type'] == 'alpr_group':
print "Found a group result!"
print '\tBest plate: {} ({:.2f}% confidence)'.format(
plates_info['candidates'][0]['plate'],
plates_info['candidates'][0]['confidence'])
make_model = plates_info['vehicle']['make_model'][0]['name']
print '\tVehicle information: {} {} {}'.format(
plates_info['vehicle']['year'][0]['name'],
plates_info['vehicle']['color'][0]['name'],
' '.join([word.capitalize() for word in make_model.split('_')]))
elif plates_info['data_type'] == 'heartbeat':
print "Received a heartbeat"
text_file.write('Best plate')
# Delete the job from the queue when it is processed.
job.delete()
text_file.close()
AFAIK there is no way to directly export data from beanstalkd.
What you have makes sense, that is streaming data out of a tube into a flat file or performing a insert into the DB directly
Given the IOPS beanstalkd can be produced, it might still be a reasonable solution (depends on what performance you are expecting)
Try asking https://groups.google.com/forum/#!forum/beanstalk-talk as well

How to handle BigQuery insert errors in a Dataflow pipeline using Python?

I'm trying to create a streaming pipeline with Dataflow that reads messages from a PubSub topic to end up writing them on a BigQuery table. I don't want to use any Dataflow template.
For the moment I just want to create a pipeline in a Python3 script executed from a Google VM Instance to carry out a loading and transformation process of every message that arrives from Pubsub (parsing the records that it contains and adding a new field) to end up writing the results on a BigQuery table.
Simplifying, my code would be:
#!/usr/bin/env python
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import pubsub_v1,
import apache_beam as beam
import apache_beam.io.gcp.bigquery
import logging
import argparse
import sys
import json
from datetime import datetime, timedelta
def load_pubsub(message):
try:
data = json.loads(message)
records = data["messages"]
return records
except:
raise ImportError("Something went wrong reading data from the Pub/Sub topic")
class ParseTransformPubSub(beam.DoFn):
def __init__(self):
self.water_mark = (datetime.now() + timedelta(hours = 1)).strftime("%Y-%m-%d %H:%M:%S.%f")
def process(self, records):
for record in records:
record["E"] = self.water_mark
yield record
def main():
table_schema = apache_beam.io.gcp.bigquery.parse_table_schema_from_json(open("TableSchema.json"))
parser = argparse.ArgumentParser()
parser.add_argument('--input_topic')
parser.add_argument('--output_table')
known_args, pipeline_args = parser.parse_known_args(sys.argv)
with beam.Pipeline(argv = pipeline_args) as p:
pipe = ( p | 'ReadDataFromPubSub' >> beam.io.ReadStringsFromPubSub(known_args.input_topic)
| 'LoadJSON' >> beam.Map(load_pubsub)
| 'ParseTransform' >> beam.ParDo(ParseTransformPubSub())
| 'WriteToAvailabilityTable' >> beam.io.WriteToBigQuery(
table = known_args.output_table,
schema = table_schema,
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND)
)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logger = logging.getLogger().setLevel(logging.INFO)
main()
(For example) The messages published in the PubSub topic use to come as follows:
'{"messages":[{"A":"Alpha", "B":"V1", "C":3, "D":12},{"A":"Alpha", "B":"V1", "C":5, "D":14},{"A":"Alpha", "B":"V1", "C":3, "D":22}]}'
If the field "E" is added in the record, then the structure of the record (dictionary in Python) and the data type of the fields is what the BigQuery table expects.
The problems that a I want to handle are:
If some messages come with an unexpected structure I want to fork the pipeline flatten and write them in another BigQuery table.
If some messages come with an unexpected data type of a field, then in the last level of the pipeline when they should be written in the table an error will occur. I want to manage this type of error by diverting the record to a third table.
I read the documentation found on the following pages but I found nothing:
https://cloud.google.com/dataflow/docs/guides/troubleshooting-your-pipeline
https://cloud.google.com/dataflow/docs/guides/common-errors
By the way, if I choose the option to configure the pipeline through the template that reads from a PubSubSubscription and writes into BigQuery I get the following schema which turns out to be the same one I'm looking for:
Template: Cloud Pub/Sub Subscription to BigQuery
You can't catch the errors that occur in the sink to BigQuery. The message that you write into bigquery must be good.
The best pattern is to perform a transform which checks your messages structure and fields type. In case of error, you create an error flow and you write this issue flow in a file (for example, or in a table without schema, you write in plain text your message)
We do the following when errors occur at the BigQuery sink.
send a message (without stacktrace) to GCP Error Reporting, for developers to be notified
log the error to StackDriver
stop the pipeline execution (the best place for messages to wait until a developer has fixed the issue, is the incomming pubSub subscription)

how to read mp3 data from google cloud using python

I am trying to read mp3/wav data from google cloud and trying to implement audio diarization technique. Issue is that I am not able to read the result which has passed by google api in variable response.
below is my python code
speech_file = r'gs://pp003231/a4a.wav'
config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
language_code='en-US',
enable_speaker_diarization=True,
diarization_speaker_count=2)
audio = speech.types.RecognitionAudio(uri=speech_file)
response = client.long_running_recognize(config, audio)
print response
result = response.results[-1]
print result
Output displayed on console is
Traceback (most recent call last):
File "a1.py", line 131, in
print response.results
AttributeError: 'Operation' object has no attribute 'results'
Can you please share your expert advice about what I am doing wrong?
Thanks for your help.
Its too late for the author of this thread. However, posting the solution for someone in future as I too had similar issue.
Change
result = response.results[-1]
to
result = response.result().results[-1]
and it will work fine
Do you have access to the wav file in your bucket? also, this is the entire code? It seems that the sample_rate_hertz and the imports are missing. Here you have the code copy/pasted from the google docs samples, but I edited it to have just the diarization function.
#!/usr/bin/env python
"""Google Cloud Speech API sample that demonstrates enhanced models
and recognition metadata.
Example usage:
python diarization.py
"""
import argparse
import io
def transcribe_file_with_diarization():
"""Transcribe the given audio file synchronously with diarization."""
# [START speech_transcribe_diarization_beta]
from google.cloud import speech_v1p1beta1 as speech
client = speech.SpeechClient()
audio = speech.types.RecognitionAudio(uri="gs://<YOUR_BUCKET/<YOUR_WAV_FILE>")
config = speech.types.RecognitionConfig(
encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=8000,
language_code='en-US',
enable_speaker_diarization=True,
diarization_speaker_count=2)
print('Waiting for operation to complete...')
response = client.recognize(config, audio)
# The transcript within each result is separate and sequential per result.
# However, the words list within an alternative includes all the words
# from all the results thus far. Thus, to get all the words with speaker
# tags, you only have to take the words list from the last result:
result = response.results[-1]
words_info = result.alternatives[0].words
# Printing out the output:
for word_info in words_info:
print("word: '{}', speaker_tag: {}".format(word_info.word,
word_info.speaker_tag))
# [END speech_transcribe_diarization_beta]
if __name__ == '__main__':
transcribe_file_with_diarization()
To run the code just name it diarization.py and use the command:
python diarization.py
Also, you have to install the latest google-cloud-speech library:
pip install --upgrade google-cloud-speech
And you need to have the credentials of your service account in a json file, you can check more info here

Categories