I have a python bolt which parses information from a file. The bolt in question receives a file path, parses the file and then emits a number of tuples from within a for loop.
The problem is that when it runs only two tuples are emitted and then it hangs. In the logs I can see that the correct number of keys have been parsed from the file and the first two tuples have been emitted but after this there are no other logs related to the bolt. (Only metrics logs)
38640 [Thread-19] INFO backtype.storm.task.ShellBolt - ShellLog
pid:14644, name:ParseFileBolt Number of keys = 1373
38870 [Thread-21] INFO backtype.storm.daemon.task - Emitting:
ParseFileBolt default ["177328623"]
38870 [Thread-21] INFO backtype.storm.daemon.task - Emitting:
ParseFileBolt default ["177328532"]
Here is a simplified version of the code which produces the issue.
As noted in the code if I manually enter a number of keys instead of parsing them from the file they all get emitted successfully.
import gzip
import storm
class ParseFileBolt(storm.BasicBolt):
def process(self, tup):
file_path = tup.values[0]
# If I parse keys from a file only two get emitted
keys = get_keys(file_path)
# e.g keys = {'393548331', '177329025', '123456789'}
# If I manually enter the keys they all get emitted
#keys = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
storm.logInfo("Number of keys = {0}".format(len(keys)))
for key in keys:
storm.emit([key])
def get_keys(file_name):
with gzip.open(file_name, 'rt') as file:
key_set = set()
for line in file:
if line.startswith("#"):
continue
else:
columns = line.split("|")
key = columns[0].strip(' \t\n\r')
key_set.add(key)
return key_set
ParseFileBolt().run()
The file which is being parsed is a .gz file which contains a header row starting with # followed by rows of '|' separated data.
# Header Row
177328623|columns1|column2|column3
177328532|columns1|column2|column3
123456789|columns1|column2|column3
...
I'm using apache-storm-0.9.4 on Windows.
The issue occurs on both local and remote clusters.
Any thoughts on what the issue would be greatly appreciated.
Related
I have a lot yaml file names that have similar structure but with different data. I need to parse out selective data, and put into a single csv (excel) file as three columns.
But i facing an issue with empty key, that always gives me an "KeyError: 'port'"
my yaml file example:
base:
server: 10.100.80.47
port: 3306
namePrefix: well
user: user1
password: kjj&%$
base:
server: 10.100.80.48
port:
namePrefix: done
user: user2
password: fhfh#$%
In the second block i have an empty "port", and my script is stuck on that point.
I need that always that an empty key is found it doesn't write anything.
from asyncio.windows_events import NULL
from queue import Empty
import yaml
import csv
import glob
yaml_file_names = glob.glob('./*.yaml')
rows_to_write = []
for i, each_yaml_file in enumerate(yaml_file_names):
print("Processing file {} of {} file name: {}".format(
i+1, len(yaml_file_names),each_yaml_file))
with open(each_yaml_file) as file:
data = yaml.safe_load(file)
for v in data:
if "port" in v == "":
data['base']['port'] = ""
rows_to_write.append([data['base']['server'],data['base']['port'],data['server']['host'],data['server']['contex']])
with open('output_csv_file.csv', 'w', newline='') as out:
csv_writer = csv.writer(out)
csv_writer.writerow(["server","port","hostname", "contextPath"])
csv_writer.writerows(rows_to_write)
print("Output file output_csv_file.csv created")
You are trying to access the key by index e.g.
data['base']['port']
But what you want is to access it with the get method like so:
data['base'].get('port')
This way if the key does not exists it return None as default, and you could even change the default value to whatever you want by passing it as the second parameter.
In PyYAML, an empty element is returned as None, not an empty string.
if data['base']['port'] is None:
data['base']['port'] = ""
Your yaml file is invalid. In yaml file, whenever you have a key (like port: in your example) you must provide a value, you cannot leave it empty and go to the next line. Unless the value is the next bunch of keys of course, but in that case you need to ident the following lines one step more, which is obviously not what you intend to do here.
This is likely why you cannot parse the file as expected with the python yaml module. If you are the creator of those yaml file, you really need to put a key in the file like port: None if you don't want to provide a value for the port, or even better you just not provide any port key at all.
If they are provided to you by someone else, ask them to provide valid yaml files.
Then the other solutions posted should work.
I have a csv file that has 1800+ addresses. I need to compare the distance of every single one of them with a specific address. I wrote a code that does that But only if I add the address manually.
I want to run this code on every line of the csv file and print the distance in km and in minutes. How can I do that?
This is my code:
# Needed to read json and to use the endpoint request
import urllib.request
import json
# Google MapsDdirections API endpoint
endpoint = 'https://maps.googleapis.com/maps/api/directions/json?'
api_key = 'add api'
# Give the original work address and lists of addresses.
# Format has to be (Number Street Name City Province)
# So for example 1280 Main Strret Hamilton ON
origin = ('add the one address to calculate distance with the other').replace(' ', '+')
destinations = ['address1', 'address2', 'address3']
distances = []
# Goes through the array of addresses and calculated each of their distances
for i in range(len(destinations)):
# Replaces the spaces with + so that it can properly work with the google maps api url
currentDestination = destinations[i].replace(' ', '+')
# Building the URL for the request
nav_request = 'origin={}&destination={}&key={}'.format(origin, currentDestination, api_key)
# Builds the request to be sent
request = endpoint + nav_request
# Sends the request and reads the response.
response = urllib.request.urlopen(request).read()
# Loads response as JSON
directions = json.loads(response)
# Gets the distance from the address in the array to the origin address
distance = directions["routes"][0]["legs"][0]["distance"]["text"]
# Adds it to the list of distances found from each address
distances.append(distance)
#print distances
print(*distances, sep="\n")
instead of having a list destinations, it should loop through the csv file addresses
Considering that your file has just one column with addresses and no quotes and begin/end then task is simply reading lines from file into list. This can be done following way
with open("addresses.txt","r") as f:
addresses = [i.rstrip("\n") for i in f]
print(addresses[:20]) # this will show at most 20 entries, which should allow check if it works as intended
Please run following code after replacing addresses.txt with name of your file and write if it work as intended. with open... is used to more that file is closed properly after it was used, .rstrip is used to remove newlines from ends of lines.
I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here
Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)
I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.
users:
xxxx1:
timestamp: '2018-10-22 11:38:28.541810'
<< : *userdefaults
xxxx2:
<< : *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
the defaults are stored in another file, which is not editable:
userdefaults: &userdefaults
# Default values for user settings
fileCountQuota: 1000
diskSizeQuota: "300g"
I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)
Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.
First of all, unless there are multiple documents in your defaults file, you
don't have to use load_all, as you don't concatenate two documents into a
multiple-document stream. If you had by using a format string with a document-end
marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}")
your aliases would not carry over from one document to another, as per the
YAML specification:
It is an error for an alias node to use an anchor that does not
previously occur in the document.
The anchor has to be in the document, not just in the stream (which can consist of multiple
documents).
I tried some hocus pocus, pre-populating the already represented dictionary
of anchored nodes:
import sys
import datetime
from ruamel import yaml
def load():
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
return merged_data
class MyRTDGen(object):
class MyRTD(yaml.RoundTripDumper):
def __init__(self, *args, **kw):
pps = kw.pop('pre_populate', None)
yaml.RoundTripDumper.__init__(self, *args, **kw)
if pps is not None:
for pp in pps:
try:
anchor = pp.yaml_anchor()
except AttributeError:
anchor = None
node = yaml.nodes.MappingNode(
u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
self.represented_objects[id(pp)] = node
def __init__(self, pre_populate=None):
assert isinstance(pre_populate, list)
self._pre_populate = pre_populate
def __call__(self, *args, **kw):
kw1 = kw.copy()
kw1['pre_populate'] = self._pre_populate
myrtd = self.MyRTD(*args, **kw1)
return myrtd
def update(md, file_name):
ud = md.pop('userdefaults')
MyRTD = MyRTDGen([ud])
yaml.dump(md, sys.stdout, Dumper=MyRTD)
with open(file_name, 'w') as fp:
yaml.dump(md, fp, Dumper=MyRTD)
md = load()
md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
update(md, 'user.yaml')
Since the PyYAML based API requires a class instead of an object, you need to
use a class generator, that actually adds the data elements to pre-populate on
the fly from withing yaml.load().
But this doesn't work, as a node only gets written out with an anchor once it is
determined that the anchor is used (i.e. there is a second reference). So actually the
first merge key gets written out as an anchor. And although I am quite familiar
with the code base, I could not get this to work properly in a reasonable amount of time.
So instead, I would just rely on the fact that there is only one key that matches
the first key of users.yaml at the root level of the dump of the combined updated
file and strip anything before that.
import sys
import datetime
from ruamel import yaml
with open('defaults.yaml') as fp:
defaults_data = fp.read()
with open('user.yaml') as fp:
user_data = fp.read()
merged_data = yaml.load("{}\n{}".format(defaults_data, user_data),
Loader=yaml.RoundTripLoader)
# find the key
for line in user_data.splitlines():
line = line.split('# ')[0].rstrip() # end of line comment, not checking for strings
if line and line[-1] == ':' and line[0] != ' ':
split_key = line
break
merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
buf = yaml.compat.StringIO()
yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
document = split_key + buf.getvalue().split('\n' + split_key)[1]
sys.stdout.write(document)
which gives:
users:
xxxx1:
<<: *userdefaults
timestamp: '2018-10-22 11:38:28.541810'
xxxx2:
<<: *userdefaults
timestamp: '2018-10-23 09:59:13.829978'
I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14.
That version is from the time I was still young (I won't claim to have been innocent).
There have been over 85 releases of the library since then.
I can understand that you might not be able to run anything but
Python2 at the moment and cannot compile/use a newer version. But what
you really should do is install virtualenv (can be done using EPEL, but also without
further "polluting" your system installation), make a virtualenv for the
code you are developping and install the latest version of ruamel.yaml (and
your other libraries) in there. You can also do that if you need
to distribute your software to other systems, just install virtualenv there as well.
I have all my utilties under /opt/util, and managed
virtualenvutils a
wrapper around virtualenv.
For writing the user part, you will have to manually split the output of yaml.dump() multifile output and write the appropriate part back to users yaml file.
import datetime
import StringIO
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='rt')
data = None
with open('defaults.yaml', 'r') as defaults:
with open('users.yaml', 'r') as users:
raw = "{}\n{}".format(''.join(defaults.readlines()), ''.join(users.readlines()))
data = list(yaml.load_all(raw))
data[0]['users']['xxxx1']['timestamp'] = datetime.datetime.now().isoformat()
with open('users.yaml', 'w') as outfile:
sio = StringIO.StringIO()
yaml.dump(data[0], sio)
out = sio.getvalue()
outfile.write(out.split('\n\n')[1]) # write the second part here as this is the contents of users.yaml
I have a python function that will open a YAML file and read the data. The YAML file contains two api keys and a domain. I want to return each value in a dictionary so it can be used in the program. However I get the error
"list indices must be integers, not str".
Should I just make the variables global, so it doesn't have to return anything?
The code is:
def ImportConfig():
with open("config.yml", 'r') as ymlfile:
config = yaml.load(ymlfile)
darksky_api = config['darksky']['api_key']
gmaps_api = ['gmaps']['api_key']
gmaps_domain = ['gmaps']['domain']
return {'darksky_api_key': darksky_api, 'gmaps_api_key': gmaps_api, 'gmaps_domain': gmaps_domain }
What does it mean that the list indices must be integers? I thought curly brackets indicated a dictionary? Also is there a better way to do this?
Independent of your yaml file if you type ['xy'] a the prompt of Python you create a list with one element and if you then index that with another string:
['xy']['abc']
you'll get that error.
You are missing config in line 5 and 6 of your program:
def ImportConfig():
with open("config.yml", 'r') as ymlfile:
config = yaml.safe_load(ymlfile)
darksky_api = config['darksky']['api_key']
gmaps_api = config['gmaps']['api_key']
gmaps_domain = config['gmaps']['domain']
return {'darksky_api_key': darksky_api, 'gmaps_api_key': gmaps_api, 'gmaps_domain': gmaps_domain }
please note that using load in PyYAML is security risk and for your data you should use safe_load().