How to get data from an API using Apache Beam (Dataflow)? - python

I have done some Python programming, but I'm not a seasoned developer by any stretch of the imagination. We have a Python etl programme, which was set up as a Cloud Function but it is timing out as there is just too much data to load and we are looking to re-write it to work in Dataflow.
The code at the moment simply connects to an API, which returns a newline-delimiter JSON, and then the data is loaded into a new table in BigQuery.
This is our first time using Dataflow and we are just trying to get to grips with how it works. It seems pretty easy to get the data into BigQuery the stumbling block we are hitting is how to get the data out of the API. Its not clear to us how we can make this work, do we need to go down the route of developing a new I/O connector as per [Develop IO Connector]? Or is there another option as developing a new connector seems complex?
We've done a lot of googling, but haven't found anything obvious to help.
Here is a sample of our code but we are not 100% sure its on the right track. The code doesn't work, and we think it needs to be a .io.read and not a .ParDo initially but we aren't quite sure where to go with that. Some guidance would be much appreciated!
class callAPI(beam.DoFn):
def __init__(self, input_header, input_uri):
self.headers = input_header
self.remote_url = input_uri
def process(self):
try:
res = requests.get(self.remote_url, headers=self.headers)
res.raise_for_status()
except HTTPError as message:
logging.error(message)
return
return res.text
with beam.Pipeline() as p:
data = ( p
| 'Call API ' >> beam.ParDo(callAPI(HEADER, REMOTE_URI))
| beam.Map(print))
Thanks in advance.

You are on the right track, but there are a couple of things to fix.
As you point out, the root of a pipeline needs to be a read of some kind. The ParDo operation processes a set of elements (ideally in parallel), but needs some input to process. You could do
p | beam.Create(['a', 'b', 'c']) | beam.ParDo(SomeDoFn())
in which SomeDoFn will be passed a, b, and c into its process method. There is a special p | beam.Impulse() operation that will produce a single None element if there's no reasonable input and you want to ensure your DoFn is just called once. You can also read elements from a file (or similar). Note that your process method takes both self and the element to be processed, and returns an iterable (to allow zero or more outputs. There is also beam.Map and beam.FlatMap which encapsulates the simpler pattern). So you could do something like
class CallAPI(beam.DoFn):
def __init__(self, input_header):
self.headers = input_header
def process(self, input_uri):
try:
res = requests.get(input_uri, headers=self.headers)
res.raise_for_status()
except HTTPError as message:
logging.error(message)
yield res.text
with beam.Pipeline() as p:
data = (
p
| beam.Create([REMOTE_URI])
| 'Call API ' >> beam.ParDo(CallAPI(HEADER))
| beam.Map(print))
which would allow you to read from more than one URI (in parallel) in the same pipeline.
You could write a full IO connector if your source is such that it can be split (ideally dynamically) rather than only read in one huge request.

Can you share the code from your cloud function?
Is this a scheduled task or triggered by an event? If it is a scheduled task Apache Airflow may be a better option, you could use Dataflow Python Operators and BigQueryOperators to do what you're looking for
Apache Airflow https://airflow.apache.org/
DataFlowPythonOperator https://airflow.apache.org/docs/apache-airflow/1.10.6/_api/airflow/contrib/operators/dataflow_operator/index.html#airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator
BigQueryOperator https://airflow.apache.org/docs/apache-airflow/1.10.14/_api/airflow/contrib/operators/bigquery_operator/index.html

Related

simple example of working with neo4j python driver?

Is there a simple example of working with the neo4j python driver?
How do I just pass cypher query to the driver to run and return a cursor?
If I'm reading for example this it seems the demo has a class wrapper, with a private member func I pass to the session.write,
session.write_transaction(self._create_and_return_greeting, ...
That then gets called it with a transaction as a first parameter...
def _create_and_return_greeting(tx, message):
that in turn runs the cypher
result = tx.run("CREATE (a:Greeting) "
This seems 10X more complicated than it needs to be.
I did just try a simpler:
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
But this results in a socket error on the query, probably because the session goes out of scope?
[dfcx/__init__] ERROR | Underlying socket connection gone (_ssl.c:2396)
[dfcx/__init__] ERROR | Failed to write data to connection IPv4Address(('neo4j-core-8afc8558-3.production-orch-0042.neo4j.io', 7687)) (IPv4Address(('34.82.120.138', 7687)))
Also I can't return a cursor/iterator, just the data()
When the session goes out of scope, the query result seems to die with it.
If I manually open and close a session, then I'd have the same problems?
Python must be the most popular language this DB is used with, does everyone use a different driver?
Py2neo seems cute, but completely lacking in ORM wrapper function for most of the cypher language features, so you have to drop down to raw cypher anyway. And I'm not sure it supports **kwargs argument interpolation in the same way.
I guess that big raise should help iron out some kinks :D
Slightly longer version trying to get a working DB wrapper:
def neo_connect() -> Union[neo4j.BoltDriver, neo4j.Neo4jDriver]:
global raw_driver
if raw_driver:
# print('reuse driver')
return raw_driver
neoconfig = NEOCONFIG
raw_driver = neo4j.GraphDatabase.driver(
neoconfig['url'], auth=(
neoconfig['user'], neoconfig['pass']))
if raw_driver is None:
raise BaseException("cannot connect to neo4j")
else:
return raw_driver
def raw_query(query, **kwargs):
# just get data, no cursor
neodriver = neo_connect()
session = neodriver.session()
# logging.info('neoquery %s', query)
# with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
data = result.data()
return data
except neo4j.exceptions.CypherSyntaxError as err:
logging.error('neo error %s', err)
logging.error('failed query: %s', query)
raise err
# finally:
# logging.info('close session')
# session.close()
update: someone pointed me to this example which is another way to use the tx wrapper.
https://github.com/neo4j-graph-examples/northwind/blob/main/code/python/example.py#L16-L21
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
This is perfectly fine and works as intended on my end.
The error you're seeing is stating that there is a connection problem. So there must be something going on between the server and the driver that's outside of its influence.
Also, please note, that there is a difference between all of these ways to run a query:
with driver.session():
result = session.run("<SOME CYPHER>")
def work(tx):
result = tx.run("<SOME CYPHER>")
with driver.session():
session.write_transaction(work)
The latter one might be 3 lines longer and the team working on the drivers collected some feedback regarding this. However, there are more things to consider here. Firstly, changing the API surface is something that needs careful planning and cannot be done in say a patch release. Secondly, there are technical hurdles to overcome. Here are the semantics, anyway:
Auto-commit transaction. Runs only that query as one unit of work.
If you run a new auto-commit transaction within the same session, the previous result will buffer all available records for you (depending on the query, this will consume a lot of memory). This can be avoided by calling result.consume(). However, if the session goes out of scope, the result will be consumed automatically. This means you cannot extract further records from it. Lastly, any error will be raised and needs handling in the application code.
Managed transaction. Runs whatever unit of work you want within that function. A transaction is implicitly started and committed (unless you rollback explicitly) around the function.
If the transaction ends (end of function or rollback), the result will be consumed and become invalid. You'll have to extract all records you need before that.
This is the recommended way of using the driver because it will not raise all errors but handle some internally (where appropriate) and retry the work function (e.g. if the server is only temporarily unavailable). Since the function might be executed multiple time, you must make sure it's idempotent.
Closing thoughts:
Please remember that stackoverlfow is monitored on a best-effort basis and what can be perceived as hasty comments may get in the way of getting helpful answers to your questions

GCP Dataflow + Apache Beam - caching question

I am new-ish to GCP, Dataflow, Apache Beam, Python, and OOP in general. I come from the land of functional javascript, for context.
Right now I have a streaming pipeline built with the Apache Beam python sdk, and I deploy it to GCP's Dataflow. The pipeline's source is a pubsub subscription, and the sink is a datastore.
The pipeline picks up a message from a pubsub subscription, makes a decision based on a configuration object + the contents of the message, and then puts it in the appropriate spot in the datastore depending on what decision it makes. This is all working presently.
Now I am in a situation where the configuration object, which is currently hardcoded, needs to be more dynamic. By that I mean: instead of just hardcoding the configuration object, we are now instead going to make an API call that will return the configuration. That way, we can update the configuration without having to redeploy the pipeline. This also works presently.
But! We are anticipating heavy traffic, so it is NOT ideal to fetch the configuration for every single message that comes in. So we are moving the fetch to the beginning, right before the actual pipeline starts. But this means we immediately lose the value in having it come from an API call, because the API call only happens one time when the pipeline starts up.
Here is what we have so far (stripped out irrelevant parts for clarity):
def run(argv=None):
options = PipelineOptions(
streaming=True,
save_main_session=True
)
configuration = get_configuration() # api call to fetch config
with beam.Pipeline(options=options) as pipeline:
# read incoming messages from pubsub
incoming_messages = (
pipeline
| "Read Messages From PubSub"
>> beam.io.ReadFromPubSub(subscription=f"our subscription here", with_attributes=True))
# make a decision based off of the message + the config
decision_messages = (
incoming_messages
| "Create Decision Messages" >> beam.FlatMap(create_decision_message, configuration)
)
create_decision_message takes in the incoming message from the stream + the configuration file and then, you guessed it, makes a decision. It is pretty simple logic. Think "if the message is apples, and the configuration says we only care about oranges, then do nothing with the message". We need to be able to update it on the fly to say "nevermind, we care about apples too now suddenly".
I need to figure out a way to let the pipeline know it needs to re-fetch that configuration file every 15 minutes. I'm not totally sure what is the best way to do that with the tools I'm using. If it were javascript, I would do something like:
(please forgive the pseudo-code, not sure if this would actually run but you get the idea)
let fetch_time = Date.now() // initialized when app starts
let expiration = 900 // 900 seconds = 15 mins
let config = getConfigFromApi() // fetch config right when app starts
function fetchConfig(now){
if (fetch_time + expiration < now) {
// if fetch_time + expiration is less than the current time, we need to re-fetch the config
config = getConfigFromApi() // assign new value to config var
fetch_time = now // assign new value to fetch_time var
}
return config
}
...
const someLaterTime = Date.now() // later in the code, within the pipeline, I need to use the config object
const validConfig = fetchConfig(someLaterTime) // i pass in the current time and get back either the memory-cached config, or a just-recently-fetched config
I'm not really sure how to translate this concept to python, and I'm not really sure if I should. Is this a reasonable thing to try to pull off? Or is this type of behavior not congruent with the stack I'm using? I'm in a position where I'm the only one on my team working on this, and it is a greenfield project, so there are no examples anywhere of how it's been done in the past. I am not sure if I should try to figure this out, or if I should say "sorry bossman, we need another solution".
Any help is appreciated, no matter how small... thank you!
I think there are multiple ways to implement what you want to achieve, the most straight-forward way is probably through stateful processing in which you record your config through state in a Stateful DoFn and set a looping timer to refresh the record.
You can read more about stateful processing here https://beam.apache.org/blog/timely-processing/
more from the beam programming guide about the state and timer: https://beam.apache.org/documentation/programming-guide/#types-of-state.
I would image that you can define your processing logic which requires the config in a ParDo like:
class MakeDecision(beam.DoFn):
CONFIG = ReadModifyWriteState('config', coders.StrUtf8Coder())
REFRESH_TIMER = TimerSpec('output', TimeDomain.REAL_TIME)
def process(self,
element,
config=DoFn.StateParam(CONFIG),
timer=DoFn.TimerParam(REFRESH_TIMER)):
valid_config={}
if config.read():
valid_config=json.loads(config.read())
else: # config is None and hasn't been fetched before.
valid_config=fetch_config() # your own fetch function.
config.write(json.dumps(valid_config))
timer.set(Timestamp.now() + Duration(seconds=900))
# Do what ever you need with the config.
...
#on_timer(REFRESH_TIMER)
def refresh_config(self,
config=DoFn.StateParam(CONFIG),
timer=DoFn.TimerParam(REFRESH_TIMER)):
valid_config=fetch_config()
config.write(json.dumps(valid_config))
timer.set(Timestamp.now() + Duration(seconds=900))
And then you can now process your messages with the Stateful DoFn.
with beam.Pipeline(options=options) as pipeline:
pipeline
| "Read Messages From PubSub"
>> beam.io.ReadFromPubSub(subscription=f"our subscription here", with_attributes=True))
| "Make decision" >> beam.ParDo(MakeDecision())

Python3.8 Asyncio - Return Results from List of Dictionaries

I am really really struggling to figure out how to use asyncio to return a bunch of results from a bunch of AWS Lambda calls, here is my example.
My team owns a bunch of AWS accounts. For the sake of time, I want to run an async of AWS lambda functions to process the information of each account, and return the results. I'm trying to understand how I can create an async of sending a whole bunch of accounts quickly rather than doing it one at a time. Here is my example code.
def call_lambda(acct):
aws_lambda = boto3.client('lambda', region_name='us-east-2')
aws_payload = json.dumps(acct)
response = aws_lambda.invoke(
FunctionName='MyLambdaName',
Payload=aws_payload,
)
return json.loads(response['Payload'].read())
def main():
scan_time = datetime.datetime.utcnow()
accounts = []
scan_data = []
account_data = account_parser()
for account_info in account_data:
account_info['scan_time'] = scan_time
for account in account_data:
scan_data.append(call_lambda(account))
I am struggling to figure out how to do this in an asyncio style. I originally managed to pull it off using concurrent futures threadpoolexecutor, but I ran into some issues with performance, but here is what I had.
executor = concurrent.futures.ThreadPoolExecutor(max_workers=50)
sg_data = executor.map(call_lambda, account_data)
So this worked, but not well, and I was told to do asyncio instead. I read these following articles but I am still just lost as to how to make this work. I know AWS Lambda itself is asynchronous, and should work fine without a coroutine.
The tl;dr is I want to kick off call_lambda(acct) for every single Dict in my List (account_data is a list of dictionaries) and then return all the results into one big list of Dict again. (this eventually gets written into an CSV, company policy issues for why not going into a database).
I have read the following, still confused...
https://stackabuse.com/python-async-await-tutorial/
Lambda invocations are by default synchronous (RequestResponse), so you have to specify the InvocationType to Event. By doing this though you don't get your response to collate the account info you desire. Hence the desire to use async or something similar.
response = client.invoke(
FunctionName='string',
InvocationType='Event'|'RequestResponse'|'DryRun',
LogType='None'|'Tail',
ClientContext='string',
Payload=b'bytes'|file,
Qualifier='string'
)
I've not implemented async in lambda, but as alternate solution, if you can create an s3 file, then have each invocation of FunctionName='MyLambdaName' just update the file, you might get what you need that way.

Optimizing speed of python code that tests results of an API

I'm trying to test a publicly available web page that takes a GET request and returns a different JSON file depending on the GET argument.
The API looks like
https://www.example.com/api/page?type=check&code=[Insert string here]
I made a program to check the results of all possible 4-letter strings on this API. My code looks something like this (with the actual URL replaced):
import time, urllib.request
for a in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for b in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for d in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
a,b,c,d = "J","A","K","E"
test = urllib.request.urlopen("https://www.example.com/api/page?type=check&code=" + a + b + c + d).read()
if test != b'{"result":null}':
print(a + b + c + d)
f = open("codes", "a")
f.write(a + b + c + d + ",")
f.close()
This code is completely functional and works as expected. However, there is a problem. Because the program can't progress until it receives a responses, this method is very slow. If this ping time is 100ms for the API, then it will take 100ms for each check. When I modified this code so that it could test half of the results in one instance, and half in another, I noticed that the speed doubled.
Because of this, I'm led to believe that the ping time of the site is the limiting factor in this script. What I want to do is be able to is basically check each code, and then immediately check the next one without waiting for a response.
That would be the equivalent of opening up the page a few thousand times in my browser. It could load many tabs at the same time, since each page is less than a kilobyte.
I looked into using threading to do this, but I'm not sure if its relevant or helpful.
User a worker pool, like described here: https://docs.python.org/3.7/library/multiprocessing.html
from multiprocessing import Pool
def test_url(code):
''' insert code to test URL '''
pass
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(test_url, [code1,code2,code3]))
Just be aware that the website might be rate-limiting the amount of requests you are making.
To be more specific with your example, I would split it up into two phases: (1) generate test codes (2) test url, given one test code. Once you have the list of codes generated, you can apply the above strategy of applying the verifier to each generated code, using a worker pool.
To generate the test codes, you can use itertools:
codes_to_test = [''.join(i) for i in itertools.product(string.ascii_lowercase, repeat = 5)]
You have a better understanding of how to test a URL given one test code , so I assume you can write a function test_url(test_code) that will make the appropriate URL request and verify the result as necessary. Then you can call:
with Pool(5) as p:
print(p.map(test_url, test_codes))
On top of this, I would suggest two things: make sure test_codes is not enormous at first (for example by taking a sublist of these generated codes) to make sure your code is working correctly and (2) that you can play with the size of the worker pool to not overwhelm your machine or the API.
Alternatively you can use asyncio (https://docs.python.org/3/library/asyncio.html) to keep everything in a single process.

Python Mock Process for Unit Testing

Background:
I am currently writing a process monitoring tool (Windows and Linux) in Python and implementing unit test coverage. The process monitor hooks into the Windows API function EnumProcesses on Windows and monitors the /proc directory on Linux to find current processes. The process names and process IDs are then written to a log which is accessible to the unit tests.
Question:
When I unit test the monitoring behavior I need a process to start and terminate. I would love if there would be a (cross-platform?) way to start and terminate a fake system process that I could uniquely name (and track its creation in a unit test).
Initial ideas:
I could use subprocess.Popen() to open any system process but this runs into some issues. The unit tests could falsely pass if the process I'm using to test is run by the system as well. Also, the unit tests are run from the command line and any Linux process I can think of suspends the terminal (nano, etc.).
I could start a process and track it by its process ID but I'm not exactly sure how to do this without suspending the terminal.
These are just thoughts and observations from initial testing and I would love it if someone could prove me wrong on either of these points.
I am using Python 2.6.6.
Edit:
Get all Linux process IDs:
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Get all Windows process IDs:
import ctypes, ctypes.wintypes
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
count = 50
while True:
# Build arguments to EnumProcesses
processIds = (ctypes.wintypes.DWORD*count)()
size = ctypes.sizeof(processIds)
bytes_returned = ctypes.wintypes.DWORD()
# Call enum processes to find all processes
if self.EnumProcesses(ctypes.byref(processIds), size, ctypes.byref(bytes_returned)):
if bytes_returned.value &lt size:
return processIds
else:
# We weren't able to get all the processes so double our size and try again
count *= 2
else:
print "EnumProcesses failed"
sys.exit()
Windows code is from here
edit: this answer is getting long :), but some of my original answer still applies, so I leave it in :)
Your code is not so different from my original answer. Some of my ideas still apply.
When you are writing Unit Test, you want to only test your logic. When you use code that interacts with the operating system, you usually want to mock that part out. The reason being that you don't have much control over the output of those libraries, as you found out. So it's easier to mock those calls.
In this case, there are two libraries that are interacting with the sytem: os.listdir and EnumProcesses. Since you didn't write them, we can easily fake them to return what we need. Which in this case is a list.
But wait, in your comment you mentioned:
"The issue I'm having with it however is that it really doesn't test
that my code is seeing new processes on the system but rather that the
code is correctly monitoring new items in a list."
The thing is, we don't need to test the code that actually monitors the processes on the system, because it's a third party code. What we need to test is that your code logic handles the returned processes. Because that's the code you wrote. The reason why we are testing over a list, is because that's what your logic is doing. os.listir and EniumProcesses return a list of pids (numeric strings and integers, respectively) and your code acts on that list.
I'm assuming your code is inside a Class (you are using self in your code). I'm also assuming that they are isolated inside their own methods (you are using return). So this will be sort of what I suggested originally, except with actual code :) Idk if they are in the same class or different classes, but it doesn't really matter.
Linux method
Now, testing your Linux process function is not that difficult. You can patch os.listdir to return a list of pids.
def getLinuxProcess(self):
try:
processDirectories = os.listdir(self.PROCESS_DIRECTORY)
except IOError:
return []
return [pid for pid in processDirectories if pid.isdigit()]
Now for the test.
import unittest
from fudge import patched_context
import os
import LinuxProcessClass # class that contains getLinuxProcess method
def test_LinuxProcess(self):
"""Test the logic of our getLinuxProcess.
We patch os.listdir and return our own list, because os.listdir
returns a list. We do this so that we can control the output
(we test *our* logic, not a built-in library's functionality).
"""
# Test we can parse our pdis
fakeProcessIds = ['1', '2', '3']
with patched_context(os, 'listdir', lamba x: fakeProcessIds):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = [1, 2, 3]
self.assertEqual(result, expected)
# Test we can handle IOERROR
with patched_context(os, 'listdir', lamba x: raise IOError):
myClass = LinuxProcessClass()
....
result = myClass.getLinuxProcess()
expected = []
self.assertEqual(result, expected)
# Test we only get pids
fakeProcessIds = ['1', '2', '3', 'do', 'not', 'parse']
.....
Windows method
Testing your Window's method is a little trickier. What I would do is the following:
def prepareWindowsObjects(self):
"""Create and set up objects needed to get the windows process"
...
Psapi = ctypes.WinDLL('Psapi.dll')
EnumProcesses = self.Psapi.EnumProcesses
EnumProcesses.restype = ctypes.wintypes.BOOL
self.EnumProcessses = EnumProcess
...
def getWindowsProcess(self):
count = 50
while True:
.... # Build arguments to EnumProcesses and call enun process
if self.EnumProcesses(ctypes.byref(processIds),...
..
else:
return []
I separated the code into two methods to make it easier to read (I believe you are already doing this). Here is the tricky part, EnumProcesses is using pointers and they are not easy to play with. Another thing is, that I don't know how to work with pointers in Python, so I couldn't tell you of an easy way to mock that out =P
What I can tell you is to simply not test it. Your logic there is very minimal. Besides increasing the size of count, everything else in that function is creating the space EnumProcesses pointers will use. Maybe you can add a limit to the count size but other than that, this method is short and sweet. It returns the windows processes and nothing more. Just what I was asking for in my original comment :)
So leave that method alone. Don't test it. Make sure though, that anything that uses getWindowsProcess and getLinuxProcess get's mocked out as per my original suggestion.
Hopefully this makes more sense :) If it doesn't let me know and maybe we can have a chat session or do a video call or something.
original answer
I'm not exactly sure how to do what you are asking, but whenever I need to test code that depends on some outside force (external libraries, popen or in this case processes) I mock out those parts.
Now, I don't know how your code is structured, but maybe you can do something like this:
def getWindowsProcesses(self, ...):
'''Call Windows API function EnumProcesses and
return the list of processes
'''
# ... call EnumProcesses ...
return listOfProcesses
def getLinuxProcesses(self, ...):
'''Look in /proc dir and return list of processes'''
# ... look in /proc ...
return listOfProcessses
These two methods only do one thing, get the list of processes. For Windows, it might just be a call to that API and for Linux just reading the /proc dir. That's all, nothing more. The logic for handling the processes will go somewhere else. This makes these methods extremely easy to mock out since their implementations are just API calls that return a list.
Your code can then easy call them:
def getProcesses(...):
'''Get the processes running.'''
isLinux = # ... logic for determining OS ...
if isLinux:
processes = getLinuxProcesses(...)
else:
processes = getWindowsProcesses(...)
# ... do something with processes, write to log file, etc ...
In your test, you can then use a mocking library such as Fudge. You mock out these two methods to return what you expect them to return.
This way you'll be testing your logic since you can control what the result will be.
from fudge import patched_context
...
def test_getProcesses(self, ...):
monitor = MonitorTool(..)
# Patch the method that gets the processes. Whenever it gets called, return
# our predetermined list.
originalProcesses = [....pids...]
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic is right ...
# Let's "add" some new processes and test that our logic realizes new
# processes were added.
newProcesses = [...]
updatedProcesses = originalProcessses + (newProcesses)
with patched_context(monitor, "getLinuxProcesses", lamba x: updatedProcesses):
monitor.getProcesses()
# ... assert logic caught new processes ...
# Let's "kill" our new processes and test that our logic can handle it
with patched_context(monitor, "getLinuxProcesses", lamba x: originalProcesses):
monitor.getProcesses()
# ... assert logic caught processes were 'killed' ...
Keep in mind that if you test your code this way, you won't get 100% code coverage (since your mocked methods won't be run), but this is fine. You're testing your code and not third party's, which is what matters.
Hopefully this might be able to help you. I know it doesn't answer your question, but maybe you can use this to figure out the best way to test your code.
Your original idea of using subprocess is a good one. Just create your own executable and name it something that identifies it as a testing thing. Maybe make it do something like sleep for a while.
Alternately, you could actually use the multiprocessing module. I've not used python in windows much, but you should be able to get process identifying data out of the Process object you create:
p = multiprocessing.Process(target=time.sleep, args=(30,))
p.start()
pid = p.getpid()

Categories