Run some task in parallel and others sequentially using python - python

I was new to python, Now, I have an requirements to run single function as parallel for some process and some may be depends on others.
My dataset will be like:
[
{
'process_id': 1,
'dependency_id': 0
},
{
'process_id': 2,
'dependency_id': 0
},
{
'process_id': 3,
'dependency_id': 0
},
{
'process_id': 4,
'dependency_id': 2
},
{
'process_id': 5,
'dependency_id': 2
},
{
'process_id': 6,
'dependency_id': 1
}
]
Here,
the process_id 1,2,3 should run in parallel because it has the
dependency_id as 0.
the process_id 4 has the dependency_id as 2 -- it should wait until
process 2 get done
the process_id 5 has the dependency_id as 2 -- it should wait until
process 2 get done. Now process 4 and 5 should wait for 2 and get run
in parallel.
the process_id 6 has the dependency_id as 1 -- it should wait until
process 1 get done
I have tried in the following way :
from multiprocessing import Process
from time import sleep
process = []
process_list =[{'process_id': 1,'dependency_id': 0},{'process_id': 2,'dependency_id': 0},{'process_id': 3,'dependency_id': 0},{'process_id': 4,'dependency_id': 2},{'process_id': 5,'dependency_id': 2},{'process_id': 6,'dependency_id': 1}]
def process_fn(process_id):
print i
def main():
for item in len(process_list):
p = process(target=process_fn, args=(item['process_id']))
process.append(p)
p.start()
But I want to know how to include that dependency logic inside multiprocessing. Can anyone please help me with the solution.

Related

Django Choice Queries with annotate

I am stucked in django annotate query.
here is certain Model I have been working.
class Claim(model.Model)
CLAIMS_STATUS_CHOICES = (
(2, PROCESSING),
(1, ACCEPTED),
(0, REJECTED)
)
status = models.CharField(max_length=10,choice=CLAIMS_STATUS_CHOICES)
problem is I don't want to annotate processing choice but I just want to get individual status count of accepted and rejected.
Here is what I tried.
claim = Claim.objects.filter(Q(status=1) | Q(status=0))
total_status_count = claim.count()
status_counts = Claim.objects.filter(Q(status=1) |Q(status=0)).annotate(count=Count('status')).values('count', 'status')
but I am getting multiple rejected and accepted queries
this is what I got as op
[
{
"total_claims": 3,
"status_count": [
{
"status": "1",
"count": 1
},
{
"status": "0",
"count": 1
},
{
"status": "1",
"count": 1
}
]
}
]
what I wanted
[
{
"total_claims": 3,
"status_count": [
{
"status": "1",
"count": 2
},
{
"status": "0",
"count": 1
}
]
}
]
Any help regarding this?
Claim.objects.exclude(status=2).values('status').annotate(count=Count('status'))
I have done a similar task in my project, where I have to count the total completed and in progress projects. In my case, the scenario is, every project have a status.
In the project model, status is a choice field which have choices (uploaded, inprogress, inreview, and completed).
I'm posting the query here, maybe someone finds it helpful. In an API, write the following query:
from django.db.models import Count, Q
project_counts = Project.objects.filter(user=self.request.user).aggregate(
total=Count('id'),
inprogress=Count('id', Q(status=Project.ProjectStatus.IN_PROGRESS)),
completed=Count('id', Q(status=Project.ProjectStatus.COMPLETED)))
and then return the response like this.
return response.Response({
'total': project_counts['total'] | 0,
'inprogress': project_counts['inprogress'] | 0,
'completed': project_counts['completed'] | 0
})

show commit size for each commit based on lines of code while finding coding days

I am trying to find the coding days(active days) along with number of commits and each commit's size per coding days. This is what I meant to say
{
"2014-05-2": {
"commit_count": "1",
"commit": [{ 'commit_hash': {'lines_added': 10, 'lines_removed': 4 }}]
},
"2014-05-3": {
"commit_count": "2",
"commit": [
{ 'commit_hash': {'lines_added': 10, 'lines_removed': 4 }},
{ 'commit_hash': {'lines_added': 14, 'lines_removed': 0 }},
]
}
}
For now, I could only find coding days and number of commits this way
async def get_coding_days(self, author, before, after):
cmd = (
f'git log {self.conf.get("branch", "master")} --author="{author}" --date=short'
)
cmd += ' --pretty=format:"%ad %an"'
if before:
cmd += f' --before="{before}"'
if after:
cmd += f' --after="{after}"'
results = get_proc_out([cmd, "sort", "uniq -c"]).splitlines()
np_results = np.array(results)
stripped_array = np.char.strip(np_results)
for result in stripped_array:
second_space_pos = result.find(" ", 2)
if second_space_pos > 2:
count_with_time = result[0 : second_space_pos - 1]
[commit_count, coding_day] = count_with_time.split(" ")
author = result[second_space_pos + 1 :]
# default_dict[author].append(
# {"commit_count": commit_count, "coding_day": coding_day}
# )
if author not in self.coding_days:
self.coding_days[author] = []
self.coding_days[author].append(
{coding_day: {"commit_count": commit_count}}
)
return self.coding_days
How can I show commit size for each commit?
Use --shortstat to get the stat for the commit
git log --date=short --after="e5e16518" --pretty="%h %an %ad" --shortstat
to get
cb807c0d clmno 2021-06-15
4 files changed, 71 insertions(+), 15 deletions(-)
2f8de5fb clmno 2021-06-15
4 files changed, 76 insertions(+), 23 deletions(-)
92a0e1e3 clmno 2021-06-15
2 files changed, 9 insertions(+)
Now you can simply parse these to generate the JSON you need.

Get array from file which also includes strings

I'm writing an automation script in Python that makes use of another library. The output I'm given contains the array I need, however the output also includes log messages in string format that are irrelevant.
For my script to work, I need to retrieve only the array which is in the file.
Here's an example of the output I'm getting.
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
How would I get only the array from this file?
FWIW, I'd rather not have the logic based on the strings outside of the array, as they could be subject to change.
UPDATE: Script I'm getting the data from is here: https://github.com/brave/ab2cb/tree/master/ab2cb
My full code is here:
def pipe_in(process, filter_lists):
try:
for body, _, _ in filter_lists:
process.stdin.write(body)
finally:
process.stdin.close()
def write_block_lists(filter_lists, path, expires):
block_list = generate_metadata(filter_lists, expires)
process = subprocess.Popen(('ab2cb'),
cwd=ab2cb_dirpath,
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
threading.Thread(target=pipe_in, args=(process, filter_lists)).start()
result = process.stdout.read()
with open('output.json', 'w') as destination_file:
destination_file.write(result)
destination_file.close()
if process.wait():
raise Exception('ab2cb returned %s' % process.returncode)
The output will ideally be modified in stdout and written later to file as I still need to modify the data within the previously mentioned array.
You can use regex too
import re
input = """
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
asd
asd
"""
regex = re.compile(r"\[(.|\n)*(?:^\]$)", re.M)
x = re.search(regex, input)
print(x.group(0))
EDIT
re.M turns on 'MultiLine matching'
https://repl.it/repls/InfantileDopeyLink
I have written a library for this purpose. It's not often that I get to plug it!
from jsonfinder import jsonfinder
logs = r"""
Split /adclix.$~image into 2 rules
Split /mediahosting.engine$document,script into 2 rules
[
{
"action": {
"type": "block"
},
"trigger": {
"url-filter": "/adservice\\.",
"unless-domain": [
"adservice.io"
]
}
}
]
Generated a total of 1 rules (1 blocks, 0 exceptions)
Something else that looks like JSON: [1, 2]
"""
for start, end, obj in jsonfinder(logs):
if (
obj
and isinstance(obj, list)
and isinstance(obj[0], dict)
and {"action", "trigger"} <= obj[0].keys()
):
print(obj)
Demo: https://repl.it/repls/ImperfectJuniorBootstrapping
Library: https://github.com/alexmojaki/jsonfinder
Install with pip install jsonfinder.

Parsing dictionary and grouping output with Python

Let's say I have a
dictionary = {
'host_type' : {'public_ip':['ip_address','ip_address','ip_address'],
'private_dns':['dns_name','dns_name','dns_name']}
}
There are some host types, let's say there are 3 host types: master,slave,backup
The output from the dictionary can contain different amount of hosts for each host type. For example, for 2 masters, 6 slaves, 2 backups the dictionary would look like this:
dictionary =
{
'master' : {
'public_ip':['ip_address','ip_address'],
'private_dns': ['dns_name','dns_name']
},
'slave' : {
'public_ip':['ip_address','ip_address', 'ip_address','ip_address','ip_address','ip_address'],
'private_dns': ['dns_name','dns_name','dns_name','dns_name','dns_name','dns_name']
},
'backup' : {
'public_ip':['ip_address','ip_address'],
'private_dns':['dns_name','dns_name']
}
}
Now I want to parse the dictionary and group the hosts in such way that I always have 1 master, 1 backup, 3 slaves. How can I parse such dictionary to achieve similar effect:
master,public_ip,private_dns
backup,public_ip,private_dns
slave,public_ip,private_dns
slave,public_ip,private_dns
slave,public_ip,private_dns
master,public_ip,private_dns
backup,public_ip,private_dns
slave,public_ip,private_dns
slave,public_ip,private_dns
slave,public_ip,private_dns
d = {
'master' : {
'public_ip':['ip_address0M','ip_address1M'],
'private_dns': ['dns_name','dns_name']
},
'slave' : {
'public_ip':['ip_address0s','ip_address1s', 'ip_address2s','ip_address3s','ip_address4s','ip_address5s'],
'private_dns': ['dns_name','dns_name','dns_name','dns_name','dns_name','dns_name']
},
'backup' : {
'public_ip':['ip_address0b','ip_address1b'],
'private_dns':['dns_name','dns_name']
}
}
masterCount = 0
slavecount = 0
backupCount = 0
result = list()
while(masterCount + 1 <= len(d['master']['public_ip']) and slavecount + 3 <= len(d['slave']['public_ip']) and backupCount + 1 <= len(d['backup']['public_ip'])):
result.append([])
tempList = [d['master']['public_ip'][masterCount], d['slave']['public_ip'][slavecount:slavecount+3], d['backup']['public_ip'][backupCount]]
result[masterCount].append(tempList)
masterCount+=1
slavecount+=3
backupCount==1
print(result)
Now result is of the format:
result[index][0] is master
result[index][1] is slave
result[index][2] is backup
[EDIT]
You can do something similar to add the DNS. I have not added it as you mentioned you only wanted the directions.
Output:
[[['ip_address0M', ['ip_address0s', 'ip_address1s', 'ip_address2s'], 'ip_address0b']], [['ip_address1M', ['ip_address3s', 'ip_address4s', 'ip_address5s'], 'ip_address0b']]]
m1 = d['master']['public']
m2 = d['master']['private']
b1 = d['backup']['public']
b2 = d['backup']['private']
s1 = d['slave']['public']
s2 = d['slave']['private']
zip(zip(m1, m2), zip(b1, b2), zip(*[iter(zip(s1, s2))]*3))
There's probably a better solution for resolving all the lists from the dictionary, but this should work.

Celery Worker sleep not working correctly

I have the next problem, I'm using a process on Python that must wait X number of second, the process by itself work correctly, the problem is when I put it as task on celery.
When the worker try to do the time.sleep(X) on one task it pause all the tasks in the worker, for example:
I have the Worker A, it can do 4 tasks at the same time (q,w,e and r), the task r have a sleep of 1800 seconds, so the worker is doing the 4 tasks at the same time, but when the r task do the sleep the worker stop q, w and e too.
Is this normal? Do you know how I can solve this problem?
EDIT:
this is an example of celery.py with my beat and queues
app.conf.update(
CELERY_DEFAULT_QUEUE='default',
CELERY_QUEUES=(
Queue('search', routing_key='search.#'),
Queue('tests', routing_key='tests.#'),
Queue('default', routing_key='tasks.#'),
),
CELERY_DEFAULT_EXCHANGE='tasks',
CELERY_DEFAULT_EXCHANGE_TYPE='topic',
CELERY_DEFAULT_ROUTING_KEY='tasks.default',
CELERY_TASK_RESULT_EXPIRES=10,
CELERYD_TASK_SOFT_TIME_LIMIT=1800,
CELERY_ROUTES={
'tests.tasks.volume': {
'queue': 'tests',
'routing_key': 'tests.volume',
},
'tests.tasks.summary': {
'queue': 'tests',
'routing_key': 'tests.summary',
},
'search.tasks.links': {
'queue': 'search',
'routing_key': 'search.links',
},
'search.tasks.urls': {
'queue': 'search',
'routing_key': 'search.urls',
},
},
CELERYBEAT_SCHEDULE={
# heavy one
'each-hour-summary': {
'task': 'tests.tasks.summary',
'schedule': crontab(minute='0', hour='*/1'),
'args': (),
},
'each-hour-volume': {
'task': 'tests.tasks.volume',
'schedule': crontab(minute='0', hour='*/1'),
'args': (),
},
'links-each-cuarter': {
'task': 'search.tasks.links',
'schedule': crontab(minute='*/15'),
'args': (),
},
'urls-each-ten': {
'schedule': crontab(minute='*/10'),
'task': 'search.tasks.urls',
'args': (),
},
}
)
test.tasks.py
#app.task
def summary():
execute_sumary() #heavy task ~ 1 hour aprox
#app.task
def volume():
execute_volume() #no important ~ less than 5 minutes
and search.tasks.py
#app.task
def links():
free = search_links() #return boolean
if free:
process_links()
else:
time.sleep(1080) #<--------sleep with which I have problems
process_links()
#app.task
def urls():
execute_urls() #no important ~ less than 1 minute
Well, I have 2 workers, A for the queue search and B for tests and defaul.
The problem is with A, when it take the task "links" and it execute the time.sleep() it stop the other tasks that the worker is doing.
Because the worker B is working correctly I thinks the problem is the time.sleep() function.
If you only have one process/thread, call to sleep() will block it. This means that no other task will run...
You set CELERYD_TASK_SOFT_TIME_LIMIT=1800 but your sleep is 1080.
Only one or two task can work in this time interval.
Set CELERYD_TASK_SOFT_TIME_LIMIT > (1080+(work time))*3
Set more --concurency (> 4) when start celery worker.

Categories