I'm trying to make a pool which prints data into a file.
def get_and_print_something(url):
with open('file.txt','a') as f:
f.write(get_line(url))
pool = Pool(50)
for url in urls:
pool.apply_async(get_something, args=(url,))
The problem is that sometimes it writes wrong data. It's because two workers manipulates with the same file in the same time. Is it possible to allow waiting until the file could be modified?
Example of the txt:
This is a correct line.
This is a correct line.
orrect line.
This is a correct line.
...
You can take the example from e.g. this site:
http://effbot.org/zone/thread-synchronization.htm#locks, or
https://pymotw.com/2/threading/
which basically boils down to:
import threading
lock = threading.Lock()
def get_and_print_something(url):
# Not yet in critical section because we want this to happen concurrently:
line = get_line(url)
lock.acquire() # Will wait if necessary until any other thread has finished its file access.
# In critical section now. Only one thread may run this at any one time.
try:
with open('file.txt','a') as f:
f.write( line )
finally:
lock.release() # Release lock, so that other threads can access the file again.
Related
So this is the first time I am playing around with threading so please bare with me here. In my main application (which I will implement this into), I need to add multithreading into my script. The script will read account info from a text file, then login & do some tasks with that account. I need to make sure that threads aren't reading the same line from the accounts text file since that would screw everything up, which I'm not quite sure about how to do.
from multiprocessing import Queue, Process
from threading import Thread
from time import sleep
urls_queue = Queue()
max_process = 10
def dostuff():
with open ('acc.txt', 'r') as accounts:
for account in accounts:
account.strip()
split = account.split(":")
a = {
'user': split[0],
'pass': split[1],
'name': split[2].replace('\n', ''),
}
sleep(1)
print(a)
for i in range(max_process):
urls_queue.put("DONE")
def doshit_processor():
while True:
url = urls_queue.get()
if url == "DONE":
break
def main():
file_reader_thread = Thread(target=dostuff)
file_reader_thread.start()
procs = []
for i in range(max_process):
p = Process(target=doshit_processor)
procs.append(p)
p.start()
for p in procs:
p.join()
print('all done')
# wait for all tasks in the queue
file_reader_thread.join()
if __name__ == '__main__':
main()
So at the moment I don't think the threading is even working, because it's printing one account out per second, even with 10 threads. So it should be printing 10 accounts per second which it isn't which has me confused. Also I am not sure how to make sure that threads won't pick the same account line. Help by a big brain is much appreciated
The problem is that you create a single thread to generate the data for your processes but then don't post that data to the queue. You sleep in that single thread so you see one item generated per second and then... nothing because the item isn't queued. It seems that all you are doing is creating a process pool and the inbuilt multiprocessing.Pool should work for you.
I've set pool "chunk size" low so that workers are only given 1 work item at a time. This is good for workflows where processing time can vary for each work item. By default, pool tries to optimize for the case where processing time is roughly equivalent and instead tries to reduce interprocess communication time.
Your data looks like a colon-separated file and you can use csv to cut down the processing there too. This smaller script should do what you want.
import multiprocessing as mp
from time import sleep
import csv
max_process = 10
def doshit_processor(row):
time.sleep(1) # if you want to simulate work
print(row)
def main():
with open ('acc.txt', newline='') as accounts:
table = list(csv.DictReader(accounts, fieldnames=('user', 'pass', 'name'),
delimiter=':')
with mp.Pool(max_process) as pool:
pool.map(doshit_processor, table, chunksize=1)
print('all done')
if __name__ == '__main__':
main()
I have the below code that shows how a queue would always be cleared even with multiple threads adding to the queue. It's using recursion but a while loop could work as well. Is this a bad practice or would there be a scenario where the queue might have an object and it won't get pulled until something gets added to the queue.
The primary purpose of this is to have a queue that ensures order of execution without the need to continually poll or block with q.get()
import queue
import threading
lock = threading.RLock()
q = queue.Queue()
def execute():
with lock:
if not q.empty():
text = q.get()
print(text)
execute()
def add_to_queue(text):
q.put(text)
execute()
# Assume multiple threads can call add to queue
add_to_queue("Hello")
This is one solution that uses timeout on the .get function, one pushes to the queue and one reads from the queue. You could have multiple readers and writers.
import queue
import threading
q = queue.Queue()
def read():
try:
while True:
text = q.get(timeout=1)
print(text)
except queue.Empty:
print("exiting")
def write():
q.put("Hello")
q.put("There")
q.put("My")
q.put("Friend")
writer = threading.Thread(target=write)
reader = threading.Thread(target=read)
writer.start()
reader.start()
reader.join()
import threading
from queue import Queue
print_lock = threading.Lock()
def job(worker):
with print_lock:
with open('messages.txt') as f:
for line in f:
print(line)
def reader():
while True:
worker = q.get()
job(worker)
q.task_done()
q = Queue()
for x in range(10):
t = threading.Thread(target=reader)
t.daemon = True
t.start()
for worker in range(1):
q.put(worker)
q.join()
So what i want is each thread reads different messages,
Queue is thread safe
so, threadling lock does not needed
your trying too many things to learn in same code snippet like 1) Multi-Threading 2) Queue Data Structure 3) Thread Synchronization Mechanisms 4) Locking etc.
Let me answer regarding multi-threading only.
In your case, every thread is reading all messages because target function "job" is opening file and reading all the data and every thread is calling that target function.
Let me simplify the stuff bit.
You want to read each line of file in new thread.
So, instead of opening file in every thread and read it, we will open file one time and put data in list.
Now, every thread will read one line from list and print it. Also, it will remove that printed line from list.
Once, all the data is printed and still thread trying to read, we will add the exception.
Code :
import threading
import sys
#Global variable list for reading file data
global file_data
file_data = []
#Create lock
lock = threading.Lock()
def reader():
while len(file_data) != 0:
print threading.currentThread().getName() + " --- "
try:
lock.acquire()
#Get one line from list and print it
a = file_data.pop()
print a
except:
#Once data is not present, let's print exception message
print "------------------------"
print "No data present in file"
sys.exit()
lock.release()
#Read data from file and put it into list
with open("messages.txt") as fh:
file_data = fh.readlines()
for x in range(2):
name = "Thread_"+str(x)
t = threading.Thread(name=name,target=reader)
t.start()
Output:
C:\Users\dinesh\Desktop>python demo.py
Thread_0 --- Thread_1 ---
Each thread read each message
Thread_1 --- I am great
Thread_0 --- How Are you ?
Thread_1 --- Grey
Thread_0 --- Hey
Thread_1 --- Dinesh
Thread_0 --- Hello
------------------------
No data present in file
C:\Users\dinesh\Desktop>
C:\Users\dinesh\Desktop>
NOTE : I know sue of global is not recommended. But for learning purpose it is good.
I'm currently writing a Python program to receive data from either a TCP/UDP socket, and then write the data to a file. Right now, my program is I/O bound by writing each datagram to the file as it comes in (I'm doing this for very large files, so the slowdown is considerable). With that in mind, I've decided that I'd like to trying receiving the data from the socket in one thread, and then write that data in a different thread. So far, I've come up with the following rough draft. At the moment, it only writes a single data chunk (512 bytes) to a file.
f = open("t1.txt","wb")
def write_to_file(data):
f.write(data)
def recv_data():
dataChunk, addr = sock.recvfrom(buf) #THIS IS THE DATA THAT GETS WRITTEN
try:
w = threading.Thread(target = write_to_file, args = (dataChunk,))
threads.append(w)
w.start()
while(dataChunk):
sock.settimeout(4)
dataChunk,addr = sock.recvfrom(buf)
except socket.timeout:
print "Timeout"
sock.close()
f.close()
threads = []
r = threading.Thread(target=recv_data)
threads.append(r)
r.start()
I imagine I'm doing something wrong, I'm just not sure what the best way to use threading is. Right now, my issue is that I have to supply an argument when I create my thread, but the value of that argument doesn't properly change to reflect the new data chunks that come in. However, if I put the line w=threading.Thread(target=write_to_file, arg=(dataChunk,)) inside the while(dataChunk) loop, wouldn't I be creating a new thread each iteration?
Also, for what it's worth, this is just my small proof-of-concept for using separate receive and write threads. This is not the larger program that should ultimately make use of this concept.
You need to have a buffer that the reading thread writes to, and the writing thread reads from. A deque from the collections module is perfect, as it allows append/pop from either side without performance degradation.
So, don't pass dataChunk to your thread(s), but the buffer.
import collections # for the buffer
import time # to ease polling
import threading
def write_to_file(path, buffer, terminate_signal):
with open(path, 'wb') as out_file: # close file automatically on exit
while not terminate_signal.is_set() or buffer: # go on until end is signaled
try:
data = buffer.pop() # pop from RIGHT end of buffer
except IndexError:
time.sleep(0.5) # wait for new data
else:
out_file.write(data) # write a chunk
def read_from_socket(sock, buffer, terminate_signal):
sock.settimeout(4)
try:
while True:
data, _ = sock.recvfrom(buf)
buffer.appendleft(data) # append to LEFT of buffer
except socket.timeout:
print "Timeout"
terminate_signal.set() # signal writer that we are done
sock.close()
buffer = collections.deque() # buffer for reading/writing
terminate_signal = threading.Event() # shared signal
threads = [
threading.Thread(target=read_from_socket, kwargs=dict(
sock=sock,
buffer=buffer,
terminate_signal=terminate_signal
)),
threading.Thread(target= write_to_file, kwargs=dict(
path="t1.txt",
buffer=buffer,
terminate_signal=terminate_signal
))
]
for t in threads: # start both threads
t.start()
for t in threads: # wait for both threads to finish
t.join()
I have a python program that I have written. This python program calls a function within a module I have also written and passes it some data.
program:
def Response(Response):
Resp = Response
def main():
myModule.process_this("hello") #Send string to myModule Process_this function
#Should wait around here for Resp to contain the Response
print Resp
That function processes it and passes it back as a response to function Response in the main program.
myModule:
def process_this(data)
#process data
program.Response(data)
I checked and all the data is being passed correctly. I have left out all the imports and the data processing to keep this question as concise as possible.
I need to find some way of having Python wait for resp to actually contain the response before proceeding with the program. I've been looking threading and using semaphores or using the Queue module, but i'm not 100% sure how I would incorporate either into my program.
Here's a working solution with queues and the threading module. Note: if your tasks are CPU bound rather than IO bound, you should use multiprocessing instead
import threading
import Queue
def worker(in_q, out_q):
""" threadsafe worker """
abort = False
while not abort:
try:
# make sure we don't wait forever
task = in_q.get(True, .5)
except Queue.Empty:
abort = True
else:
# process task
response = task
# return result
out_q.put(response)
in_q.task_done()
# one queue to pass tasks, one to get results
task_q = Queue.Queue()
result_q = Queue.Queue()
# start threads
t = threading.Thread(target=worker, args=(task_q, result_q))
t.start()
# submit some work
task_q.put("hello")
# wait for results
task_q.join()
print "result", result_q.get()