I have been trying understand what is wrong with my code with no success...
I have two.py file which I have written with some function logs.py supposed to write an input to a file
and monitor_mode.py use thous function
When running the log.py as main everything just work fine and the file is created and written on, however when trying to use the same function in monitor_mode.py nothings seems to be written to the files and I have no idea why
I did try to debug and the code is directed to to right function and everything is going as excepted except there is no creation or data written to the file
thanks for any help
logs.py
serviceList = 'serviceList.txt'
statusLog = 'statusLog.txt'
def print_to_file(file_name, input):
with open(file_name, 'a+') as write_obj:
write_obj.write(input + '\n')
write_obj.close()
def add_timestamp(input):
timestamp = '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' + datetime.datetime.now().strftime(
"%Y-%m-%d %H:%M:%S") + '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
input = timestamp + '\n' + input
return input
if __name__ == "__main__":
import services
for i in range(3):
proc = services.list_of_process()
proc = add_timestamp(proc)
print_to_file(serviceList, proc)
monitor_mode.py
import logs
import services
serviceList = 'serviceList.txt'
statusLog = 'statusLog.txt'
def updates_log():
proc = services.list_of_process()
proc = logs.add_timestamp(proc)
logs.print_to_file(serviceList, proc)
print('Updates Logs\n' + proc)
if __name__ == "__main__":
for i in range(3):
updates_log()
EDIT1.1
the above code is running on ubuntu16.8
when running the code on win10 machine its working just fine.
services.list_of_process() - return a string
I have to parse 30 days access logs from the server based on client IP and accessed hosts and need to know top 10 accessed sites. The log file will be around 10-20 GB in size which takes lots of time for single threaded execution of script. Initially, I wrote a script which was working fine but it is taking a lot of time to due to large log file size. Then I tried to implement multiprocessing library for parallel processing but it is not working. It seems implementation of multiprocessing is repeating tasks instead of doing parallel processing. Not sure, what is wrong in the code. Can some one please help on this? Thank you so much in advance for your help.
Code:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()
UPDATE AFTER CONVERSATION WITH OP, PLEASE SEE COMMENTS
My take: Split the file into smaller chunks and use a process pool to work on those chunks:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
There are many ways to implement the chunk_of_lines method, you could iterate over the file lines using a simple for or do something more advance like call fp.read().
I'm trying to write a script to extract data from a number of files in a directory with the extension ".tp6" and then write all of that data to a single text file.
It's able to get data from each file correctly and print them to the terminal, but I haven't been able to 'pass' each data point to another function that writes it to a text file.
Any ideas? Thank you!
import glob
import os
import Tkinter
import tkFileDialog
root = Tkinter.Tk()
root.withdraw()
dir_path = tkFileDialog.askdirectory()
os.chdir(dir_path)
def main():
for file_path in glob.glob('*.tp6'):
uncovext(file_path)
def main2():
for file_path in glob.glob('*.tp6'):
totext(uncovext)
#find and print data from each .tp6 file - this part works correctly
def uncovext(file_path):
for line in open(file_path):
if line.startswith(' UNCONVOLVED INTEGRATED RADIANCE'):
text = line[36:47]
number = float(text) * 10000
print('%.3f' % number)
def totext(uncovext):
with open("output.txt", "a") as f:
f.write(uncovext)
f.close()
if __name__ == '__main__':
main()
main2()
I think it was a matter of naming: if you change your input parameter of totext function to p_uncovext for example, it should work. You need also to call the function to text on your loop.
import glob
import os
import Tkinter
import tkFileDialog
root = Tkinter.Tk()
root.withdraw()
dir_path = tkFileDialog.askdirectory()
os.chdir(dir_path)
def main():
for file_path in glob.glob('*.tp6'):
uncovext(file_path)
#find and print data from each .tp6 file - this part works correctly
def uncovext(file_path):
for line in open(file_path):
if line.startswith(' UNCONVOLVED INTEGRATED RADIANCE'):
text = line[36:47]
number = float(text) * 10000
totext('%.3f' % number)
def totext(p_uncovext):
with open("output.txt", "a") as f:
f.write(p_uncovext)
f.close()
if __name__ == '__main__':
main()
You have a couple of problems. First uncovext doesn't save the data it parses from the input file. After printing to the screen, it is just thrown away. You could collect it into a list and return that for further processing. Then, you call the writer in a second function and you don't have any way for main to let main2 know what the data is.
An easy fix is a single function that calls uncovext and uses its result to call totext.
import glob
import os
import Tkinter
import tkFileDialog
root = Tkinter.Tk()
root.withdraw()
dir_path = tkFileDialog.askdirectory()
os.chdir(dir_path)
def main():
for file_path in glob.glob('*.tp6'):
totext(uncovext(file_path))
#find and print data from each .tp6 file - this part works correctly
def uncovext(file_path):
output = []
for line in open(file_path):
if line.startswith(' UNCONVOLVED INTEGRATED RADIANCE'):
text = line[36:47]
number = float(text) * 10000
output.append('%.3f\n' % number)
return output
def totext(uncovext):
with open("output.txt", "a") as f:
f.writelines(uncovext)
if __name__ == '__main__':
main()
You could also rewrite your parser as a generator and write code that I find more self-explanatory (that's just me though...)
def main():
with open('output.txt', 'a') as f:
for file_path in glob.glob('*.tp6'):
f.writelines(uncovext(file_path))
#find and print data from each .tp6 file - this part works correctly
def uncovext(file_path):
for line in open(file_path):
if line.startswith(' UNCONVOLVED INTEGRATED RADIANCE'):
text = line[36:47]
number = float(text) * 10000
yield '%.3f\n' % number
Following code I'm using for parallel csv processing:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
def init_worker(x):
sleep(.5)
print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rt")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
results = p.map(init_worker, csvReader, chunksize = 10)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
p.terminate()
# CLOSE FH FOR READING INPUT
inputFH.close()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# WRITE RESULTS TO OUTPUT FILE
[csvWriter.writerow(row) for row in results]
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(results)
# print len(results)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
I would like to somehow measure the progress of the script (just plain text not any fancy ASCII art). The one option that comes to my mind is to compare the lines that were successfully processed by init_worker to all lines in input.csv, and print the actual state e.g. every second, can you please point me to right solution? I've found several articles with similar problematic but I was not able to adapt it to my needs because neither used the Pool class and map method. I would also like to ask about p.close(), p.join(), p.terminate() methods, I've seen them mainly with Process not Pool class, are they necessary with Pool class and have I use them correctly? Using of p.terminate() was intended to kill the process with ctrl+c but this is different story which has not an happy end yet. Thank you.
PS: My input.csv looks like this, if it matters:
0,0
1,3
2,6
3,9
...
...
48,144
49,147
PPS: as I said I'm newbie in multiprocessing and the code I've put together just works. The one drawback I can see is that whole csv is stored in memory, so if you guys have better idea do not hesitate to share it.
Edit
in reply to #J.F.Sebastian
Here is my actual code based on your suggestions:
#!/usr/bin/env python
import csv
from time import sleep
from multiprocessing import Pool
from multiprocessing import cpu_count
from multiprocessing import current_process
from pprint import pprint as pp
from tqdm import tqdm
def do_job(x):
sleep(.5)
# print "(%s,%s)" % (x[0],x[1])
x.append(int(x[0])**2)
return x
def parallel_csv_processing(inputFile, outputFile, header=["Default", "header", "please", "change"], separator=",", skipRows = 0, cpuCount = 1):
# OPEN FH FOR READING INPUT FILE
inputFH = open(inputFile, "rb")
csvReader = csv.reader(inputFH, delimiter=separator)
# SKIP HEADERS
for skip in xrange(skipRows):
csvReader.next()
# OPEN FH FOR WRITING OUTPUT FILE
outputFH = open(outputFile, "wt")
csvWriter = csv.writer(outputFH, lineterminator='\n')
# WRITE HEADER TO OUTPUT FILE
csvWriter.writerow(header)
# PARALLELIZE COMPUTING INTENSIVE OPERATIONS - CALL FUNCTION HERE
try:
p = Pool(processes = cpuCount)
# results = p.map(do_job, csvReader, chunksize = 10)
for result in tqdm(p.imap_unordered(do_job, csvReader, chunksize=10)):
csvWriter.writerow(result)
p.close()
p.join()
except KeyboardInterrupt:
p.close()
p.join()
# CLOSE FH FOR READING INPUT
inputFH.close()
# CLOSE FH FOR WRITING OUTPUT
outputFH.close()
print pp(result)
# print len(result)
def main():
inputFile = "input.csv"
outputFile = "output.csv"
parallel_csv_processing(inputFile, outputFile, cpuCount = cpu_count())
if __name__ == '__main__':
main()
Here is output of tqdm:
1 [elapsed: 00:05, 0.20 iters/sec]
what does this output mean? On the page you've referred tqdm is used in loop following way:
>>> import time
>>> from tqdm import tqdm
>>> for i in tqdm(range(100)):
... time.sleep(1)
...
|###-------| 35/100 35% [elapsed: 00:35 left: 01:05, 1.00 iters/sec]
This output makes sense, but what does my output mean? Also it does not seems that ctrl+c problem is fixed: after hitting ctrl+c script throws some Traceback, if I hit ctrl+c again then I get new Traceback and so on. The only way to kill it is sending it to background (ctr+z) and then kill it (kill %1)
To show the progress, replace pool.map with pool.imap_unordered:
from tqdm import tqdm # $ pip install tqdm
for result in tqdm(pool.imap_unordered(init_worker, csvReader, chunksize=10)):
csvWriter.writerow(result)
tqdm part is optional, see Text Progress Bar in the Console
Accidentally, it fixes your "whole csv is stored in memory" and "KeyboardInterrupt is not raised" problems.
Here's a complete code example:
#!/usr/bin/env python
import itertools
import logging
import multiprocessing
import time
def compute(i):
time.sleep(.5)
return i**2
if __name__ == "__main__":
logging.basicConfig(format="%(asctime)-15s %(levelname)s %(message)s",
datefmt="%F %T", level=logging.DEBUG)
pool = multiprocessing.Pool()
try:
for square in pool.imap_unordered(compute, itertools.count(), chunksize=10):
logging.debug(square) # report progress by printing the result
except KeyboardInterrupt:
logging.warning("got Ctrl+C")
finally:
pool.terminate()
pool.join()
You should see the output in batches every .5 * chunksize seconds. If you press Ctrl+C; you should see KeyboardInterrupt raised in the child processes and in the main process. In Python 3, the main process exits immediately. In Python 2, the KeyboardInterrupt is delayed until the next batch should have been printed (bug in Python).
I've got a small script with monitors when files are added or removed to a directory. The next step is for me to get the script to execute the files (windows batch files) once they’ve been added to the directory. I’m struggling to understand how to use a variable with subprocess call (if this is the best way this can be acheived). Could anyone help me please? Many thanks. Code looks like this so far ;
import sys
import time
import os
inputdir = 'c:\\test\\'
os.chdir(inputdir)
contents = os.listdir(inputdir)
count = len(inputdir)
dirmtime = os.stat(inputdir).st_mtime
while True:
newmtime = os.stat(inputdir).st_mtime
if newmtime != dirmtime:
dirmtime = newmtime
newcontents = os.listdir(inputdir)
added = set(newcontents).difference(contents)
if added:
print "These files added: %s" %(" ".join(added))
import subprocess
subprocess.call(%,shell=True)
removed = set(contents).difference(newcontents)
if removed:
print "These files removed: %s" %(" ".join(removed))
contents = newcontents
time.sleep(15)
This should do what you wanted, cleaned it up a little.
import sys
import time
import os
import subprocess
def monitor_execute(directory):
dir_contents = os.listdir(directory)
last_modified = os.stat(directory).st_mtime
while True:
time.sleep(15)
modified = os.stat(directory).st_mtime
if last_modified == modified:
continue
last_modified = modified
current_contents = os.listdir(directory)
new_files = set(current_contents).difference(dir_contents)
if new_files:
print 'Found new files: %s' % ' '.join(new_files)
for new_file in new_files:
subprocess.call(new_file, shell=True)
lost_files = set(dir_contents).difference(current_contents)
if lost_files:
print 'Lost these files: %s' % ' '.join(lost_files)
dir_contents = current_contents