iIve asked you few weeks ago about solution on my python's scripy problem.
I just started my project again, and still got a problem.
My Arduino is working fine, command sudo screen /dev/ttyACM0 is working perfect, and I'm getting:
T: 52.80% 23.80 15% 92% N
T: 52.80% 23.80 15% 92% N
T: 52.80% 23.80 15% 92% N
T - letter is separator between next row
First number is Humidity
Second is temperature
Third is photoresistor
Next one is soil moisure
and last one is fan working state (N - not working, Y - working)
I would like to use Python's script with cron to write a text file with results for every single sensor data.
For example I'll use cron to save 4 text files (temp.txt, humi.txt, soil.txt, photo.txt) every 5 minutes, 30 minutes, 1 hour, 3 hours, 12 hours, 24 hours.
Next I'll use a php script to show data as diagrams on my website.
But the problem is with my python script. I've got a solution here, and at the moment I'm using the following script (temperature's example):
#!/usr/bin/python
import serial
import time
buffer = bytes()
ser = serial.Serial('/dev/ttyACM0',9600, timeout=10)
while buffer.count('T:') < 2:
buffer += ser.read(30)
ser.close();
# Now we have at least one complete datum. Isolate it.
start = buffer.index('T:')
end = buffer.index('T:', start+1)
items = buffer[start:end].strip().split()
print time.strftime("%Y-%m-%d %H:%M:%S"), items[2]
But in my text file I've got incorrect info, which looks like:
2013-05-10 19:47:01 12%
2013-05-10 19:48:01
2013-05-10 19:49:01 N
2013-05-10 19:50:01 24.10
2013-05-10 19:51:01 24.10
2013-05-10 19:52:01 7%
2013-05-10 19:53:01 24.10
but it should be 2013-05-10 19:47:01 24.10 all the time.
What's wrong with it?
I suspect that instead of
items = buffer[start:end].strip().split()
you want
items = buffer[start:end].split().strip()
or maybe
items = buffer[start:end].split()
Related
I'm extract the execution time of a Linux process using Subprocess and ps. I'd like to put it in a datetime object, to perform datetime arithmetic. However, I'm a little concerned about the output ps returns for the execution time:
1-01:12:23 // 1 day, 1 hour, 12 minutes, 23 seconds
05:39:03 // 5 hours, 39 minutes, 3 seconds
15:06 // 15 minutes, 6 seconds
Notice there is no zero padding before the day. And it doesn't include months/years, whereas technically something could run for that long.
Consequently i'm unsure what format string to convert it to a timedelta because I don't want it to break if a process has ran for months, or another has only ran for hours.
UPDATE
Mozway has given a very smart answer. However, I'm taking a step back and wondering if I can get the execution time another way. I'm currently using ps to get the time, but it means I also have the pid. Is there something else I can do with the pid, to get the execution time in a simpler format?
(Can only use official Python libraries)
UPDATE2
It's actually colons between the hours, mins and seconds.
You should use a timedelta
Here is a suggestion on how to convert from your string:
import datetime
s = '1-01-12-23'
out = datetime.timedelta(**dict(zip(['days', 'hours', 'minutes', 'seconds'],
map(int, s.split('-')))))
Output:
datetime.timedelta(days=1, seconds=4343)
If you can have more or less units, and assuming the smallest units are present you take advantage of the fact that zip stops with the smallest iterable, just reverse the inputs:
s = '12-23'
units = ['days', 'hours', 'minutes', 'seconds']
out = datetime.timedelta(**dict(zip(reversed(units),
map(int, reversed(s.split('-'))))))
Output:
datetime.timedelta(seconds=743)
As a function
Using re.split to handle the 1-01:23:45 format
import re
def to_timedelta(s):
units = ['days', 'hours', 'minutes', 'seconds']
return datetime.timedelta(**dict(zip(reversed(units),
map(int, reversed(re.split('[-:]', s))))))
to_timedelta('1-01:12:23')
# datetime.timedelta(days=1, seconds=4343)
to_timedelta('05:39:03')
# datetime.timedelta(seconds=20343)
to_timedelta('15:06')
# datetime.timedelta(seconds=906)
i know this question answered several times in different places, but i'm trying to find things to do in parallel. i came across this answer from Python: how to determine if a list of words exist in a string answered by #Aaron Hall. it works perfectly, but the problem is when i want to run the same snippet in parrllel using ProcessPoolExecutor or ThreadPoolExecutor it is very slow. normal execution takes 0.22 seconds to process 119288 lines, but with ProcessPoolExecutor it is taking 93 seconds. I don't understand the problem, code snippet is here.
def multi_thread_execute(): # this takes 93 seconds
lines = get_lines()
print("got {} lines".format(len(lines)))
futures = []
my_word_list = ['banking', 'members', 'based', 'hardness']
with ProcessPoolExecutor(max_workers=10) as pe:
for line in lines:
ff = pe.submit(words_in_string,my_word_list, line)
futures.append(ff)
results = [f.result() for f in futures]
single thread takes 0.22 seconds.
my_word_list = ['banking', 'members', 'based', 'hardness']
lines = get_lines()
for line in lines:
result = words_in_string(my_word_list, line)
I have 50GB + single file (google 5gram files), reading lines in parallel this works very well, but above multi thread is too much slow. is it problem of GIL. how can i improve performance.
sample format of file (single file with 50+GB, total data is 3 TB)
n.p. : The Author , 2005 1 1
n.p. : The Author , 2006 7 2
n.p. : The Author , 2007 1 1
n.p. : The Author , 2008 2 2
NP if and only if 1977 1 1
NP if and only if 1980 1 1
NP if and only if 1982 3 2
Python is known to be a language that generally does not have a strong use-cases for multithreading, more can be read about why in this StackOverflow Question
I posted a similar question a few days ago but without any code, now I created a test code in hopes of getting some help.
Code is at the bottom.
I got some dataset where I have a bunch of large files (~100) and I want to extract specific lines from those files very efficiently (both in memory and in speed).
My code gets a list of relevant files, the code opens each file with [line 1], then maps the file to memory with [line 2], also, for each file I receives a list of indices and going over the indices I retrieve the relevant information (10 bytes for this example) like so: [line 3-4], finally I close the handles with [line 5-6].
binaryFile = open(path, "r+b")
binaryFile_mm = mmap.mmap(binaryFile.fileno(), 0)
for INDEX in INDEXES:
information = binaryFile_mm[(INDEX):(INDEX)+10].decode("utf-8")
binaryFile_mm.close()
binaryFile.close()
This codes runs in parallel, with thousands of indices for each file, and continuously do that several times a second for hours.
Now to the problem - The code runs well when I limit the indices to be small (meaning - when I ask the code to get information from the beginning of the file). But! when I increase the range of the indices, everything slows down to (almost) a halt AND the buff/cache memory gets full (I'm not sure if the memory issue is related to the slowdown).
So my question is why does it matter if I retrieve information from the beginning or the end of the file and how do I overcome this in order to get instant access to information from the end of the file without slowing down and increasing buff/cache memory use.
PS - some numbers and sizes: so I got ~100 files each about 1GB in size, when I limit the indices to be from the 0%-10% of the file it runs fine, but when I allow the index to be anywhere in the file it stops working.
Code - tested on linux and windows with python 3.5, requires 10 GB of storage (creates 3 files with random strings inside 3GB each)
import os, errno, sys
import random, time
import mmap
def create_binary_test_file():
print("Creating files with 3,000,000,000 characters, takes a few seconds...")
test_binary_file1 = open("test_binary_file1.testbin", "wb")
test_binary_file2 = open("test_binary_file2.testbin", "wb")
test_binary_file3 = open("test_binary_file3.testbin", "wb")
for i in range(1000):
if i % 100 == 0 :
print("progress - ", i/10, " % ")
# efficiently create random strings and write to files
tbl = bytes.maketrans(bytearray(range(256)),
bytearray([ord(b'a') + b % 26 for b in range(256)]))
random_string = (os.urandom(3000000).translate(tbl))
test_binary_file1.write(str(random_string).encode('utf-8'))
test_binary_file2.write(str(random_string).encode('utf-8'))
test_binary_file3.write(str(random_string).encode('utf-8'))
test_binary_file1.close()
test_binary_file2.close()
test_binary_file3.close()
print("Created binary file for testing.The file contains 3,000,000,000 characters")
# Opening binary test file
try:
binary_file = open("test_binary_file1.testbin", "r+b")
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno == errno.ENOENT: # errno.ENOENT = no such file or directory
create_binary_test_file()
binary_file = open("test_binary_file1.testbin", "r+b")
## example of use - perform 100 times, in each itteration: open one of the binary files and retrieve 5,000 sample strings
## (if code runs fast and without a slowdown - increase the k or other numbers and it should reproduce the problem)
## Example 1 - getting information from start of file
print("Getting information from start of file")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,100000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
## Example 2 - getting information from all of the file
print("Getting information from all of the file")
binary_file = open("test_binary_file1.testbin", "r+b")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,3000000000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
My results: (The average time of getting information from all across the file is almost 4 times slower than getting information from the beginning, with ~100 files and parallel computing this difference gets much bigger)
Getting information from start of file
Iter 9 Average time - 0.14790
Iter 19 Average time - 0.14590
Iter 29 Average time - 0.14456
Iter 39 Average time - 0.14279
Iter 49 Average time - 0.14256
Iter 59 Average time - 0.14312
Iter 69 Average time - 0.14145
Iter 79 Average time - 0.13867
Iter 89 Average time - 0.14079
Iter 99 Average time - 0.13979
Getting information from all of the file
Iter 9 Average time - 0.46114
Iter 19 Average time - 0.47547
Iter 29 Average time - 0.47936
Iter 39 Average time - 0.47469
Iter 49 Average time - 0.47158
Iter 59 Average time - 0.47114
Iter 69 Average time - 0.47247
Iter 79 Average time - 0.47881
Iter 89 Average time - 0.47792
Iter 99 Average time - 0.47681
The basic reason why you have this time difference is that you have to seek to where you need in the file. The further from position 0 you are, the longer it's going to take.
What might help is since you know the starting index you need, seek on the file descriptor to that point and then do the mmap. Or really, why bother with mmap in the first place - just read the number of bytes that you need from the seeked-to position, and put that into your result variable.
To determine if you're getting adequate performance, check the memory available for the buffer/page cache (free in Linux), I/O stats - the number of reads, their size and duration (iostat; compare with the specs of your hardware), and the CPU utilization of your process.
[edit] Assuming that you read from a locally attached SSD (without having the data you need in the cache):
When reading in a single thread, you should expect your batch of 50,000 reads to take more than 7 seconds (50000*0.000150). Probably longer because the 50k accesses of a mmap-ed file will trigger more or larger reads, as your accesses are not page-aligned - as I suggested in another Q&A I'd use simple seek/read instead (and open the file with buffering=0 to avoid unnecessary reads for Python buffered I/O).
With more threads/processes reading simultaneously, you can saturate your SSD throughput (how much 4KB reads/s it can do - it can be anywhere from 5,000 to 1,000,000), then the individual reads will become even slower.
[/edit]
The first example only accesses 3*100KB of the files' data, so as you have much more than that available for the cache, all of the 300KB quickly end up in the cache, so you'll see no I/O, and your python process will be CPU-bound.
I'm 99.99% sure that if you test reading from the last 100KB of each file, it will perform as well as the first example - it's not about the location of the data, but about the size of the data accessed.
The second example accesses random portions from 9GB, so you can hope to see similar performance only if you have enough free RAM to cache all of the 9GB, and only after you preload the files into the cache, so that the testcase runs with zero I/O.
In realistic scenarios, the files will not be fully in the cache - so you'll see many I/O requests and much lower CPU utilization for python. As I/O is much slower than cached access, you should expect this example to run slower.
I'm working on a huffman encoder/decoder in Python, and am experiencing some unexpected (at least for me) behavior in my code. Encoding the file is fine, the problem occurs when decoding the file. Below is the associated code:
def decode(cfile):
with open(cfile,"rb") as f:
enc = f.read()
len_dkey = int(bin(ord(enc[0]))[2:].zfill(8) + bin(ord(enc[1]))[2:].zfill(8),2) # length of dictionary
pad = ord(enc[2]) # number of padding zeros at end of message
dkey = { int(k): v for k,v in json.loads(enc[3:len_dkey+3]).items() } # dictionary
enc = enc[len_dkey+3:] # actual message in bytes
com = []
for b in enc:
com.extend([ bit=="1" for bit in bin(ord(b))[2:].zfill(8)]) # actual encoded message in bits (True/False)
cnode = 0 # current node for tree traversal
dec = "" # decoded message
for b in com:
cnode = 2 * cnode + b + 1 # array implementation of tree
if cnode in dkey:
dec += dkey[cnode]
cnode = 0
with codecs.open("uncompressed_"+cfile,"w","ISO-8859-1") as f:
f.write(dec)
The first with open(cfile,"rb") as f call runs very quickly for all file sizes (tested sizes are 1.2MB, 679KB, and 87KB), but the part that slows down the code significantly is the for b in com loop. I've done some timing and I honestly don't know what's going on.
I've timed the whole decode function on each file, as shown below:
87KB 1.5 sec
679KB 6.0 sec
1.2MB 384.7 sec
first of all, I don't even know how to assign this complexity. Next, I've timed a single run through of the problematic loop, and got that the line cnode = 2*cnode + b + 1 takes 2e-6 seconds while the if cnode in dkey line takes 0.0 seconds (according to time.clock() on OSX). So it seems as if the arithmetic is slowing down my program significantly...? Which I feel like doesn't make sense.
I actually have no idea what is going on, and any help at all would be super welcome
I found a solution to my problem, but I am still left with confusion afterwards. I solved the problem by changing the dec from "" to [], and then changing the dec += dkey[cnode] line to dec.append(dkey[cnode]). This resulted in the following times:
87KB 0.11 sec
679KB 0.21 sec
1.2MB 1.01 sec
As you can see, this has immensely cut down the time, so in that aspect, this was a success. However, I am still confused as to why python's string concatenation seems to be the problem here.
Any links for me to convert datetime to filetime using python?
Example: 13 Apr 2011 07:21:01.0874 (UTC) FILETIME=[57D8C920:01CBF9AB]
Got the above from an email header.
My answer in duplicated question got deleted, so I'll post here:
Surfing around i found this link: http://cboard.cprogramming.com/windows-programming/85330-hex-time-filetime.html
After that, everything become simple:
>>> ft = "57D8C920:01CBF9AB"
... # switch parts
... h2, h1 = [int(h, base=16) for h in ft.split(':')]
... # rebuild
... ft_dec = struct.unpack('>Q', struct.pack('>LL', h1, h2))[0]
... ft_dec
... 129471528618740000L
... # use function from iceaway's comment
... print filetime_to_dt(ft_dec)
2011-04-13 07:21:01
Tuning it up is up for you.
Well here is the solution I end up with
parm3=0x57D8C920; parm3=0x01CBF9AB
#Int32x32To64
ft_dec = struct.unpack('>Q', struct.pack('>LL', parm4, parm3))[0]
from datetime import datetime
EPOCH_AS_FILETIME = 116444736000000000; HUNDREDS_OF_NANOSECONDS = 10000000
dt = datetime.fromtimestamp((ft_dec - EPOCH_AS_FILETIME) / HUNDREDS_OF_NANOSECONDS)
print dt
Output will be:
2011-04-13 09:21:01 (GMT +1)
13 Apr 2011 07:21:01.0874 (UTC)
base on David Buxton 'filetimes.py'
^-Note that theres a difference in the hours
Well I changes two things:
fromtimestamp() fits somehow better than *UTC*fromtimestamp() since I'm dealing with file times here.
FAT time resolution is 2 seconds so I don't care about the 100ns rest that might fall apart.
(Well actually since resolution is 2 seconds normally there be no rest when dividing HUNDREDS_OF_NANOSECONDS )
... and beside the order of parameter passing pay attention that struct.pack('>LL' is for unsigned 32bit Int's!
If you've signed int's simply change it to struct.pack('>ll' for signed 32bit Int's!
(or click the struct.pack link above for more info)