I want to read as many 24 bit chunks as possible from a file.
How can I do this using bitstrings' ConstBitStream
when I don't now how many chunks there are?
Currently I do this:
eventList = ConstBitStream(filename = 'events.dat')
for i in range(1000) :
packet = eventList.read(24)
(here I have to calculate the number of events beforehand)
You could read until an ReadError exeption is generated
try:
while True:
packet = eventList.read(24)
except ReadError:
pass
Catching the ReadError is a perfectly good answer, but another way is to instead use the cut method, which returns a generator for bitstrings of a given length, so just
for packet in eventList.cut(24):
should work.
Related
I am trying to get all the info of a .wav file by interpreting it as a text file, but using the next code:
import wave
w = wave.open('C:/Users/jorge/Desktop/Programas/Python/Datos/Si_Canciones/NSYNC - Its Gonna Be Me.wav', 'r') # :P
for i in range(5000):#w.getnframes()):
frame = w.readframes(i)
print(frame)
It prints all as I want, but at the end I get something like this:
00\x00\x00\x00\x00\x00\x00\x0
b''
b''
b''
b''
#And the b''s continue for a while
I would like to add something like this in the for, so I don't get rid off those b''s:
if (something):
break
But I don't know what that "something" could be. Can someone help me with it? :/
(I stay tuned to your answers and wish you a nice week)
The most obvious answer would be
if frame==b"":
break
But as stated in the docs, there is also a method that gives you the amount of frames; so you might want to use that; enabling you to only iterate through the existing frames. I'm not familiar with the module though.
I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
replaceFile.seek(0)
replaceFile.truncate()
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
replaceFile.write(line)
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.
If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data = f.read()
f.seek(0)
f.truncate()
for k, v in mappings.items():
data = data.replace(k, v)
f.write(data)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
p.start()
processes.append(p)
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
active_keys.append(queue.get())
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
p.join()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(f.read(), mappings)
f.seek(0)
f.truncate()
f.write(data)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._mappings.has_key(key):
return self._mappings[key]
else:
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text = fp.read()
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
fp.write(text)
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )
The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
tree.add(key)
tree.make()
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]
This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text = fd.read()
# seek to the start of the file and truncate so file will be edited inline
fd.seek(0)
fd.truncate()
for key in mappings.keys():
text = text.replace(key, mappings[key])
fd.write(text)
I'm having a problem with sockets in python.
I have a a TCP server and client that send each other data in a while 1 loop.
It packages up 2 shorts in the struct module (struct.pack("hh", mousex, mousey)). But sometimes when recving the data on the other computer, it seems like 2 messages have been glued together. Is this nagle's algorithm?
What exactly is going on here? Thanks in advance.
I agree with other posters, that "TCP just does that". TCP guarantees that your bytes arrive in the right order, but makes no guarantees about the sizes of the chunks they arrive in. I would add that TCP is also allowed to split a single send into multiple recv's, or even for example to split aabb, ccdd into aab, bcc, dd.
I put together this module for dealing with the relevant issues in python:
http://stromberg.dnsalias.org/~strombrg/bufsock.html
It's under an opensource license and is owned by UCI. It's been tested on CPython 2.x, CPython 3.x, Pypy and Jython.
HTH
To be sure I'd have to see actual code, but it sounds like you are expecting a send of n bytes to show up on the receiver as exactly n bytes all the time, every time.
TCP streams don't work that way. It's a "streaming" protocol, as opposed to a "datagram" (record-oriented) one like UDP or STCP or RDS.
For fixed-data-size protocols (or any where the next chunk size is predictable in advance), you can build your own "datagram-like receiver" on a stream socket by simply recv()ing in a loop until you get exactly n bytes:
def recv_n_bytes(socket, n):
"attempt to receive exactly n bytes; return what we got"
data = []
while True:
have = sum(len(x) for x in data)
if have >= n:
break
want = n - have
got = socket.recv(want)
if got == '':
break
return ''.join(data)
(untested; python 2.x code; not necessarily efficient; etc).
You may not assume that data will become available for reading from the local socket in the same size pieces it was provided for sending at the other source end. As you have seen, this might sometimes be usually true, but by no means reliably so. Rather, what TCP guarantees is that what goes in one end will eventually come out the other, in order without anything missing or if that cannot be achieved by means built into the protocol such as retries, then whole thing will break with an error.
Nagle is one possible cause, but not the only one.
I need to send an array of namedtuples by a socket.
To create the array of namedtuples I use de following:
listaPeers=[]
for i in range(200):
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
ipPuerto.ip="121.231.334.22"
ipPuerto.puerto="8988"
listaPeers.append(ipPuerto)
Now that is filled, i need to pack "listaPeers[200]"
How can i do it?
Something like?:
packedData = struct.pack('XXXX',listaPeers)
First of all you are using namedtuple incorrectly. It should look something like this:
# ipPuerto is a type
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
# theTuple is a tuple object
theTuple = ipPuerto("121.231.334.22", "8988")
As for packing, it depends what you want to use on the other end. If the data will be read by Python, you can just use Pickle module.
import cPickle as Pickle
pickledTuple = Pickle.dumps(theTuple)
You can pickle whole array of them at once.
It is not that simple - yes, for integers and simple numbers, it s possible to pack straight from named tuples to data provided by the struct package.
However, you are holding your data as strings, not as numbers - it is a simple thing to convert to int in the case of the port - as it is a simple integer, but requires some juggling when it comes to the IP.
def ipv4_from_str(ip_str):
parts = ip_str.split(".")
result = 0
for part in parts:
result <<= 8
result += int(part)
return result
def ip_puerto_gen(list_of_ips):
for ip_puerto in list_of_ips:
yield(ipv4_from_str(ip_puerto.ip))
yield(int(ip_puerto.puerto))
def pack(list_of_ips):
return struct.pack(">" + "II" * len(list_of_ips),
*ip_puerto_gen(list_of_ips)
)
And you then use the "pack" function from here to pack your structure as you seem to want.
But first, attempt to the fact that you are creating your "listaPiers" incorrectly (your example code simply will fail with an IndexError) - use an empty list, and the append method on it to insert new named tuples with ip/port pairs as each element:
listaPiers = []
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
for x in range(200):
new_element = ipPuerto("123.123.123.123", "8192")
listaPiers.append(new_element)
data = pack(listaPiers)
ISTR that pickle is considered insecure in server processes, if the server process is receiving pickled data from untrusted clients.
You might want to come up with some sort of separator character(s) for the records and fields (perhaps \0 and \001 or \376 and \377). Then putting together a message is kind of like a text file broken up into records and fields separated by spaces and newlines. Or for that matter, you could use spaces and newlines, if your normal data doesn't include these.
I find this module very valuable for framing data in socket-based protocols:
http://stromberg.dnsalias.org/~strombrg/bufsock.html
It lets you do things like "read up until the next null byte" or "read the next 10 characters" - without needing to worry about the complexities of IP aggregating or splitting packets.
I have a socket opened and I'd like to read some json data from it. The problem is that the json module from standard library can only parse from strings (load only reads the whole file and calls loads inside) It even looks that all the way inside the module it all depends on the parameter being string.
This is a real problem with sockets since you can never read it all to string and you don't know how many bytes to read before you actually parse it.
So my questions are: Is there a (simple and elegant) workaround? Is there another json library that can parse data incrementally? Is it worth writing it myself?
Edit: It is XBMC jsonrpc api. There are no message envelopes, and I have no control over the format. Each message may be on a single line or on several lines.
I could write some simple parser that needs only getc function in some form and feed it using s.recv(1), but this doesn't as a very pythonic solution and I'm a little lazy to do that :-)
Edit: given that you aren't defining the protocol, this isn't useful, but it might be useful in other contexts.
Assuming it's a stream (TCP) socket, you need to implement your own message framing mechanism (or use an existing higher level protocol that does so). One straightforward way is to define each message as a 32-bit integer length field, followed by that many bytes of data.
Sender: take the length of the JSON packet, pack it into 4 bytes with the struct module, send it on the socket, then send the JSON packet.
Receiver: Repeatedly read from the socket until you have at least 4 bytes of data, use struct.unpack to unpack the length. Read from the socket until you have at least that much data and that's your JSON packet; anything left over is the length for the next message.
If at some point you're going to want to send messages that consist of something other than JSON over the same socket, you may want to send a message type code between the length and the data payload; congratulations, you've invented yet another protocol.
Another, slightly more standard, method is DJB's Netstrings protocol; it's very similar to the system proposed above, but with text-encoded lengths instead of binary; it's directly supported by frameworks such as Twisted.
If you're getting the JSON from an HTTP stream, use the Content-Length header to get the length of the JSON data. For example:
import httplib
import json
h = httplib.HTTPConnection('graph.facebook.com')
h.request('GET', '/19292868552')
response = h.getresponse()
content_length = int(response.getheader('Content-Length','0'))
# Read data until we've read Content-Length bytes or the socket is closed
data = ''
while len(data) < content_length or content_length == 0:
s = response.read(content_length - len(data))
if not s:
break
data += s
# We now have the full data -- decode it
j = json.loads(data)
print j
What you want(ed) is ijson, an incremental json parser.
It is available here: https://pypi.python.org/pypi/ijson/ . The usage should be simple as (copying from that page):
import ijson.backends.python as ijson
for item in ijson.items(file_obj):
# ...
(for those who prefer something self-contained - in the sense that it relies only on the standard library: I wrote yesterday a small wrapper around json - but just because I didn't know about ijson. It is probably much less efficient.)
EDIT: since I found out that in fact (a cythonized version of) my approach was much more efficient than ijson, I have packaged it as an independent library - see here also for some rough benchmarks: http://pietrobattiston.it/jsaone
Do you have control over the json? Try writing each object as a single line. Then do a readline call on the socket as described here.
infile = sock.makefile()
while True:
line = infile.readline()
if not line: break
# ...
result = json.loads(line)
Skimming the XBMC JSON RPC docs, I think you want an existing JSON-RPC library - you could take a look at:
http://www.freenet.org.nz/dojo/pyjson/
If that's not suitable for whatever reason, it looks to me like each request and response is contained in a JSON object (rather than a loose JSON primitive that might be a string, array, or number), so the envelope you're looking for is the '{ ... }' that defines a JSON object.
I would, therefore, try something like (pseudocode):
while not dead:
read from the socket and append it to a string buffer
set a depth counter to zero
walk each character in the string buffer:
if you encounter a '{':
increment depth
if you encounter a '}':
decrement depth
if depth is zero:
remove what you have read so far from the buffer
pass that to json.loads()
You may find JSON-RPC useful for this situation. It is a remote procedure call protocol that should allow you to call the methods exposed by the XBMC JSON-RPC. You can find the specification on Trac.
res = str(s.recv(4096), 'utf-8') # Getting a response as string
res_lines = res.splitlines() # Split the string to an array
last_line = res_lines[-1] # Normally, the last one is the json data
pair = json.loads(last_line)
https://github.com/A1vinSmith/arbitrary-python/blob/master/sockets/loopHost.py