How to speed up language-tool-python library use case - python

I have a pandas dataframe with 3 million rows of social media comments. I'm using the language-tool-python library to find the number of grammatical errors in a comment. Afaik the language-tool library by default sets up a local language-tool server on your machine and queries responses from that.
Getting the number of grammatical errors is just consists of creating an instance of the language tool object and calling the .check() method with the string you want to check as a parameter.
>>> tool = language_tool_python.LanguageTool('en-US')
>>> text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
>>> matches = tool.check(text)
>>> len(matches)
2
So the method I used is df['body_num_errors'] = df['body'].apply(lambda row: len(tool.check(row))). Now I am pretty sure this works. Its quite straight forward. This single line of code has been running for the past hour.
Because running the above example took 10-20 second, so with 3 million instances, it might as well take virtually forever.
Is there any way I can cut my losses and speed this process up? Would iterating over every row and putting the whole thing inside of a threadpoolexecutor help? Intuitively it makes sense to me as its a I/O bound task.
I am open to any suggestions as to how to speed up this process and if the above method works would appreciate if someone can show me some sample code.
edit - Correction.
It takes 10-20 seconds along with the instantiation, calling the method is almost instantaneous.

I'm the creator of language_tool_python. First, none of the comments here make sense. The bottleneck is in tool.check(); there is nothing slow about using pd.DataFrame.map().
LanguageTool is running on a local server on your machine. There are at least two major ways to speed this up:
Method 1: Initialize multiple servers
servers = []
for i in range(100):
servers.append(language_tool_python.LanguageTool('en-US'))
Then call to each server from a different thread. Or alternatively initialize each server within its own thread.
Method 2: Increase the thread count
LanguageTool takes a maxCheckThreads option – see the LT HTTPServerConfig documentation – so you could also try playing around with that? From a glance at LanguageTool's source code, it looks like the default number of threads in a single LanguageTool server is 10.

In the documentation, we can see that language-tool-python has the configuration option maxSpellingSuggestions.
However, despite the name of the variable and the default value being 0, I have noticed that the code runs noticeably faster (almost 2 times faster) when this parameter is actually set to 1.
I don't know where this discrepancy comes from, and the documentation does not mention anything specific about the default behavior. It is a fact however that (at least for my own dataset, which I don't think can affect this much the running time) this setting improves the performance.
Example initialization:
import language_tool_python
language_tool = language_tool_python.LanguageTool('en-US', config={'maxSpellingSuggestions': 1})

If you are worried about scaling up with pandas, switch to Dask instead. It integrates with Pandas and will use multiple cores in your CPU, which I am assuming you have, instead of a single-core that pandas use. This helps parallelize the 3 million instances and can speed up your execution time. You can read more about dask here or see an example here.

Related

How can I store and restore random state in numpy.random.generator instances?

I did a quick search and the only relevant questions I found talk about the old numpy.random interface. I am trying to understand how to use the new interface. I would like to be able to run some simulation for a given amount of time. Then I want to store the random number generator state information to a file so that I can continue the simulation at a later time.
I have found one way to accomplish this, but it seems to me to be a bad idea since it isn't documented in the API anywhere. I'm wondering if there is a simple way that I have somehow overlooked.
Let's say that I start a simulation with the following code.
from numpy.random import Generator, PCG64
rg = Generator(PCG64(12345))
rg.standard_normal(1024)
save_to_file('state.txt', rg.bit_generator.state)
print(rg.standard_normal(8))
Here, save_to_file saves the dictionary returned by rg.bit_generator.state to state.txt. Now, if I want to continue processing the simulation where I saved it at a later time, I can do so by using the following.
from numpy.random import Generator, PCG64
rg = Generator(PCG64())
rg.bit_generator.state = load_from_file('state.txt')
print(rg.standard_normal(8))
This works, the same 8 numbers are printed for me. I figured out how to do this by inspecting the bit_generator object in the python console. I am using Python 3.6.8 and Numpy 1.18.4. The documentation here and here on the bit_generator object is extremely sparse and doesn't have any suggestions for this common (at least in my work) scenario.
This answer to a similar question about the older interface seems to suggest that it is quite difficult to obtain this for the Mersenne Twister (MT19937), but I am using the PCG64 algorithm which seems not to have as much internal state. At least based on the success of the code I have provided. Is there a better way to accomplish this? One that is either documented or condoned by the community at large? Something that won't break without being well-documented if I one day decide to update Numpy.
Accessing the bit generator through rg is the same as declaring pg = PCG64() and then accessing pg.state. There's nothing wrong with accessing via rg.bit_generator. The docs are a bit scarce but the docs for BitGenerator state that accessing BitGenerator.state allows you to get and set the state if you chose BitGenerator.
https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.PCG64.state.html?highlight=pcg64

Strategies for speeding up string searches in Python

I need some help. I've been working on a file searching app as I learn Python, and it's been a very interesting experience so far, learned a lot and realized how little that actually is.
So, it's my first app, and it needs to be fast! I am unsatisfied with (among other things) the speed of finding matches for sparse searches.
The app caches file and folder names as dbm keys, and the search is basically running search words past these keys.
The GUI is in Tkinter, and to try not get it jammed, I've put my search loop in a thread. The thread recieves queries from the GUI via a queue, then passes results back via another queue.
That's how the code looks:
def TMakeSearch(fdict, squeue=None, rqueue=None):
'''Circumventing StopIteration(), did not see speed advantage'''
RESULTS_PER_BATCH=50
if whichdb(DB)=='dbhash' or 'dumb' in whichdb(DB):
'''iteration is not implemented for gdbm and (n)dbm, forced to
pop the keys out in advance for "for key in fdict:" '''
fdict=fdict
else:
# 'dbm.gnu', 'gdbm', 'dbm.ndbm', 'dbm'
fdict=fdict.keys()
search_list=None
while True:
query=None
while not squeue.empty():
#more items may get in (or not?) while condition is checked
query=squeue.get()
try:
search_list=query.lower().encode(ENCODING).split()
if Tests.is_query_passed:
print (search_list)
except:
#No new query, or a new database has been created and needs to be synced
sleep(0.1)
continue
else:
is_new_query=True
result_batch=[]
for key in fdict:
separator='*'.encode(ENCODING) #Python 3, yaaay
filename=key.split(separator)[0].lower()
#Add key if matching
for token in search_list:
if not token in filename:
break
else:
#Loop hasn't ended abruptly
result_batch.append(key)
if len(result_batch)>=RESULTS_PER_BATCH:
#Time to send off a batch
rqueue.put((result_batch, is_new_query))
if Tests.is_result_batch:
print(result_batch, len(result_batch))
print('is_result_batch: results on queue')
result_batch=[]
is_new_query=False
sleep(0.1)
if not squeue.empty():
break
#Loop ended naturally, with some batch<50
rqueue.put((result_batch, is_new_query))
Once there are few results, the results cease to be real-time, but rather take a few seconds, and that's on my smallish 120GB hard disk.
I believe it can be faster, and wish to make the search real-time.
What approaches exist to make the search faster?
My current marks all involve ramping up the faculties that I use - use multiprocessing somehow, use cython, perhaps somehow use ctypes to make the searches circumvent the Python runtime.
However, I suspect there are simpler things that can be done to make it work, as I am not savvy with Python and optimization.
Assistance please!
I wish to stay within the standard library if possible, as a proof of concept and for portability (currently I only scandir as an external library on Python <3.5), so for example ctypes would be preferrable to cython.
If it's relevant/helpful, the rest of the code is here -
https://github.com/h5rdly/Jiffy
EDIT:
This is the heart of the function, take a few pre-arrangements:
for key in fdict:
for token in search_list:
if not token in key:
break
else:
result_batch.append(key)
where search_list is a list of strings, and fdict is a dictionary or a dbm (didn't see a speed difference trying both).
This is what I wish to make faster, so that results arrive in real-time, even when there are only few keys containing my search words.
EDIT 2:
On #hpaulj 's advice, I've put the dbm keys in a (frozen) set, to gain a noticable imrovement on Windows/Python27 (dbhash):
I have some caveats though -
For my ~50Gb in use, the frozenset takes 28Mb, as by pympler.asizeof. So for the full 1Tb, I suspect it'll take a nice share of RAM.
On linux, for some reason, the conversion not only doesn't help, but the query itself stops getting updated in real time for some weird reason for the duration of the search, making the GUI look unrespnsive.
On Windows, This is almost as fast as I want, but still not warp-immediate.
So this comes around to this addition:
if 'win' in sys.platform:
try:
fdict=frozenset(fdict)
except:
fdict=frozenset(fdict.keys())
Since it would take a significant amount of RAM for larger disks, I think I'll add it as an optional faster search for now, "Scorch Mode".
I wonder what to do next. I thought that perhaps, if I could somehow export the keys/filenames to a datatype that ctypes can pass along, I could then pop a relevant C function to do the searches.
Also, perhaps learn the Python bytecode and do some lower-level optimization.
I'd like this to be as fast as Python would let me, please advise.

What is the difference between random.normalvariate() and random.gauss() in python?

What is the difference between random.normalvariate() and random.gauss()?
They take the same parameters and return the same value, performing essentially the same function.
I understand from a previous answer that random.gauss() is not thread safe, but what does this mean in this context? Why should a programmer care about this? Alternatively posed, why was both a thread safe and non-thread safe version of included in Python's 'random'?
This is an interesting question. In general, the best way to know the difference between two python implementations is to inspect the code yourself:
import inspect, random
str_gauss = inspect.getsource(random.gauss)
str_nv=inspect.getsource(random.normalvariate)
and then you print each of the strings to see how the sources differ. A quick look at the codes show that not only they behave differently multithread-wise, but also that the algorithms are not the same; for example, normalvariate uses something called the Kinderman and Monahan method, as per the following comments in str_nv:
# Uses Kinderman and Monahan method. Reference: Kinderman,
# A.J. and Monahan, J.F., "Computer generation of random
# variables using the ratio of uniform deviates", ACM Trans
# Math Software, 3, (1977), pp257-260.
Thread-safe pieces of code must account for possible race conditions during execution. This introduces overhead as a result of synchronization schemes like mutexes, semaphores, etc.
However, if you are writing non-reentrant code, no race conditions normally arise, which essentially means that you can write code that executes a bit faster. I guess this is why random.gauss() was introduced, since the python doc says it's faster than the thread-safe version.
I'm not entirely sure about this but the Python Documentation says that random.gauss is slightly faster so if you're OK with non-thread safe then you can go a little faster.
In a multi-threaded system, calling random.normalvariate twice very quickly in succession will cause the internal code of random.normalvariate to be run twice, potentially before the first call has had a chance to return. Internal variables may of the function may not be reset before the second, which may cause errors in the function output.
Successive calls to random.gauss must instead wait for earlier calls to return before being called themselves.
The advantage with random.normalvariate is therefore that it is faster, but may produce an erroneous output.

Benchmarking run times in Python

I have to benchmark JSON serialization time and compare it to thrift and Google's protocol buffer's serialization time. Also it has to be in Python.
I was planning on using the Python profilers.
http://docs.python.org/2/library/profile.html
Would the profiler be the best way to find function runtimes? Or would outputting a timestamp before and after the function call be the better option?
Or is there an even better way?
From the profile docs that you linked to:
Note The profiler modules are designed to provide an execution profile for a given program, not for benchmarking purposes (for that, there is timeit for reasonably accurate results). This particularly applies to benchmarking Python code against C code: the profilers introduce overhead for Python code, but not for C-level functions, and so the C code would seem faster than any Python one.
So, no, you do not want to use profile to benchmark your code. What you want to use profile for is to figure out why your code is too slow, after you already know that it is.
And you do not want to output a timestamp before and after the function call, either. There are just way too many things you can get wrong that way if you're not careful (using the wrong timestamp function, letting the GC run a cycle collection in the middle of your test run, including test overhead in the loop timing, etc.), and timeit takes care of all of that for you.
Something like this is a common way to benchmark things:
for impl in 'mycode', 'googlecode', 'thriftcode':
t = timeit.timeit('serialize(data)',
setup='''from {} import serialize;
with open('data.txt') as f: data=f.read()
'''.format(impl),
number=10000)
print('{}: {}'.format(impl, t)
(I'm assuming here that you can write three modules that wrap the three different serialization tools in the same API, a single serialize function that takes a string and does something or other with it. Obviously there are different ways to organize things.)
You should be careful when you are profiling python code based on a time stamp at the start and end of the problem. This does not take into account other processes that might also be running concurrently.
Instead, you should consider looking at
Is there any simple way to benchmark python script?

Counting number of symbols in Python script

I have a Telit module which runs [Python 1.5.2+] (http://www.roundsolutions.com/techdocs/python/Easy_Script_Python_r13.pdf)!. There are certain restrictions in the number of variable, module and method names I can use (< 500), the size of each variable (16k) and amount of RAM (~ 1MB). Refer pg 113&114 for details. I would like to know how to get the number of symbols being generated, size in RAM of each variable, memory usage (stack and heap usage).
I need something similar to a map file that gets generated with gcc after the linking process which shows me each constant / variable, symbol, its address and size allocated.
Python is an interpreted and dynamically-typed language, so generating that kind of output is very difficult, if it's even possible. I'd imagine that the only reasonable way to get this information is to profile your code on the target interpreter.
If you're looking for a true memory map, I doubt such a tool exists since Python doesn't go through the same kind of compilation process as C or C++. Since everything is initialized and allocated at runtime as the program is parsed and interpreted, there's nothing to say that one interpreter will behave the same as another, especially in a case such as this where you're running on such a different architecture. As a result, there's nothing to say that your objects will be created in the same locations or even with the same overall memory structure.
If you're just trying to determine memory footprint, you can do some manual checking with sys.getsizeof(object, [default]) provided that it is supported with Telit's libs. I don't think they're using a straight implementation of CPython. Even still, this doesn't always work and with raise a TypeError when an object's size cannot be determined if you don't specify the default parameter.
You might also get some interesting results by studying the output of the dis module's bytecode disassembly, but that assumes that dis works on your interpreter, and that your interpreter is actually implemented as a VM.
If you just want a list of symbols, take a look at this recipe. It uses reflection to dump a list of symbols.
Good manual testing is key here. Your best bet is to set up the module's CMUX (COM port MUXing), and watch the console output. You'll know very quickly if you start running out of memory.
This post makes me recall my pain once with Telit GM862-GPS modules. My code was exactly at the point that the number of variables, strings, etc added up to the limit. Of course, I didn't know this fact by then. I added one innocent line and my program did not work any more. I drove me really crazy for two days until I look at the datasheet to find this fact.
What you are looking for might not have a good answer because the Python interpreter is not a full fledged version. What I did was to use the same local variable names as many as possible. Also I deleted doc strings for functions (those count too) and replace with #comments.
In the end, I want to say that this module is good for small applications. The python interpreter does not support threads or interrupts so your program must be a super loop. When your application gets bigger, each iteration will take longer. Eventually, you might want to switch to a faster platform.

Categories