I have a Python script that counts the occurrences of every k-long substring in a very large text. This is how it does it, after having stored and deduplicated the substrings:
counts = {}
for s in all_substrings:
counts[s] = full_text.count(s)
I was surprised to see that this script uses only 4% CPU on average. I have a 4-core, 8-thread CPU, but no core is used at more than single-digit percentages. I would have expected the script to use 100% of one core, since it doesn't do IO.
Why does it use so little computing power, and how can I improve that?
Your snippet is likely a memory bandwidth limited program. You do next to no computation in that loop. However most cpu monitoring programs report 100% for a program that just accesses memory. I am puzzled too.
If you want to see more of what is happening i recommend you play with the excellent perf tool in linux.
You could start by looking at page faults
# Sample page faults with stack traces, until Ctrl-C:
perf record -e page-faults -ag
I was a victim of the BD PROCHOT bug in Dell's XPS 15, which constantly limited the CPU to 800 Mhz. Disabling BD PROCHOT with ThrottleStop fixed the problem and my script now uses 100% of the CPU cores.
Related
I am currently working on a project where a python program is supposed to be running for several days, essentially in an endless loop until an user intervenes.
I have observed that the ram usage (as shown in the windows task manager) rises - slowly, but steadily. For example from ~ 80 MB at program start to ~ 120 MB after one day. To get a closer look at this, I started to log the allocated memory with
tracemalloc.get_traced_memory() at regular intervalls throughout the program execution. The output was written to the time series db (see image below).
tracemalloc output for one day runtime
To me it looks like the memory that is needed for the program does not accumulate over time. How does this fit in the output of the windows task manager? Should I go through my program to search for growing data structures?
Thank your very much in advance!
Okay, turns out the answer is: no, this is not proper behaviour, the ram usage can stay absolutely stable. I have tested this for three weeks now and the ram usage never exceeded 80 mb.
The problem was in the usage of the influxdb v2 client.
You need to close both the write_api (implicitly done with the "with... as write_api:" statement) and the client itself (explicitly done via the "client.close()" in the example below).
In my previous version that had increasing memory usage, I only closed the write_api and not the client.
client = influxdb_client.InfluxDBClient(url=self.url, token=self.token, org=self.org)
with client.write_api(write_options=SYNCHRONOUS) as write_api:
# force datatypes, because influx does not do fluffy ducktyping
datapoint = influxdb_client.Point("TaskPriorities")\
.tag("task_name", str(task_name))\
.tag("run_uuid", str(run_uuid))\
.tag("task_type", str(task_type))\
.field("priority", float(priority))\
.field("process_time_h", float(process_time))\
.time(time.time_ns())
answer= write_api.write(bucket=self.bucket, org=self.org, record=datapoint)
client.close()
I downloaded the full wikipedia archive 14.9gb and I am running thise line of code:
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2")
My code doesn't seem to be getting past here and it has been running for an hour now, I understand that the target file is massive, but I was wondering how I could tell it is working, or what is the expected time for it to complete?
You can often use an OS-specific monitoring tool, such as top on Linux/Unix/MacOS systems, to get an idea whether your Python process is intensely computing, using memory, or continuing with IO.
Even the simple vocabulary-scan done when 1st instantiating WikiCorpus may take a long time, to both decompress and tokenize/tally, so I wouldn't be surprised by a runtime longer than hour. (And if it's relying on any virtual-memory/swapping during this simple operation, as may be clear from the output of top or similar monitoring, that'd slow things down even more.)
As a comparative baseline, you could time how long decompression-only takes with a shell command like:
% time bzcat enwiki-latest-pages-articles.xml.bz2 | wc
(A quick test on my MacBook Pro suggests 15GB of BZ2 data might take 30-minutes-plus just to decompress.)
In some cases, turning on Python logging at the INFO level will display progress information with gensim modules, though I'm not sure WikiCorpus shows anything until it finishes. Enabling INFO-level logging can be as simple as:
import logging
logging.getLogger().setLevel(logging.INFO)
I'm trying to improve my own GDB pretty printers using the GDB python API.
Currently I'm testing them with a core.
I'm trying to get info for some QMap, QList content, but they have so many elements that printing them with it's contents is really slow (minutes).
So, I would like to know if there is any known way to profile which parts are slower.
I've already checked Python profile manual and google-perftools, but I don't know how to use them in the GDB execution cycle.
gdbcommands.txt:
source /home/user/codigo/git/kde-dev-scripts/gdb/load-qt5printers.py
source /home/user/codigo/myownprinters.py
core ../../core.QThread.1493215378
bt
p longQMapofQList
Link to load-qt5-printers.py content:
Then I launch gdb to automatically run those commands:
gdb-multiarch --command=~/bin/gdbcommands.txt
they have so many elements that printing them with it's contents is really slow (minutes).
Do the Qt pretty-printers respect the print-elements limit? Is your limit set too high?
Profiling is unlikely to help here: if you are printing a list with (say) 1000 elements, it's likely that GDB will need to execute 10,000 or more ptrace calls, and ~all the time is spent waiting for these calls.
You can run gdb under strace -c and observe how many ptraces are executed while printing a list of 10 elements, and 100 elements.
If the increase is linear, you can try to optimize the pretty printer to do fewer accesses. If the increase is quadratic, the pretty printer may do unnecessary pointer chasing (and that would certainly explain why printing long lists takes minutes).
I am using pyspark to aggregate and group a largish csv on a low end machine ; 4 GB Ram and 2 CPU Core. This is done to check the memory limits for the prototype. After aggregation I need to store the RDD to Cassandra which is running in another machine.
I am using Datastax cassandra-python driver. First I used rdd.toLocalIterator and iterated through the RDD and used the drivers synchronous API session.execute. I managed to insert about 100,000 records in 5 mts- very slow. Checking this I found as explained here python driver cpu bound, that when running nload nw monitor in the Cassandra node, the data put out by the python driver is at a very slow rate, causing the slowness
So I tried session.execute_async and I could see the NW transfer at very high speed, and insertion time was also very fast.
This would have been a happy story but for the fact that, using session.execute_async, I am now running out of memory while inserting to a few more tables (with different primary keys)
Sincerdd.toLocalIterator is said to need memory equal to a partition, I shifted the write to Spark worker using rdd.foreachPartition(x), but still going out of memory.
I am doubting that it is not the rdd iteration that causes this, but the fast serialization ? of execute_async of the python driver (using Cython)
Of course I can shift to a bigger RAM node and try; but it would be sweet to solve this problem in this node; maybe will try also multiprocessing next; but if there are better suggestions please reply
the memory error I am getting is from JVM/or OS outofmemory,
6/05/27 05:58:45 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 0 is 183 bytes
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fdea10cc000, 12288, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/ec2-user/hs_err_pid3208.log
I tried the execution in a machine with a bigger RAM - 16 GB; This time I am able to avert the Out if Memory scenario of above;
However this time I change the insert a bit to insert to multiple table;
So even with session.executeAsysc too I am finding that the python driver is CPU bound (and I guess due to GIL not able to make use of all CPU cores), and what goes out in the NW is a trickle.
So I am not able to attain case 2; Planning to change to Scala now
Case 1: Very less output to the NW - Write speed is fast but nothing to write
Case 2:
Ideal Case - Inserts being IO bound: Cassandra writes very fast
I'm trying to reduce the processor time consumed by a python application, and after profiling it, I've found a small chunk of code consuming more processor time than it should:
class Stats(DumpableObject):
members_offsets = [
('blkio_delay_total', 40),
('swapin_delay_total', 56),
('read_bytes', 248),
('write_bytes', 256),
('cancelled_write_bytes', 264)
]
[...other code here...]
def accumulate(self, other_stats, destination, coeff=1):
"""Update destination from operator(self, other_stats)"""
dd = destination.__dict__
sd = self.__dict__
od = other_stats.__dict__
for member, offset in Stats.members_offsets:
dd[member] = sd[member] + coeff * od[member]
Why is this so expensive? How can I improve the efficiency of this code?
Context:
One of my favourite Linux tools, iotop, uses far more processor time than I think is appropriate for a monitoring tool - quickly consuming minutes of processor time; using the built in --profile option, total function calls approached 4 million running for only 20 seconds. I've observed similar behaviour on other systems, across reboots, & on multiple kernels. pycallgraph highlighted accumulate as one of a few time-consuming functions.
After studying the code for a full week, I think that dictionaries are the best choice for a data structure here, as a large number of threads to update will require many lookups, but don't understand why this code is expensive. Extensive search failed to enlighten. I don't understand the curses, socket, and struct libraries well enough to ask a self-contained question. I'm not asking for code as lightweight as pure C is in i7z.
I'd post images & other data, but I don't have the reputation.
The iotop git repository: http://repo.or.cz/w/iotop.git/tree (The code in question is in data.py, beginning line 73)
The system in question runs Ubuntu 13.04 on an Intel E6420 with 2GB of ram. Kernel 3.8.0-35-generic.
(I wish that Guillaume Chazarain had written more docstrings!)