python BeautifulSoup/threading very slow on xubuntu, not sure why

python BeautifulSoup/threading very slow on xubuntu, not sure why - python

I have a program I wrote on Windows that runs perfectly fine however when running it in Xubuntu it is veryyyyy slow and I'm not sure why, I have isolated the problem to be something to do with either beautiful soup or threading so I made a testing program and ran it on both Windows and Xubuntu.
from testThread import myThread
thread1 = myThread("thread1","http://www.youtube.com")
thread2 = myThread("thread2","http://www.youtube.com")
thread1.start()
thread2.start()
thread1.join()
thread2.join()
from threading import Thread
import time
from bs4 import BeautifulSoup
import requests
class myThread(Thread):
def __init__(self, name, url):
Thread.__init__(self)
self.t0 = time.time()
self.name = name
self.url = url
self.r = requests.get(url)
def run(self):
start_time = time.time() - self.t0
soup = BeautifulSoup(self.r._content)
end_time = time.time() - self.t0
print self.name, "Time: %s" % str(end_time - start_time)
Essentially what the program does is it creates 2 threads that print out the time it takes to use BeautifulSoup to get the contents of a url.
Xubuntu output:
thread1 Time: 6.88162994385
thread2 Time: 6.92221403122
Windows output:
thread1 Time: 0.524999856949
thread2 Time: 0.542999982834
As you can see windows was faster by a significant amount of time, and I have absolutely no idea why.
Xubuntu specs: Intel(R) Core(TM) i7, 2.6Ghz, 16Gb memory, Ubuntu 15.04
Windows specs: Intel(R) Core(TM) Duo CPU 2.2Ghz, 2Gb memory, Windows 7
So how come Xubuntu is so slow compared to Windows when doing this task? I've been looking all over trying to find a solution to this problem but with no luck. Any help would be much appreciated, thank you.
EDIT: I should note that I have tried explicitly stating the parser for BeautifulSoup to use. I have tried
soup = BeautifulSoup(self.r._content,"lxml")
There was no effect on the outputs when using lxml as the parser
EDIT: I have isolated the problem even further and have determined that the issue occurs inside beautiful soups dammit.py class.
Code in BeautifulSoup.dammit
import codecs
from htmlentitydefs import codepoint2name
import re
import logging
import string
# Import a library to autodetect character encodings.
chardet_type = None
try:
# First try the fast C implementation.
# PyPI package: cchardet
import cchardet
def chardet_dammit(s):
return cchardet.detect(s)['encoding']
except ImportError:
try:
# Fall back to the pure Python implementation
# Debian package: python-chardet
# PyPI package: chardet
import chardet
def chardet_dammit(s):
return chardet.detect(s)['encoding']
#import chardet.constants
#chardet.constants._debug = 1
except ImportError:
# No chardet available.
def chardet_dammit(s):
return None
On Windows it executes the "return None" in
except ImportError:
# No chardet available.
def chardet_dammit(s):
return None
On Xubuntu however it executes "return chardet.detect(s)['encoding'] in
try:
# Fall back to the pure Python implementation
# Debian package: python-chardet
# PyPI package: chardet
import chardet
def chardet_dammit(s):
return chardet.detect(s)['encoding']
#import chardet.constants
#chardet.constants._debug = 1
The code that Xubuntu executes takes much longer. It seems the issue has something to with the libraries being imported by Beautiful soup. I'm not exactly sure why Windows is getting the nested import error exception and Xubuntu isn't, but it seems that when beautifulsoup successfully imports the libraries it slows it down greatly which seems odd. I would appreciate it if someone who better understands what's going on here could explain, thank you. Also not sure if this has any effect but I am using pycharm on both Xubuntu and Windows.
EDIT: Found the problem. On Xubuntu I had the chardet library installed whereas on Windows I did not. BeautifulSoup used this library and it ended up slowing down the program considerably, uninstalled it and everything is working smoothly.

Related

Running two python code in parallel from two different directory using Multiprocessing

Below is my code for running two python code in parallel using multiprocessing :
defs.py
import os
def pro(process):
#print(process)
os.system('python {}'.format(process))
Multiprocessing.py
import os
from multiprocessing import Pool
import multiprocessing as mp
import defs
import datetime
import pandas as pd
processes = ('python_code1.py','python_code2.py')
if __name__ == '__main__':
pool = Pool(processes=4)
start = datetime.datetime.now()
print('Start:',start)
pool.map(defs.pro, processes)
end = datetime.datetime.now()
print('End :',end)
total = end-start
print('Total :', end-start)
This code is running perfectly fine. But my requirement is I need to run the python code 'python_code1.py' and 'python_code2.py' from two different directory.
so I made the below changes in Multiprocessing.py:
path1 = r'C:\Users\code1\python_code1.py'
path2 = r'C:\Users\code2\python_code2.py'
processes = (path1,path2)
but this is not working for me.
My Multiprocessing.py and defs.py are kept on path `C:\Users\Multiprocessing\'

Well an elegant solution using asyncio. It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem. And you might find syntax easier as I do:
import os
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def pro(process):
#print(process)
os.system('python {}'.format(process))
processes = (r'C:\Users\code1\python_code1.py',r'C:\Users\code2\python_code2.py')
for process in processes:
pro(process)
Detailed answer on parallelizing for loop. You might find useful.

All threads in my python application appear as "python3" [duplicate]

When I set the name for a Python thread, it doesn't show up on htop or ps. The ps output only shows python as the thread name. Is there any way to set a thread name so that it shows up on system reports like them?
from threading import Thread
import time
def sleeper():
while True:
time.sleep(10)
print "sleeping"
t = Thread(target=sleeper, name="Sleeper01")
t.start()
t.join()
ps -T -p {PID} output
PID SPID TTY TIME CMD
31420 31420 pts/30 00:00:00 python
31420 31421 pts/30 00:00:00 python

First install the prctl module. (On debian/ubuntu just type sudo apt-get install python-prctl)
from threading import Thread
import time
import prctl
def sleeper():
prctl.set_name("sleeping tiger")
while True:
time.sleep(10)
print "sleeping"
t = Thread(target=sleeper, name="Sleeper01")
t.start()
t.join()
This prints
$ ps -T
PID SPID TTY TIME CMD
22684 22684 pts/29 00:00:00 bash
23302 23302 pts/29 00:00:00 python
23302 23303 pts/29 00:00:00 sleeping tiger
23304 23304 pts/29 00:00:00 ps
Note: python3 users may wish to use pyprctl.

Prctl module is nice and provide many features, but depends libcap-dev package. Libcap2 is most likely installed because it is a dependency of many packages (systemd for example). So if you only need set thread name, use libcap2 over ctypes.
See improved Grief answer below.
LIB = 'libcap.so.2'
try:
libcap = ctypes.CDLL(LIB)
except OSError:
print(
'Library {} not found. Unable to set thread name.'.format(LIB)
)
else:
def _name_hack(self):
# PR_SET_NAME = 15
libcap.prctl(15, self.name.encode())
threading.Thread._bootstrap_original(self)
threading.Thread._bootstrap_original = threading.Thread._bootstrap
threading.Thread._bootstrap = _name_hack

On Python 2, I use the following monkey patch to propagate the Thread's name to the system if prctl is installed in the system:
try:
import prctl
def set_thread_name(name): prctl.set_name(name)
def _thread_name_hack(self):
set_thread_name(self.name)
threading.Thread.__bootstrap_original__(self)
threading.Thread.__bootstrap_original__ = threading.Thread._Thread__bootstrap
threading.Thread._Thread__bootstrap = _thread_name_hack
except ImportError:
log('WARN: prctl module is not installed. You will not be able to see thread names')
def set_thread_name(name): pass
After the execution of this code, you can set thread's name as usual:
threading.Thread(target=some_target, name='Change monitor', ...)
That means, that if you already set names for threads, you don't need to change anything. I cannot guarantee, that this is 100% safe, but it works for me.

I was confused after I found a tool--py-spy to show python thread while running.
install: pip3 install -i https://pypi.doubanio.com/simple/ py-spy
usage: py-spy dump --pid process-number
for example, py-spy dump --pid 1234 can show all the thread stacks,name,id of python process 1234

An alternative solution (actually a dirty one, since it sets the process name, not the thread name) is to use the setproctitle module from pypi.
You can install it with pip install setproctitle and use it as follow:
import setproctitle
import threading
import time
def a_loop():
setproctitle.setproctitle(threading.currentThread().name)
# you can otherwise explicitly declare the name:
# setproctitle.setproctitle("A loop")
while True:
print("Looping")
time.sleep(99)
t = threading.Thread(target=a_loop, name="ExampleLoopThread")
t.start()

https://pypi.org/project/namedthreads/ provides a way to patch threading.Thread.start to call pthread_setname_np with the Python Thread.name.
It is compatible with Python 2.7 & 3.4+ (I've tested it with 3.10)
To activate it,
import namedthreads
namedthreads.patch()
Note that thread names in Python are unlimited, but pthreads has a limit of 15 char, so the Python name will be trimmed.

I attempted to follow answers here to install python-prctl or pyprctl. However none of them could be installed because the need for a gcc that we don't have.
After some digging on the net, this python issue 15500 gave a nice solution [https://bugs.python.org/issue15500]. Here is what I've got based on it:
import ctypes, os, threading
def set_thread_name_np(the_name):
the_lib_path = "/lib/libpthread-2.42.so"
if not os.path.isfile(the_lib_path):
return None
try:
libpthread = ctypes.CDLL(the_lib_path)
except:
return None
if hasattr(libpthread, "pthread_setname_np"):
pthread_setname_np = libpthread.pthread_setname_np
pthread_setname_np.argtypes = [ctypes.c_void_p,
ctypes.c_char_p]
pthread_setname_np.restype = ctypes.c_int
if isinstance(the_name, str):
the_name = the_name.encode('ascii', 'replace')
if type(the_name) is not bytes:
return None
the_thread = threading.current_thread()
ident = getattr(the_thread, "ident", None)
if ident is not None:
pthread_setname_np(ident, the_name[:15])
return True
return None

Python's threads block on IO operation

I have the following problem. Whenever a child thread wants to perform some IO operation (writing to file, downloading a file) the program hangs. In the following example the program hangs on opener.retrieve. If I execute python main.py the program is blocked on an retrieve function. If I execute python ./src/tmp.py everything is fine. I don't understand why. Can anybody explain me what is happening?
I am using python2.7 on Linux system (kernel 3.5.0-27).
File ordering:
main.py
./src
__init__.py
tmp.py
main.py
import src.tmp
tmp.py
import threading
import urllib
class DownloaderThread(threading.Thread):
def __init__(self, pool_sema, i):
threading.Thread.__init__(self)
self.pool_sema = pool_sema
self.daemon = True
self.i = i
def run(self):
try:
opener = urllib.FancyURLopener({})
opener.retrieve("http://www.greenteapress.com/thinkpython/thinkCSpy.pdf", "/tmp/" + str(self.i) + ".pdf")
finally:
self.pool_sema.release()
class Downloader(object):
def __init__(self):
maxthreads = 1
self.pool_sema = threading.BoundedSemaphore(value=maxthreads)
def download_folder(self):
for i in xrange(20):
self.pool_sema.acquire()
print "Downloading", i
t = DownloaderThread(self.pool_sema,i)
t.start()
d = Downloader()
d.download_folder()

I managed to get it to work by hacking urllib.py - if you inspect it you will see many import statements dispersed within the code - i.e. it uses imports stuff 'on the fly' and not just when the module loads.
So, the real reason is still unknown - but not worth investigating - probably some deadlock in Python's import system. You just shouldn't run nontrivial code during an import - that's just asking for trouble.
If you insist, you can get it to work if you move all these weird import statements to the beginning of urllib.py.

Strange problems when using requests and multiprocessing

Please check this python code:
#!/usr/bin/env python
import requests
import multiprocessing
from time import sleep, time
from requests import async
def do_req():
r = requests.get("http://w3c.org/")
def do_sth():
while True:
sleep(10)
if __name__ == '__main__':
do_req()
multiprocessing.Process( target=do_sth, args=() ).start()
When I press Ctrl-C (wait 2sec after run - let Process run), it doesn't stop. When I change the import order to:
from requests import async
from time import sleep, time
it stops after Ctrl-C. Why it doesn't stop/kill in first example?
It's a bug or a feature?
Notes:
Yes I know, that I didn't use async in this code, this is just stripped down code. In real code I use it. I did it to simplify my question.
After pressing Ctrl-C there is a new (child) process running. Why?
multiprocessing.__version__ == 0.70a1, requests.__version__ == 0.11.2, gevent.__version__ == 0.13.7

Requests async module uses gevent. If you look at the source code of gevent you will see that it monkey patches many of Python's standard library functions, including sleep:
request.async module during import executes:
from gevent import monkey as curious_george
# Monkey-patch.
curious_george.patch_all(thread=False, select=False)
Looking at the monkey.py module of gevent you can see:
https://bitbucket.org/denis/gevent/src/f838056c793d/gevent/monkey.py#cl-128
def patch_time():
"""Replace :func:`time.sleep` with :func:`gevent.sleep`."""
from gevent.hub import sleep
import time
patch_item(time, 'sleep', sleep)
Take a look at the code from the gevent's repository for details.

Windows Service code gives threading error when using Pywin32 / PyInstaller

I'm getting an error when using a python exe generated from pyinstaller, to create a windows service. The error message may be innocuous, and doesn't seem to effect the operation of the service, but I'm not sure if there are other problems going on behind the scenes. I'm using the pywin32 libraries to install the application as a windows service. I should note that I don't get this error when installing from the python script itself, using PythonService.exe from pywin32, only from the executable generated with pyinstaller.
When using pyinstaller, I can generate the exe from my windows service code and install it, no problem. I can also start the service, no problem. I can even stop the service and the application appears to shut down properly. However, once I've initiated the stop, I get the following error on the console while running the win32traceutil.py:
"Exception KeyError: KeyError(2244,) in <module 'threading' from '2\build\pyi.win32\agentservice\outPYZ1.pyz/threading'> ignored"
No errors are recorded to the event log. I've been able to trace it back to the python logging module. Simply importing the logging module seems to cause my problem. Commenting out the import eliminates the error. It seems pretty clear to me that this is causing the problem, but I find it strange that pyinstaller would have issues with a module in the standard library. Has anyone else run into this?
I'm running Python 2.6.6, Pyinstaller 1.5.1, Build 217 of Pywin32. I'm on Windows XP.
And a stripped down version of my code:
import win32service
import win32serviceutil
import win32event
import win32evtlogutil
import win32traceutil
import servicemanager
import sys
import os
import time
class myservice(win32serviceutil.ServiceFramework):
_svc_name_ = "5"
_svc_display_name_ = "5"
_svc_deps_ = ["EventLog"]
def __init__(self, args):
self.isAlive = True
win32serviceutil.ServiceFramework.__init__(self, args)
self.hWaitStop = win32event.CreateEvent(None, 0, 0, None)
def event_log(self, msg):
servicemanager.LogInfoMsg(str(msg))
def SvcStop(self):
# tell Service Manager we are trying to stop (required)
self.ReportServiceStatus(win32service.SERVICE_STOP_PENDING)
print "svcstop: stopping service, setting event"
# set the event to call
win32event.SetEvent(self.hWaitStop)
print "svcstop: ending svcstop"
def SvcDoRun(self):
print "we are starting the service..."
self.event_log("Starting %s" % self._svc_name_)
############ IF LOGGING IS COMMENTED OUT, THE ERROR GOES AWAY################
import logging
print "svcdorun: waiting for object"
win32event.WaitForSingleObject(self.hWaitStop,win32event.INFINITE)
print "svcdorun: return from function"
if __name__ == '__main__':
if len(sys.argv)==1:
import win32traceutil
print "service is starting..."
#servicemanager.Initialize()
servicemanager.Initialize('backup service', None)
servicemanager.PrepareToHostSingle(myservice)
# Now ask the service manager to fire things up for us...
servicemanager.StartServiceCtrlDispatcher()
print "service done!"
else:
win32serviceutil.HandleCommandLine(myservice)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python BeautifulSoup/threading very slow on xubuntu, not sure why - python

Related

Running two python code in parallel from two different directory using Multiprocessing

All threads in my python application appear as "python3" [duplicate]

Python's threads block on IO operation

Strange problems when using requests and multiprocessing

Windows Service code gives threading error when using Pywin32 / PyInstaller

Categories

Resources