Batch processing in python. - python

I am just wondering, is there any way to process multiple videos in one go using a cluster may be? Specifically using python? For example, if I have 50+ videos in a folder and I have to analyze for movement related activity. Assuming I have a code written in python and I have to use that one particular code for the each video. What exactly I want is, instead of analyzing videos one by one (i.e., putting in loop) I need to analyze videos in parallel. Is there any way that I can implement the same?

You can do this with multiprocessing or threading. The details can be a bit involved, and since you didn't give any in your question, suggesting the above links is about all I can help.

Related

How do I include a data-extraction module into my python project?

I am currently starting a kind of larger project in python and I am unsure about how to best structure it. Or to put it in different terms, how to build it in the most "pythonic" way. Let me try to explain the main functionality:
It is supposed to be a tool or toolset by which to extract data from different sources, at the moment mainly SQL-databases, in the future maybe also data from files stored on some network locations. It will probably consist of three main parts:
A data model which will hold all the data extracted from files / SQL. This will be some combination of classes / instances thereof. No big deal here
One or more scripts, which will control everything (Should the data be displayed? Outputted in another file? Which data exactly needs to be fetched? etc) Also pretty straightforward
And some module/class (or multiple modules) which will handle the data extraction of data. This is where I struggle mainly
So for the actual questions:
Should I place the classes of the data model and the "extractor" into one folder/package and access them from outside the package via my "control script"? Or should I place everything together?
How should I build the "extractor"? I already tried three different approaches for a SqlReader module/class: I tried making it just a simple module, not a class, but I didn't really find a clean way on how and where to initialize it. (Sql-connection needs to be set up) I tried making it a class and creating one instance, but then I need to pass around this instance into the different classes of the data model, because each needs to be able to extract data. And I tried making it a static class (defining
everything as a#classmethod) but again, I didn't like setting it up and it also kind of felt wrong.
Should the main script "know" about the extractor-module? Or should it just interact with the data model itself? If not, again the question, where, when and how to initialize the SqlReader
And last but not least, how do I make sure, I close the SQL-connection whenever my script ends? Meaning, even if it ends through an error. I am using cx_oracle by the way
I am happy about any hints / suggestions / answers etc. :)
For this project you will need the basic Data Science Toolkit: Pandas, Matplotlib, and maybe numpy. Also you will need SQLite3(built-in) or another SQL module to work with the databases.
Pandas: Used to extract, manipulate, analyze data.
Matplotlib: Visualize data, make human readable graphs for further data analyzation.
Numpy: Build fast, stable arrays of data that work much faster than python's lists.
Now, this is just a guideline, you will need to dig deeper in their documentation, then use what you need in your project.
Hope that this is what you were looking for!
Cheers

Scraping web sites with computer vision

I have been given the task to scrape a high number of websites. All of them represent (visually speaking) the data I'm interested in in a similar way. Each one of those websites has a product-details-view (so to call it). And all of the views contain the same information: a product title, price, maybe some images, a description, etc...
If I had to scrape 10 sites, I'd write 10 if/else or case in order to handle them, but I'm afraid the number of websites is quite bigger. And thus I'm getting into a whole other problem.
Then I figured out I'd use "computer vision" and "machine learning". That sounds reasonable in the sense of having almost identical websites and "teaching" an algorithm how to "see" the data I'm interested in.
My strategy, so far, is to render each product-detail-view in a headless chrome (controlled with selenium), take a screenshot and split the visual representation of the website into chunks: left column, main, right column. Then split the "main" part into several chunks: title, breadcrumb, content, etc...
Unfortunately I'm not really sure how to actually split the screenshot into chunks. I have been looking at OpenCV's docs, but I'm not sure it's suited for that concrete purpose (or is it?).
Are there any other libraries that would be a better fit for what I'm trying to do? Also, does my strategy sound good or are there better ways of approaching this problem?
PS: Diffbot, Import.io and similar are not an option. Please don't suggest them.
You can try to solve the problem more engineering approach instead of machine learning. I mean to have one code for all the websites, but different configs for each of them. Some example of the config:
title: '#title_id',
description: '#description_id',
price: '#price_id'
Such approach will need some support in the future, because markup can be changed. But can be good to start for now.

Tracking the progress of a moviepy process on the front end

I'm making a web application that uses moviepy functions but am only able to track progress at a basic level (video uploading, video processes happening, etc, not percentage of progress). Moviepy processes track progress on the server. How might I be able to track progress this information to give to the client?
The ability to pass a function for write_videofile() to call each iteration has been a often-requested feature. It currently is not possible using the main branch of moviepy. It is a feature that we are looking into properly implementing, and you can see all the thoughts about it on our project page here.
There are a few different pull requests submitted by different people with the same feature implemented in different ways. They may be exactly what you are looking for, so feel free to clone them and try them out here, here, here. The first one is much newer, so you're probably better starting off with that one.
I'll update this answer as soon as we decide on something to add.

What are some of the Artificial Intelligence (AI) related techniques one would use for parsing a webpage?

I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)

Writing a Faster Python Spider

I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.
What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.
I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.
You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.
The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.
If you are spidering just one site, consider using wget -r (an example).
Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.
As you are new to Python, I think the following may be helpful for you :)
if you are writing regex to search for certain pattern in the page, compile your regex wherever you can and reuse the compiled object
BeautifulSoup is a html/xml parser that may be of some use for your project.
Spidering somebody's site with millions of requests isn't very polite. Can you instead ask the webmaster for an archive of the site? Once you have that, it's a simple matter of text searching.
You waste a lot of time waiting for network requests when spidering, so you'll definitely want to make your requests in parallel. I would probably save the result data to disk and then have a second process looping over the files searching for the term. That phase could easily be distributed across multiple machines if you needed extra performance.
What Adam said. I did this once to map out Xanga's network. The way I made it faster is by having a thread-safe set containing all usernames I had to look up. Then I had 5 or so threads making requests at the same time and processing them. You're going to spend way more time waiting for the page to DL than you will processing any of the text (most likely), so just find ways to increase the number of requests you can get at the same time.

Categories