How to structure a web scraper project? - python

I have a project that is to collect posts from several second hand vehicle websites using BeautifulSoup and then store them in a database. Also my client requested to build this functionality on top of some content management system he is familiar or semi-familiar with like wordpress.
Can this be done using wordpress without making a big mess out of it? If not how would you suggest to structure my project and what cms to use?

Wordpress seems to support only mySQL and MariaDB, according to their site: https://codex.wordpress.org/Using_Alternative_Databases. Those seem to be your only database-tech options if you want to maintain Wordpress support.
From there, it's up to whatever is easier for your python to access, to be honest.

Related

Linking python file into functioning HTML/CSS website

I'm giving myself a project to better learn these languages which I already know a lot of it's just syncing them together I need to get better with. This project is a pretty basic "SIM" game, generate some animals into your profile with login/logout. So far I've got the website aspect with HTML/CSS done and functioning with all the pages I currently need all of which is local host on my desktop. Now I'm moving on to working with Python and possibly some PHP aspects into this to get the login/logout and generate a new animal into your account.
Everything I've done with python so far has been done in IDEL, I'm wondering how to link my python document to my HTML document. Like you would CSS? Or is that not possible if not then how do I connect the two to have python interact with the HTML/CSS that has been created? I'm guessing to need MySQL for a database setup but seeing how much I can get as a simple local host without hosting online?
If you want to setup a localhost with PHP and MYSQL I can recommend XAMP (https://www.apachefriends.org/). In order for your webapp to talk to your Python scripts you will either need to use FLASK or Django to create a python webserver, or use PHP to run python scripts. Either way, you will need to make AJAX requests to an API to get this done.
Edit: Forgot to mention this, but you will need JavaScript in order to do this

Open Source Search UI projects that can be consume REST services

I'm trying to find a solution to my current problem. Let me explain: I need to find a Search UI that can consume a REST service of my choice and be highly configurable. I've searched the web and found Blacklight Search UI (written in Ruby) for Solr. I've also looked at Haystack (for django) which seems to be more promissing because somewhere in the docs i found out that you can link Haystack to your custom search engine. Out of the box Haystack supports Solr, Xapian and 2 others which i can't remember now.
What i'm trying to find is a UI written in Java, PHP(last resort!) or Python that will allow me to specify the endpoints for my APIs and with a few configurations (i'm not expecting it to run out of the box) it should be able to query the APIs and return results.
If that is not possible then could somebody suggest me something that gets close to what I described and allows me to write my own backend code that will link to the APIs ? A Haystack example will also do...
Thanks
I'm interested in this topic as well. I know about the SESAT framework supporting FAST, Solr, Yahoo!, generic XML and more, but it is old and not well maintained, and also tries to do much more than a simple front-end.
You also have AJAX-Solr which obviously only supports Solr.
I have forwarded your question on Twitter, hope others will fill in as well.

How can I update a plone page via a script?

I have a large amount of automatically generated html files that I would like to push to my Plone website with a script. I currently generate the files, log into Plone, click edit on each individual page and copy and paste the html into the editor. I'd like to automate this. It would be nice to retain the plone versioning, have a auto generated comment for the edit, and come from a specific user.
I've read and tried Webdav with little luck at getting it working consistently and know that there is a way to connect to plone via ftp, but haven't tried it. I'm not sure if these are the methods that I need.
My google searches aren't leading me to anything useful. Any ideas on where to start looking for a solution to this? Or any tips on implementing it?
You can script anything in Plone via the following methods:
Through-the-web via API calls (e.g. XML-RPC, wsapi, etc.)
The bin/instance run script provided by plone.recipe.zope2instance (See charm for an example of this).
You can also use a migration framework like:
collective.transmogrifier
which allows you to write migration code, and trigger it via GenericSetup or Browser view. Additionally, there are applications written on top of Transmogrifier aimed roughly at what you are describing, the most popular of which is:
funnelweb
I would recommend that you consider using or writing a Transmogrifier "blueprint(s)" to do your import, and execute the pipeline with a tool that makes that easy:
mr.migrator
You can find blueprints by searching PyPI for "transmogrify". One popular set of blueprints is:
quintagroup.transmogrifier
One of the main attractions to the Transmogrifier approach, aside from getting the job done, is the ability to share useful blueprints with others.
I think transmogrifier is the best tool for this job, but this will definitely be a programming task no matter how you do it. It's used for many such migration jobs such as migrating from drupal.
There's an add-on, wsapi4plone.core that pumazi at WebLion started that provides web services for portals which you can then hook into. You can create, modify, delete content via XML-RPC calls. The only caveat is that it doesn't yet work with Collections (criteria specifically).
project: http://pypi.python.org/pypi/wsapi4plone.core
docs: http://packages.python.org/wsapi4plone.core/
You can also do it programmatically by hooking into the ZODB via Python (zopepy or some other method).
These should get you started:
http://plone.org/documentation/kb/manipulating-plone-objects-programmatically/reading-and-writing-field-values - you should be able to get an understanding of accessors and mutators (setters and getters), in your case you are going to be more than likely working with obj.Text (getter) and obj.setText (setter).
https://weblion.psu.edu/trac/weblion/wiki/AutomatingObjectCreation - lots of examples (slightly outdated but still relevant)
http://plone.org/documentation/faq/upload-images-files
Try to enable Webdav or ftp in Plone, then you can access Plone via webdav or ftp clients, pushing the html files. Plone (Zope) will recognises the html files as Pages.

How to use python for a webservice

I am really new to python, just played around with the scrapy framework that is used to crawl websites and extract data.
My question is, how to I pass parameters to a python script that is hosted somewhere online.
E.g. I make following request mysite.net/rest/index.py
Now I want to pass some parameters similar to php like *.php?id=...
Yes that would work. Although you would need to write handlers for extracting the url parameters in index.py. Try import cgi module for this in python.
Please note that there are several robust python based web frameworks available (aka Django, Pylons etc.) which automatically parses your url & forms a dictionary of all it's parameters, plus they do much more like session management, user authentication etc. I would highly recommend you use them for faster code turn-around and less maintenance hassles.

How to build an interactive search engine web interface using python

I have build a static web interface for searching data from some tables in my PostgreSQL database. The query website consists of a simple textfield for entering the search term, the result website presents the results as a simple html table. The server side code for searching the PostgreSQL database and returning the results is written in python using psycopg2.
Now I would like to add some interactive "Ajax features" to my search engine. When entering the search term I would like to be able to see a list of possible search terms like Google does it. On the results page, I would like to be able to sort the table showing the results.
What would be the easiest/recommended way to implement these features for my search engine web site?
I have not had to build a search outside of Django, but Haystack http://haystacksearch.org/ makes things very easy.
If you don't want to get into Django you could look at Whoosh. http://bitbucket.org/mchaput/whoosh/wiki/Home
what you call "Ajax features" are technically known as auto-suggest. Unless you want to reinvent the wheel. I would highly recommend indexing your db tables using Apache Solr. It comes with autosuggest, faceted filtering (like on most ecommerce sites) and spell-check. and since it is HTTP based you can hook into Python very easily using its RESTful API.

Categories