What is the easiest way to compare two web pages using python? - python

Hello I want to Compare two webpages using python script.
how can i achieve it? thanks in advance!

First, you want to retrieve both webpages. You can use wget, urlretrieve, etc.:
wget Vs urlretrieve of python
Second, you want to "compare" the pages. You can use a "diff" tool as Chinmay noted. You can also do a keyword analysis of the two pages:
Parse all keywords from page. e.g. How do I extract keywords used in text?
Optionally take the "stem" of the words with something like:
http://pypi.python.org/pypi/stemming/1.0
Use some math to compare the two pages' keywords, e.g. term frequency–inverse document frequency: http://en.wikipedia.org/wiki/Tf%E2%80%93idf with some of the python tools out there like these: http://wiki.python.org/moin/InformationRetrieval

What do you mean by compare? If you just want to find the differences between two files, try difflib, which is part of the standard Python library.

Related

tabula-py get total number of pages

I am using tabula-py to extract some text from a pdf.
For my program I need to know the total number of pages. Is it possible to know this with tabula-py or do I need to use another module for this? If yes can you suggest the easiest method, possibly without any additional module or with a built-in module?
Note: No I don't need to read all, so I am not using pages='all'

How to diff between two HTML codes?

I need to run a diff mechanism on two HTML page sources to kick out all the generated data (like user session, etc.).
I wondering if there is a python module that can do that diff and return me the element that contains the difference (So I will kick him in the rest of my code in another sources)
You can use the difflib module. It's available as a part of standard python library.

Collect calls and save them to csv

Is it possible if I have a list of url parse them in python and take this server calls key/values without need to open any browser manually and save them to a local file?
The only library I found for csv is pandas but anything for the first part. Any example will be perfect for me.
You can investigate the use of one of the built in or available libraries that let python actually perform the browser like operations and record the results, filter them and then use the built in csv library to output the results.
You will probably need one of the lower level libraries:
urllib/urllib2/urllib3
And you may need to override, one or more, of the methods to record the transaction data that you are looking for.

Crawling all wikipedia pages for phrases in python

I need to design a program that finds certain four or five word phrases across the entire wikipedia collection of articles (yes, I know it's lot of pages, and I don't need answers calling me an idiot for doing this).
I haven't programmed much stuff like this before, so there are two issues that I would greatly appreciate some help with:
First, how I would be able to get the program to crawl through all of the pages (i.e NOT hardcoding each one of the millions of pages. I have downloaded all the articles onto my hard drive, but I'm not sure how I can tell the program to iterate through each one in the folder)
EDIT - I have all the wikipedia articles on my hard drive
The snapshots of the pages have pictures and tables in them. How would I extract solely the main text of the article?
Your help on either of the issues is greatly appreciated!
Instead of crawling page manually, which is slower and can be blocked, you should download the official datadump. These don't contain images so the second problem is also solved.
EDIT: I see that you have all the article on you computer, so this answer might not help much.
The snapshots of the pages have pictures and tables in them. How would
I extract solely the main text of the article?
If you are okay with finding the phrases within the tables, you could try using regular expressions directly, but the better choice would be to use a parser and remove all the markup. You could use Beautiful Soup to do this (you will need lxml too):
from bs4 import BeautifulSoup
# produces an iterable generator that returns the text of each tag in turn
gen = BeautifulSoup(markup_from_file, 'xml').stripped_strings
list_of_strings = [x for x in gen] # list comprehension generates list
' '.join(list_of_strings)
BeautifulSoup produces unicode text, so if you need to change the encoding, you can just do:
list_of_strings = map(lambda x: x.encode('utf-8'),list_of_strings)
Plus, Beautiful Soup can help you to better navigate and select from each document. If you know the encoding of the data dump, that will definitely help it go faster. The author also says that it runs faster on Python 3.
bullet point 1: Python has a module just for the task of recursively iterating every file or directory at path, os.walk.
point 2: what you seem to be asking here is how to distinguish files that are images from files that are text. the magic module, available at the cheese shop, provides python bindings for the standard unix utility of the same name (usually invoked as file(1))
You asked:
I have downloaded all the articles onto my hard drive, but I'm not
sure how I can tell the program to iterate through each one in the
folder
Assuming all the files are in a directory tree structure, you could use os.walk (link to Python documentation and example) to visit every file and then search each file for the phrase(s) using something like:
for line in open("filename"):
if "search_string" in line:
print line
Of course, this solution won't be featured on the cover of "Python Perf" magazine, but I'm new to Python so I'll pull the n00b card. There is likely a better way to grep within a file using Python's pre-baked modules.

text extraction project - best tool for extracting only specific rows / items out of a PDF?

I'm working on a project that is going to extract specified text from a pdf document. I have no experience with this type of extraction. One issue is that we don't just want a dump of all the text in the document. Rather, is there a way to extract only certain fields in the pdf? Is there a notion of pdf templates that could be used for something like this?
I'm trying to use Apple's Automator - this is able to get all the text but not specified text. Ideally, I would like someone in Pages to have for example 30 discreet rows of text and have 20 of those rows be specified as 'catalog item' and have our Automator script take ONLY those twenty lines.
Any ideas on best workflow / extraction tools for this? I would prefer only consumer level items be used such as Apple Pages, Automator, and ruby or python as a scripting language.
thx
edit #1
looks like tagged pdf's might be one way to do this - not sure how well supported on Apple Pages this is
With python, the best choice would probably be PDFMiner. It can extract the coordinates for every text string, so you can work out the rectangles in your form on your own and pick out what falls within them. It's all pretty low level, but PDF is unfortunately a pretty low level format.
Be warned that unless you already know a lot about the structure of PDF, you'll find the API and documentation rather scanty. Look around for usage examples, including here on SO.
For Ruby you might try pdf-reader for parsing a PDF and accessing both metadata and content. Extracting the specific items your interested in is another story, but how to go about doing that depends highly on what format of data you're expecting.
You can use Origami in Ruby, a framework designed to parse, analyze,
and forge PDF documents, or the Python equivalent: Origapy, a simple Python
interface for the Ruby based Origami.

Categories