I have this problem, I need to scrape lots of different HTML data sources, each data source contains a table with lots of rows, for example country name, phone number, price per minute.
I would like to build some semi automatic scraper which will try to ..
find automatically the right table in the HTML page,
-- probably by searching the text for some sample data and trying to find the common HTML element which contain both
extract the rows
-- by looking at above two elements and selecting the same patten
identify which column contains what
-- by using some fuzzy algorithm to best guess which column is what.
export it to some python / other list
-- cleaning everytihng.
does this look like a good design ? what tools would you choose to do it in if you program in python ?
does this look like a good design ?
No.
what tools would you choose to do it in if you program in python ?
Beautiful Soup
find automatically the right table in the HTML page, -- probably by searching the text for some sample data and trying to find the common HTML element which contain both
Bad idea. A better idea is to write a short script to find all tables, dump the table and the XPath to the table. A person looks at the table and copies the XPath into a script.
extract the rows -- by looking at above two elements and selecting the same patten
Bad idea. A better idea is to write a short script to find all tables, dump the table with the headings. A person looks at the table and configures a short block of Python code to map the table columns to data elements in a namedtuple.
identify which column contains what -- by using some fuzzy algorithm to best guess which column is what.
A person can do this trivially.
export it to some python / other list -- cleaning everytihng.
Almost a good idea.
A person picks the right XPath to the table. A person writes a short snippet of code to map column names to a namedtuple. Given these parameters, then a Python script can get the table, map the data and produce some useful output.
Why include a person?
Because web pages are filled with notoriously bad errors.
After having spent the last three years doing this, I'm pretty sure that fuzzy logic and magical "trying to find" and "selecting the same patten" isn't a good idea and doesn't work.
It's easier to write a simple script to create a "data profile" of the page.
It's easier to write a simple script reads a configuration file and does the processing.
I cannot see better solution.
It is convenient to use XPath to find the right table.
Related
I got a project in school to make a python program for a hotel management system. In a specific part I found to make cancellation slip in the following format.
Refer to format 3
Please tell how to make a format like this. I tried to make table using prettytable module but didn't get the desired results. I am new to python so try to suggest an easy way. Specifically I want to know how to create a self made table in a short way. I want the columns and rows merged as is in the image. For example so that I can merge two cells of a table. The table doesn't need to be fancy I just want the strings aligned properly looking neat and clean.
I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!
I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.
I have several SQLite databases ranging in size from 1 to 150 MB some with as many as 30,000 rows. The data being searched is basic HTML. I'm looking for the quickest way to search the HTML text while compensating for any HTML tags.
For instance, if I am searching for "the sky is blue" and a record in a database has an italics tag (i.e. "the <i>sky</i> is blue"), I need it to find it.
Obviously a straight search,
SELECT * FROM dictionary WHERE definition LIKE "%the sky is blue%"
won't work.
So I tried a search for all the individual words in a record in any order and then filter them with a regular expression. This works but is slow. It delivers too many false records that must be scanned by the regex. Especially if there are common words in the search string.
I tried searching for the individual words in order (LIKE "%the%sky%is%blue%") but this would sometimes cause the SQL search to hang with the larger records for some reason. I think it is because of the short common strings ("is", "at", etc.) that produce 1000s of hits.
An SQL regex search is also too slow for my purposes.
One option is to make another table with the data in all the records stripped of HTML tags and search that instead, but this nearly doubles the size of the database.
What other options are there to compensate for the tags?
As you have discovered, relational systems weren't designed for this kind of searching, and there's very little you can do to fix that. The best answer is indeed to store a pre-stripped version of the text purely for searching purposes. Even a 300MB file would be considered small in today's terms, so unless space is a real constraint I wouldn't bother too much about that.
There's no real need for another table, though - that would only complicate things. I'd recommend that you simply add the stripped text as an additional column on your existing table.
I'm working on a project that is going to extract specified text from a pdf document. I have no experience with this type of extraction. One issue is that we don't just want a dump of all the text in the document. Rather, is there a way to extract only certain fields in the pdf? Is there a notion of pdf templates that could be used for something like this?
I'm trying to use Apple's Automator - this is able to get all the text but not specified text. Ideally, I would like someone in Pages to have for example 30 discreet rows of text and have 20 of those rows be specified as 'catalog item' and have our Automator script take ONLY those twenty lines.
Any ideas on best workflow / extraction tools for this? I would prefer only consumer level items be used such as Apple Pages, Automator, and ruby or python as a scripting language.
thx
edit #1
looks like tagged pdf's might be one way to do this - not sure how well supported on Apple Pages this is
With python, the best choice would probably be PDFMiner. It can extract the coordinates for every text string, so you can work out the rectangles in your form on your own and pick out what falls within them. It's all pretty low level, but PDF is unfortunately a pretty low level format.
Be warned that unless you already know a lot about the structure of PDF, you'll find the API and documentation rather scanty. Look around for usage examples, including here on SO.
For Ruby you might try pdf-reader for parsing a PDF and accessing both metadata and content. Extracting the specific items your interested in is another story, but how to go about doing that depends highly on what format of data you're expecting.
You can use Origami in Ruby, a framework designed to parse, analyze,
and forge PDF documents, or the Python equivalent: Origapy, a simple Python
interface for the Ruby based Origami.