Best method to extract information from unstructured text - python

My aim is to extract information from old scanned reports and store in the structured database. I have already extracted text from these reports using Solr.
All of these are scientific reports and have a different structure in terms of the content of the report, but all of these has similar information. I wanted to a create a structured database using these reports such as name of the company involved in the report, name of the software involved in the report, name of the location, date of the experiment etc. For each of these fields, I have some keywords which shall be used for extraction, For example for the Location information: Location, Place of experiment, Place, Facility etc. What will be the best way to proceed in the direction?
Also, in some of these files, there are no sentences to process. Information is given in Form like structure, for example:
Location: Canada
Date of the experiment: 1985-05-01.
Which techniques will be best to extract the information? also which software, libraries should I use?

Related

Extract data from csv in RASA

I am making a chatbot in RASA which helps High school graduates find university according to their desired location. I have all my data stored in a CSV file. So is there any way we can extract some specific data from that CSV.
Example: If a user asks to show universities available in a certain location, how to extract the specific data from CSV which is the name of the university according to the location given by the user.
Seems like you will need to train the model with location entity. Create a story that will link an intent with location entity to a custom action.
Sample story could like something like this:
story 1
* ask_university{"location":New York}
- action_get_universities
In the custom action action_get_universities, you will then need to handle the CSV query based on the location entity that the model detected. Pandas should work just fine.
Have fun exploring !

Export PDF with pre-filled and unfilled editable fields

I am trying to print out custom commercial invoices based on a known set of data and allowing for an unknown set of data. My known data includes addresses, general contact information, etc.
What I want is to be able to use a PDF template of a "commercial invoice" and have the known data auto-populated into the form where available. Then, the user can download the (incomplete) PDF and fill in the empty / optional form fields using their own collection of information - stuff like Tax ID, recipient care-of names, internal tracking ids, etc.
How can I use JSON / XML + python + HTML + a PDF template to auto-fill some info and leave some info empty, on an editable PDF form?
Thanks!
You essentially want server-side filling of the form.
There are several possible approaches.
An industry-strength approach would be using a dedicated application which could be called via command line (FDFMerge by Appligent comes to my mind, which is very easy to integrate, as all you'd have to do is to assemble the FDF data, and then the command string).
Another approach is to use one of the PDF creating libraries out there (iText, pdflib or Adobe's PDF Library come to my mind here). In this case, you have considerably more programming effort, but may have somewhat more flexibility.

Information Extraction from Text into Structured Data with Python

I'm near a total outsider of programming, just interested in it.
I work in a Shipbrokering company and need to match between positions (which ship will be open at where, when) and orders (what kind of ships will be needed at where, when for what kind of employment).
And we send and receive such info (positions and orders) by emails to and from our principals and co-brokers.
There are thousands of such emails each day.
We do the matching by reading the emails manually.
I want to build an app to do the matching for us.
One important part of this app will do the information extraction from email text.
==> My question is how do I use Python to extract unstructured info into structured data.
Sample email of an order [annotation in the brackets, but is not included in the email]:
Email Subject: 20k dwt requirement, 20-30/mar, Santos-Conti
Content:
Acct ABC [Account Name]
Abt 20,000 MT Deadweight [Size of Ship Needed]
Delivery to make Santos [Delivery Point/Range, Owners will deliver the ship to Charterers here]
Laycan 20-30/Mar [Laycan (the time spread in which delivery can be accepted]
1 time charter with grains [What kind of Empolyment/Trade, Cargo]
Duration about 35 days [Duration]
Redelivery 1 safe port Continent [Redelivery Point/Range, Charterers will redeliver the ship back to Owners here.]
Broker name/email/phone...
End Email
Same email above can be written in many different ways - some writes in one line, some use l/c instead of laycan...
And there are emails for positions with ship's name, open port, date range, ship's deadweight and other specs.
How can I extract the info and put it into structured data, with Python?
Let's say I have put all email contents into text files.
Thanks.
Below is a possible approach:
Step 1: Classify the mails in categories using the subject and/or message in the mail.
As you stated one category is of mails requesting position and the other is of mails of order.
Machine Learning can be used to classify. You can use set of previous mails as training corpus. You might consider using NLTK(Natural Langauage Toolkit) for Python. Here is the link on text classification using NLTK.
Step 2: Once an email is identified as an order mail, process it to fetch the details(account name, size, time spread etc.) As you mentioned the challenge here is that there is no fixed format for these data. To solve this problem, you might consider preparing an exhaustive list of synonyms for each label(like for account the list could be like ['acct', 'a/c', 'account', 'acnt']). This should be done once, by going through a fixed volume of previous mails.
To make the solution more effective, you could consider implementing option for active learning
(i.e., prompt the user if in a mail a lable is found which is not found in any list. E.g. in a mail, if "accnt" is used, it wont be resolved, hence user should be prompted to ask in which category it falls.)
Once a lable is identifies, you can use basic string operations, to parse the email in a fetch relevant data in structured format.
You can refer to this discussion for a better understanding.

Extract statistical information from Wikipedia article

I'm currently extracting data from DBpedia articles using a SPARQLWrapper for python, but I can't seem to find how to extract the number of watchers (and other statistical information) for a given article.
Is there an easy way to achieve this? I don't mind if it's through DBpedia, or directly through wikipedia (using wget, for example).
Thanks for any advice.
It shell be prohibited to get the number of watchers for every arbitrary article, as it is considered to be a security leak if everyone could find unwatched pages. For example, only privileged users have access to Special:Unwatched Pages. There is a toolserver tool (which has access to the DB) showing the number of watchers, but it is restricted to pages with more than 30 watchers for the same reasons - at least unauthenticated.
The MediaWiki query API exposes only mostly content and status information about articles, though you can query and evaluate the public logs or revision histories as well to get statistical data about (public) user actions. For more stats about the Wikimedia sites you may have a look at Meta:Statistics, where various data sources (mostly http://stats.wikimedia.org/) and visualisations of them are listed.

Performing multiple searches of a term in Twitter

I have little working knowledge of python. I know that there is something called a Twitter search API, but I'm not really sure what I'm doing. I know what I need to do:
I need point data for a class. I thought I would just pull up a map of the world in a GIS application, select cities that have x population or larger, then export those selections to a new table. That table would have a key and city name.
next i randomly select 100 of those cities. Then I perform a search of a certain term (in this case, Gaddafi) for each of those 100 cities. All I need to know is how many posts there were on a certain day (or over a few days depending on amount of tweets there were).
I just have a feeling there is something that already exsists that does this, and I'm having a hard time finding it. I've dowloaded and installed python-twitter but have no idea how to get this search done. Anyone know where I can find or how I can make this tool? Any suggestions would really help. Thanks!
A tweet itself comes with a geo tag. But it is a new feature and majority tweets do not have it. So it is not possible to search for all tweets containing "Gaddafi" from a city given the city name.
What you could do is the reverse, you search for "Gaddafi" first (regardless of geo location), using search api. Then, for each tweet, find the location of the poster (either thru the RESTful api or use some sort of web scraping).
so basically you can classify the tweets collected according to the location of the poster.
I think only tweepy have access to both twitter search API as well as RESTful API.

Categories