Information Extraction from Text into Structured Data with Python

Information Extraction from Text into Structured Data with Python - python

I'm near a total outsider of programming, just interested in it.
I work in a Shipbrokering company and need to match between positions (which ship will be open at where, when) and orders (what kind of ships will be needed at where, when for what kind of employment).
And we send and receive such info (positions and orders) by emails to and from our principals and co-brokers.
There are thousands of such emails each day.
We do the matching by reading the emails manually.
I want to build an app to do the matching for us.
One important part of this app will do the information extraction from email text.
==> My question is how do I use Python to extract unstructured info into structured data.
Sample email of an order [annotation in the brackets, but is not included in the email]:
Email Subject: 20k dwt requirement, 20-30/mar, Santos-Conti
Content:
Acct ABC [Account Name]
Abt 20,000 MT Deadweight [Size of Ship Needed]
Delivery to make Santos [Delivery Point/Range, Owners will deliver the ship to Charterers here]
Laycan 20-30/Mar [Laycan (the time spread in which delivery can be accepted]
1 time charter with grains [What kind of Empolyment/Trade, Cargo]
Duration about 35 days [Duration]
Redelivery 1 safe port Continent [Redelivery Point/Range, Charterers will redeliver the ship back to Owners here.]
Broker name/email/phone...
End Email
Same email above can be written in many different ways - some writes in one line, some use l/c instead of laycan...
And there are emails for positions with ship's name, open port, date range, ship's deadweight and other specs.
How can I extract the info and put it into structured data, with Python?
Let's say I have put all email contents into text files.
Thanks.

Below is a possible approach:
Step 1: Classify the mails in categories using the subject and/or message in the mail.
As you stated one category is of mails requesting position and the other is of mails of order.
Machine Learning can be used to classify. You can use set of previous mails as training corpus. You might consider using NLTK(Natural Langauage Toolkit) for Python. Here is the link on text classification using NLTK.
Step 2: Once an email is identified as an order mail, process it to fetch the details(account name, size, time spread etc.) As you mentioned the challenge here is that there is no fixed format for these data. To solve this problem, you might consider preparing an exhaustive list of synonyms for each label(like for account the list could be like ['acct', 'a/c', 'account', 'acnt']). This should be done once, by going through a fixed volume of previous mails.
To make the solution more effective, you could consider implementing option for active learning
(i.e., prompt the user if in a mail a lable is found which is not found in any list. E.g. in a mail, if "accnt" is used, it wont be resolved, hence user should be prompted to ask in which category it falls.)
Once a lable is identifies, you can use basic string operations, to parse the email in a fetch relevant data in structured format.
You can refer to this discussion for a better understanding.

Related

Task Automation Help: (no similar entries)

Problem: I’m a nurse, and part of my job is to pull up a list of “unsigned services”. Then of course, take these charts, and send them to the right person.
The person before me did not do this, leaving me THOUSANDS of charts to pull up by patient name and DOB, select the right document, and send to the right person.
I have figured out how to use selenium with python to automate logging in, using input to send keys to search the correct patient, and even to pull up the correct document that needs signed.
How do I have the program do this, for every chart? How do I have python work down the list of names and DOB’s without my having to manually put them in?
Anything I look for on my own is just examples of applying a basic function to a list of numbers and that isn’t my goal.
Thanks for your help!

Twitter Filter Streaming API - Increase Rule Limit or Filter Tweets from Friends?

I’m using Twitter’s real-time filtered stream API (for the first time!) in Python, and I’m basically trying to recreate my real-time timeline in-terminal: I want to get a response every time someone I follow tweets, retweets, etc. I’m doing this in two steps:
GETting an id list of people I follow (my “friends”) from api.twitter.com/1.1/friends/ids.json?screen_name=my_username
Iterating over the returned list ids and creating a “from:” follow rule for each:
for f_id in friends_list_response['ids']:
rule_val = "from:" + str(f_id)
filter_rules.append({"value": rule_val, "tag": str(f_id)},)
I have this working for up to 10 test friends—they tweet, I get terminal printout in real-time. However I very quickly hit a cap, since I follow 500+ accounts, and am thus creating 500+ distinct from: rules for the filtered stream.
(HTTP 422): {"detail":"Rule creation request exceeds account's current rule limits. Please remove or update existing rules to proceed.","title":"RulesCapExceeded","type":"https://api.twitter.com/2/problems/rule-cap"
I don’t really understand what to do about hitting my cap (that "type:" url doesn't offer guidance). Whether I need higher permissions to the API v2, or if I need to do something with Power Track. How can I increase my rule limit, ideally tracking 1000+ accounts? I believe I have the basic developer authorization. Or, is there a workaround to a filter rule for tweets from my friends? I wish there was an endpoint that filtered tweets by everyone followed by a user.
Again, I'm a new to this API, so I appreciate any insight!

Nevermind, Tweepy's "home_timeline" function is exactly what I needed, almost a direct answer to "I wish there was an endpoint that filtered tweets by everyone followed by a user."

In Twitter API v2, you can use filtered stream to follow a list of users. You can build filtered stream from rules. The limits of rules are there:
Your rules will be limited depending on which product track you are using.
If you are using the Standard product track at the Basic level of access, you are able to add 25 concurrent rules to your stream, and each rule can be 512 characters long.
If you are using the Academic Research product track, you are able to add 1000 concurrent rules to your stream, and each rule can be 1024 characters long.
So, you can add 25 concurrent rules with 512 characters for each rule. If the only filter will be users accounts or IDs, you need to calculate the number of users that fit within the character limitations.
The average character length of the id of twitter accounts could be between 8 (older) and 18 (newest accounts). Note that you will need to add '\sfrom:' for each user, this means 6 more characters for each user account.
If we take into account that, the average length of ID is 13, adding the 6 predecessor characters:
512 limit characters / 19 avg characters per user ID = ~26 ID per rule
If we have 25 rules:
25 rules * 26 IDs per rule = ~650 IDs in total
Note that this is approximate and you can use the account name instead of the ID.

Best method to extract information from unstructured text

My aim is to extract information from old scanned reports and store in the structured database. I have already extracted text from these reports using Solr.
All of these are scientific reports and have a different structure in terms of the content of the report, but all of these has similar information. I wanted to a create a structured database using these reports such as name of the company involved in the report, name of the software involved in the report, name of the location, date of the experiment etc. For each of these fields, I have some keywords which shall be used for extraction, For example for the Location information: Location, Place of experiment, Place, Facility etc. What will be the best way to proceed in the direction?
Also, in some of these files, there are no sentences to process. Information is given in Form like structure, for example:
Location: Canada
Date of the experiment: 1985-05-01.
Which techniques will be best to extract the information? also which software, libraries should I use?

Implementing a "Find Friends by phone number" feature on App Engine using datastore

One of the core functions of my messaging app is allowing users to find friends who are also on the service based on their phone numbers. Apps like Whatsapp and Snapchat do have the same kind of mechanism.
I'm struggling to find a solution that returns a good number of results. I'm wondering how most other apps approach this pretty widely implemented feature.
My current implementation is that I have User model and a PhoneUser model. The PhoneUser model is keyed on the user's phone number that has been converted into the standardized E164 format. It has a KeyProperty to link it to the respective user.
class PhoneUser(ndb.Model):
# id is the phone number in E164 format
user = ndb.KeyProperty(kind='User', required=True)
When a user signs up for the service and grants access to their phone contacts, the app can get a large number of phone numbers from the user's phone book. 1,000 numbers is not impossible. I convert all these numbers into the standardized E164 format and then build keys for each (ie. ndb.Key('PhoneUser', PHONE_NUMBER)). With that list of PhoneUser keys, I can use ndb.get_multi(list_of_phoneuser_keys). This lets me avoid querying for 1,000 numbers.
This theoretically works well under the assumption that users enter their phone numbers with country code correctly so that the python phonenumbers library can parse it.
However, that is many times not the case, but this approach requires it because getting entities by keys requires exact matches.
This was just one approach I had thought of and it has its drawbacks. This seems like a very common function in apps and I was wondering if there was a better approach.

In any case you'll need to normalize phone numbers to common format (E164). We use libphonenumber, which works pretty well. You might check out the python port.
We replace missing country codes in friends phone numbers with the country code of the user doing the search. Rationale: if user does not have a country code entered for his contact, then they are probably from the same country.
Hint: things will get interesting when you will want to implement reverse-search - notifying existing users that one of their friends showed up in the network.

The right way would be not to rely on users placing city/county codes and/or prefixes. I dont know anyone that saves their numbers with country codes unless they travel frequently oversees.
You will need to parse and correct the numbers. You can use current geolocation to try and add missing city/country codes and also remove unneeded prefixes. Probably involves some research into your target country audiences.

An Email Report Application for Outlook

The sales folk in my start-up send and receive a bunch of mail on a daily basis from vendors. dealers and customers.
But they tend to lose track of these mails quite often...as to whether they have responded/followed-up or not. And they waste a lot of time on figuring this out.
Expecting them to use a Mail Tool like MailChimp is even more painful and a ticketing tool is not a good fit for the jobs.
Hence, I am trying to build an app that can create a report of the total Email IDs interacted with in a particular date range. The only goal is to create a report, in a csv file or to dump the data into Google spreadsheets.
The report for the period entered by the user would look as below:
Email ID - All emails lying in "sent items" AND "inbox" for particular date range
Name - If present
Status
The "Status" would be:
Received not responded by sales person
Sent but not responded by recipient
Responded by Sales Person
...and so on
I am vary of running the script directly on the mail server and am not sure if Outlook Exchange would allow something like this.
I would prefer if it could be an application that runs directly on the sales person's machine.
A few use Macs and the others Windows. I would be focussing on the Macs first.
The mail tool used is Outlook for Mac-2011 and the machines are either Lions or Snow Leopards.
Mail is on the Outlook Exchange
I must confess to be not much of a coder, but i blunder/Google my was through it.
I had some time on my hand with the holiday season coming up, hence thought of taking this project up.
I am moderately comfortable with Python.
But for this project, from what i have read, appears to be the job for AppleScripting.
Before starting my blunderings, i wanted to seek the advice of the SO community on the same:
Is AppleScripting the best bet here? If yes, could you share the best resource to read up the same. I have the copy of "AppleScript The Comprehensive Guide to Scripting and Automation on Mac OS X". But it is almsot 6 years old.
Could it be done somehow just using Python? - I wanted to dump the respective reports onto Google Spreadsheets, hence would be easier to get Python involved here.
Are there any similar applications that are already out there?
Or am i completely off track
Sorry for the ramble. But really Looking forward to some assistance on this

I liked:
AppleScript 1-2-3
and
AppleScript: The Definitive Guide
There also some good tutorials here: MacScripter
That being said, you should consider the cost/benefit of learning AppleScript to accomplish one task at your company. You may be better off simply hiring someone to write the script for you and focus on growing your business instead.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.