Python - parse inconsistently delimted data table - python

I have a folder of emails that contain data that I need to extract and put into a Database. These emails are all formatted differently so I've grouped them by how similar their formats are. The following two emails bodies are examples of what I am trying to parse right now:
1)
2)
So in my attempts to extract the valuable data (the fish stock, the weight, the price, the sector, the date) I have tried several methods. I have a list of all possible 30+ stocks, and I run a RegEx on the entire email.
fishy_re = re.compile(r'({}).*?([\d,]+)'.format('|'.join(stocks)), re.IGNORECASE|re.MULTILINE|re.DOTALL)
This RegEx, I was told, will search for any occurrence of a fish, then capture the next number that follows, and then group the two together.....and it does that job perfect. But when I tried adding an additional .*?([\d,]+) chunk to capture the NEXT number (the price, as seen in email 2) it fails to do that.
Is my RegEx that tries to grab the price wrong?
Also, in trying to deal with emails that have a Package deal (email 1), I again tried using RegEx to search for any line that has the word Package and then capture the next number that follows on that line.
word = ['package']
package_re = re.compile(r'({}).*?([\d,]+)'.format('|'.join(word)), re.IGNORECASE|re.MULTILINE|re.DOTALL)
But that produces nothing....even when doing a simple command like:
with open(file_path) as f:
for line in f:
for match in package2_re.finditer(f.read()):
print("yes")
It fails to print yes.
So is there a more effective way to extract the Package price information?
Thanks.

I created my own test email and parsed it like so:
import bs4 # BeautifulSoup html parsing
import email # built-in Python mail-parsing library
FNAME = "c:/users/Stephen/mail/test.eml" # full path to saved email
# load message
with open(FNAME) as in_f:
msg = email.message_from_file(in_f)
# message is multipart/MIME - payload 0 is text, 1 is html
html_msg = msg.get_payload(1)
# grab the body of the message
body = html_msg.get_payload(decode=True)
# convert from bytes to unicode
html = body.decode()
# now parse to get table
table = bs4.BeautifulSoup(html).find("table")
data = [[cell.text.strip() for cell in row.find_all("td")] for row in table.find_all("tr")]
which returns something like
[
['', 'LIVE WGT', ''],
['BGE COD', '746', ''],
['GBW CODE', '13,894', ''],
['GOM COD', '60', 'Package deal $52,500.00'],
# etc
]

Related

Detecting and replacing xml from string in python

I have a file that contains text as well as some xml content dumped into it. It looks something like this :
The authentication details : <id>70016683</id><password>password#123</password>
The next step is to send the request.
The request : <request><id>90016133</id><password>password#3212</password></request>
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
I am using a python program to parse this file. I would like to replace the xml part with a place holder : xml_obj. The output should look something like this :
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj
At the same time I would also like to extract the replaced xml text and store it in a list. The list should contain None if the line doesn't have an xml object.
I have tried using regex for this purpose :
xml_tag = re.search(r"<\w*>",line)
if xml_tag:
start_position = xml_tag.start()
xml_word = xml_tag.group()[:1]+'/'+xml_tag.group()[1:]
xml_pattern = r'{}'.format(xml_word)
stop_position = re.search(xml_pattern,line).stop()
But this code retrieves the start and stop positions for only one xml tag and it's content for the first line and the entire format for the last line ( in the input file ). I would like to get all xml content irrespective of the xml structure and also replace it with 'xml_obj'.
Any advice would be helpful. Thanks in advance.
Edit :
I also want to apply the same logic to files that look like this :
The authentication details : ID <id>70016683</id> Password <password>password#123</password> Authentication details complete
The next step is to send the request.
The request : <request><id>90016133</id><password>password#3212</password></request> Request successful
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
The above files may have more than one xml object in a line.
They may also have some plain text after the xml part.
The following is a a little convoluted, but assuming that your actual text is correctly represented by the sample in your question, try this:
txt = """[your sample text above]"""
lines = txt.splitlines()
entries = []
new_txt = ''
for line in lines:
entry = (line.replace(' <',' xxx<',1).split('xxx'))
if len(entry)==2:
entries.append(entry[1])
entry[1]="xml_obj"
line=''.join(entry)
else:
entries.append('none')
new_txt+=line+'\n'
for entry in entries:
print(entry)
print('---')
print(new_txt)
Output:
<id>70016683</id><password>password#123</password>
none
<request><id>90016133</id><password>password#3212</password></request>
<Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
---
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj

How to make auto-request in python on the website

I need to get data from this website.
It is possible to request information about parcels by help of a URL pattern, e.g. https://uldk.gugik.gov.pl/?request=GetParcelById&id=260403_4.0001.186/2.
The result for this example will look like this:
0
0103000020840800000100000005000000CBB2062D8C6F224110297D382512144128979BC870702241200E9D7C57161441CFC255973F702241C05EAADB7D161441C7AF26C2606F2241A0AD0EFB67121441CBB2062D8C6F224110297D3825121441
This is wkb format with information about geometry of the parcel.
The problem is:
I have excel spreadsheet with hundreds of parcels id. How can I get each id from the Excel file, make request as described at the begining and write result to file (for example to Excel)?
Use the xlrd library to read the Excel file and process the parcel ids.
For each of the parcel id you can access the url and extract the required information. Following code does this job for the given URL:
import requests
r = requests.get('https://uldk.gugik.gov.pl/?request=GetParcelById&id=260403_4.0001.186/2')
result = str(r.content, 'utf-8').split()
# ['0', '0103000020840800000100000005000000CBB2062D8C6F224110297D382512144128979BC870702241200E9D7C57161441CFC255973F702241C05EAADB7D161441C7AF26C2606F2241A0AD0EFB67121441CBB2062D8C6F224110297D3825121441']
As you have several hundreds of those ids, i'd write a function to do exectly this job:
import requests
def get_parcel_info(parcel_id):
url = f'https://uldk.gugik.gov.pl/?request=GetParcelById&id={parcel_id}'
r = requests.get(url)
return str(r.content, 'utf-8').split()
get_parcel_info('260403_4.0001.186/2')
# ['0', '0103000020840800000100000005000000CBB2062D8C6F224110297D382512144128979BC870702241200E9D7C57161441CFC255973F702241C05EAADB7D161441C7AF26C2606F2241A0AD0EFB67121441CBB2062D8C6F224110297D3825121441']

Feedparser not parsing searched for description

I'm trying to utilize RSS to get auto notifications for specific security vulnerabilities i may be concerned with. I have gotten it functional for searching for keywords in the title and url of feed entries, but it seems to ignore the rss description.
I've verified the description field exists within the feed (I originally started with summary in place of description before discovering this) but don't understand why its not working (relatively new to python). Is it possibly a sanitation issue, or am i missing something on how the search is performed?
#!/usr/bin/env python3.6
import feedparser
#Keywords to search for in the rss feed
key_words = ['Chrome','Tomcat','linux','windows']
# get the urls we have seen prior
f = open('viewed_urls.txt', 'r')
urls = f.readlines()
urls = [url.rstrip() for url in urls]
f.close()
#Returns true if keyword is in string
def contains_wanted(in_str):
for wrd in key_words:
if wrd.lower() in in_str:
return True
return False
#Returns true if url result has not been seen before
def url_is_new(urlstr):
# returns true if the url string does not exist
# in the list of strings extracted from the text file
if urlstr in urls:
return False
else:
return True
#actual parsing phase
feed = feedparser.parse('https://nvd.nist.gov/feeds/xml/cve/misc/nvd-rss.xml')
for key in feed["entries"]:
title = key['title']
url = key['links'][0]['href']
description = key['description']
#formats and outputs the specified rss fields
if contains_wanted(title.lower()) and contains_wanted(description.lower()) and url_is_new(url):
print('{} - {} - {}\n'.format(title, url, description))
#appends reoccurring rss feeds in the viewed_urls file
with open('viewed_urls.txt', 'a') as f:
f.write('{}\n'.format(title,url))
This helped. I was unaware of the conjunction logic but have resolved it. I omitted contains_wanted(title.lower()) since this was not necessary in the statement logic as contains_wanted(description.lower()) fulfills the title statements purpose as well as its own. and am getting proper output.
Thank you pbn.

scrape text from webpage using python 2.7

I'm trying to scrape data from this website:
Death Row Information
I'm having trouble to scrape the last statements from all the executed offenders in the list because the last statement is located at another HTML page. The name of the URL is built like this: http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html. I can't think of a way of how I can scrape the last statements from these pages and put them in an Sqlite database.
All the other info (expect for "offender information", which I don't need) is already in my datbase.
Anyone who can give me a pointer to get started getting this done in Python?
Thanks
Edit2: I got a little bit further:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0].lower()
firstname = column[1].lower()
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName2 = CleanName.replace(" ", "")
Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Url+CleanName2+"last.html"
URLS.append(Link)
for URL in URLS:
try:
page = urllib2.urlopen(URL)
except URLError, e:
if e.code ==404:
continue
soup = BeautifulSoup(page.read())
statements = soup.findAll ('p',{ "class" : "Last Statement:" })
print statements
csvfile.close()
conn.commit()
conn.close()
The code is messy, I know. Once everything works I will clean it up. One problem though. I'm trying to get all the statements by using soup.findall, but I cannot seem to get the class right. The relevant part of the page source looks like this:
<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>
However, the output of my program:
[]
[]
[]
...
What could be the problem exactly?
I will not write code that solves the problem, but will give you a simple plan for how to do it yourself:
You know that each last statement is located at the URL:
http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html
You say you already have all the other information. This presumably includes the list of executed prisoners. So you should generate a list of names in your python code. This will allow you to generate the URL to get to each page you need to get to.
Then make a For loop that iterates over each URL using the format I posted above.
Within the body of this for loop, write code to read the page and get the last statement. The last statement on each page is in the same format on each page, so you can use parsing to capture the part that you want:
<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don’t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that’s all I have to say.</p>
Once you have your list of last statements, you can push them to SQL.
So your code will look like this:
import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above
# Create a dictionary to hold all the last words:
LastWords = {}
# Iterate over each individual page
for eachURL in URLS:
response = urllib2.urlopen(eachURL)
html = response.read()
## Some prisoners had no last words, so those URLs will 404.
if ...: # Handle those 404s here
## Code to parse the response, hunting specifically
## for the code block I mentioned above. Once you have the
## last words as a string, save to dictionary:
LastWords['LastFirst'] = "LastFirst's last words."
# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.

how to search protein data bank using author's full name (no initials)

I am trying to search the protein data bank with author's name, but the only choice is to use full last name and initials, therefore there are some false hits. Is there a way to do this with python? Below is the code I used:
import urllib2
#http://www.rcsb.org/pdb/software/rest.do#search
url = 'http://www.rcsb.org/pdb/rest/search'
queryText = """
<?xml version="1.0" encoding="UTF-8"?>
<orgPdbQuery>
<queryType>org.pdb.query.simple.AdvancedAuthorQuery</queryType>
<description>Author Name: Search type is All Authors and Author is Wang, R. and Exact match is true</description>
<searchType>All Authors</searchType>
<audit_author.name>Wang, R. </audit_author.name>
<exactMatch>true</exactMatch>
</orgPdbQuery>
"""
print "query:\n", queryText
print "querying PDB...\n"
req = urllib2.Request(url, data=queryText)
f = urllib2.urlopen(req)
result = f.read()
if result:
print "Found number of PDB entries:", result.count('\n')
print result
else:
print "Failed to retrieve results"enter code here
PyPDB can perform an advanced search of the RCSB Protein Data Bank by author, keyword, or subject area. The repository is here but it can also be found on PyPI:
pip install pypdb
For your application, I'd suggest first doing a general keyword search for PDB IDs with the author's name, and then search the resulting list of PDBs for entries containing the author's name among the metadata:
The keyword search for "actin network"
from pypdb import *
author_name = 'J.A. Doudna'
search_dict = make_query(author_name)
found_pdbs = do_search(search_dict)
Now iterate through the results looking for the author's name
matching_results = list()
for pdb_id in found_pdbs:
desc_pdb = describe_pdb(item)
if author_name in desc_pdb['citation_authors']:
matching_results.append(pdb_id)
You can imagine using fancier regular expressions to improve slight variations in the way an author's name or initials are used. There also might be a nicer way to write this code that bundles requests.

Categories