How to remove html commands

How to remove html commands - python

I have a .htm document. By using text_content(), I extracted the text from the document.
Here is the text:
'PART II \xa0\r\nThe Company\x92s common stock is traded on the over-the-counter market and is quoted on the NASDAQ Global Select Market under the symbol\r\nAAPL and on the Frankfurt Stock Exchange under the symbol APCD. Price Range of Common Stock The price range per share of common stock presented below represents the highest and lowest sales prices for the Company\x92s common stock\r\non the NASDAQ Global Select Market during each quarter of the two most recent years. \xa0\r\nHolders As of October\xa016, 2009, there were 30,573 shareholders of record. Dividends\r\n The Company did not declare or pay cash dividends in either 2009 or 2008. The Company anticipates that for the foreseeable\r\nfuture it will retain any earnings for use in the operation of its business. Purchases of Equity Securities by the Issuer and Affiliated\r\nPurchasers None. \xa0\r\n 33 '
With this text, I need to remove a heading which is preceded and followed by a blank line. Thus, lines of the following form should be removed:
\n
some text here\n
\n
I have a code that does that for .txt version of document. However, from .htm document, I realized that some weird things like \xa0\r\n are used to make words capital (for example). Is there any way to remove all of these things and correctly remove only the headings?
Here is the function that does remove the heading:
def clean_text_passage(a_text_string):
"""REMOVE /n: take a list of strings (some passage of text)
and remove noise which is defined as lines that are preceded
by a blank line and followed by a blank line that is lines of
this form will not be in the output
\n
some text here\n
\n
"""
new_passage=[]
p=[line+'\n' for line in a_text_string.split('\n')]
passage = [w.lower().replace('</b>\n', '\n') for w in p]
if len(passage[0].strip())>0:
if len(passage[1].strip())>0:
new_passage.append(passage[0])
for counter, text_line in enumerate(passage[:-1]):
len_line_before=len(passage[counter-1].strip())
len_line_after=len(passage[counter+1].strip())
if len_line_before==len_line_after==0:
continue
if len(text_line.strip())!=0:
new_passage.append(text_line)
if len(passage[-2].strip())!=0:
if len(passage[-1].strip())!=0:
new_passage.append(passage[-1])
return new_passage
I guess the key is to identify the heading in htm document.
Thank you so much for your time and help.

Related

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000

You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']

Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']

I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

How to extract specific information from multi-line string

I have extracted some invoice related information from email body to Python strings, my next task is to extract the Invoice numbers from the string.
The format of emails could vary, hence it is getting difficult to find invoice number from the text. I also tried "Named Entity Recognition" from SpaCy but since in most of the cases the Invoice number is coming in next line from the heading 'Invoice' or 'Invoice#',the NER doesn't understand the relation and returns incorrect details.
Below are 2 examples of the text extracted from mail body:
Example - 1.
Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.
Example - 2.
Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19
My problem is that if I convert this entire text to a single string then this becomes something like this:
Invoice Date Purchase Order Due Date Balance 8754321 8/17/17
7200016508 9/16/18 140.72
As it is visible that the Invoice number (8754321 in this case) changed its position and doesn't follow the keyword "Invoice" anymore, which is more difficult to find.
My desired output is something like this:
Output Example - 1 -
8754321
5245344
Output Example - 2 -
7651234
9872341
I don't know how can I retrieve text just under keyword "Invoice" or "Invoice#" which is the invoice number.
Please let me know if further information is required. Thanks!!
Edit: The invoice number doesn't have any pre-defined length, it can be 7 digit or can be more than that.

Code per my comments.
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
index = -1
# Get first line of table, print line and index of 'Invoice'
for line in email.split('\n'):
if all(x != x.lower() for x in line.split()) and ('Invoice' in line) and len(line) > 0:
print('--->', line, ' --- index of Invoice:', line.find('Invoice'))
index = line.find('Invoice')
Uses heuristic that the column header row is always camel case or capitals (ID). This would fail if say a heading was exactly 'Account no.' rather than 'Account No.'
# get all number at a certain index
for line in email.split('\n'):
words = line[index:].split()
if words == []: continue
word = words[0]
try:
print(int(word))
except:
continue
Reliability here depends on data. So in my code Invoice column must be first of table header. i.e. you can't have 'Invoice Date' before 'Invoice'. Obviously this would need fixing.

Going off what Andrew Allen was saying, as long as these 2 assumptions are true:
Invoice numbers are always exactly 7 numerical digits
Invoice numbers are always following a whitespace and followed by a whitespace
Using regex should work. Something along the lines of;
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
invoices = re.findall(r'\s(\d\d\d\d\d\d\d)\s', email)
invoice in this case has a list of 2 strings, ['8754321', '5245344']

Using Regex. re.findall
Ex:
import re
email = '''Dear Customer:
The past due invoices listed below are still pending. This includes the
following:
Invoice Date Purchase Order Due Date Balance
8754321 8/17/17 7200016508 9/16/18 140.72
5245344 11/7/17 4500199620 12/7/18 301.54
We would appreciate quick payment of these invoices.'''
email2 = """Hi - please confirm the status of below two invoices.
Invoice# Amount Invoice Date Due Date
7651234 $19,579.06 29-Jan-19 28-Apr-19
9872341 $47,137.20 27-Feb-19 26-Apr-19 """
for eml in [email, email2]:
print(re.findall(r"\b\d{7}\b", eml, flags=re.DOTALL))
Output:
['8754321', '5245344']
['7651234', '9872341']
\b - regex boundaries
\d{7} - get 7 digit number

A Python regex to find soccer team fixtures in string

I am using the Requests module to access the HTML from my target website and then using Beautiful Soup to select a specific element on the website. The element in question is a table that contains the results thus far of the English Premier League 2016/2017 season. The table contains the match date, the teams involved, the full-time score and the half-time score. I want to use Python to parse the HTML of the table element and extract the fixtures listed on there. The teams are always listed as:
Team A - Team B
A team name can be 1-3 separate strings (e.g. Burnley, Manchester United, West Ham United.
My attempt so far is:
import re
teamsRegex = re.compile(r'((\w+\s)+-(\s\w+)+)')
My logic here is that the first team can be 1-3 separate strings in length and each string is always followed by a white space. Therefore, the pattern (\w+\s)+ represents a string of any length followed by a white space and can be repeated 1 or many times. The second team name will always begin with a white space following the "-" character and again can be a string of any length, repeated 1 or many times (\s\w+)+.
I'm sort of achieving the desired results but the above is not entirely correct. I am returned a list with my desired result at index 0 followed by the first string of index 0 as index 1, and the last string in index 0 as index 2.
Example string:
'Burnley - Swansea City align=center width=45> 0 - 1 align=center> (0-0)'
Regex finds:
[('Burnley - Swansea City', 'Burnley ', ' City'), ('0 - 1', '0 ', ' 1')]
I would just like it to find [('Burnley - Swansea City')]
Many thanks in anticipation of any help!

r'(?:[A-Z][a-z]*\s)+-(?:\s[A-Z][a-z]*)+'
Here you have two non-capturing (?:, so you'll get the full match only) groups to match the teams' names. I chose to use letters explicitly, so the expressions only match words beginning with capital letters and exclude digits. You should change that if the teams' names can contain digits (like "BVB 09").
Depending on the HTML file's content one could add a final lookahead (?= align) to increase specifity.
Edit:
To match up to three capitals and optional '&'s, try this :
r'(?:[A-Z&]{1,3}[a-z]*\s)+-(?:\s[A-Z&]{1,3}[a-z]*)+'

Python regex to parse financial data

I am relatively new to regex (always struggled with it for some reason)...
I have text that is of this form:
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
Parsing the text reveals the following structure:
Two or more words beginning the sentence, and before the first comma, is the name of the person involved in the transaction
One or more words before ('sold'|'bought'|'exercised'|'sold post-exercise') is the title of the person
Presence of either one of these: ('sold'|'bought'|'exercised'|'sold post-exercise') AFTER the title, identifies the transaction type
first numeric string following the transaction type ('sold'|'bought'|'exercised'|'sold post-exercise') denotes the size of the transaction
'price of ' preceeds a numeric string, which specifies the price at which the deal was struck.
My question is:
How can I use this knowledge (and regex), to write a function that parses similar text to return the variables of interest (listed 1 - 5 above)?
Pseudo code for the function I want to write ..
def grok_directors_dealings_text(text_input):
name, title, transaction_type, lot_size, price = (None, None, None, None, None)
....
name = ...
title = ...
transaction_type = ...
lot_size = ...
price = ...
pass
How would I use regex to implement the functions to return the variables of interest when passed in text that conforms to the structure I have identified above?
[[Edit]]
For some reason, I have seemed to struggle with regex for a while, if I am to learn from the correct answer here on S.O, it will be much better, if an explanation is offered as to why the magical expression (sorry, regexpr) actually works.
I want to actually learn this stuff instead of copy pasting expressions ...

You can use the following regex:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
DEMO
Python:
import re
financialData = """
David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...
Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...
Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...
"""
print(re.findall('(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)',financialData))
Output:
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00p'), ('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75p'), ('Albert Ellis', 'CEO', 'bought', '262', '52.00p')]
EDIT 1
To understand how and what they mean, follow the DEMO link,on top right you can find a block explaining what each and every character means as follows:
Also Debuggex helps you simulate the string by showing what group matches which characters!
Here's a debuggex demo for your particular case:
(.*?),\s(.*)\s(sold(?: post-exercise)?|bought|exercised)\s*([\d,]*).*price of\s*(\d*.\d+?p)
Debuggex Demo

I came up with this regex:
([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p
Debuggex Demo
Basically, we are using the parenthesis to capture the important info you want so let's check it out each one:
([\w ]+): \w matches any word character [a-zA-Z0-9_] one or more times, this will give us the name of the person;
([\w ]+)Another one of these after a space and comma to get the title;
(sold post-exercise|sold|bought|exercised) then we search for our transaction types. Notice I put the post-exercise before the post so that it tries to match the bigger word first;
([\d,\.]+) Then we try to find the numbers, which are made of digits (\d), a comma and probbably a dot may appear as well;
([\d\.,]+) Then we need to get to the price which is basically the same as the size of the transaction.
The regex that connects each group are pretty basic as well.
If you try it on regex101 it provides some explanation about the regex and generates this code in python to use:
import re
p = re.compile(ur'([\w ]+), ([\w ]+) (sold post-exercise|sold|bought|exercised) ([\d,\.]+).*price of ([\d\.,]+)p')
test_str = u"David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...\n\nMark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...\n\nAlbert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by..."
re.findall(p, test_str)

You can use the following regex that just looks for characters surrounding the delimiters:
(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p
The parts in parentheses will be captured as groups.
>>> import re
>>> l = ['''David Meredith, Financial Director sold post-exercise 15,000 shares in the company on YYYY-mm-dd at a price of 1044.00p. The Director now holds 6,290 shares representing 0.01% of the...''', '''Mark Brookes, Non Executive Director bought 811 shares in the company on YYYY-mm-dd at a price of 76.75p. The Director now holds 189,952 shares representing 0.38% of the shares in...''', '''Albert Ellis, CEO bought 262 shares in the company on YYYY-mm-dd at a price of 52.00p. The Director now holds 465,085 shares. NOTE: Purchased through Co's SIP Story provided by...''']
>>> for s in l:
... print(re.findall(r'(.*?), (.*?) (sold post-exercise|bought|exercised|sold) (.*?) shares .*? price of (.*?)p', s))
...
[('David Meredith', 'Financial Director', 'sold post-exercise', '15,000', '1044.00')]
[('Mark Brookes', 'Non Executive Director', 'bought', '811', '76.75')]
[('Albert Ellis', 'CEO', 'bought', '262', '52.00')]

this is the regex that will do it
(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)
you use it like this
import re
def get_data(line):
pattern = r"(.*?),(.*?)(sold post-exercise|sold|bought|exercised).*?([\d|,]+).*?price of ([\d|\.]+)"
m = re.match(pattern, line)
return m.groups()
for the first line this will return
('David Meredith', ' Financial Director ', 'sold post-exercise', '15,000', '1044.00')
EDIT:
adding explanation
this regex works as follows
the first characters (.*?), mean - take the string until the next match(witch is the ,)
. means every character
the * means that it can be many times (many characters and not just 1)
? means dont be greedy, that means that it will use the first ',' and another one (if there are many ',')
after that there is this again (.*?)
again take the characters until the next thing to match (with is the constant words)
after that there is (sold post-exercise|sold|bought|exercised) witch means - find one of the words (sperated by | )
after that there is a .*? witch again means take all text until next match (this time it is not surounded by () so it wont be selected as a group and wont be part of the output)
([\d|,]+) means take a digit (\d) or a comma. the + stands for one or more times
again .*? like before
'price of ' finds the actual string 'price of '
and last ([\d|.]+) means again take a digit or a dot (escaped because the character . is used by regex for 'any character') one or more times

BeautifulSoup: when pulling text from a section, <emph> and other tags are ignored causing adjacent words to be pushed together

I have an XML document. I want to pull all text between all .. <.p> tags. Below is an example of the text. The problem is that in a sentence like:
"Because the <emph>raspberry</emph> and.."
the output is "Because theraspberryand...". Somehow, the emph tags are being dropped (which is good) but being dropped in a way that pushes together the adjacent word.
Here is the relevant code I am using:
xml = BeautifulSoup(xml, convertEntities=BeautifulSoup.HTML_ENTITIES)
for para in xml.findAll('p'):
text = text + " " + para.text + " "
Here is a the start of part of the text, in case the full text helps:
<!DOCTYPE art SYSTEM "keton.dtd">
<art jid="PNAS" aid="1436" vid="94" iss="14" date="07-08-1997" ppf="7349" ppl="7355">
<fm>
<doctopic>Developmental Biology</doctopic>
<dochead>Inaugural Article</dochead>
<docsubj>Biological Sciences</docsubj>
<atl>Suspensor-derived polyembryony caused by altered expression of
valyl-tRNA synthetase in the <emph>twn2</emph>
mutant of <emph>Arabidopsis</emph></atl>
<prs>This contribution is part of the special series of Inaugural
Articles by members of the National Academy of Sciences elected on
April 30, 1996.</prs>
<aug>
<au><fnm>James Z.</fnm><snm>Zhang</snm></au>
<au><fnm>Chris R.</fnm><snm>Somerville</snm></au>
<fnr rid="FN150"><aff>Department of Plant Biology, Carnegie Institution of Washington,
290 Panama Street, Stanford CA 94305</aff>
</fnr></aug>
<acc>May 9, 1997</acc>
<con>Chris R. Somerville</con>
<pubfront>
<cpyrt><date><year>1997</year></date>
<cpyrtnme><collab>The National Academy of Sciences of the USA</collab></cpyrtnme></cpyrt>
<issn>0027-8424</issn><extent>7</extent><price>2.00/0</price>
</pubfront>
<fn id="FN150"><p>To whom reprint requests should be addressed. e-mail:
<email>crs#andrew.stanford.edu</email>.</p>
</fn>
<abs><p>The <emph>twn2</emph> mutant of <emph>Arabidopsis</emph>
exhibits a defect in early embryogenesis where, following one or two
divisions of the zygote, the decendents of the apical cell arrest. The
basal cells that normally give rise to the suspensor proliferate
abnormally, giving rise to multiple embryos. A high proportion of the
seeds fail to develop viable embryos, and those that do, contain a high
proportion of partially or completely duplicated embryos. The adult
plants are smaller and less vigorous than the wild type and have a
severely stunted root. The <emph>twn2-1</emph> mutation, which is the
only known allele, was caused by a T-DNA insertion in the 5′
untranslated region of a putative valyl-tRNA synthetase gene,
<it>valRS</it>. The insertion causes reduced transcription of the
<it>valRS</it> gene in reproductive tissues and developing seeds but
increased expression in leaves. Analysis of transcript initiation sites
and the expression of promoter–reporter fusions in transgenic plants
indicated that enhancer elements inside the first two introns interact
with the border of the T-DNA to cause the altered pattern of expression
of the <it>valRS</it> gene in the <emph>twn2</emph> mutant. The
phenotypic consequences of this unique mutation are interpreted in the
context of a model, suggested by Vernon and Meinke &lsqbVernon, D. M. &
Meinke, D. W. (1994) <emph>Dev. Biol.</emph> 165, 566–573&rsqb, in
which the apical cell and its decendents normally suppress the
embryogenic potential of the basal cell and its decendents during early
embryo development.</p>
</abs>
</fm>

I think the problem here is that you're trying to write bs4 code with bs3.
The obvious fix is to the use bs4 instead.
But in bs3, the docs show two ways to get all of the text recursively from all contents of a soup:
''.join(e for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
''.join(soup.findAll(text=True))
You can obviously change either one of those to explicitly strip whitespace off the edges and add exactly one space between each node instead of relying on whatever space might be there:
' '.join(e.strip() for e in soup.recursiveChildGenerator() if isinstance(e, unicode))
' '.join(map(str.strip, soup.findAll(text=True)))
I wouldn't want to guarantee that this will be exactly the same as the bs4 text property… but I think it's what you want here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.