Parsing fixed-format data embedded in HTML in python - python

I am using google's appengine api
from google.appengine.api import urlfetch
to fetch a webpage. The result of
result = urlfetch.fetch("http://www.example.com/index.html")
is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.
EDIT:
Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way...
END EDIT
If the document is something like this:
<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A 288 AAA
</body></html>
result.content will be this, after urlfetch fetches it:
'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA</body></html>'
Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried
result.content.split('\n')
and
result.content.split('\r')
but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.
Any ideas how I can parse this data? Maybe I need to fetch it differently?
Thanks in advance!

I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.
I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like
import re
data = re.findall('<body>([^\<]*)</body>', result)[0]
then, it should be as easy as:
start = 0
end = 5
while (end<len(data)):
print data[start:end]
start = end+1
end = end+5
print data[start:]
(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)

Only suggestion I can think of is to parse it as if it has fixed width columns. Newlines are not taken into consideration for HTML.
If you have control of the source data, put it into a text file rather than HTML.

Once you have the body text as a single, long string, you can break it up as follows.
This presumes that each record is 26 characters.
body= "AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA"
for i in range(0,len(body),26):
line= body[i:i+26]
# parse the line

EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.
If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):
def generate_lines(datastring):
while datastring:
splitresult = datastring.split(' ', 5)
if len(splitresult) >= 5:
datastring = splitresult[5]
else:
datastring = None
yield splitresult[:5]
for line in generate_lines(data):
process_data_line(line)
Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.

Further suggestions for splitting the string s into 26-character blocks:
As a list:
>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
'BBB 987 2009-01-02 JSE',
'A4A 288 AAA']
As a generator:
>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987 2009-01-02 JSE
A4A 288 AAA
Replace range() with xrange() in Python 2.x if s is very long.

Related

Parse Output for Python

My software outputs these two types of output:
-rwx------ Administrators/Domain Users 456220672 0% 2018-04-16 16:04:40 E:\\_WiE10-18.0.100-77.iso
-rwxrwx--- Administrators/unknown 6677 0% 2018-04-17 01:33:23 E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log
I would like to get the file names from both outputs:
E:\\_WiE10-18.0.100-77.iso, for the first one
E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log, for the second one
If i use something like the code below, it won't work if the second parameter has spaces in it. It works if there aren't any spaces in the Domain Username.
for item in outputs:
outputs.extend(item.split())
for item2 in [' '.join(outputs[6:])]:
new_list.append(item2)
How can I get all the parameters individually, including the filenames?
If regex is an option:
text = """-rwx------ Administrators/Domain Users 456220672 0% 2018-04-16 16:04:40 E:\\_WiE10-18.0.100-77.iso
-rwxrwx--- Administrators/unknown 6677 0% 2018-04-17 01:33:23 E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log"""
import re
for h in re.findall(r"^.*?\d\d:\d\d:\d\d (.*)",text,flags=re.MULTILINE):
print(h)
Output:
E:\_WiE10-18.0.100-77.iso
E:\program files\cluster groups\sql server (mssqlserver)\logs\progress-MOD-1523883344023-3001-Windows.log
Pattern explained:
The pattern r"^.*?\d\d:\d\d:\d\d (.*)" looks for linestart '^' + as less anythings as possible '.*?' + the time-stamp '\d\d:\d\d:\d\d ' followed by a space and captures all behind it till end of line into a group.
It uses the re.MULTILINE flag for that.
Edit:
Capturing the individual things needs some more capturing groups:
import re
for h in re.findall(r"^([rwexXst-]+) ([^0-9]+) +\d+.+? +(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (.*)",text,flags=re.MULTILINE):
# ^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^
# flags grpName datetime filename
for k in h:
print(k)
print("")
Output:
-rwx------
Administrators/Domain Users
2018-04-16 16:04:40
E:\_WiE10-18.0.100-77.iso
-rwxrwx---
Administrators/unknown
2018-04-17 01:33:23
E:\program files\cluster groups\sql server (mssqlserver)\logs\progress-MOD-1523883344023-3001-Windows.log
You could use a regular expression like
\b[A-Z]:\\\\.+
Aside from using regex, you can try something similar to this.
output = '-rwx------ ... 2018-04-16 16:04:40 E:\\\\_WiE10-18.0.100-77.iso'
drive_letter_start = output.find(':\\\\')
filename = output[drive_letter_start - 1:]
It looks for the first occurrence of ':\\'and gets the drive letter before the substring (i.e. ':\\') and the full file path after the substring.
EDIT
Patrick Artner's answer is better and completely answers OP's question compared to this answer. This only encompasses capturing the file path. I am leaving this answer here should anyone find it useful.

Extracting #mentions from tweets using findall python (Giving incorrect results)

I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Python regex for UK number

Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols
Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.
I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$

Why won't BeautifulSoup find text in a table in Python?

I'm trying to check whether a numerical value is found in a table. Why would this code not find the numerical text "699" in this table? The print statement gives a value of "None."
html = """
<table>
December 31, 1997 1996 1995 1994 1993
Allowance for credit losses--loans 699 773
Allowance for credit losses--
trading assets 285 190
Allowance for credit losses--
other liabilities 13 10
- --------------------------------------------------------------------------------
Total $ 997 $ 973 $ 992 $1,252 $1,324
================================================================================
</table>
"""
soup = BeautifulSoup(''.join(html))
table = soup.find('table')
test = table.find(text='699')
print test
table.find() will search all tags inside the table, but there are no tags inside the table. There is just a string, which happens to be an ASCII table which is in no way formatted as HTML.
If you want to use BeautifulSoup to parse the table, you need to convert it into an HTML table first. Otherwise you can use table.string to get the string itself and parse that with regex.
If you pass a string as an argument into a Beautiful Soup find() method, Beautiful Soup looks for that exact string. Passing in text='699' will find the string "699", but not a longer string that includes "699".
To find strings that contain a substring, you can use a custom function or a regular expression:
import re
table.find(text=re.compile('699')
table.find(text=lambda x: '699' in x)

regular expression search in python

I am trying to parse some data and just started reading up on regular Expressions so I am pretty new to it. This is the code I have so far
String = "MEASUREMENT 3835 303 Oxygen: 235.78 Saturation: 90.51 Temperature: 24.41 DPhase: 33.07 BPhase: 29.56 RPhase: 0.00 BAmp: 368.57 BPot: 18.00 RAmp: 0.00 RawTem.: 68.21"
String = String.strip('\t\x11\x13')
String = String.split("Oxygen:")
print String[1]
String[1].lstrip
print String[1]
What I am trying to do is to do is remove the oxygen data (235.78) and put it in its own variable using an regular expression search. I realize that there should be an easy solution but I am trying to figure out how regular expressions work and they are making my head hurt. Thanks for any help
Richard
re.search( r"Oxygen: *([\d.]+)", String ).group( 1 )
import re
string = "blabla Oxygen: 10.10 blabla"
regex_oxygen = re.compile('''Oxygen:\W+([0-9.]*)''')
result = re.findall(regex_oxygen,string)
print result
What for?
print String.split()[4]
For general parsing of lists like this one could
import re
String = "MEASUREMENT 3835 303 Oxygen: 235.78 Saturation: 90.51"
String = String.replace(':','')
value_list=re.split("MEASUREMENT\W+[0-9]+\W+[0-9]+\W",String)[1].rstrip().split()
values = dict(zip(value_list[::2],map(float,value_list[1::2])))
I believe the answer to you specific problem has been posted. However I wanted to show you a few ressource for regular expression for python. The python documentation on regular expression is the place to start.
O'reilly also has many good books on the subject, either if you want to understand regular expression deep down or just enough to make things work.
Finally regular-expressions.info is a good ressource for regular expression among mainstream languages. You can even test your regular expression on the website.
I would like to share my ?is this an email? regex expresion, just to inspire you. :)
9 emailregex = "^[a-zA-Z.a-zA-Z]+#mycompany.org$"
10
11 def validateEmail(email):
12 """returns 1 if is an email, 0 if not """
13 # len(x.y#mycompany.org) = 17
14 if len(email)>=17:
15 if re.match(emailregex,email)!= None:
16 return 1
17 return 0

Categories