Strip off characters from output - python

I have the following structure generated by bs4, python.
['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu']
['L10038779', '9551154555', ',', ',']
['R10831945', '9150000747, 9282109134, 9043728565', ',', ',']
['B10750123', '9952946340', '', 'Dealer', 'Bala']
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']
I wanna rip characters off and I should get something like the following
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340 , Dealer, Bala
9841280752, 9884797013, Dealer, Senthil
I am using print re.findall("'([a-zA-Z0-9,\s]*)'", eachproperty['onclick'])
So basically I wanna remove the "[]" and "''" and "," and random ID which is in the start.
Update
onclick="try{appendPropertyPosition(this,'Y10765227','9884877926, 9283183326','','Dealer','Rgmuthu');jsb9onUnloadTracking();jsevt.stopBubble(event);}catch(e){};"
So I am scraping from this onclick attribute to get the above mentioned data.

You can use a combination of str.join and str.translate here:
>>> from string import punctuation, whitespace
>>> lis = [['Y10765227', '9884877926, 9283183326', '', 'Dealer', 'Rgmuthu'],
['L10038779', '9551154555', ',', ','],['R10831945', '9150000747, 9282109134, 9043728565', ',', ','],
['B10750123', '9952946340', '', 'Dealer', 'Bala'],
['R10763559', '9841280752, 9884797013', '', 'Dealer', 'Senthil']]
for item in lis:
print ", ".join(x for x in item[1:]
if x.translate(None, punctuation + whitespace))
...
9884877926, 9283183326, Dealer, Rgmuthu
9551154555
9150000747, 9282109134, 9043728565
9952946340, Dealer, Bala
9841280752, 9884797013, Dealer, Senthil

Related

splitting text further while preserving line breaks

I am splitting text para and preserving the line breaks \n using the following
from nltk import SpaceTokenizer
para="\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*"
sent=SpaceTokenizer().tokenize(para)
Which gives me the following
print(sent)
['\n[STUFF]\n', '', 'comma,', '', 'with', 'period.', 'the', 'new', 'question?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
My goal is to get the following output
['\n[STUFF]\n', '', 'comma', ',', '', 'with', 'period', '.', 'the', 'new', 'question', '?', '\n\nthe\n', '', '\nline\n', 'new', 'char*']
That is to say, I would like to split the 'comma,' into 'comma', ',' split the 'period.' into 'period', '.' split the 'question?' into 'question', '?' while preserving the \n
I have tried word_tokenize and it will achieve splitting 'comma', ',' etc but does not preserve \n
What can I do to further split sent as shown above while preserving \n?
https://docs.python.org/3/library/re.html#re.split is probably what you want.
From the looks of your desired output however, you're going to need to process the string a bit more than just applying a single function to it.
I would start by replacing all of the \n with a string like new_line_goes_here before splitting the string up, and then replacing new_line_goes_here with \n once it's all split up.
per #randy suggestion to look https://docs.python.org/3/library/re.html#re.split
import re
para = re.split(r'(\W+)', '\n[STUFF]\n comma, with period. the new question? \n\nthe\n \nline\n new char*')
print(para)
Output (close to what I am looking for)
['', '\n[', 'STUFF', ']\n ', 'comma', ', ', 'with', ' ', 'period', '. ', 'the', ' ', 'new', ' ', 'question', '? \n\n', 'the', '\n \n', 'line', '\n ', 'new', ' ', 'char', '*', '']

python re.split(): how to save some of the delimiters (instead of all the delimiter by using bracket)

For the sentences:
"I am very hungry, so mum brings me a cake!
I want it split by delimiters, and I want all the delimiters except space to be saved as well. So the expected output is :
"I" "am" "very" "hungry" "," "so", "mum" "brings" "me" "a" "cake" "!" "\n"
What I am currently doing is re.split(r'([!:''".,(\s+)\n])', text), which split the whole sentences but also saved a lot of space characters which I don't want. I've also tried the regular expression \s|([!:''".,(\s+)\n]) , which gives me a lot of None somehow.
search or findall might be more appropriate here than split:
import re
s = "I am very hungry, so mum brings me a !#$## cake!"
print(re.findall(r'[^\w\s]+|\w+', s))
# ['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', '!#$##', 'cake', '!']
The pattern [^\w\s]+|\w+ means: a sequence of symbols which are neither alphanumeric nor whitespace OR a sequence of alphanumerics (that is, a word)
That is because your regular expression contains a capture group. Because of that capture group, it will also include the matches in the result. But this is likely what you want.
The only challenge is to filter out the Nones (and other values with truthiness False) in case there is no match, we can do this with:
def tokenize(text):
return filter(None, re.split(r'[ ]+|([!:''".,\s\n])', text))
For your given sample text, this produces:
>>> list(tokenize("I am very hungry, so mum brings me a cake!\n"))
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']
One approach is to surround the special characters (,!.\n) with space and then split on space:
import re
def tokenize(t, pattern="([,!.\n])"):
return [e for e in re.sub(pattern, r" \1 ", t).split(' ') if e]
s = "I am very hungry, so mum brings me a cake!\n"
print(tokenize(s))
Output
['I', 'am', 'very', 'hungry', ',', 'so', 'mum', 'brings', 'me', 'a', 'cake', '!', '\n']

Python Regex not returning phone numbers

Given the following code:
import re
file_object = open("all-OANC.txt", "r")
file_text = file_object.read()
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.findall(pattern, file_text):
print match
I get output that stretches like this:
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
('', '', '', '-', '-')
I'm trying to find phone numbers, and I am one hundred percent sure there are numbers in the file. When I search for numbers in an online applet for example, with the same expression, I get matches.
Here is a snippet where the expression is found outside of python:
"Slate on Paper," our
specially formatted print-out version of Slate, is e-mailed to readers
Friday around midday. It also can be downloaded from our
site. Those services are free. An actual paper edition of "Slate on Paper"
can be mailed to you (call 800-555-4995), but that costs money and can take a
few days to arrive."
I want output that at least recognizes the presence of a number
It's your capture groups that are being displayed. Display the whole match:
text = '''"Slate on Paper," our specially formatted print-out version of Slate, is e-mailed to readers Friday around midday. It also can be downloaded from our site. Those services are free. An actual paper edition of "Slate on Paper" can be mailed to you (call 800-555-4995), but that costs money and can take a few days to arrive."'''
pattern = "(\+?1-)?(\()?[0-9]{3}(\))?(-|.)[0-9]{3}(-|.)[0-9]{4}"
for match in re.finditer(pattern,text):
print(match.group())
Output:
800-555-4995

Python regex: tokenizing English contractions

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of "shouldn't" would be ["should", "n't"].
The nltk module does not seem to be up to the task however as:
"I wouldn't've done that."
tokenizes as:
['I', "wouldn't", "'ve", 'done', 'that', '.']
where the desired tokenization of "wouldn't've" was: ['would', "n't", "'ve"]
After examining common English contractions, I am trying to write a regex to do the job but I am having a hard time figuring out how to match "'ve" only once. For example, the following tokens can all terminate a contraction:
n't, 've, 'd, 'll, 's, 'm, 're
But the token "'ve" can also follow other contractions such as:
'd've, n't've, and (conceivably) 'll've
At the moment, I am trying to wrangle this regex:
\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b
However, this pattern also matches the badly formed:
"wouldn't've've"
It seems the problem is that the third apostrophe qualifies as a word boundary so that the final "'ve" token matches the whole regex.
I have been unable to think of a way to differentiate a word boundary from an apostrophe and, failing that, I am open to advice for alternative strategies.
Also, I am curious if there is any way to include the word boundary special character in a character class. According to the Python documentation, \b in a character class matches a backspace and there doesn't seem to be a way around this.
EDIT:
Here's the output:
>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]
I can't figure out the third match. In particular, I just realized that if the third apostrophe were matching the leading \b, then I don't know what would be matching the character class [a-zA-Z]+.
You can use the following complete regexes :
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."
print [i for i in pattern.split(s) if i]
result :
['I', 'would', "n't", "'ve", 'done', 'that.']
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.
You can use this regex to tokenize the text:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
Usage:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']
>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']
so:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']
Here a simple one
text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
.replace("'s ", ' is ').replace("'m ", ' am ') \
.replace("'ll ", ' will ').replace("'d ", ' would ') \
.replace("'re ", ' are ').replace("'ve ", ' have ')

Python: Regex outputs 12_34 - I need 1234

So I have input coming in as follows: 12_34 5_6_8_2 4_____3 1234
and the output I need from it is: 1234, 5682, 43, 1234
I'm currently working with r'[0-9]+[0-9_]*'.replace('_',''), which, as far as I can tell, successfully rejects any input which is not a combination of numeric digits and under-scores, where the underscore cannot be the first character.
However, replacing the _ with the empty string causes 12_34 to come out as 12 and 34.
Is there a better method than 'replace' for this? Or could I adapt my regex to deal with this problem?
EDIT: Was responding to questions in comments below, I realised it might be better specified up here.
So, the broad aim is to take a long input string (small example:
"12_34 + 'Iamastring#' I_am_an_Ident"
and return:
('NUMBER', 1234), ('PLUS', '+'), ('STRING', 'Iamastring#'), ('IDENT', 'I_am_an_Ident')
I didn't want to go through all that because I've got it all working as specified, except for number.
The solution code looks something like:
tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE',
'IDENT', 'STRING', 'NUMBER')
t_PLUS = "+"
t_MINUS = '-'
and so on, down to:
t_NUMBER = ###code goes here
I'm not sure how to put multi-line processes into t_NUMBER
I'm not sure what you mean and why you need regex, but maybe this helps
In [1]: ins = '12_34 5_6_8_2 4_____3 1234'
In [2]: for x in ins.split(): print x.replace('_', '')
1234
5682
43
1234
EDIT in response to the edited question:
I'm still not quite sure what you're doing with tokens there, but I'd do something like (at least it makes sense to me:
input_str = "12_34 + 'Iamastring#' I_am_an_Ident"
tokens = ('NUMBER', 'SIGN', 'STRING', 'IDENT')
data = dict(zip(tokens, input_str.split()))
This would give you
{'IDENT': 'I_am_an_Ident',
'NUMBER': '12_34',
'SIGN': '+',
'STRING': "'Iamastring#'"}
Then you could do
data['NUMBER'] = int(data['NUMBER'].replace('_', ''))
and anything else you like.
P.S. Sorry if it doesn't help, but I really don't see the point of having tokens = ('PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'IDENT', 'STRING', 'NUMBER'), etc.
a='12_34 5_6_8_2 4___3 1234'
>>> a.replace('_','').replace(' ',', ')
'1234, 5682, 43, 1234'
>>>
The phrasing of your question is a little bit unclear. If you don't care about input validation, the following should work:
input = '12_34 5_6_8_2 4_____3 1234'
re.sub('\s+', ', ', input.replace('_', ''))
If you need to actually strip out all characters which are not either digits or whitespace and add commas between the numbers, then:
re.sub('\s+', ', ', re.sub('[^\d\s]', '', input))
...should accomplish the task. Of course, it would probably be more efficient to write a function that only has to walk through the string once rather than using multiple re.sub() calls.
You seem to be doing something like:
>>> data = '12_34 5_6_8_2 4_____3 1234'
>>> pattern = '[0-9]+[0-9_]*'
>>> re.findall(pattern, data)
['12_34', '5_6_8_2', '4_____3', '1234']
re.findall(pattern.replace('_', ''), data)
['12', '34', '5', '6', '8', '2', '4', '3', '1234']
The issue is that pattern.replace isn't a signal to re to remove the _s from the matches, it changes your regex to: '[0-9]+[0-9]*'. What you want to do is to do replace on the results, rather than the pattern - eg,
>>> [match.replace('_', '') for match in re.findall(pattern, data)]
['1234', '5682', '43', '1234']
Also note that your regex can be simplified slightly; I will leave out the details of how since this is homework.
Well, if you really have to use re and only re, you could do this:
import re
def replacement(match):
separator_dict = {
'_': '',
' ': ',',
}
for sep, repl in separator_dict.items():
if all( (char == sep for char in match.group(2)) ):
return match.group(1) + repl + match.group(3)
def rec_sub(s):
"""
Recursive so it works with any number of numbers separated by underscores.
"""
new_s = re.sub('(\d+)([_ ]+)(\d+)', replacement, s)
if new_s == s:
return new_s
else:
return rec_sub(new_s)
But that epitomizes the concept of overkill.

Categories