Iterate and replace words in lines of a tuple python - python

I want to iterate through this tuple and for each line, iterate through the words to find and replace some words (internet addresses, precisely) using regex while leaving them as lines.
aList=
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a l... T.CO/CHPBRVH9WKk",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant T.CO?BUXLVZFDWQ",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
This code below almost does that, but it breaks the list into words separated by lines:
i=0
while i<len(aList):
for line in aList[i].split():
line = re.sub(r"^[http](.*)\/(.*)$", "", line)
print (line)
i+=1
I'd love to have results as with the exception of the internet addresses in each line:
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
Thanks

From this:
re.sub(r"^[http](.*)\/(.*)$", "", line)
it looks to me as if you expect that all your URLs will be at the end of the line. In that case, try:
[re.sub('http://.*', '', s) for s in aList]
Here, http:// matches anything that starts with http://. .* matches everything that follows.
Example
Here is your list with some URLs added:
aList = [
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a http://example.com/CHPBRVH9WKk",
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant http://example.com?BUXLVZFDWQ",
"want to go on holiday again, missing the sun"
]
Here is the result:
>>> [re.sub('http://.*', '', s) for s in aList]
['being broken changes people, \nand rn im missing the old me',
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
'#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ',
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
'This was my ring tone, before I decided change was good and missing a call was insignificant ',
'want to go on holiday again, missing the sun']

Your question is a little unclear, but I think I get what you're going for
newlist = [re.sub(r"{regex}", "", line) for line in alist]
Should iterate through a list of strings and replace any strings that match your regex pattern with an empty string using a python list comprehension
side note:
Looking closer at your regex it looks like its not doing what you think its doing
I would look at this stack over flow post about matching urls in regex
Regex to find urls in string in Python

Related

Need a Regex that adds a space after a period, but can account for abbreviations such as U.S. or D.C

Here is what I have so far:
text = re.sub((?<=\.)(?=[A-Z]), text)
This already avoids numbers and it gets around non-capital letters, but I need it to account for the edge case where initials are separated by periods.
An example sentence where I wouldn't want to add a space would be:
The U.S. health care is more expensive than U.K health care.
Currently, my regex makes it like:
The U. S. health care is more expensive than U. K health care.
But I want it to look exactly like the first sentence without the spaces separating U.S and U.K
I'm not sure how to do this, any advice would be appreciated!
EDIT:
(?<=\.)(?=[A-Z][a-z]{1,})
makes it so that it avoids one word abbreviations.
I think that this does what you want. We find points which do not have a capital letter before them, nor a space after.
import re
text="The U.S. health care is more expensive than U.K health care.The end."
text = re.sub(r'((?<![A-Z])\.(?!\s))',r'\1 ', text)
print('<',text,'>')
output (with '<' and '>' to show the beginning and end of the text more clearly.
< The U.S. health care is more expensive than U.K health care. The end. >

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

Is there a shorter way to replace words in a string? [duplicate]

This question already has answers here:
How to replace multiple substrings of a string?
(28 answers)
Closed 3 years ago.
This is my task
journey = """Just a small tone girl
Leaving in a lonely whirl
She took the midnight tray going anywhere
Just a seedy boy
Bored and raised in South Detroit or something
He took the midnight tray going anywhere"""
Gross. Okay, so for this exercise, your job is to use Python's string replace method to fix this string up and print the new version out to the console.
This is what I did
journey = """ just a small tone girl
Leaving in a lonely whirl
she took a midnight tray going anywhere
Just a seedy boy
bored and raised in south detroit or something
He took the midnight tray going anywhere"""
journeyEdit = journey.replace("tone" ,
"town").replace("tray","train").replace("seedy","city").replace("Leaving",
"living").replace("bored","born").replace("whirl","world").replace("or
something", " ")
print (journeyEdit)
Here is a sample way to replace words from text. you can use python re package.
please find the below code for your guidance.
import re
journey = """ just a small tone girl Leaving in a lonely whirl she took a
midnight tray going anywhere Just a seedy boy bored and raised in south
detroit or something He took the midnight tray going anywhere"""
# define desired replacements here
journeydict = {"tone" : "town",
"tray":"train",
"seedy":"city",
"Leaving": "living",
"bored":"born",
"whirl":"world"
}
# use these given three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in journeydict.items())
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest
versions
pattern = re.compile("|".join(journeydict.keys()))
text = pattern.sub(lambda m: journeydict[re.escape(m.group(0))], journey)
print(journey)
print(text)
Probably a longer way than you given ;-).
As given at How to replace multiple substrings of a string?:
import re
journey = """ just a small tone girl Leaving in a lonely whirl she took a
midnight tray going anywhere Just a seedy boy bored and raised in south
detroit or something He took the midnight tray going anywhere"""
rep = {"tone": "town",
"tray": "train",
"seedy":"city",
"Leaving": "living",
"bored":"born",
"whirl":"world",
"or something": " "}
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.iteritems())
# Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
journeyEdit = pattern.sub(lambda m: rep[re.escape(m.group(0))], journey)
print(journeyEdit)

Iterate over a text and find the distance between predefined substrings

I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.

Split by regex of new line and capital letter

I've been struggling to split my string by a regex expression in Python.
I have a text file which I load that is in the format of:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
I'd like to get the following output:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
I'm looking to split my string by a new line and a capital letter or a bullet point in Python.
I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.
Here's what I have so far:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
This just gives me:
[u'\nKyle', u'\nSome']
which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.
I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?
I hope this makes sense and I'm sorry if my question is in anyway unclear. :)
You can use this split function:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
Code Demo
You can split at a \n proceeded by a capital letter or the bullet character:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
Output:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
Or, without using the symbol for the bullet character:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

Categories