Replace characters in a string with whitespaces

Replace characters in a string with whitespaces - python

I am writing a simple Python script that retrieves the latest tweet of any twitter user (in this case BBC) and uses the integrated text-to-speech system on Mac to read out the content of that particular tweet.
Everything is running as it should, but there are certain things I want to improve. For instance, if a tweet contains the character "#", the computer will speak this as "number". E.g, if the tweet were to read "#BBC covers the latest news", the computer speaks "number BBC covers the latest news".
I have declared a string to hold the content of the tweet, and wish to find a way to replace unwanted characters with white spaces. So far, I have the following:
for char in data_content: #data_content is the string holding the tweet
if char in "#&/": # does not replace #
mod_data = data_content.replace(char, '')
print(mod_data)
system('say ' + mod_data)
This seems to be working correctly with the "/" character, but does not replace the "#" character. So, any help on this matter is very much appreciated!
P.S. I have tried replacing the "#" character alone, in which case I get the desired result. However, when I try to provide a series of characters to replace, it only replaces the "/" character.
Thanks!

Your loop always transforms data_content to mod_data so you will always only see the last change.
Say your string is "#BBC covers the latest issues with G&F. See bbc.co.uk/gf"
First time a char in your list is found is the # so:
mod_data = "BBC covers the latest issues with G&F. See bbc.co.uk/gf"
Next the & is found but it is found in data_content so the changes you made earlier are ignored and you get:
mod_data = "#BBC covers the latest issues with GF. See bbc.co.uk/gf"
The same happens when the / is found and you get:
mod_data = "#BBC covers the latest issues with G&F. See bbc.co.ukgf"
That's why it looks like it is only working for the /.
You can simply do what you want using regular expressions like this:
import re
string = "#BBC covers the latest issues with G&F. See bbc.co.uk/gf"
mod_data = re.sub(r"[#&/]", " ", string)
print(mod_data)
system('say ' + mod_data)

I have an additional suggestion. Since replace() works for all occurrences of the character in the string, you don't need that outer loop, so you could change your code to something like this:
mod_data = data_content
for char in "#&/":
mod_data = mod_data.replace(char, '')

Related

Why is re.findall behaving like this? (python regex)

I made a small program in pyhton that searches through a music website and collects music data. The music has a format of [artist] - [music name] [music file format]. At first I used re.search to find a certain artist (I used regex because there are some other characters and irregularities in the music info above, and the only indicator for finding the artist was the - following the artist).
Somehow it didn't work so I changed it to re.findall just in case but it still didn't work. since I'm a beginner at python I thought I sis something wrong so I wrote some test code to study what was wrong. And this is what I got.
when I changed the x string (which would be the music info) and ran re.findall again it gave me a different result(none). I 100% thought the result would be the same. why is this behaving like this? And could this be the reason why my original code's re.serach, re.findall wasn't working?
I've included the code just in case. (used selenium)
idx = 1
while True:
try:
hxp1 = "(//h3[#class='entry-title td-module-title']/a)[" + str(idx) + "]"
text = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, hxp1)))
# info = eg) 'Michael Jackson - Beat it [FLAC, MP3, WAV]'
info = text.get_attribute('title') # get 'info' as string
# ARTIST = eg) 'Michael Jackson'
regex = ARTIST + ' - '
match = re.findall(regex, info) # or use re.search
# do something with 'match'...
idx += 1
except:
# do something...
break

It seems you need to make sure you match
any Unicode whitespaces (i.e. \s in Python 3.x, or (?u)\s in Python 2.x, see re documentation: "Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages).")
any Unicode hyphens (see Searching for all Unicode variation of hyphens in Python).
Combining all that into your regex:
Minami\s[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]\s
In your case, if you just need to support en-dash/em-dash/hyhen chars and any Unicode whitespace chars, you can use
Minami\s[-—–]\s

"# this is a string", How python identifies it as a string but not a comment?

I really want to know how python identifies # in quotes as a string and normal # as a comment
I mean how the code to identify difference between these actually works, like will the python read a line and how it excludes the string to find the comment
"# this is a string" # this is a comment
How the comment is identified, will python exclude the string and if so, How?
How can we write a code which does the same, like to design a compiler for our own language with python
I am a newbie, please help

You need to know that whether something is a string or a comment can be determined from just one single character. That is the job of the scanner (or lexical analyzer if you want to sound fancy).
If it starts with a ", it's a string. If it starts with #, it's a comment.
In the code that makes up Python itself, there's probably a loop that goes something like this:
# While there is still source code to read
while not done:
# Get the current character
current = source[pos]
# If the current character is a pound sign
if current == "#":
# While we are not at the end of the line
while current != "\n":
# Get the next character
pos += 1
current = source[pos]
elif current == '"':
# Code to read a string omitted for brevity...
else:
done = True
In the real Python lexer, there are probably dozens more of those if statements, but I hope you have a better idea of how it works now. :)

Because of the quotes
# This is a comment
x = "# this is a string"
x = '# this is a also string'
x = """# this string
spans
multiple
lines"""

"# this is a string" # this is a comment
In simple terms, the interpreter sees the first ", then it takes everything that follows as part of the string until it finds the matching " which terminates the string. Then it sees the subsequent # and interprets everything to follow as a comment. The first # is ignored because it is between the two quotes, and hence is taken as part of the string.

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.

You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera

Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').

If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))

You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

find substrings and replace them but get their information [python]

I want to do something like this to a text (This is just an example to show the problem):
new_text = re.sub(r'\[(?P<index>[0-9]+)\]',
'(Found pattern the ' + index + ' time', text)
Where text is my original text. I want to find any substring like this: [3] or [454]. But this isn't the hard part. The hard part is to get the number in there. I want to use the number to use a method called add_link(number) which expects a number(instead of the string I'm building with "Found pattern..." - that's just an example). (In a database it has stored links matched to IDs where it finds the links.)
Python tells me it doesn't know the local variable index. How can I make it knowing?
Edit: I have been told I didn't ask clearly. (I already have an answer but maybe someone is going to read this in future.) The question was how to get the pattern known as [0-9]+ get as a local variable. I guessed it would be something like this: (?P<index>[0-9]+), and it was.
Thanx in advanced, Asqiir

You can reference a named group in the replacement string with the syntax \g<field name>. So your code should be written as:
new_text = re.sub(r'\[(?P<index>[0-9]+)\]', '(Found pattern the \g<index> time', text)

using \b in regex

--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?

It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.

Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b

You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace characters in a string with whitespaces - python

I have an additional suggestion. Since replace() works for all occurrences of the character in the string, you don't need that outer loop, so you could change your code to something like this: mod_data = data_content for char in "#&/": mod_data = mod_data.replace(char, '')

Related

Why is re.findall behaving like this? (python regex)

"# this is a string", How python identifies it as a string but not a comment?

Python 3.6 Identifying a string and if X in Y

find substrings and replace them but get their information [python]

using \b in regex

Categories

Resources