How to remove words containing only numbers in python? - python

I have some text in Python which is composed of numbers and alphabets. Something like this:
s = "12 word word2"
From the string s, I want to remove all the words containing only numbers
So I want the result to be
s = "word word2"
This is a regex I have but it works on alphabets i.e. it replaces each alphabet by a space.
re.sub('[\ 0-9\ ]+', ' ', line)
Can someone help in telling me what is wrong? Also, is there a more time-efficient way to do this than regex?
Thanks!

You can use this regex:
>>> s = "12 word word2"
>>> print re.sub(r'\b[0-9]+\b\s*', '', s)
word word2
\b is used for word boundary and \s* will remove 0 or more spaces after your number word.

Using a regex is probably a bit overkill here depending whether you need to preserve whitespace:
s = "12 word word2"
s2 = ' '.join(word for word in s.split() if not word.isdigit())
# 'word word2'

Without using any external library you could do:
stringToFormat = "12 word word2"
words = ""
for word in stringToFormat.split(" "):
try:
int(word)
except ValueError:
words += "{} ".format(word)
print(words)

Related

How can I extract a word after # symbol in Python String

Given a sentence, e.g. "Im SHORTING #RSR here", I need to extract word that follow the "#" symbol (from, and not including the "#", to the next space).
Obviously, the "#" symbol can be anywhere in the string.
Thanks.
You could use:
sentence = "Im SHORTING #RSR here"
words = [word.lstrip('#') for word in sentence.split() if word.startswith('#')]
The result will contain all hashtaged words in the sentence.
If there's always exactly one, just use words[0]
Try this:
phrase = 'Im SHORTING #RSR here'
# split the input on spaces
words = phrase.split(' ')
# init empty list
comments = []
# iterate through each word
for word in words:
# check if the first letter of the word is '#'
if word[0] == '#':
# add the comment to the list of comments
comments.append(word)
# let's see what we have!
print(comments)

Extract words from sentence that are containing substring

I want to extract full phrase (one or multiple words) that contain the specific substring. Substring can have one multiple words, and words from substring can 'break'/'split' words in the test_string, but desired output is full phrase/word from test_string, for example
test_string = 'this is an example of the text that I have, and I want to by amplifier and lamp'
substring1 = 'he text th'
substring2 = 'amp'
if substring1 in test_string:
print("substring1 found")
if substring2 in test_string:
print("substring2 found")
My desired output is:
[the text that]
[example, amplifier, lamp]
FYI
Substring can be at the beginning of the word, middle or end...it does not matter.
If you want something robust I would do something like that:
re.findall(r"((?:\w+)?" + re.escape(substring2) + r"(?:\w+)?)", test_string)
This way you can have whatever you want in substring.
Explanation of the regex:
'(?:\w+)' Non capturing group
'?' zero or one
I have done this at the begining and at the end of your substring as it can be the start or the end of the missing part
To answer the latest comment about how to get the punctuation as well. I would do something like that using string.punctuation
import string
pattern = r"(?:[" + r"\w" + re.escape(string.punctuation) + r"]+)?"
re.findall("(" + pattern + re.escape(substring2) + pattern + ")",
test_string)
Doing so, will match any punctuation in the word at the beginning and the end. Like: [I love you.., I love you!!, I love you!?, ?I love you!, ...]
this is a job for regex, as you could do:
import re
substring2 = 'amp'
test_string = 'this is an example of the text that I have'
print("matches for substring 1:",re.findall(r"(\w+he text th\w+)", test_string))
print("matches for substring 2:",re.findall(r"(\w+amp\w+)",test_string))
Output:
matches for substring 1:['the text that']
matches for substring 2:['example']

Python how to find a substring in a string and print the whole string containing the substring

I am struggling to find a solution to print a string which contains a particular substring. So e.g. I have a string
mystr = "<tag> name = mon_this_is_monday value = 10 </tag>"
I want to search for "mon" in the string above and print "mon_this_is_monday" but not sure how to do it
I tried doing
pattern = re.compile('mon_')
try:
match = re.search(pattern, mystr).group(0)
print(match)
except AttributeError:
print('No match')
but this this just gives mon_ as output for match. How do I get the whole string "mon_this_is_monday" as output?
We could try using re.findall with the pattern \b\w*mon\w*\b:
mystr = "<tag> name = mon_this_is_monday value = 10 </tag>"
matches = re.findall(r'\b\w*mon\w*\b', mystr)
print(matches)
This prints:
['mon_this_is_monday']
The regex pattern matches:
\b a word boundary (i.e. the start of the word)
\w* zero or more word characters (letters, numbers, or underscore)
mon the literal text 'mon'
\w* zero or more word characters, again
\b another word boundary (the end of the word)
print([string for string in mystr.split(" ") if "mon" in string])
you can also do a search on regex
import re
mystr = "<tag> name = mon_this_is_monday value = 10 </tag>"
abc = re.search(r"\b(\w*mon\w*)\b",mystr)
print(abc.group(0))

How to remove abnormal words from string like '0xd46b6c46a37f4578' or 'jrLJW PUNtTLrQGZ25X4DA ' - python

I have a string which contains proper words and some gibberish which does not make any sense to reader. I want to remove the those abnormal words from the string . Please note that these are just sample words there are tons of them in string.
Example:
0xe933b1dfab45d591 0xe7d363050cec0146
0xf5e4005d43867c48 0x1e0b75e9dff872f5
0xa46406ec8a4e6cdc 0x3ea14cfd28ccf8fe
0x750b065d3715b1c8 0x6bb50ebe411dd5da
0xd46b6c46a37f4578 0x15b9290f631cded2
0xafcfd4f9daa2187e 0x9dcc5dbad77c926a
AEj_0IB_BpqtlN76JnAdUQ0gWWYXEzVQrFBrGQ
0ahUKEwjj09PGppLeAhXUZSsKHZltBc8Q61gI1QIoBzAF
i removed extra charaters like +, - ,' using following :
text = re.sub(r'[^\w]', ' ', text)
but i couldn't not find the way to remove these characters. Help Needed .
Thanks.
Does this work for you?
newtext = ""
for word in text.split():
if not(any(char.isdigit() for char in word) and any(char.isalpha() for char in word)):
newtext += word + " "
This checks if a string separated by spaces in your text contains both letters and digits. And if it doesn't it puts them in a new string.

Python: delete all characters before the first letter in a string

After a thorough search I could find how to delete all characters before a specific letter but not before any letter.
I am trying to turn a string from this:
" This is a sentence. #contains symbol and whitespace
To this:
This is a sentence. #No symbols or whitespace
I have tried the following code, but strings such as the first example still appear.
for ch in ['\"', '[', ']', '*', '_', '-']:
if ch in sen1:
sen1 = sen1.replace(ch,"")
Not only does this fail to delete the double quote in the example for some unknown reason but also wouldn't work to delete the leading whitespace as it would delete all of the whitespace.
Thank you in advance.
Instead of just removing white spaces, for removing any char before first letter, do this :
#s is your string
for i,x in enumerate(s):
if x.isalpha() #True if its a letter
pos = i #first letter position
break
new_str = s[pos:]
import re
s = " sthis is a sentence"
r = re.compile(r'.*?([a-zA-Z].*)')
print r.findall(s)[0]
Strip all whitespace and punctuation:
>>> text.lstrip(string.punctuation + string.whitespace)
'This is a sentence. #contains symbol and whitespace'
Or, an alternative, find the first character that is an ascii letter. For example:
>>> pos = next(i for i, x in enumerate(text) if x in string.ascii_letters)
>>> text[pos:]
'This is a sentence. #contains symbol and whitespace'
This is a very basic version; i.e. it uses syntax that beginners in Python will easily understand.
your_string = "1324 $$ '!' '' # this is a sentence."
while len(your_string) > 0 and your_string[0] not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz":
your_string = your_string[1:]
print(your_string)
#prints "this is a sentence."
Pros: Simple, no imports
Cons: The while loop could be avoided if you feel comfortable using list comprehensions.
Also, the string that you're comparing to could be simpler using regex.
Drop everything up to the first alpha character.
import itertools as it
s = " - .] * This is a sentence. #contains symbol and whitespace"
"".join(it.dropwhile(lambda x: not x.isalpha(), s))
# 'This is a sentence. #contains symbol and whitespace'
Alternatively, iterate the string and test if each character is in a blacklist. If true strip the character, otherwise short-circuit.
def lstrip(s, blacklist=" "):
for c in s:
if c in blacklist:
s = s.lstrip(c)
continue
return s
lstrip(s, blacklist='\"[]*_-. ')
# 'This is a sentence. #contains symbol and whitespace'
You can use re.sub
import re
text = " This is a sentence. #contains symbol and whitespace"
re.sub("[^a-zA-Z]+", " ", text)
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH)

Categories