find the specific part of string between special characters - python

i am trying to find specific part of the string using regex or something like that.
for example:
string = "hi i am *hadi* and i have &18& year old"
name = regex.find("query")
age = regex.find("query")
print(name,age)
result:
hadi 18
i need the 'hadi' and '18'
Attention: The string is different each time. I need the sentence or
words betwee ** and &&

Try:
import re
string = "hi i am *hadi* and i have &18& year old"
pattern = r'(?:\*|&)(\w+)(?:\*|&)'
print(re.findall(pattern, string))
Outputs:
['hadi', '18']
You could assign re.findall(pattern, string) to a variable and have a Python list and access the values etc.
Regex demo:
https://regex101.com/r/vIg7lU/1
The \w+ in the regex can be changed to .*? if there is more than numbers and letters. Example: (?:\*|&)(.*?)(?:\*|&) and demo: https://regex101.com/r/RIqLuI/1

this is how i solved my question:
import re
string = "hello. my name is *hadi* and i am ^18^ years old."
name = re.findall(r"\*(.+)\*", string)
age = re.findall(r"\^(.+)\^", string)
print(name[0], age[0])

Related

Remove String between two characters for all occurrences

I am looking for help on string manipulation in Python 3.
Input String
s = "ID bigint,FIRST_NM string,LAST_NM string,FILLER1 string"
Desired Output
s = "ID,FIRST_NM,LAST_NM,FILLER1"
Basically, the objective is to remove anything between space and comma at all occurrences in the input string.
Any help is much appreciated
using simple regex
import re
s = "ID bigint,FIRST_NM string,LAST_NM string,FILLER1 string"
res = re.sub('\s\w+', '', s)
print(res)
# output ID,FIRST_NM,LAST_NM,FILLER1
You can use regex
import re
s = "ID bigint,FIRST_NM string,LAST_NM string,FILLER1 string"
s = ','.join(re.findall('\w+(?= \w+)', s))
print(s)
Output:
ID,FIRST_NM,LAST_NM,FILLER1

Python: Splitting year number in brackets in a string using regex

Suppose I have a string like str = "The Invisible Man (2020)". In Python I want to split it into a list with String + Number (year number always at the end of the string) of Year like below:
['The Invisible Man', '2020']
How can I achieve this goal using a regular expression in Python?
Here's one way using re.split, which works for this specific string structure:
import re
s = "The Invisible Man (2020)"
re.split(r'\s+\((\d+)\)', s)[:2]
# ['The Invisible Man', '2020']
Here is one way using regexp and named groups. You take longest string followed by space and opening parenthesis and name it name. Then you take 4 digit long number inside parenthesis and name it year.
Finally make a list as requested in question.
import re
r = re.compile(r'(?P<name>([a-zA-Z ]*)) \((?P<year>\d\d\d\d)\)')
m = r.match("The Invisible Man (2020)")
l = [m.group('name'), m.group('year')]
You can write a regex for the whole string, and use re.search and re.search.groups to get the title and year out of the string:
import re
s = "The Invisible Man (2020)"
regex = r"(.+) \((\d+)\)"
title, year = re.search(regex, s).groups()
print('title = "{}", year = "{}"'.format(title, year))
Output:
title = "The Invisible Man", year = "2020"

Removing all spaces, punctuation, and capitalization [duplicate]

This question already has answers here:
Best way to replace multiple characters in a string?
(16 answers)
Closed 3 years ago.
We are working on the Vigenere cipher in my computer science class and one of the first steps our teacher wants us to take is to delete all whitespace, punctuation, and capitalization from a string.
#pre-process - removing spaces, punctuation, and capitalization
def pre_process(s):
str = s.lower()
s = (str.replace(" ", "") + str.replace("!?'.", ""))
return s
print(pre_process("We're having a surprise birthday party for Eve!"))
What I want the output to be is "werehavingasurpisebirthdaypartyforeve" but what I'm actually getting is "we'rehavingasurprisebirthdaypartyforeve!we're having a surprise birthday party for eve!"
You should use regex instead of string replace. Try this code.
import re
mystr="We're having a surprise birthday party for Eve!"
#here you can pass as many punctuations you want
result=re.sub("[.'!#$%&\'()*+,-./:;<=>?#[\\]^ `{|}~]","",mystr)
print(result.lower())
str.replace("!?'.", "")) replaces only the string !?'., not any of the four characters on their own.
You need to use a separate replace call for each character, or otherwise use regular expressions.
The reason your solution does not work, is because it is attempting to remove the literal string "!?'.", and not each character individually.
One way to accomplish this would be the following:
import re
regex = re.compile('[^a-zA-Z]')
s = "We're having a surprise birthday party for Eve!"
s = regex.sub('', s).lower()
import re
def preprocess(s):
return re.sub(r'[\W_]', '', s).lower()
re.sub removes all non-alphanumeric characters (everything except A-Z and 0-9).
lower() removes capitalization.
An approach without using RegEx.
>>> import string
>>> s
"We're having a surprise birthday party for Eve!"
>>> s.lower().translate(None, string.punctuation).replace(" ", "")
'werehavingasurprisebirthdaypartyforeve'
Change your code as below:-
def pre_process(s):
str = s.lower()
s = (str.replace(" ", ""))
s= s.replace("!", "")
s= s.replace("'", "")
return s
print(pre_process("We're having a surprise birthday party for Eve!"))
str.translate is also an option. you can create a translation table using str.maketrans where the first arguments (ascii_uppercase) will be translated to the second ones (ascii_lowercase). the third argument (punctuation + whitespace) is a list of the characters you want deleted:
from string import ascii_lowercase, ascii_uppercase, punctuation, whitespace
table = str.maketrans(ascii_uppercase, ascii_lowercase, punctuation + whitespace)
s = "We're having a surprise birthday party for Eve!"
print(s.translate(table))
# werehavingasurprisebirthdaypartyforeve
once you have the table initialized every subsequent string can just be converted by applying
s.translate(table)
You could use re ?,
>>> import re
>>> x
"We're having a surprise birthday party for Eve!"
>>> re.sub(r'[^a-zA-Z0-9]', '', x).lower() # negate the search. Fastest among the two :)
'werehavingasurprisebirthdaypartyforeve'
or list comprehension ?
>>> import string
>>> ''.join(y for y in x if y in string.ascii_letters).lower()
'werehavingasurprisebirthdaypartyforeve'
Just a benchmark,
>>> timeit.timeit("''.join(y for y in x if y in string.ascii_letters).lower()", setup='import string;x = "We\'re having a surprise birthday party for Eve!"')
7.747261047363281
>>> timeit.timeit("re.sub(r'[^a-zA-Z0-9]', '', x).lower()", setup='import re;x = "We\'re having a surprise birthday party for Eve!"')
2.912994146347046

python search string for numbers, put brackets around them

I'm trying to search a string for numbers, and when finding them, wrap some chars around them, e.g.
a = "hello, i am 8 years old and have 12 toys"
a = method(a)
print a
"hello, i am \ref{8} years old and have \ref{12} toys"
I've looked at the re (regular expression) library, but cannot seem to find anything helpful... any cool ideas?
This is pretty basic usage of the .sub method:
numbers = re.compile(r'(\d+)')
a = numbers.sub(r'\ref{\1}', a)
The parethesis around the \d+ number pattern create a group, and the \1 reference is replaced with the contents of the group.
>>> import re
>>> a = "hello, i am 8 years old and have 12 toys"
>>> numbers = re.compile(r'(\d+)')
>>> a = numbers.sub(r'\\ref{\1}', a)
>>> print a
hello, i am \ref{8} years old and have \ref{12} toys
you need to use re.sub function along these lines :
re.sub("(\d+)",my_sub_func,text) # catch the numbers here (altho this only catches non real numbers)
where my_sub_func is defined like this :
def my_sub_func(match_obj):
text = match_obj.group(0) # get the digit text here
new_text = "\\ref{"+text+"}" # change the pattern here
return new_text`

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories