How to remove certain characters from a string? - python

My current string is "Spam, Eggs ( S & E)".
In our data base people are being input as: "First, Last" sometimes people add nicknames in the form of ("Nickname") so an example string would be "William, Smith (Will)" No matter the case I only want the first and last.
Is there a quick solution to this?

import re
string1 = "Spam, Eggs ( S & E)"
string2 = re.sub(r'\(.*\)', "", string1).strip()
print(string2)
Results in the output you want. Repl Here:
https://repl.it/repls/UselessColorlessArchitect
The regex matches an opening parenthesis followed by any character (any amount of times) followed by a closing parenthesis.

Related

How to extract string information from these two strings?

I want to write a single regular expression code to extract the string from these two strings:
string1 = '#HISEQ:625:HC2T5BCXY:1:1101:1177:2101'
string2 = '#SRR7216015.1 HISEQ:630:HC2VKBCXY:1:1101:1177:2073/1'
I want to extract the string right after the # until it hit the end or a space to get
HISEQ:625:HC2T5BCXY:1:1101:1177:2101 from string1
or
SRR7216015.1 from string2
So, how to do it. I've tested a bunch of the regular expression code but couldn't do it.
Below is the code I tried:
string1 = '#HISEQ:625:HC2T5BCXY:1:1101:1177:2101'
string2 = '#SRR7216015.1 HISEQ:630:HC2VKBCXY:1:1101:1177:2073/1'
pattern1 = re.compile(r'#(\w*.*:*\d*:*\w*:*\d*:*\d*[$|\s])')
print(pattern1.search(string1).group(1))
Thanks in advance!
Just use
#(\S+)
and take the first group. Lookarounds or alternations - as suggested in other answers - are expensive.
You could use this regex for that:
(?<=#).*?(?= |$)
Use lookarounds. (?<=#) checks for an # signt before, (?= |$) matches an spaces or end of string. .* mathes everything between
https://regex101.com/r/p7kI2O/1

Remove spaces from string after and before letter

I have a quite a few sums of strings that look like this: "a name / another name / something else".
I want to get to this: "a name/another name/something else".
Basically removing the spaces before and after the forward slashes only (not between the words themselves).
I know nothing about programming but I looked and found that this can be done with Python and Regex. I was a bit overwhelmed though with the amount of information I found.
You can use the pattern:
(?:(?<=\/) | (?=\/))
(?: Non capturing group.
(?<=\/) Lookbehind for /.
| OR
(?=\/) Positive lookahead for /.
) Close non capturing group.
You can try it live here.
Python snippet:
import re
str = 'a name / another name / something else'
print(re.sub(r'(?:(?<=\/) | (?=\/))','',str))
Prints:
a name/another name/something else
There's no need for regex here, since you're simply replacing a string of literals.
str = "a name / another name / something else"
print(str.replace(" / ", "/"))
Here is an answer without using regex that I feel is easier to understand
string = "a name / another name / something else"
edited = "/".join([a.strip() for a in string.split("/")])
print(edited)
output:
a name/another name/something else
.join() joins elements of a sequence by a given seperator, docs
.strip() removes beginning and trailing whitespace, docs
.split() splits the string into tokens by character, docs
This pattern will match for any amount of whitespace surrounding / and remove it. I think the regex is relatively easy to understand
\s*([\/])\s*
Has a capturing group that matches the backslash (that's what the middle part is). The s* parts match whitespace (at least one up to any amount of times).
You can then replace these matched strings with just a / to get rid of all the whitespace.
str1 being your string:
re.sub(" / ", "/" ,str1)
Use the following code to remove all spaces before and after the / character:
import re
str = 'a name / another name / something else'
str = re.sub(r'(?:(?<=\/)\s*|\s*(?=\/))','', str)
Check this document for more information.

Python regex match only if standalone

Using re in python3, I want to match appearances of percentages in text, and substitute them with a special token (e.g. substitute "A 30% increase" by "A #percent# increase").
I only want to match if the percent expression is a standalone item. For example, it should not match "The product's code is A322%n43%". However, it should match when a line contains only one percentage expression like "89%".
I've tried using delimiters in my regex like \b, but because % is itself a non-alphanumeric character, it doesn't catch the end of the expression. Using \s makes it impossible to catch expression standing by themselves in a line.
At the moment, I have the code:
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "1,211.21%")
' #percent '
which still matches if the expression is followed by letters or other text (like the product code example above).
>>> re.sub(r"[+-]?[.,;]?(\d+[.,;']?)+%", ' #percent# ', "EEE1,211.21%asd")
'EEE #percent# asd'
What would you recommend?
Looks like a perfect job for Negative Lookbehind and Negative Lookahead:
re.sub(r'''(?<![^\s]) [+-]?[.,;]? (\d+[.,;']?)+% (?![^\s.,;!?'"])''',
'#percent#', string, flags=re.VERBOSE)
(?<![^\s]) means "no space immediately before the current position is allowed" (add more forbidden characters if you need).
(?![^\s.,;!?'"]) means "no space, period, etc. immediately after the current position are allowed".
Demo: https://regex101.com/r/khV7MZ/1.
Try putting "first" capture group with a "second".
original: r"[+-]?[.,;]?(\d+[.,;']?)+%"
suggestd: r"[+-]?[.,;]?((\d+[.,;']?)+%)\b"

Print the line between specific pattern

I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.
Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main
There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world

python regular expression match comma

In the following string,how to match the words including the commas
--
process_str = "Marry,had ,a,alittle,lamb"
import re
re.findall(r".*",process_str)
['Marry,had ,a,alittle,lamb', '']
--
process_str="192.168.1.43,Marry,had ,a,alittle,lamb11"
import re
ip_addr = re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",l)
re.findall(ip_addr,process_str1)
How to find the words after the ip address excluding the first comma only
i.e, the outout again is expected to be Marry,had ,a,alittle,lamb11
In the second example above how to find if the string is ending with a digit.
In the second example, you just need to capture (using ()) everything that follows the ip:
import re
s = "192.168.1.43,Marry,had ,a,alittle,lamb11"
text = re.findall(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3},(.*)", s)[0]
// text now holds the string Marry,had ,a,alittle,lamb11
To find out if the string ends with a digit, you can use the following:
re.match(".*\d$", process_str)
That is, you match the entire string (.*), and then backtrack to test if the last character (using $, which matches the end of the string) is a digit.
Find the words including the commas, that's how I understand this sentence:
>>> re.findall("\w+,*", process_str)
['Marry,', 'had', 'a,', 'alittle,', 'lamb']
ending with a didgit:
"[0-9]+$"
Hmm. The examples are not quite clear, but it seems in example #2, you want to only match text , commas, space-chars, and ignore digits? How about this:
re.findall('(?i)([a-z, ]+), process_str)
I didn't quite understand the "if the string is ending with a digit". Does that mean you ONLY want to match 'Mary...' IF it ends with a digit? Then that would look like this:
re.findall('(?i)([a-z, ]+)\d+, process_str)

Categories