My question is quite simple
I'm trying to come up with a RE to select any set of words or statement in between two characters.
For example is the strings are something like this :
') as whatever '
and it can also look like
') as whatever\r\n'
So i need to extract 'whatever' from this string.
The Regex I came up with is this :
\)\sas\s(.*?)\s
It works fine and extracts 'whatever' but this will only work for the first example not the second. What should i do in case of the second statement
I'm basically looking for an OR condition kind of thing!
Any help would be appreciated
Thanks in advance
The question is not very clear but maybe the regular expression syntax you are looking for might be something like this:
\)\sas\s(.*?)[\s | \r | \n]
basically telling after the string you are interested you can find a space or other characters.
EDIT
As example take the following code in Python2. The OR operator is '|' and I used it in the square brackets to catch the strings which have as subsequent character a space, '\r' a . or 'd'.
import re
a = ') as whatever '
b = ') as whatever\r\n'
c = ') as whatever.'
d = ') as whateverd'
a_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', a)[0] #ending with space, \r or new line char
b_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', b)[0]
c_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \.]', c)[0] #ending with space, \r new line char or .
d_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \. | d]', d)[0] #ending with space, \r, new line char, . or d
print(a_res, len(a_res))
print(b_res, len(b_res))
print(c_res, len(c_res))
print(d_res, len(d_res))
It is working as you intended. Please check it
import re
a =') as whatever '
b=') as whatever\r\n'
print re.findall(r'\)\sas\s(.*?)\s', a)[0]
print re.findall(r'\)\sas\s(.*?)\s', b)[0]
This will output as
'whatever'
'whatever'
Related
I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$
I am trying to use Regex to look through a specific part of a string and take what is between but I cant get the right Regex pattern for this.
My biggest issue is with trying to form a Regex pattern for this. I've tried a bunch of variations close to the example listed. It should be close.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text).lower())
# Gets rid of whitespace in case they move the []/[x] around
result = result.replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Happy Path:
I take string (text) and use Regex expression to get the substring between Link Created and Research Done.
Then make the result lowercase and get rid of whitespace just in case they move the []/[x]s. Then it looks at the string (result) for '[]' or '[x]' and print.
Actual Output:
At the moment all I keep getting is None because the the Regex syntax is off...
If you want . to match newlines, you have the use the re.S option.
Also, it would seem a better idea to check if the regex matched before proceeding with further calls. Your call to lower() gave me an error because the regex didn't match, so calling result.group(0).lower() only when result evaluates as true is safer.
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# Regex to search between parameters and make result lowercase if there are any uppercase Chars
result = (re.search("(?<=Link Created)(.+?)(?=Research Done)", text, re.S))
if result:
# Gets rid of whitespace in case they move the []/[x] around
result = result.group(0).lower().replace(" ", "")
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
else:
print("re did not match")
PS: all the re options are documented in the re module documentation. Search for re.DOTALL for the details on re.S (they're synonyms). If you want to combine options, use bitwise OR. E.g., re.S|re.I will have . match newline and do case-insensitive matching.
I believe it's the \n newline characters giving issues. You can get around this using [\s\S]+ as such:
import re
toFind = ['[]', '[x]']
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
# New regex to match text between
# Remove all newlines, tabs, whitespace and column separators
result = re.search(r"Link Created([\s\S]+)Research Done", text).group(1)
result = re.sub(r"[\n\t\s\|]*", "", result)
if any(x in result for x in toFind):
print("Exists")
else:
print("Doesn't Exist")
Seems like regex is overkill for this particular job unless I am missing something (also not clear to me why you need the step that removes the whitespace from the substring). You could just split on "Link Created" and then split the following string on "Research Done".
text = "| Completed?|\n|------|:---------:|\n|Link Created | [] |\n|Research Done | [X] "
s = text.split("Link Created")[1].split("Research Done")[0].lower()
if "[]" in s or "[x]" in s:
print("Exists")
else:
print("Doesn't Exist")
# Exists
I was trying to get a nice and clean representation of a string. My desired version would be ['Course Number: CLASSIC 10A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4']
However, the current output is ['Course Number: CLASSIC\xa010A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4'].
Something (\xa) is getting in the way of the first element. I will attach the part of codes below. Thanks in advance for helping me out.
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
all_rows = []
num = list[index][0].get_text(strip=True)
name = str.isalnum, list[index][1].get_text(strip=True)
unit = list[index][2].get_text(strip=True)
all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(c)
As #melpomene commented the string '\xa0' is a character - a non-breaking space... What you really need to be doing to this string is reformatting it to so called 'raw text', through the use of regex:
import re
re.sub('[^A-Za-z0-9-|:]+', ' ', str)
This is generally my preferred way of removing special characters/formatting - but how does it work... If we look with the first set of quotation marks'[^A-Za-z0-9-|:]+'we see the first thing we state is A-Z which simply means from A to Z all in capital letters. We then get from a-z all in lower case. After that we have 0-9 which shows all values from 0 to 9 and finally we have |: which means any colons or pipes... Let's test this with a simple script:
import re
vals = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|:'
print(vals == re.sub('[^A-Za-z0-9-|:]+', ' ', vals))
I would recommend running this code yourself to try it out but the answer you get back is True.
Incorporating this into your script would be as simple as:
import re
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
all_rows = []
num = list[index][0].get_text(strip=True)
name = str.isalnum, list[index][1].get_text(strip=True)
unit = list[index][2].get_text(strip=True)
all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(re.sub('[^A-Za-z0-9-|:]+', ' ', c))
If you encounter any other values you wish to include within your string, simple add them to the end of ^A-Za-z0-9-|:. For example, if you wished to keep underscores as well you would simply use '[^A-Za-z0-9-|:_]+'
Hope this helped. To read more go to the regex how to section of the python3 docs.
I have a following input :
"auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]"
I need to store them in a list with $, <, and > serving as delimiters.
Expected output:
['auth-server', '$na me$', '$1n ame$', '[position', '[$pr io$]]', 'xxxx', '[match-fqdn', '[[$fq dn$]', '[all]]]']
How can I do this?
What you could do is split it on the spaces, then go through each substring and check if it starts with one of the special delimiters. If it does, start a new string and append subsequent strings until you get to the end delimiter. Then remove those substrings and replace them with the new one.
I think what you want is
import re
re.split(r"(?<=\]) | (?=\$|\[)", "auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]")
This yields
['auth-server', '$na me$', '$1n ame$', '[position', '[$pr io$]]', 'xxxx', '[match-fqdn', '[[$fq dn$]', '[all]]]']
Note however that this is not exactly what you described, but what matches your example. It seems that you want to split on spaces when they are preceded by ] or followed by $ or [.
try re.split and a regex who make someone cry blood
import re
print re.split(r'(\$[^\$]+\$|\[\S+([^\]]+\]\])?|[-0-9a-zA-Z]+)',"auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]")
consider using pyparsing:
from pyparsing import *
enclosed = Forward()
nestedBrackets = nestedExpr('[', ']')
enclosed << ( Combine(Group(Optional('$') + Word(alphas) + Optional('$'))) | nestedBrackets )
print enclosed.parseString(data).asList()
output:
[['auth-server', '$na', 'me$', '$1n', 'ame$', ['position', ['$pr', 'io$']], 'xxxx',
['match-fqdn', [['$fq', 'dn$'], ['all']]]]]
Not quite a full answer, but I used regexp search...
a = "auth-server $na me$ $1n ame$ [position [$pr io$]] xxxx [match-fqdn [[$fq dn$] [all]]]"
m = re.search('\$.*\$', a)
combine this with a.split() and we can do the math...
I'm trying to make a simple regex that would recognize micro dvd format:
{52}{118}some text
{123}{202}some text
{203}{259}some text
{261}{309}some text
My code looks lke the following. match_obj is None and I don't know why:
import re
my_re = r"\{([0-9]*)\}\{[0-9]\}(.*)"
f = open('abc.txt')
match_obj = re.match(my_re, f.readline())
I have tried also:
match_obj = re.match(my_re, f.readline(), re.M|re.I)
with the same results.
You're very close - you're just missing a repeat symbol in the second number section. Your regex should look like this:
my_re = r"\{([0-9]*)\}\{[0-9]*\}(.*)"
Notice the added asterisk after the second [] block.
\{([0-9]*)\}\{[0-9] \}(.*)
/|\
|
You're missing a repeater in your second number character class.
I'm not sure about the rules of movie subtitles, but I would assume the brackets can not be empty.
A stricter regex would then be (albeit, probably not needed in your case):
\{([0-9]+)\}\{[0-9]+\}(.*)
The + repeater means 1 or more. The * repeater means 0 or more.
Are you only interested in the first number?
Is the text meant to be optional?