Strike-Down Markdown text between two symols - python

I'm trying to emulate the strike-through markdown from GitHub in python and I managed to do half of the job. Now there's just one problem I have: The pattern I'm using doesn't seem to replace the text containing symbols and I couldn't figure it out so I hope someone can help me
text = "This is a ~~test?~~"
match = re.findall(r"(?<![.+?])(~{2})(?!~~)(.+?)(?<!~~)\1(?![.+?])", text) # Finds all the text between ~~ symbols
if match:
for _, m in match: # Iterates though the matches. First variable (_) containing the symbol ~ and the second one (m) contains the text I want to replace
text = re.sub(f"~~{m}~~", "\u0336".join(m) + "\u0336", text) # Should replace ~~test?~~ with t̶e̶s̶t̶?̶ but it fails

There is a problem in the string that you are trying to replace. In your case, ~~{m}~~ where value of m is test? the regex to be replaced becomes ~~test?~~ and here ? has a special meaning which you aren't escaping hence the replace doesn't work properly. Just try using re.escape(m) instead of m so meta characters get escaped and are treated as literals.
Try your modified Python code,
import re
text = "This is a ~~test?~~"
match = re.findall(r"(?<![.+?])(~{2})(?!~~)(.+?)(?<!~~)\1(?![.+?])", text) # Finds all the text between ~~ symbols
if match:
for _, m in match: # Iterates though the matches. First variable (_) containing the symbol ~ and the second one (m) contains the text I want to replace
print(m)
text = re.sub(f"~~{re.escape(m)}~~", "\u0336".join(m) + "\u0336", text) # Should replace ~~test?~~ with t̶e̶s̶t̶?̶ but it fails
print(text)
This replaces like you expected and prints,
This is a t̶e̶s̶t̶?̶

Related

How to extract value from re?

import re
cc = 'test 5555555555555555/03/22/284 test'
cc = re.findall('[0-9]{15,16}\/[0-9]{2,4}\/[0-9]{2,4}\/[0-9]{3,4}', cc)
print(cc)
[5555555555555555/03/22/284]
This code is working fine but if i put 5555555555555555|03|22|284 on cc variable then this output will come:
[]
I want one condition if it contains '|' then it gives output: 5555555555555555|03|22|284 or '/' then also it will give output: 5555555555555555/03/22/284
Just replace all the /s in your regex (which incidentally don't need to be backslashed) with [/|], which matches either a / or a |. Or if you want backslashes, too, as in your comment on Zain's answer, [/|\\]. (You should always use raw strings r'...' for regexes since they have their own interpretation of backslashes; in a regular string, [/|\\] would have to be written [/|\\\\].)
match = re.findall(
r'[0-9]{15,16}[/|\\][0-9]{2,4}[/|\\][0-9]{2,4}[/|\\][0-9]{3,4}',
cc)
Any other characters you want to include, like colons, can likewise be added between the square brackets.
If you want to accept repeated characters – and treat them as a single delimiter – you can add + to accept "1 or more" of any of the characters:
match = re.findall(
r'[0-9]{15,16}[:/|\\]+[0-9]{2,4}[:/|\\]+[0-9]{2,4}[:/|\\]+[0-9]{3,4}',
cc)
But that will accept, for example, 555555555555555:/|\\03::|::22\\//284 as valid. If you want to be pickier you can replace the character class with a set of alternates, which can be any length. Just separate the options via | – note that outside of the square brackets, a literal | needs a backslash – and put (?:...) around the whole thing: (?:/|\\|\||:|...) whatever, in place of the square-bracketed expressions up there.
I don't recommend assigning the result of the findall back to the original cc variable; for one thing, it's a list, not a string. (You can get the string with e.g. new_cc = match[0]).
Better to create a new variable so (1) you still have the original value in case you need it and (2) when you use the new value in later code, it's clear that it's different.
In fact, if you're going to the trouble of matching this pattern, you might as well go ahead and extract all the components of it at the same time. Just put (...) around the bits you want to keep, and they'll be put in a tuple as the result of that match:
import re
pat = re.compile(r'([0-9]{15,16})[:/|\\]+([0-9]{2,4})[:/|\\]+([0-9]{2,4})[:/|\\]+([0-9]{3,4})')
cc = 'test 5555555555555555/03/22/284 test'
match, = pat.findall(cc)
print(match)
Which outputs this:
('5555555555555555', '03', '22', '284')
Define both options in re to let your string work with both e.g. the following RE used checks for both "\" and also "|" in the string
import re
cc = 'test 5555555555555555/03/22/284 test'
#cc = 'test 5555555555555555|03|22|284 test'
cc = re.findall('[0-9]{15,16}[\/|][0-9]{2,4}[\/|][0-9]{2,4}[\/|][0-9]{3,4}', cc)
print(cc)

Remove "function calls" with alphanumeric names and backslashes in a string with Python

I'm reading some strings from a file such as this one:
s = "Ab [word] 123 \test[abc] hi \abc [] a \command123[there\hello[www]]!"
which should be transformed into
"Ab [word] 123 abc hi \abc [] a therewww!"
Another example is
s = "\ human[[[rr] \[A] r \B[] r p\[]q \A[x\B[C]!"
which should be transformed into
"\ human[[[rr] A r r pq \A[xC!"
How can you generalize this to all similar "functions" with alphanumeric names? By "function" I mean a pattern such as \name[arg] where name is a (possibly empty) alphanumeric string and arg is a (possibly empty) arbitrary string.
Update: After reading kcsquared's comments, I looked through the input files and found stray brackets and backslashes, so I've updated my examples accordingly. The previous regex solution (see below) breaks completely for these special cases:
s = re.sub(r'\\command123\[([^}]*)\]', ' \\1', s)
s = re.sub(r'\test\[([^}]*)\]', ' \\1', s) # Fails if this substitution is executed first
s = " ".join(s.split())
Use an array to push and pop the strings onto it, as if it were a stack.
Scan the string by character and interpret it one by one, don't use regex.

get all the text between two newline characters(\n) of a raw_text using python regex

So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.
Some examples of text are given below:
1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n
The code I have written is given below:
def get_term_regex(s):
raw_text = s
term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'
try:
if ('TERMS' or 'Terms') in raw_text:
pattern1 = re.search(term_regex1,raw_text)
#print(pattern1)
return pattern1
except:
pass
But I am not getting any output, as there is no match.
The expected output is:
1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS
Any help would be really appreciated.
Try the following:
import re
text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.
('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!
So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.
I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()
The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)
Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.
to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.
string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

How to extract function name python regex

Hello I am trying to extract the function name in python using Regex however I am new to Python and nothing seems to be working for me. For example: if i have a string "def myFunction(s): ...." I want to just return myFunction
import re
def extractName(s):
string = []
regexp = re.compile(r"\s*(def)\s+\([^\)]*\)\s*{?\s*")
for m in regexp.finditer(s):
string += [m.group()]
return string
Assumption: You want the name myFunction from "...def myFunction(s):..."
I find something missing in your regex and the way it is structured.
\s*(def)\s+\([^\)]*\)\s*{?\s*
Lets look at it step by step:
\s*: match to zero or more white spaces.
(def): match to the word def.
\s+: match to one or more white spaces.
\([^\)]*\): match to balanced ()
\s*: match to zero or more white spaces.
After that pretty much doesn't matter if you are going for just the name of the function. You are not matching the exact thing you want out of the regex.
You can try this regex if you are interested in doing it by regex:
\s*(def)\s([a-zA-Z]*)\([a-zA-z]*\)
Now the way I have structured the regex, you will get def myFunction(s) in group0, def in group1 and myFunction in group2. So you can use the following code to get you result:
import re
def extractName(s):
string = ""
regexp = re.compile(r"(def)\s([a-zA-Z]*)\([a-zA-z]*\)")
for m in regexp.finditer(s):
string += m.group(2)
return string
You can check your regex live by going on this site.
Hope it helps!

Categories