Regex Pattern in Python for special charaters

Regex Pattern in Python for special charaters - python

I asked a similar question a few days ago on here and it was great help! A new challenge I wanted build is to further develop the regex pattern to look for specific formats in this iteration, and I thought I have solved it using regex 101 to build/test a regex code but when applied in Python received 'pattern contain no group'. Below is a test df, and a image of what the results should be like/code that was provided via StackOverflow that worked with digits only.
df = pd.DataFrame([["{1} | | Had a Greeter welcome clients {1.0} | | Take measures to ensure a safe and organized distribution {1.000} | | Protected confidentiality of clients (on social media, pictures, in conversation, own congregation members receiving assistance, etc.)",
"{1.00} | | Chairs for clients to sit in while waiting {1.0000} | | Take measures to ensure a safe and organized distribution"],
["{1 } | Financial literacy/budgeting {1 } | | Monetary/Bill Support {1} | | Mental Health Services/Counseling",
"{1}| | Clothing Assistance {1 } | | Healthcare {1} | | Mental Health Services/Counseling {1} | | Spiritual Support {1} | | Job Skills Training"]
] , columns = ['CF1', 'CF2'])
Here is the iteration code that worked digits only. I changed the pattern search with my new regex pattern and it did not work.
Original code: (df.stack().str.extractall('(\d+)')[0] .groupby(level=[0,1]).sum().unstack())
New Code (Failed to recognize pattern): (df.stack().str.extractall(r'(?<=\{)[\d+\.\ ]+(?=\})')[0].astype(int) .groupby(level=[0,1]).sum().unstack())
**In the test df you will see I want to only capture the numbers between "{}" and there's a mixture of decimals and spaces following the number I want to capture and sum. The new pattern did not work in application so any help would be great! **

Your (?<=\{)[\d+\.\ ]+(?=\}) regex contains no capturing groups while Series.str.extractall requires at least one capturing group to output a value.
You need to use
(df.stack().str.extractall(r'\{\s*(\d+(?:\.\d+)?)\s*}')[0].astype(float) .groupby(level=[0,1]).sum().unstack())
Output:
CF1 CF2
0 3.0 2.0
1 3.0 5.0
The \{\s*(\d+(?:\.\d+)?)\s*} regex matches
\{ - a { char
\s* - zero or more whitespaces
(\d+(?:\.\d+)?) - Group 1 (note this group captured value will be the output of the extractall method, it requires at least one capturing group): one or more digits, and then an optional occurrence of a . and one or more digits
\s* - zero or more whitespaces
} - a } char.
See the regex demo.

You can use '\{([\d.]+)\}':
(df.stack().str.extractall(r'\{([\d.]+)\}')[0]
.astype(float).groupby(level=[0,1]).sum().unstack())
output:
CF1 CF2
0 3.0 2.0
1 1.0 4.0
as int only:
(df.stack().str.extractall(r'\{(\d+)(?:\.\d+)?\}')[0]
.astype(int).groupby(level=[0,1]).sum().unstack())
output:
CF1 CF2
0 3 2
1 1 4

Related

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'

Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.

Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).

The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

Is there a way to add a new line after every ']' in Python?

I originally had a string containing BBCode in which I wanted to format it better so it can be readable.
I had something like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG][b]\\u25ba osu! Mapping Theory[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]Linear Momentum[\\/url] | [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]Linear Momentum 2[\\/url] | [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]Angular Momentum and Circular Flow[\\/url] | [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]Active and Passive Mapping[\\/url]\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]Slider Flow[\\/url] | [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]Stream Flow[\\/url] | [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]Slider Mechanics[\\/url] | [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]Aesthetics by Symmetry[\\/url] | [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]Aesthetics by Complexity[\\/url] | [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]Defining Flow[\\/url]\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]Flow and Aesthetics[\\/url] | [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]Angle Emphasis[\\/url] | [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]Strain[\\/url] | [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]Pressure[\\/url] | [url=https:\\/\\/youtu.be\\/jm43HilQhYk]Tension[\\/url] | [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]Song Choice[\\/url] | [url=https:\\/\\/youtu.be\\/BNjVu8xq4os]Song Length[\\/url]\\n\\n[url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG6t5MCwGnq87iYZnE5G7aZL][b]\\u25ba osu! Rambling[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/-Beeh7dKyTk]Storyboards[\\/url] | [url=https:\\/\\/youtu.be\\/i6zzHMzwIzU]Why[\\/url]\\n\\n[url=https:\\/\\/youtu.be\\/_sBP7ttRQog]0 BPM[\\/url] | [url=https:\\/\\/youtu.be\\/UgtR6WnuTT8]ppv2 Pt.1[\\/url] | [url=https:\\/\\/youtu.be\\/Bx14u5tltyE]ppv2 Pt.2[\\/url] | [url=https:\\/\\/youtu.be\\/-095yuSLE4Y]Super high star rating[\\/url][\\/notice][url=https:\\/\\/amo.s-ul.eu\\/oApvJHWA][b]Skin v3.4[\\/b][\\/url]\\n[size=85]Personal edit of [url=https:\\/\\/osu.ppy.sh\\/forum\\/t\\/481314]Re:m skin by Choilicious[\\/url][\\/size]\\n\\n[img]http:\\/\\/puu.sh\\/qqv6C\\/0aaca52f51.jpg[\\/img][url=https:\\/\\/osu.ppy.sh\\/u\\/Satellite][img]http:\\/\\/puu.sh\\/qqv6K\\/94681bed3f.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Sellenite][img]http:\\/\\/puu.sh\\/qqv6T\\/c943ed1703.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Morinaga][img]http:\\/\\/puu.sh\\/qqv70\\/cfbdb2a242.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/-Mo-][img]http:\\/\\/puu.sh\\/qqv77\\/ca489f2d00.jpg[\\/img][\\/url]\\n[notice]I don\'t really do nomination stuff often anymore. \\nHowever, please do show me your map if it\'s any of the following:[box=][b]Bounty[\\/b]\\n[size=50]High priority modding for these artists\\/songs (maybe a GD, just ask).\\nPreferably non-cut versions and songs that have no ranked maps yet.[\\/size]\\n\\nYuuhei Satellite\\nYuuhei Catharsis\\nShoujo Fractal\\nHoneyWorks (non-vocaloid)\\nTrySail\\nClariS\\n\\nClariS - CLICK (Asterisk DnB Remix), [size=85]either version.[\\/size]\\nfhana - Outside of Melancholy, [size=85]a version that isn\'t cut pls[\\/size]\\nAny cover of \\u7832\\u96f7\\u6483\\u6226\\u3001\\u59cb\\u3081![\\/box]I also do storyboard checks for any map.\\n\\nPMs are open for anything. Ask me anything. \\nAsk me what my favourite colour is if you really want even.[\\/notice][box=Guests][b]Ranked[\\/b]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1575100][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Terasareru kurai no Shiawase [Lunatic][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1794557][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Arehateta Chijou no Uta [Collab Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1592915][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] Tanaka Hirokazu - C-TYPE [TetriS-TYPE] [S-TYPE][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1490130][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] TrySail - adrenaline!!! [Insane][\\/url] [size=85]Slightly ruined version.[\\/size]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1401096][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Shunkan Everlasting [Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/795269][img]
Basically unreadable currently.
I tried making it look like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG]
[b]
\\u25ba osu! Mapping Theory[\\/b]
[\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]
Linear Momentum[\\/url]
| [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]
Linear Momentum 2[\\/url]
| [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]
Angular Momentum and Circular Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]
Active and Passive Mapping[\\/url]
\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]
Slider Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]
Stream Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]
Slider Mechanics[\\/url]
| [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]
Aesthetics by Symmetry[\\/url]
| [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]
Aesthetics by Complexity[\\/url]
| [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]
Defining Flow[\\/url]
\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]
Flow and Aesthetics[\\/url]
| [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]
Angle Emphasis[\\/url]
| [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]
Strain[\\/url]
| [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]
Pressure[\\/url]
| [url=https:\\/\\/youtu.be\\/jm43HilQhYk]
Tension[\\/url]
| [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]
Song Choice[\\/url]
|
Where there's a new line after every ']'
I've tried tweaking re.sub("[\(\[].*?[\)\]]", "", str(x)) to fit what I need but it just deletes everything inside of them. (I have no idea how regex works)
How can I go about this?

There's no need for a regular expression, just use the simple str.replace() function.
x = x.replace(']', ']\n')

It really depends on exactly what you want your output to look like.
I interpreted your output as wanting a newline after each url= tag, which would require the following regex:
output = re.sub(r"(\[url.*?\])", r"\1\n", input)
The brackets () form a capture group which is then used in the replace statement as \1 since its the first unnamed capture group.
You can change the regex to your will but just keep the stuff you want to keep within the capture group.
If you want to experiment with regex you can use https://regexr.com/ which is an amazing resource when fiddling around with regex.

Regex to extract unique string to new column, getting error "look-behind requires fixed-width pattern"

I need help extracting unique strings into a separate column.
df = pd.DataFrame({'File Name':['90.12.21 / 02.05 / XO3 File Name Type',
'10.22.43 / X.89 / XO20G9992 Document Internal Only',
'Phase 3',
'22.32.42.12 / 99.23 / XO2 Location Site 3: Park Triangle',
'38.23.99.22 / X.23 / XO28W9998 Block 4 Beach/Dock Camp',
'39.24.32.49 / 37.29 / Blue-print/Register Info Site (RISs)',
'23.21.53.32 / Q.21 / XO R9924 Location Place 5: Drive Place (Active)',
' 33.51.63.33 / X.21 / XO20W8812 Area Place 1: Beach Drive']})
Here's what the dataframe currently looks like:
| File Name |
|----------------------------------------------------------------------|
| 90.12.21 / 02.05 / XO3 File Name Type |
| 10.22.43 / X.89 / XO20G9992 Document Internal Only |
| Phase 3 |
| 22.32.42.12 / 99.23 / XO2 Location Site 3: Park Triangle |
| 38.23.99.22 / X.23 / XO28W9998 Block 4 Beach/Dock Camp |
| 39.24.32.49 / 37.29 / Blue-print/Register Info Site (RISs) |
| 23.21.53.32 / Q.21 / XO R9924 Location Place 5: Drive Place (Active) |
| 33.51.63.33 / X.21 / XO20W8812 Area Place 1: Beach Drive |
Here's what I need it to look like:
| File Name |
|----------------------------------------|
| File Name Type |
| Document Internal Only |
| |
| Location Site 3: Park Triangle |
| Block 4 Beach/Dock Camp |
| Blue-print/Register Info Site (RISs) |
| Location Place 5: Drive Place (Active) |
| Area Place 1: Beach Drive |
Here's my attempted solution:
I know that str.extract(r'') will extract a Regex expression into a new column. I also know that in Regex, a "positive lookbehind" will select everything I want from the end of the string. So I created a positive lookbehind Regex expression that captures most of the strings I want: https://regexr.com/4t4ll . It's still not a perfect solution.
But even when I try extracting my selections using this line of code: df['File Name'].str.extract(r'((?<=\/ XO\d |XO\d[0-9]\w\d\d\d\d | XO \w\d\d\d\d ).*)'), I get an error message: "look-behind requires fixed-width pattern."
I need help figuring out how to make my Regex expression work in str.extract(r'') and how can I make my Regex expression capture all the strings that appear at the end of each entry?

You may use
.*\s/(?:\s+XO[A-Z0-9\s]*\b)?\s+(.+)
See the regex demo.
Details
.* - 0+ chars other than line break chars, as many as possible
\s - a whitespace
/ - a / char
(?:\s+XO[A-Z0-9\s]*\b)? - an optional pattern:
\s+ - 1+ whitespaces
XO - XO
[A-Z0-9\s]* - 0+ uppercase letters or digits followed with
\b - a word boundary
\s+ - 1+ whitespaces
(.+) - Group 1 (what str.extract will return): any 1+ chars other than line break chars, as many as possible
In Pandas, use
df['Result'] = df['File Name'].str.extract(r'.*\s/(?:\s+XO[A-Z0-9\s]*\b)?\s+(.+)', expand=False).fillna('')
Result:
Result
0 File Name Type
1 Document Internal Only
2
3 Location Site 3: Park Triangle
4 Block 4 Beach/Dock Camp
5 Blue-print/Register Info Site (RISs)
6 Location Place 5: Drive Place (Active)
7 Area Place 1: Beach Drive

Search strings with python

I would like to know how can I search specific strings with python. Actually I opened a markdown file which contain a sheet like below:
| --------- | -------- | --------- |
|**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. |
|**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|
|**unscrewed**| - | -Slowly and very carefully he unscrewed the ink bottle, dipped his quill into it, and began to write,|
|**downtrodden**| - | -For years, Aunt Petunia and Uncle Vernon had hoped that if they kept Harry as downtrodden as possible, they would be able to squash the magic out of him.|
|**sheets,**| - | -As long as he didn’t leave spots of ink on the sheets, the Dursleys need never know that he was studying magic by night.|
|**flinch**| - | -But he hoped she’d be back soon — she was the only living creature in this house who didn’t flinch at the sight of him.|
And I have to get the strings from each lines which decorates with |** **|, like:
propped
Pointless
unscrewed
downtrodden
sheets
flinch
I tried to use the regular expression but failed to extract it.

import re
y = '(?<=\|\*{2}).+?(?=,{0,1}\*{2}\|)'
reg = re.compile(y)
a = '| --------- | -------- | --------- | |**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. | |**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|'
reg.findall(a)
Regex(y) above explained:
(?<=\|\*{2}) - Matches if the current position in the string is preceded by a match for \|\*{2} i.e. |**
.+? - Will try to find anything(except for new line) repeated 1 or more times. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
(?=,{0,1}\*{2}\|) - ?= matches any string preceding the regex mentioned. In this case I have mentioned ,{0,1}\*{2}\|, which means zero or one , and 2 * and ending |.

Try using the following regex :
(?<=\|)(?!\s).*?(?!\s)(?=\|)
see demo / explanation

If the asterisks are in the text you are searching and you do not want the comma after sheets. The pattern would be a pipe followed by two asterisks then anything that follows that is not an asterisk or a comma.
\|\*{2}([^*,]+)
If you can live with the comma or if there might be commas you want to catch
\|\*{2}([^*]+)
Use either pattern with re.findall or re.finditer to capture the text you want.
If using the second pattern, you would need to run through the groups and strip any unwanted commas.

I have wrote below program to achieve the required output. I created a file string_test where all raw strings I copied:
a=re.compile("^\|\*\*([^*,]+)")
with open("string_test","r") as file1:
for i in file1.readlines():
match=a.search(i)
if match:
print match.group(1)

PCRE Regex (*COMMIT) equivalent for Python

The below pattern took me a long time to find. When I finally found it, it turns out that it doesn't work in Python. Does anyone know if there is an alternative?
(*COMMIT) Defined: Causes the whole match to fail outright if the rest of the pattern does not match.
(*FAIL) doesn't work in Python either. But this can be replaced by (?!).
+-----------------------------------------------+
|Pattern: |
|^.*park(*COMMIT)(*FAIL)|dog |
+-------------------------------------+---------+
|Subject | Matches |
+-----------------------------------------------+
|The dog and the man play in the park.| FALSE |
|Man I love that dog! | TRUE |
|I'm dog tired | TRUE |
|The dog park is no place for man. | FALSE |
|park next to this dog's man. | FALSE |
+-------------------------------------+---------+
Example taken from:
regex match substring unless another substring matches

This might not be a generic replacement, but for your case you can work with lookaheads, to assert that dog is matched, but park is not: ^(?=.*dog)(?!.*park).*$
Your samples on regex101

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex Pattern in Python for special charaters - python

Related

Find the most likely word alignment between two strings in Python

Is there a way to add a new line after every ']' in Python?

Regex to extract unique string to new column, getting error "look-behind requires fixed-width pattern"

Search strings with python

PCRE Regex (*COMMIT) equivalent for Python

Categories

Resources