I'm looking to find if there is a way of automating this process. Basically I have 300,000 rows of data needed to download on a daily basis. There are a couple of rows that need to be edited before it can be uploaded to SQL.
Jordan || Michael | 23 | Bulls | Chicago
Bryant | Kobe ||| 8 || LA
What I want to accomplish is to just have 4 vertical bars per row. Normally, I would search for a keyword then edit it manually then save. These two are the only anomalies in my data.
Find "Jordan", then remove the excess 1 vertical bar "|" right after it.
I need to find "Kobe", then remove the two excess vertical bars "|" right after it.
Correct format is below -
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
Not sure if this can be done in vbscript or Python.
Any help would be appreciated. Thanks!
Python or vbscript could be used but they are overkill for something this simple. Try sed:
$ sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
To save to a new file:
sed -E 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file >newfile
Or, to change the existing file in-place:
sed -Ei.bak 's/(Jordan *)\|/\1/g; s/(Kobe *)\| *\|/\1/g' file
How it works
sed reads and processes a file line by line. In our case, we need only the substitute command which has the form s/old/new/g where old is a regular expression and, if it is found, it is replaced by new. The optional g at the end of the command tells sed to perform the substitution command 'globally', meaning not just once but as many times as it appears on the line.
s/(Jordan *)\|/\1/g
This tells sed to look for Jordan followed by zero or more spaces followed by a vertical bar and remove the vertical bar.
In more detail, the parens in (Jordan *) tell sed to save the string Jordan followed by zero or more spaces as a group. In the replacement side, we reference that group as \1.
s/(Kobe *)\| *\|/\1/g
Similarly, this tells sed to look for Kobe followed by zero or more spaces followed by a vertical bar and remove the vertical bar.
Using python
Using the same logic as above, here is a python program:
$ cat kobe.py
import re
with open('file') as f:
for line in f:
line = re.sub(r'(Jordan *)\|', r'\1', line)
line = re.sub(r'(Kobe *)\| *\|', r'\1', line)
print(line.rstrip('\n'))
$ python kobe.py
Jordan | Michael | 23 | Bulls | Chicago
Bryant | Kobe | 8 || LA
To save that to a new file:
python kobe.py >newfile
I wrote a code snippet in Python 3.5 as follows.
# -*- coding: utf-8 -*-
rows = ["Jordan||Michael|23|Bulls|Chicago",
"Bryant|Kobe|||8||LA"]
keywords = ["Jordan", "Kobe"]
def get_keyword(row, keywords):
for word in keywords:
if word in row:
return word
else:
return None
for line in rows:
num_bars = line.count('|')
num_bars_del = num_bars - 4 # Number of bars to be deleted
kw = get_keyword(line, keywords)
if kw: # this line contains a keyword
# Split the line by the keyword
first, second = line.split(kw)
second = second.lstrip()
result = "%s%s%s"%(first, kw, second[num_bars_del:])
print(result)
Related
I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'
Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.
Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).
The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.
I asked a similar question a few days ago on here and it was great help! A new challenge I wanted build is to further develop the regex pattern to look for specific formats in this iteration, and I thought I have solved it using regex 101 to build/test a regex code but when applied in Python received 'pattern contain no group'. Below is a test df, and a image of what the results should be like/code that was provided via StackOverflow that worked with digits only.
df = pd.DataFrame([["{1} | | Had a Greeter welcome clients {1.0} | | Take measures to ensure a safe and organized distribution {1.000} | | Protected confidentiality of clients (on social media, pictures, in conversation, own congregation members receiving assistance, etc.)",
"{1.00} | | Chairs for clients to sit in while waiting {1.0000} | | Take measures to ensure a safe and organized distribution"],
["{1 } | Financial literacy/budgeting {1 } | | Monetary/Bill Support {1} | | Mental Health Services/Counseling",
"{1}| | Clothing Assistance {1 } | | Healthcare {1} | | Mental Health Services/Counseling {1} | | Spiritual Support {1} | | Job Skills Training"]
] , columns = ['CF1', 'CF2'])
Here is the iteration code that worked digits only. I changed the pattern search with my new regex pattern and it did not work.
Original code: (df.stack().str.extractall('(\d+)')[0] .groupby(level=[0,1]).sum().unstack())
New Code (Failed to recognize pattern): (df.stack().str.extractall(r'(?<=\{)[\d+\.\ ]+(?=\})')[0].astype(int) .groupby(level=[0,1]).sum().unstack())
**In the test df you will see I want to only capture the numbers between "{}" and there's a mixture of decimals and spaces following the number I want to capture and sum. The new pattern did not work in application so any help would be great! **
Your (?<=\{)[\d+\.\ ]+(?=\}) regex contains no capturing groups while Series.str.extractall requires at least one capturing group to output a value.
You need to use
(df.stack().str.extractall(r'\{\s*(\d+(?:\.\d+)?)\s*}')[0].astype(float) .groupby(level=[0,1]).sum().unstack())
Output:
CF1 CF2
0 3.0 2.0
1 3.0 5.0
The \{\s*(\d+(?:\.\d+)?)\s*} regex matches
\{ - a { char
\s* - zero or more whitespaces
(\d+(?:\.\d+)?) - Group 1 (note this group captured value will be the output of the extractall method, it requires at least one capturing group): one or more digits, and then an optional occurrence of a . and one or more digits
\s* - zero or more whitespaces
} - a } char.
See the regex demo.
You can use '\{([\d.]+)\}':
(df.stack().str.extractall(r'\{([\d.]+)\}')[0]
.astype(float).groupby(level=[0,1]).sum().unstack())
output:
CF1 CF2
0 3.0 2.0
1 1.0 4.0
as int only:
(df.stack().str.extractall(r'\{(\d+)(?:\.\d+)?\}')[0]
.astype(int).groupby(level=[0,1]).sum().unstack())
output:
CF1 CF2
0 3 2
1 1 4
I originally had a string containing BBCode in which I wanted to format it better so it can be readable.
I had something like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG][b]\\u25ba osu! Mapping Theory[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]Linear Momentum[\\/url] | [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]Linear Momentum 2[\\/url] | [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]Angular Momentum and Circular Flow[\\/url] | [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]Active and Passive Mapping[\\/url]\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]Slider Flow[\\/url] | [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]Stream Flow[\\/url] | [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]Slider Mechanics[\\/url] | [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]Aesthetics by Symmetry[\\/url] | [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]Aesthetics by Complexity[\\/url] | [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]Defining Flow[\\/url]\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]Flow and Aesthetics[\\/url] | [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]Angle Emphasis[\\/url] | [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]Strain[\\/url] | [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]Pressure[\\/url] | [url=https:\\/\\/youtu.be\\/jm43HilQhYk]Tension[\\/url] | [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]Song Choice[\\/url] | [url=https:\\/\\/youtu.be\\/BNjVu8xq4os]Song Length[\\/url]\\n\\n[url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG6t5MCwGnq87iYZnE5G7aZL][b]\\u25ba osu! Rambling[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/-Beeh7dKyTk]Storyboards[\\/url] | [url=https:\\/\\/youtu.be\\/i6zzHMzwIzU]Why[\\/url]\\n\\n[url=https:\\/\\/youtu.be\\/_sBP7ttRQog]0 BPM[\\/url] | [url=https:\\/\\/youtu.be\\/UgtR6WnuTT8]ppv2 Pt.1[\\/url] | [url=https:\\/\\/youtu.be\\/Bx14u5tltyE]ppv2 Pt.2[\\/url] | [url=https:\\/\\/youtu.be\\/-095yuSLE4Y]Super high star rating[\\/url][\\/notice][url=https:\\/\\/amo.s-ul.eu\\/oApvJHWA][b]Skin v3.4[\\/b][\\/url]\\n[size=85]Personal edit of [url=https:\\/\\/osu.ppy.sh\\/forum\\/t\\/481314]Re:m skin by Choilicious[\\/url][\\/size]\\n\\n[img]http:\\/\\/puu.sh\\/qqv6C\\/0aaca52f51.jpg[\\/img][url=https:\\/\\/osu.ppy.sh\\/u\\/Satellite][img]http:\\/\\/puu.sh\\/qqv6K\\/94681bed3f.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Sellenite][img]http:\\/\\/puu.sh\\/qqv6T\\/c943ed1703.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Morinaga][img]http:\\/\\/puu.sh\\/qqv70\\/cfbdb2a242.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/-Mo-][img]http:\\/\\/puu.sh\\/qqv77\\/ca489f2d00.jpg[\\/img][\\/url]\\n[notice]I don\'t really do nomination stuff often anymore. \\nHowever, please do show me your map if it\'s any of the following:[box=][b]Bounty[\\/b]\\n[size=50]High priority modding for these artists\\/songs (maybe a GD, just ask).\\nPreferably non-cut versions and songs that have no ranked maps yet.[\\/size]\\n\\nYuuhei Satellite\\nYuuhei Catharsis\\nShoujo Fractal\\nHoneyWorks (non-vocaloid)\\nTrySail\\nClariS\\n\\nClariS - CLICK (Asterisk DnB Remix), [size=85]either version.[\\/size]\\nfhana - Outside of Melancholy, [size=85]a version that isn\'t cut pls[\\/size]\\nAny cover of \\u7832\\u96f7\\u6483\\u6226\\u3001\\u59cb\\u3081![\\/box]I also do storyboard checks for any map.\\n\\nPMs are open for anything. Ask me anything. \\nAsk me what my favourite colour is if you really want even.[\\/notice][box=Guests][b]Ranked[\\/b]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1575100][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Terasareru kurai no Shiawase [Lunatic][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1794557][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Arehateta Chijou no Uta [Collab Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1592915][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] Tanaka Hirokazu - C-TYPE [TetriS-TYPE] [S-TYPE][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1490130][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] TrySail - adrenaline!!! [Insane][\\/url] [size=85]Slightly ruined version.[\\/size]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1401096][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Shunkan Everlasting [Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/795269][img]
Basically unreadable currently.
I tried making it look like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG]
[b]
\\u25ba osu! Mapping Theory[\\/b]
[\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]
Linear Momentum[\\/url]
| [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]
Linear Momentum 2[\\/url]
| [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]
Angular Momentum and Circular Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]
Active and Passive Mapping[\\/url]
\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]
Slider Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]
Stream Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]
Slider Mechanics[\\/url]
| [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]
Aesthetics by Symmetry[\\/url]
| [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]
Aesthetics by Complexity[\\/url]
| [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]
Defining Flow[\\/url]
\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]
Flow and Aesthetics[\\/url]
| [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]
Angle Emphasis[\\/url]
| [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]
Strain[\\/url]
| [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]
Pressure[\\/url]
| [url=https:\\/\\/youtu.be\\/jm43HilQhYk]
Tension[\\/url]
| [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]
Song Choice[\\/url]
|
Where there's a new line after every ']'
I've tried tweaking re.sub("[\(\[].*?[\)\]]", "", str(x)) to fit what I need but it just deletes everything inside of them. (I have no idea how regex works)
How can I go about this?
There's no need for a regular expression, just use the simple str.replace() function.
x = x.replace(']', ']\n')
It really depends on exactly what you want your output to look like.
I interpreted your output as wanting a newline after each url= tag, which would require the following regex:
output = re.sub(r"(\[url.*?\])", r"\1\n", input)
The brackets () form a capture group which is then used in the replace statement as \1 since its the first unnamed capture group.
You can change the regex to your will but just keep the stuff you want to keep within the capture group.
If you want to experiment with regex you can use https://regexr.com/ which is an amazing resource when fiddling around with regex.
I want to extract the length of a dress from a pandas dataframe .The row of that dataframe looks like this :
A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4
As you can see the size is contained between About and shoulder but in some cases shoulder is replaced by waist,hem etc.Below is my python script that finds the length but it fails when lets say there is a comma after About since i am slicing the list.
import re
def regexfinder(string_var):
res=''
x=re.search(r"(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])", string_var).group(0)
tohave=int(x[1:3])
if tohave >=16 and tohave<=36:
res="Mini"
return res
if tohave>36 and tohave<40:
res="Above the Knee"
return res
if tohave >=40 and tohave<=46:
res="Knee length"
return res
if tohave>46 and tohave<49:
res="Mid/Tea length"
return res
if tohave >=49 and tohave<=59:
res="Long/Maxi length"
return res
if tohave>59:
res="Floor Length"
return res
Your regex (?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips]) uses a character class for the words shoulder,waist,hem,bust,neck,bust,top,hips.
I think you want to put them in a non capturing group using an or |.
Try it like this using an optional comma ,?:
(?<=About),? (\d+)(?=.*?(?:shoulder|waist|hem|bust|neck|bust|top|hips]))
The size is in the first capturing group.
import re
s = """A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4"""
q = """'Velvet dress featuring mesh front, back and sleeves | Crewneck | Long bell sleeves | Self-tie closure at back cutout | About, 31" from shoulder to hem | Viscose/nylon | Hand wash | Imported | Model shown is 5\'10" (177cm) wearing a size Small.'1"""
def getSize(stringVal, strtoCheck):
for i in stringVal.split("|"): #Split string by "|"
if i.strip().startswith(strtoCheck): #Check if string startswith "About"
val = i.strip()
return re.findall("\d+", val)[0] #Extract int
print getSize(s, "About")
print getSize(q, "About")
Output:
23
31
I would like to know how can I search specific strings with python. Actually I opened a markdown file which contain a sheet like below:
| --------- | -------- | --------- |
|**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. |
|**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|
|**unscrewed**| - | -Slowly and very carefully he unscrewed the ink bottle, dipped his quill into it, and began to write,|
|**downtrodden**| - | -For years, Aunt Petunia and Uncle Vernon had hoped that if they kept Harry as downtrodden as possible, they would be able to squash the magic out of him.|
|**sheets,**| - | -As long as he didn’t leave spots of ink on the sheets, the Dursleys need never know that he was studying magic by night.|
|**flinch**| - | -But he hoped she’d be back soon — she was the only living creature in this house who didn’t flinch at the sight of him.|
And I have to get the strings from each lines which decorates with |** **|, like:
propped
Pointless
unscrewed
downtrodden
sheets
flinch
I tried to use the regular expression but failed to extract it.
import re
y = '(?<=\|\*{2}).+?(?=,{0,1}\*{2}\|)'
reg = re.compile(y)
a = '| --------- | -------- | --------- | |**propped**| - | -a flashlight in one hand and a large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow. | |**Pointless**| - | -“Witch Burning in the Fourteenth Century Was Completely Pointless — discuss.”|'
reg.findall(a)
Regex(y) above explained:
(?<=\|\*{2}) - Matches if the current position in the string is preceded by a match for \|\*{2} i.e. |**
.+? - Will try to find anything(except for new line) repeated 1 or more times. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
(?=,{0,1}\*{2}\|) - ?= matches any string preceding the regex mentioned. In this case I have mentioned ,{0,1}\*{2}\|, which means zero or one , and 2 * and ending |.
Try using the following regex :
(?<=\|)(?!\s).*?(?!\s)(?=\|)
see demo / explanation
If the asterisks are in the text you are searching and you do not want the comma after sheets. The pattern would be a pipe followed by two asterisks then anything that follows that is not an asterisk or a comma.
\|\*{2}([^*,]+)
If you can live with the comma or if there might be commas you want to catch
\|\*{2}([^*]+)
Use either pattern with re.findall or re.finditer to capture the text you want.
If using the second pattern, you would need to run through the groups and strip any unwanted commas.
I have wrote below program to achieve the required output. I created a file string_test where all raw strings I copied:
a=re.compile("^\|\*\*([^*,]+)")
with open("string_test","r") as file1:
for i in file1.readlines():
match=a.search(i)
if match:
print match.group(1)