Regular Expression_How to extract several matching patterns from a line? - python

I have a .csv document consists of several lines. In each line I have tab separated informations such as,
name_1:ayse \t name_2:fatma \t birth_date_1:24 \t birth_date_2:august \t birth_date_3:2018 \t death_date:2100 \t location:turkey.
The sequence of these informations may not be same in each line and there many informations like this in each line.
What am I trying to do is to get a specific part of the string which only has "birth_date" information in it.
I am managed to get only all 3 strings related with birth date as follows
['birth_date_1', 'birth_date_2', 'birth_date_3']
with the help of below code.
inputfile = open('ornek_data.csv','r',encoding="utf-8")
for rownum, line in enumerate(inputfile):
pattern_birth = re.compile(r"\w*birth_date\w*",re.IGNORECASE)
if pattern_birth.search(line) is not None:
a = re.findall("\w*birth_date\w*", line)
print(a)
However what i want actually is to extact below list as output and write it in another document for each line.
['birth_date_1:24', 'birth_date_2:august', 'birth_date_3:2018']
I tried several other regular expressions methods such as below but I couldn't handle it. I will be glad if anyone can help me with this problem.
for rownum, line in enumerate(inputfile):
pattern_birth = re.compile(r"\w*birth_date\w*",re.IGNORECASE)
if pattern_birth.search(line) is not None:
a = re.findall("\w*birth_date.*?:$", line)
print(a)

I would not use a regex here.
Split on '\t' and check if the splitted contains 'birth_date', simple!:
s = 'name_1:ayse \t name_2:fatma \t birth_date_1:24 \t birth_date_2:august \t birth_date_3:2018 \t death_date:2100 \t location:turkey.'
print([x.strip() for x in s.split('\t') if 'birth_date' in x])
# ['birth_date_1:24', 'birth_date_2:august', 'birth_date_3:2018']

Use "\w*birth_date.*?\s" or r"birth_date_\d:.*?\s"
Ex:
import re
line = "name_1:ayse \t name_2:fatma \t birth_date_1:24 \t birth_date_2:august \t birth_date_3:2018 \t death_date:2100 \t location:turkey."
print(re.findall("\w*birth_date.*?\s", line))
Output:
['birth_date_1:24 ', 'birth_date_2:august ', 'birth_date_3:2018 ']

Your regex doesn't match what you are trying to extract, so you need to extend it.
As an aside, you should only re.compile once - the point of compilation is to avoid needing to parse the regex again.
There is also no need to check for no matches separately. Just loop over all the matches; if there are none, the loop will execute zero times.
pat = re.compile(r"\bbirth_date_\d+:\d+",re.IGNORECASE)
with open('ornek_data.csv','r',encoding="utf-8") as inputfile:
for rownum, line in enumerate(inputfile):
for a in pat.findall(line):
print(rownum, a)
The \w* wasn't doing anything useful (if you don't care if it's there or not, as the * quantifier does, why search for it?) whereas \b requires the match to occur at a word boundary (so adjacent to whitespace or punctuation, or beginning or end of line). \d matches a digit and : simply matches itself.
If this is a well-formed CSV file, maybe instead use a CSV reader and print the fields which match startswith('birth_date_')

Related

How to insert commas amongst any letter and following any digit using regex

fileinput = open('INFILE.txt', 'r')
fileoutput = fileinput.read()
replace = re.sub(r'([A-Za-z]),([A-Za-z])', r'\1\2', fileoutput)
print replace
replaceout = open('OUTFILE.txt', 'w')
replaceout.write(replace)
The code above delete commas among any letter whether CapsLocks or not. How to insert commas among any letter and digit? I try the code
replace = re.sub(r"([a-z])([0-9])", r",\1", fileoutput)
but it does not work. Any suggestion how to insert commas among any letter and any digit?
This may help you understand how to add in the comma and reference out what you want. The brackets around the pattern allow you to capture a value in the regex pattern to return later. First one you capture is referenced as \1 and second \2 and so on.
Inside the square brackets you are telling the regex what you want it to match and without further instructions in the regex pattern it's referencing a single character it's trying to match. So the code below will put a comma in between each character.
import re
test = "123frogger"
replace = re.sub(r'([A-Za-z0-9])', r'\1,', test)
creating the output
1,2,3,f,r,o,g,g,e,r,
Here's an update based on one of your comments above about the content of what you are trying to adjust.
import re
test = "Vilniausnuoma483,NuomaVilniuiiraplinkVilniu"
replace = re.sub(r'([A-Za-z])([0-9].*)', r'\1,\2', test)
It will output the following.
Vilniausnuoma,483,NuomaVilniuiiraplinkVilniu

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

Python regex to replace a double newline delimited paragraph containing a string

Define a paragraph as a multi-line string delimited on both side with double new lines ('\n\n'). if there exist a paragraph which contains a certain string ('BAD'), i want to replace that paragraph (i.e. any text containing BAD up to the closest preceding and following double newlines) with some other token ('GOOD'). this should be with a python 3 regex.
i have text such as:
dfsdf\n
sdfdf\n
\n
blablabla\n
blaBAD\n
bla\n
\n
dsfsdf\n
sdfdf
should be:
dfsdf\n
sdfdf\n
\n
GOOD\n
\n
dsfsdf\n
sdfdf
Here you are:
/\n\n(?:[^\n]|\n(?!\n))*BAD(?:[^\n]|\n(?!\n))*/g
OK, to break it down a little (because it's nasty looking):
\n\n matches two literal line breaks.
(?:[^\n]|\n(?!\n))* is a non-capturing group that matches either a single non-line break character, or a line break character that isn't followed by another. We repeat the entire group 0 or more times (in case BAD appears at the beginning of the paragraph).
BAD will match the literal text you want. Simple enough.
Then we use the same construction as above, to match the rest of the paragraph.
Then, you just replace it with \n\nGOOD, and you're off to the races.
Demo on Regex101
Firstly, you're mixing actual newlines and '\n' characters in your example, I assume that you only meant either. Secondly, let me challenge your assumption that you need regex for this:
inp = '''dfsdf
sdadf
blablabla
blaBAD
bla
dsfsdf
sdfdf'''
replaced = '\n\n'.join(['GOOD' if 'BAD' in k else k for k in inp.split('\n\n')])
The result is
print(replaced)
'dfsdf\nsdadf\n\nGOOD\n\ndsfsdf\nsdfdf'

How do I remove spaced before and after commas in a csv file using python?

Currently one row of my csv file looks like this :
314523, 165538, 76255, 335416, 416827 1250536:1 1744638:1 298526:1 1568238:1
I need it to look like this :
314523,165538,76255,335416,416827 1250536:1 1744638:1 298526:1 1568238:1
I only want to remove the spaces after or before commas and leave the other blank spaces as it is.
How can I do this in python?
Note : I am a beginner in python
I would recommend using the replace function. You enter the pattern you want to replace. In your example, the pattern is comma space (', ') and space comma (' ,'). Then say what you want to replace the pattern with (',').
line=line.replace(', ', ',').replace(' ,',',')
You can use regex to do this, for a string:
import re
outputstring = re.sub(r'\s*,\s*', ',', inputstring)
This regex matches the whitespace surrounding a comma and the comma, and replaces it with just a comma. For a file, just do this for each line.

copy required data from one file to another file in python

I am new to Python and am stuck at this I have a file a.txt which contains 10-15 lines of html code and text. I want to copy data which matches my regular expression from one a.txt to b.txt. Suppose i have a line Hello "World" How "are" you and I want to copy data which is between double quotes i.e. World and are to be copied to new file.
This is what i have done.
if x in line:
p = re.compile("\"*\"")
q = p.findall(line)
print q
But this is just displaying only " "(double quotes) as output. I think there is a mistake in my regular expression.
any help is greatly appreciated.
Thanks.
Your regex (which translates to "*" without all the string escaping) matches zero or more quotes, followed by a quote.
You want
p = re.compile(r'"([^"]*)"')
Explanation:
" # Match a quote
( # Match and capture the following:
[^"]* # 0 or more characters except quotes
) # End of capturing group
" # Match a quote
This assumes that you never have to deal with escaped quotes, e. g.
He said: "The board is 2\" by 4\" in size"
Capture the group you're interested in (ie, between quotes), extract the matches from each line, then write them one per line to the new file, eg:
import re
with open('input') as fin, open('output', 'w') as fout:
for line in fin:
matches = re.findall('"(.*?)"', line)
fout.writelines(match + '\n' for match in matches)

Categories