python substitute a substring with one character less - python

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs

You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =

No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.

I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?

how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.

either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," ")
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module,but it's not necessory

Related

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

Python regex: re.search() is extremely slow on large text files

My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex
You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group
I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

python regex mediawiki section parsing

I have text similar to the following:
==Mainsection1==
Some text here
===Subsection1.1===
Other text here
==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here.
In the above text the MainSection 1 and 2 have different names which can be everything the user wants. Same goes for the subsections.
What i want to do with a regex is get the text of a mainsection including its subsection (if there is one).
Yes this is from a wikipage. All mainsections names start with == and end with ==
All subsections have more then the 2== in there name.
regex =re.compile('==(.*)==([^=]*)', re.MULTILINE)
regex.findall(text)
But the above returns each separate section.
Meaning it perfectly returns a mainsection but sees a subsection on his own.
I hope someone can help me with this as its been bugging me for some time
edit:
The result should be:
[('Mainsection1', 'Some text here\n===Subsection1.1===
Other text here\n'), ('Mainsection2', 'Text goes here\n===Subsecttion2.1===
Other text goes here.\n')]
Edit 2:
I have rewritten my code to not use a regex. I came to the conclusion that it's easy enough to just parse it myself. Which makes it a bit more readable for me.
So here is my code:
def createTokensFromText(text):
sections = []
cur_section = None
cur_lines = []
for line in text.split('\n'):
line = line.strip()
if line.startswith('==') and not line.startswith('==='):
if cur_section:
sections.append( (cur_section, '\n'.join(cur_lines)) )
cur_lines = []
cur_section = line
continue
if cur_section:
cur_lines.append(line)
if cur_section:
sections.append( (cur_section, '\n'.join(cur_lines)) )
return sections
Thanks everyone for the help!
All the answers provided have helped me a lot!
First, it should be known, I know a little about Python, but I have never programmed formally in it... Codepad said this works, so here goes! :D -- Sorry the expression is so complex:
(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))
This does what you asked for, I believe! on Codepad, this code:
import re
wikiText = """==Mainsection1==
Some text here
===Subsection1.1===
Other text here
==Mainsection2==
Text goes here
===Subsecttion2.1===
Other text goes here. """
outputArray = re.findall('(?<!=)==([^=]+)==(?!=)([\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))', wikiText)
print outputArray
Produces this result:
[('Mainsection1', '\nSome text here\n===Subsection1.1===\nOther text here\n\n'), ('Mainsection2', '\nText goes here\n===Subsecttion2.1===\nOther text goes here. ')]
EDIT: Broken down, the expression essentially says:
01 (?<!=) # First, look behind to assert that there is not an equals sign
02 == # Match two equals signs
03 ([^=]+) # Capture one or more characters that are not an equals sign
04 == # Match two equals signs
05 (?!=) # Then verify that there are no equals signs following this
06 ( # Start a capturing group
07 [\s\S]*? # Match zero or more of ANY character (even CrLf), but BE LAZY
08 (?= # Look ahead to verify that either...
09 $ # this is the end of the
10 | # -OR-
11 (?<!=) # when I look behind there is no equals sign
12 == # then there are two equals signs
13 [^=]+ # then one or more characters that are not equals signs
14 == # then two equals signs
15 (?!=) # then verify that there are no equals signs following this
16 ) # End look-ahead group
17 ) # End capturing group
Line 03 and Line 06 specify the capturing groups for the Main Section Title and the Main Section Content, respectively.
Line 07 begs for a lot of explanation if you're not pretty fluent in Regex...
The \s and \S inside a character class [] will match anything that is whitespace or is not whitespace (i.e. ANYTHING WHATSOEVER) - one alternative to this is using the . operator, but depending upon your compiler options (or ability to specify options) this might or might not match CrLf (or Carriage-Return/Line-Feed). Since you want to match multiple lines, this is the easiest way to ensure a match.
The *? at the end means that it will match zero or more instances of the "anything" character class, but BE LAZY ABOUT IT - "lazy" quantifiers (sometimes called "reluctant") are the opposite of the default "greedy" quantifier (without the ? following it), and will not consume a source character unless the source that follows it cannot be matched by the part of the expression that follows the lazy quantifier. In other words, this will consume any characters until it finds either the end of the source text OR another main section which is specified by exactly two and only two equals signs on either side of one or more characters that are not an equals sign (including whitespace). Without the lazy operator, it would try to consume the entire source text then "backtrack" until it could match one of the things after it in the expression (end of source or a section header)
Line 08 is a "look-ahead" that specifies that the expression follwoing should be ABLE to be matched, but should not be consumed.
END EDIT
AFAIK, it has to be this complex in order to properly exclude the subsections... If you want to match the Section Name and Section Content into named groups, you can try this:
(?<!=)==(?P<SectionName>[^=]+)==(?!=)(?P<SectionContent>[\s\S]*?(?=$|(?<!=)==[^=]+==(?!=)))
If you'd like, I can break it down for you! Just ask! EDIT (see edit above) END EDIT
The problem here is that ==(.*)== matches ==(=Subsection=)==, so the first thing to do is to make sure there is no = inside the title : ==([^=]*)==([^=]*).
We then need to make sure that there is no = before the beginning of the match, otherwise, the first = of the three is ignored and the subtitle is matched. This will do the trick : (?<!=)==([^=]*)==([^=]*), it means "Matches if not preceded by ...".
We can also do this at the end to make sure, which gives as a final result (?<!=)==([^=]*)==(?!=)([^=]*).
>>> re.findall('(?<!=)==([^=]*)==(?!=)([^=]*)', x,re.MULTILINE)
[('Mainsection1', '\nSome text here\n'),
('Mainsection2', '\nText goes here\n')]
You could also remove the check at the end of the title and replace it with a newline. That may be better if you are sure there is a new line at the end of each title.
>>> re.findall('(?<!=)==([^=]*)==\n([^=]*)', x,re.MULTILINE)
[('Mainsection1', 'Some text here\n'), ('Mainsection2', 'Text goes here\n')]
EDIT :
section = re.compile(r"(?<!=)==([^=]*)==(?!=)")
result = []
mo = section.search(x)
previous_end = 0
previous_section = None
while mo is not None:
start = mo.start()
if previous_section:
result.append((previous_section, x[previous_end:start]))
previous_section = mo.group(0)
previous_end = mo.end()
mo = section.search(x, previous_end)
result.append((previous_section, x[previous_end:]))
print result
It's more simple than it looks : repeatedly, we search for a section title after the previous one, and we add it to the result with the text between the beginning of this title and the end of the previous one. Adjust it to suit your style and your needs. The result is :
[('==Mainsection1==',
' \nSome text here \n===Subsection1.1=== \nOther text here \n\n'),
('==Mainsection2==',
' \nText goes here \n===Subsecttion2.1=== \nOther text goes here. ')]

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories