Source:
[This] is some text with [some [blocks that are nested [in a [variety] of ways]]]
Resultant text:
[This] is some text with
I don't think you can do a regex for this, from looking at the threads at stack overflow.
Is there a simple way to to do this -> or must one reach for pyparsing (or other parsing library)?
Here's an easy way that doesn't require any dependencies: scan the text and keep a counter for the braces that you pass over. Increment the counter each time you see a "["; decrement it each time you see a "]".
As long as the counter is at zero or one, put the text you see onto the output string.
Otherwise, you are in a nested block, so don't put the text onto the output string.
If the counter doesn't finish at zero, the string is malformed; you have an unequal number of opening and closing braces. (If it's greater than zero, you have that many excess [s; if it's less than zero you have that many excess ]s.)
Taking the OP's example as normative (any block including further nested blocks must be removed), what about...:
import itertools
x = '''[This] is some text with [some [blocks that are nested [in a [variety]
of ways]]] and some [which are not], and [any [with nesting] must go] away.'''
def nonest(txt):
pieces = []
d = 0
level = []
for c in txt:
if c == '[': d += 1
level.append(d)
if c == ']': d -= 1
for k, g in itertools.groupby(zip(txt, level), lambda x: x[1]>0):
block = list(g)
if max(d for c, d in block) > 1: continue
pieces.append(''.join(c for c, d in block))
print ''.join(pieces)
nonest(x)
This emits
[This] is some text with and some [which are not], and away.
which under the normatime hypothesis would seem to be the desired result.
The idea is to compute, in level, a parallel list of counts "how nested are we at this point" (i.e., how many opened and not yet closed brackets have we met so far); then segment the zip of level with the text, with groupby, into alternate blocks with zero nesting and nesting > 0. For each block, the maximum nesting herein is then computed (will stay at zero for blocks with zero nesting - more generally, it's just the maximum of the nesting levels throughout the block), and if the resulting nesting is <= 1, the corresponding block of text is preserved. Note that we need to make the group g into a list block as we want to perform two iteration passes (one to get the max nesting, one to rejoin the characters into a block of text) -- to do it in a single pass we'd need to keep some auxiliary state in the nested loop, which is a bit less convenient in this case.
You will be better off writing a parser, especially if you use a parser generator like pyparsing. It will be more maintainable and extendable.
In fact pyparsing already implements the parser for you, you just need to write the function that filters the parser output.
I took a couple of passes at writing a single parser expression that could be used with expression.transformString(), but I had difficulty distinguish between nested and unnested []'s at parse time. In the end I had to open up the loop in transformString and iterate over the scanString generator explicitly.
To address the question of whether [some] should be included or not based on the original question, I explored this by adding more "unnested" text at the end, using this string:
src = """[This] is some text with [some [blocks that are
nested [in a [variety] of ways]] in various places]"""
My first parser follows the original question's lead, and rejects any bracketed expression that has any nesting. My second pass takes the top level tokens of any bracketed expression, and returns them in brackets - I didn't like this solution so well, as we lose the information that "some" and "in various places" are not contiguous. So I took one last pass, and had to make a slight change to the default behavior of nestedExpr. See the code below:
from pyparsing import nestedExpr, ParseResults, CharsNotIn
# 1. scan the source string for nested [] exprs, and take only those that
# do not themselves contain [] exprs
out = []
last = 0
for tokens,start,end in nestedExpr("[","]").scanString(src):
out.append(src[last:start])
if not any(isinstance(tok,ParseResults) for tok in tokens[0]):
out.append(src[start:end])
last = end
out.append(src[last:])
print "".join(out)
# 2. scan the source string for nested [] exprs, and take only the toplevel
# tokens from each
out = []
last = 0
for t,s,e in nestedExpr("[","]").scanString(src):
out.append(src[last:s])
topLevel = [tok for tok in t[0] if not isinstance(tok,ParseResults)]
out.append('['+" ".join(topLevel)+']')
last = e
out.append(src[last:])
print "".join(out)
# 3. scan the source string for nested [] exprs, and take only the toplevel
# tokens from each, keeping each group separate
out = []
last = 0
for t,s,e in nestedExpr("[","]", CharsNotIn('[]')).scanString(src):
out.append(src[last:s])
for tok in t[0]:
if isinstance(tok,ParseResults): continue
out.append('['+tok.strip()+']')
last = e
out.append(src[last:])
print "".join(out)
Giving:
[This] is some text with
[This] is some text with [some in various places]
[This] is some text with [some][in various places]
I hope one of these comes close to the OP's question. But if nothing else, I got to explore nestedExpr's behavior a little further.
Related
The problem is given a string S and an integer k<len(S) we need to find the highest string in dictionary order with any k characters removed but maintaining relative ordering of string.
This is what I have so far:
def allPossibleCombinations(k,s,strings):
if k == 0:
strings.append(s)
return strings
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings)
return strings
def stringReduction(k, s):
strings = []
combs = allPossibleCombinations(k,s, strings)
return sorted(combs)[-1]
This is working for a few test cases but it says that I have too many recursive calls for other testcases. I don't know the testcases.
This should get you started -
from itertools import combinations
def all_possible_combinations(k = 0, s = ""):
yield from combinations(s, len(s) - k)
Now for a given k=2, and s="abcde", we show all combinations of s with k characters removed -
for c in all_possible_combinations(2, "abcde"):
print("".join(c))
# abc
# abd
# abe
# acd
# ace
# ade
# bcd
# bce
# bde
# cde
it says that I have too many recursive calls for other testcases
I'm surprised that it failed on recursive calls before it failed on taking too long to come up with an answer. The recursion depth is the same as k, so k would have had to reach 1000 for default Python to choke on it. However, your code takes 4 minutes to solve what appears to be a simple example:
print(stringReduction(8, "dermosynovitis"))
The amount of time is a function of k and string length. The problem as I see it, recursively, is this code:
for i in range(len(s)):
new_str = s[:i]+s[i+1:]
strings = allPossibleCombinations(k-1, new_str, strings, depth + 1)
Once we've removed the first character say, and done all the combinations without it, there's nothing stopping the recursive call that drops out the second character from again removing the first character and trying all the combinations. We're (re)testing too many strings!
The basic problem is that you need to prune (i.e. avoid) strings as you test, rather than generate all possibilties and test them. If a candidate's first letter is less than that of the best string you've seen so far, no manipulation of the remaining characters in that candidate is going to improve it.
Question from Daily Coding Problem 11 as reproduced below:
Implement an autocomplete system. That is, given a query string s and a set of all possible query strings, return all strings in the set that have s as a prefix.
For example, given the query string de and the set of strings [dog, deer, deal], return [deer, deal].
Hint: Try preprocessing the dictionary into a more efficient data structure to speed up queries.
I have come up with a working(hopefully) solution but in the course of doing so, I came across something which I could not understand. Question below.
def autocomplete(word):
words = []
## Set up word import wordDict
with open('11_word_list.txt','r') as f:
for line in f:
words += line.split()
## Optional: Filter out words with same first alphabet of search word
## Then ensure that remaining words also are at least as long as search word
words = filter(lambda x: x[0] == word[0], words)
words = filter(lambda x: len(x) >= len(word), words)
## Strictly speaking, this is the only required line that can still
## make this solution work
words = filter(lambda x: x.startswith(word), words)
####################
## Works in progress
####################
##Suppose that the content of words[] are already as long as, or longer than
##the search term, how come only Option B seems to work, but when shortened into
##Option A as a more generic form, it does not work?
##
##Put simply, it seems words[] is not updated after every run of the for loop??
## Option A
## for i in range(1,len(word)):
## words = filter(lambda x: x[i] == word[i], words)
## Option B
## words = filter(lambda x: x[1] == word[1], words)
## words = filter(lambda x: x[2] == word[2], words)
## words = filter(lambda x: x[3] == word[3], words)
return list(words)
Question as mentioned in commented block.
Option B was first written to test the concept and it worked, although it is hardcoded. Option A was an attempt to generalise Option B but it seems that words variable is not being updated despite the assignment at the front and it keeps reading from the original words array.
There are two issues that combine to cause the problem. Fixing either one would prevent the larger problem.
The first issue is that filter is a lazy iterator. It doesn't actually process its input immediately. It only skips over bad values as you iterate on its output iterator. That's a good thing if you might quit iterating early, but it's problematic here. You could avoid your problem by using a non-lazy approach to filtering, like a list comprehension (or you could do list(filter(...))):
for i in range(1,len(word)):
words = [x for x in words if x[i] == word[i]]
The reason the lazy filter is problematic is our second issue. This is that your lambda functions are closures. They're reading the i variable from the enclosing namespace, not from their own namespace. Unfortunately, since i keeps changing in the outer namespace as the loop goes on, they all end up seeing the last i value, rather than the value i had when they were defined. There's a fix for this, which is to use i as an argument in the lambda, with its current value (in the outer namespace) as the default value:
for i in range(1,len(word)):
words = filter(lambda x, i=i: x[i] == word[i], words)
Note that neither of these fixes is going to result in very efficient prefix matching. For a really efficient search, you probably want to load the data file just once, and build a data structure that can be efficiently searched for prefixes. A trie does that by default, but if you don't want to build one yourself (or use a library), a basic dictionary mapping from prefixes to full strings might be reasonable if you don't have too many words.
Today I realized that python's list.index can also take an optional start (and even end) parameter.
I was wondering whether or not this is efficiently implemented and which of these two is better:
pattern = "qwertyuytresdftyuioknn"
words_list = ['queen', 'quoin']
for word in words_list:
i = 1
for character in word:
try:
i += pattern[i:].index(character)
except ValueError:
break
else:
yield word
or
pattern = "qwertyuytresdftyuioknn"
words_list = ['queen', 'quoin']
for word in words_list:
i = 1
for character in word:
try:
i = pattern.index(character, i)
except ValueError:
break
else:
yield word
So basically i += pattern[i:].index(character) vs i = pattern.index(character, i).
Searching for this on generic_search_machine returns nothing helpful, except a lot of beginner tutorials trying to teach me what a list is.
Background:
This code tries to find all words from words_list which match pattern. pattern is a list of characters a user entered by swiping over the keyboard, like on most modern mobile device's keyboards.
In the actual implementation there is the additional requirement that the returned word should be longer than 5 characters and the first and last character have to exactly match. These lines are omitted here for brevity, since they are trivial to implement.
This calls a built-in function implemented in C:
i = pattern.index(character, i)
Even without looking at the source code, you can always assume that the underlying implementation is smart enough to implement that efficiently, i.e. that it does not look at the first i values in the list.
As a rule of thumb, using a built-in functionality is always faster than (or at least as fast as) the best thing you can implement yourself.
The attempt to make it better:
i += pattern[i:].index(character)
This is deffinitely worse. It makes a copy of pattern[i:] and then looks for character in it.
So, in the worst case, if you have a pattern of 1 GB and i=1, this copies 1 GB of data in memory in attempt to skip the first element (which whould have been skipped anyway).
Given a string, lets say "TATA__", I need to find the total number of differences between adjacent characters in that string. i.e. there is a difference between T and A, but not a difference between A and A, or _ and _.
My code more or less tells me this. But when a string such as "TTAA__" is given, it doesn't work as planned.
I need to take a character in that string, and check if the character next to it is not equal to the first character. If it is indeed not equal, I need to add 1 to a running count. If it is equal, nothing is added to the count.
This what I have so far:
def num_diffs(state):
count = 0
for char in state:
if char != state[char2]:
count += 1
char2 += 1
return count
When I run it using num_diffs("TATA__") I get 4 as the response. When I run it with num_diffs("TTAA__") I also get 4. Whereas the answer should be 2.
If any of that makes sense at all, could anyone help in fixing it/pointing out where my error lies? I have a feeling is has to do with state[char2]. Sorry if this seems like a trivial problem, it's just that I'm totally new to the Python language.
import operator
def num_diffs(state):
return sum(map(operator.ne, state, state[1:]))
To open this up a bit, it maps !=, operator.ne, over state and state beginning at the 2nd character. The map function accepts multible iterables as arguments and passes elements from those one by one as positional arguments to given function, until one of the iterables is exhausted (state[1:] in this case will stop first).
The map results in an iterable of boolean values, but since bool in python inherits from int you can treat it as such in some contexts. Here we are interested in the True values, because they represent the points where the adjacent characters differed. Calling sum over that mapping is an obvious next step.
Apart from the string slicing the whole thing runs using iterators in python3. It is possible to use iterators over the string state too, if one wants to avoid slicing huge strings:
import operator
from itertools import islice
def num_diffs(state):
return sum(map(operator.ne,
state,
islice(state, 1, len(state))))
There are a couple of ways you might do this.
First, you could iterate through the string using an index, and compare each character with the character at the previous index.
Second, you could keep track of the previous character in a separate variable. The second seems closer to your attempt.
def num_diffs(s):
count = 0
prev = None
for ch in s:
if prev is not None and prev!=ch:
count += 1
prev = ch
return count
prev is the character from the previous loop iteration. You assign it to ch (the current character) at the end of each iteration so it will be available in the next.
You might want to investigate Python's groupby function which helps with this kind of analysis.
from itertools import groupby
def num_diffs(seq):
return len(list(groupby(seq))) - 1
for test in ["TATA__", "TTAA__"]:
print(test, num_diffs(test))
This would display:
TATA__ 4
TTAA__ 2
The groupby() function works by grouping identical entries together. It returns a key and a group, the key being the matching single entry, and the group being a list of the matching entries. So each time it returns, it is telling you there is a difference.
Trying to make as little modifications to your original code as possible:
def num_diffs(state):
count = 0
for char2 in range(1, len(state)):
if state[char2 - 1] != state[char2]:
count += 1
return count
One of the problems with your original code was that the char2 variable was not initialized within the body of the function, so it was impossible to predict the function's behaviour.
However, working with indices is not the most Pythonic way and it is error prone (see comments for a mistake that I made). You may want rewrite the function in such a way that it does one loop over a pair of strings, a pair of characters at a time:
def num_diffs(state):
count = 0
for char1, char2 in zip(state[:-1], state[1:]):
if char1 != char2:
count += 1
return count
Finally, that very logic can be written much more succinctly — see #Ilja's answer.
Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)
• Search a key word in Column (C#3) and keep 2 words front and back to that key word.
• Divert the print output to file.
• Here I don’t want to touch C#1, C#2 for referential integrity purposes.
Really appreciate for all your help.
My input file:
C #1 C # 2 C# 3 (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it
Desired output file: (only change in Column 3 or last column)
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
Code I am currently using:
s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():
if not line.strip():
continue
fields = line.split(None, 2)
joined = '|'.join(fields)
print(joined)
BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.
First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.
For 1 million cases you'll need pandas as for loop is too slow.
import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
"This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())
which gives
df
C1 C2 C3
0 12088 CITA very nice lists, better to
1 12089 CITA theme for lists keep it
There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.
Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword(), and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.
KEYWORD = "lists"
S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
def findword(words, keyword):
"""Return index of first occurrence of `keyword` in sequence
`words`, otherwise return None.
The current implementation searches for "keyword" as well as
for "keyword," (with trailing comma).
"""
for test in (keyword, "%s," % keyword):
try:
return words.index(test)
except ValueError:
pass
return None
for line in S.splitlines():
tokens = line.split("|")
words = tokens[2].split()
idx = findword(words, KEYWORD)
if idx is None:
# Keyword not found. Print line without change.
print line
continue
l = len(words)
start = idx-2 if idx > 1 else 0
end = idx+3 if idx < l-2 else -1
tokens[2] = " ".join(words[start:end])
print '|'.join(tokens)
Test:
$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
PS: I hope I got the indices right for slicing. You should check, nevertheless.