Related
I have 2 lists:
a = [
'Okay. ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when ',
"it's ",
'set ',
'up ',
'just ',
'one ',
'and ',
"we're ",
'like ',
'next ',
'to ',
'each ',
'other '
]
b = [
'Okay. ',
'Yeah. ',
'Everything ',
'as ',
'normal ',
'as ',
'possible. ',
'Yeah. ',
'Yeah. ',
'Okay. ',
'Is ',
'that ',
'better? ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when '
]
Each list is slightly different. However, there will be moments when a stretch of continuous elements in a will match a stretch of continuous elements in b.
For example:
The first 2 elements in both lists match. The matching list would be ['Okay.', 'Yeah.']. This is only 2 elements long.
There is a longer stretch of matching words. You can see that each contains the following continuous set:
['Yeah. ','So ','my ','thinking ','is, ','so ','when ']
This continuous matching sequence has 7 elements. This is the longest sequence.
I want the index of where this sequence starts for each list. For a, this should be 1 and for b this should be 13.
I understand that I can make every possible ordered sequence in a, starting with the longest, and check for a match in b, stopping once I get the match. However, this seems inefficent.
How I would solve this:
from difflib import SequenceMatcher
match = SequenceMatcher(None, a, b).find_longest_match()
print(a[match.a:match.a + match.size])
print(b[match.b:match.b + match.size])
You get:
['Yeah. ', 'So ', 'my ', 'thinking ', 'is, ', 'so ', 'when ']
['Yeah. ', 'So ', 'my ', 'thinking ', 'is, ', 'so ', 'when ']
So, we start from the top of 'a', and search through 'b' to find the longest match. Since this only continues as long as there is a match, it isn't terribly inefficient.
a = [
'Okay. ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when ',
"it's ",
'set ',
'up ',
'just ',
'one ',
'and ',
"we're ",
'like ',
'next ',
'to ',
'each ',
'other '
]
b = [
'Okay. ',
'Yeah. ',
'Everything ',
'as ',
'normal ',
'as ',
'possible. ',
'Yeah. ',
'Yeah. ',
'Okay. ',
'Is ',
'that ',
'better? ',
'Yeah. ',
'So ',
'my ',
'thinking ',
'is, ',
'so ',
'when '
]
start = None
maxlen = 0
for i in range(len(a)):
for j in range(len(b)):
for k in range(min(len(a)-i,len(b)-j)):
if a[i+k] != b[j+k]:
break
if k > maxlen:
start = (i,j)
maxlen = k
print(start,maxlen)
Output:
(1, 13) 6
I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):
import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b cc dd! :ee "))
produces the expected output:
['', 'aa', ' ', 'b+b', ' ', 'cc', ' ', 'dd!', ' ', ':ee', ' ']
Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:
regexB.split("aa b+b cc(dd! :ee (ff gg) hh) ii ")
should produce:
['', 'aa', ' ', 'b+b', ' ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', ' ']
Using
regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')
works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?
And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.
Any hint is warmly welcome...
In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.
(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)
Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)
But python does not support sub-routinepattern in regex.
So I tried first replacing every ( , ) with another distinctive character( # in this example) and applying the regex to split and finally turn # back to ( or ) respectively in my pythone script.
Regex for spliting.
(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)
Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because #,(,) are included in [\S]' possible characters set.
Python script may be like this
import re
ss="""aa b+b cc(dd! :ee ((ff gg)) hh) ii """
ss=re.sub(r"\(|\)","#",ss) #repacing every `(`,`)` to `#`
regx=re.compile(r"(?m)(\s+)(?=[^#]*(?:(?:#[^#]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)): # turn `#` back to `(` or `)` respectively
n= m[i].count('#')
if n < 2: continue
else:
for j in range(int(n/2)):
k=m[i].find('#'); m[i]=m[i][:k]+'('+m[i][k+1:]
m[i]= m[i].replace("#",')')
print(m)
Output is
['aa', ' ', 'b+b', ' ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', ' ', '']
Finally after having tested several ideas based on the answers proposed by #Wiktor Stribiżew and #Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:
import re
text = "aa b%b( %cc(dd! (:ee ff) gg) %hh ii) "
# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))
# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))
# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))
# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
if n: words[-1] += word
else: words.append(word)
if n or (word and word[0] == '%'):
n += word.count('(') - word.count(')')
print(words)
Here is the generated output:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:
Solution 1: ['', 'aa', ' ', 'b%b(', ' ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 2: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 3: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', ' ']
Solution 4: ['', 'aa', ' ', 'b%b( %cc(dd! (:ee ff) gg) %hh ii)', ' ']
I have a list that I'm trying to strip all punctuation and the character "·" from and then returning that list without any of the above. However, when I try to return the list, only the first word of the list appears and I'm not sure where I went wrong with this.
Here is the list I'm trying to strip punctuation from:
['in·vis·i·ble', 'in·vis·i·bil·i·ty, ', 'in·vis·i·ble·ness, ', 'in·vis·i·bly, ', 'qua·si-in·vis·i·ble, ', 'qua·si-in·vis·i·bly, ', 'inˌvisiˈbility, ', 'inˈvisibleness, ', 'inˈvisibly, ']
Here's what I'm getting: ['invisible']
Here is a portion of my code (it's part of a larger function)
syl = []
for words in span:
if words not in syl:
syl.append(words)
for text in syl:
drop_sep = re.sub(r'·', '', text)
return drop_sep
Use a list comprehension where each element of the resulting list is a string with all occurrences of dot substring '·' replaced by the void '':
[word.replace('·', '') for word in words]
Example
>>> words = ['in·vis·i·ble',
... 'in·vis·i·bil·i·ty, ',
... 'in·vis·i·ble·ness, ',
... 'in·vis·i·bly, ',
... 'qua·si-in·vis·i·ble, ',
... 'qua·si-in·vis·i·bly, ',
... 'inˌvisiˈbility, ',
... 'inˈvisibleness, ',
... 'inˈvisibly, ']
>>>
>>> from pprint import pprint
>>> pprint([word.replace('·', '') for word in words])
['invisible',
'invisibility, ',
'invisibleness, ',
'invisibly, ',
'quasi-invisible, ',
'quasi-invisibly, ',
'inˌvisiˈbility, ',
'inˈvisibleness, ',
'inˈvisibly, ']
I have this list,
last_names = [
'Hag ', 'Hag ', 'Basmestad ', 'Grimlavaag ', 'Kleivesund ',
'Fintenes ', 'Svalesand ', 'Molteby ', 'Hegesen ']
and I want to print i reversed, so 'Hegesen' comes first, then ' Molteby' and at the end 'Hag'.
I have tried last_names.reverse(), but that returnes None..
Any help?
.reverse returns None because it reverses in-place:
>>> last_names = [
... 'Hag ', 'Hag ', 'Basmestad ', 'Grimlavaag ', 'Kleivesund ',
... 'Fintenes ', 'Svalesand ', 'Molteby ', 'Hegesen ']
>>> last_names.reverse()
>>> last_names
['Hegesen ', 'Molteby ', 'Svalesand ', 'Fintenes ', 'Kleivesund ', 'Grimlavaag ', 'Basmestad ', 'Hag ', 'Hag ']
To do this in an expression, do last_names[::-1].
As stated before, .reverse reverses the list in place, a more pythonic way to reverse a list and return it, is to use reversed:
>>> list(reversed([1,2,3]))
[3, 2, 1]
I have this program that takes two csv files into consideration. It looks at "testclaims" (one column many rows) and sees if any words in "masterlist"(one column, many rows) are within the rows of "testclaims." If the rows in "testclaims" contains any word in "masterlist" it will list it into a new .csv file called "output." This part of the program works great.
The part that I can't seem to figure out is to output all the remaining rows in "testclaims" that don't contain ANY words in "masterlist" into another csv called "output2" I would think that the last two lines of my code should get this to work, but it's not outputting what I want. I hope I've explained this clearly enough. Here's my code:
import csv
with open("testclaims.csv") as file1, open("masterlist.csv") as file2,
open("stopwords.csv") as file3,\
open("output.csv", "wb+") as file4, open("output2.csv", "wb+") as file5:
writer = csv.writer(file4)
writer2 = csv.writer(file5)
key_words = [word.strip() for word in file2.readlines()]
stop_words = [word.strip() for word in file3.readlines()]
internal_stop_words = [' a ', ' an ', ' and ', 'as ', ' at ', ' be ', 'ed ',
'ers ', ' for ',\
' he ', ' if ', ' in ', ' is ', ' it ', ' of ', ' on ', ' to ', 'her ', 'hers '\
' do ', ' did ', ' a ', ' b ', ' c ', ' d ', ' e ', ' f ', ' g ', ' h ', ' i ',\
' j ', ' k ', ' l ', ' m ', 'n ', ' n', ' nc ' ' o ', ' p ', ' q ', ' r ', ' s ',\
' t ', ' u ', ' v ', ' w ', ' x ', ' y ', 'z ', ',', '"', 'ers ', ' th ', ' gc ',\
' so ', ' ot ', ' ft ', ' ow ', ' ir ', ' ho ', ' er ', ]
for row in file1:
row = row.strip()
row = row.lower()
for stop in stop_words:
if stop in row:
row = row.replace(stop," ")
for stopword in internal_stop_words:
if stopword in row:
row = row.replace(stopword," ")
for key in key_words:
if key in row:
writer.writerow([key, row])
elif key not in row:
writer2.writerow([row])
What output2 is outputting is every row in "testclaims" repeated multiple times.
For example if "testclaims" contains this one column:
Happy
Sad
Angry
Dog
Cat
"output2" is outputting a csv that prints this one column:
Happy
Happy
Happy
Happy
Happy
Sad
Sad
Sad
Sad
Angry
Angry
Angry
Angry
Angry
Dog
Dog
Dog
Dog
Dog
Cat
Cat
Cat
Cat
Cat
And it doesn't output the same number of each row either.
you have a double for loop and each time you print the row, but you only want it at most once per row.
you should adjust your last two lines:
for row in file1:
...
for key in key_words:
if key in row:
writer.writerow([key, row])
if not any(key in row for key in key_words):
writer2.writerow([row])