Does pyparsing know the depth of the recursive expression at parse time? - python

I love the ability to define a parseAction with pyarsing, but I've run into a roadblock for a particular use case. Take the input string and the following simple grammar:
from pyparsing import *
line = "[[one [two [three] two [three [four]]] one] zero]"
token = Word(alphas)
# Define the simple recursive grammar
grammar = Forward()
nestedBrackets = nestedExpr('[', ']', content=grammar)
grammar << (token | nestedBrackets)
P = grammar.parseString(line)
print P
I'd like the results to be:
[('one',1), [('two',2), [('three',3)], ('two',2), [('three',3), [('four',4)]]] one], ('zero',0)]
i.e parse each token and return a tuple with the token and the depth. I know that this can be done post-parse, but I want to know if it is possible to do with a parseAction. Here way my incorrect attempt with a global variable:
# Try to count the depth
counter = 0
def action_token(x):
global counter
counter += 1
return (x[0],counter)
token.setParseAction(action_token)
def action_nest(x):
global counter
counter -= 1
return x[0]
nestedBrackets.setParseAction(action_nest)
Giving:
[('one', 1), ('two', 2), ('three', 3), ('two', 3), ('three', 4), ('four', 5), ('one', 3), ('zero', 3)]

Do this (leaving the rest as you have it):
def openB(s, l, t):
global count
count += 1
def closeB(s, l, t):
global count
count -= 1
opener = Literal("[").setParseAction(openB)
closer = Literal("]").setParseAction(closeB)
nestedBrackets = nestedExpr(opener, closer, content=grammar)
The issue is that the nesting depends not on the number of nested groups matched, but on the number of open brackets matched versus the number of closed brackets matched. Therefore, you need to adjust the count when you parse the open/close brackets, not when you parse the group. So you need to set the parseAction on the group delimiters, not the group itself.
Also, your example has the nesting off by one level (at least by my eyes). The 'zero" should really be one, since it is inside one level of brackets, and likewise everything else should be shifted up by one. If you really want that outermost "zero" to have level zero and so on, you need to initialize count to -1.

Related

How can I store my result in tuple in Python

Write a function called word_freq(text) which takes one string
argument. This string will not have any punctuation. Perform a count
of the number of 'n' character words in this string and return a list
of tuples of the form[(n, count), (n-1, count) ...] in descending
order of the counts. For example:
Example: word_freq('a aaa a aaaa')
Result: [(4, 1), (3, 1), (1, 2)]
Note: that this does not show anything for the 2 character words. str1
= 'a aaa a aaa' str.split(str1) str.count(str1)
def word_freq(str): Python code to find frequency of each word
I tried this
text = 'a aaa a aaaa'
def word_freq(str):
tuple = ()
count = {}
for x in str:
if x in count.keys():
count[x] += 1
else:
count[x] = 1
print(count)
def count_letters(word):
char = "a"
count = 0
for c in word:
if char == c:
count += 1
return count
word_freq(text)
The code below does what you want. Now I'll explain how it works. before anything, we will make a dictionary called "WC" which will hold the count of each n-character-word in our sentence. now we start. first of all, it receives a string from user. then it takes the string and using split(), it turns the string into a LIST of words. then for each word it checks its length, if it is 2, it ignores it. otherwise, it will add 1 to the count of that n-character word in our dictionary.
after every word is checked, we use wc.items() to turn our dictionary into a list of tuples. Each element in the list is a tuple that contains data for each word. each tuple has 2 elements. the first is number of charatcers of each word and the second element is the number of times it existed in the sentence. with that out of the way, Now all we need is to do is sort this list based on the character counts in reverse (from high char count to low char count). we do that using the sorted function. we sort based on x[0] which means the first element of each tuple which is the character count for each word. Finally, we return this list of tuples. You can print it.
if anything is unclear, let me know. also, you can put print() statements at every line so you can better understand what is happening.
here's the code, I hope it helps:
inp = input("Enter your text: ")
def word_count(inp_str):
wc = {}
for item in inp_str.strip().split():
if len(item) == 2:
continue
wc[len(item)] = wc.get(len(item), 0) + 1
return sorted(wc.items(), key=lambda x: x[0], reverse = True)
print(word_count(inp))

Check if strings are anagrams

I want to create a python function that checks if the given strings are anagram or not, this code works if there is only one word that doesn't match, i.e if the strings are bad and dad it returns 'b' and 'd' but if the strings are 'zippo' and 'hipps' it just returns z and h, how should I modify the code to return all the values that do not match.
def anagram(str_1, str_2):
'''
This function check if two string are anagram,
if yes then prints yes otherwise it checks
for the words that need to be deleted to make it an anagram.
'''
if sorted(str_1) == sorted(str_2):
return "The given strings are anagrams"
else:
# words_needed_to_removed = []
zipped_strings = zip(str_1,str_2)
for (i,j) in zipped_strings:
if i!=j:
return i,j
# words_needed_to_removed.append((i,j))
# return f"The words needed to be removed to make the strings an anagram are:{words_needed_to_removed}"
You have two options
Instead of return use yield
Instead of returning as soon as you find the unequal values, store the values in some list as tuples and then return the list.
Also, another suggestion is:
Instead of sorting the strings, find the count of each character in a particular string and then compare that.
Sorting takes O(nlogn) whereas counting just takes O(n). Though there's this added space complexity.
To return everything, add it to a list and then return the list at the end of the loop.
def anagram(str_1, str_2):
if sorted(str_1) == sorted(str_2):
return "The given strings are anagrams"
else:
zipped_strings = zip(str_1,str_2)
all_letters = []
for (i,j) in zipped_strings:
if i!=j:
all_letters.append((i, j))
return all_letters
However, your code won't work for cases where the sorted order causes an offset between the matching letters. For example: anagram("ohipzs", "hipzas") gives [('o', 'h'), ('h', 'i'), ('i', 'p'), ('p', 'z'), ('z', 'a')], even though the only mismatch is 'o' and 'a'.
Instead, you could first, make a counter containing the counts of all letters in the first word.
counter = dict()
for letter in word1:
counter[letter] = counter.get(letter, 0) + 1
Next, subtract all letters of the second word from this counter
for letter in word2:
counter[letter] = counter.get(letter, 0) - 1
Finally, return all keys of counter that do not have a value of zero
mismatch_letters_1 = []
mismatch_letters_2 = []
for letter, count in counter.items():
if count == 0:
# Counts match perfectly, skip this letter
continue
elif count > 0:
append_to_list = mismatch_letters_1
else:
append_to_list = mismatch_letters_2
for _ in range(abs(count)):
append_to_list.append(letter)
answer = [(i, j) for i, j in zip(mismatch_letters_1, mismatch_letters_2)]
Putting this into a function and testing gives:
anagram2("bad", "dad") -> [('b', 'd')]
anagram2("zippo", "hipps") -> [('z', 'h'), ('o', 's')]
anagram2("ohipzs", "hipzas") -> [('o', 'a')]

Reconstruct input string given ngrams of that string

Given a string, e.g. i am a string.
I can generate the n-grams of this string like so, using the nltk package, where n is variable as per a specified range.
from nltk import ngrams
s = 'i am a string'
for n in range(1, 3):
for grams in ngrams(s.split(), n):
print(grams)
Gives the output:
('i',)
('am',)
('a',)
('string',)
('i', 'am')
('am', 'a')
('a', 'string')
Is there a way to 'reconstruct' the original string using combinations of the generated ngrams? Or, in the words of the below commenter, is there a way to divide the sentence into consecutive word sequences where each sequence has a maximum length of k (in this case k is 2).
[('i'), ('am'), ('a'), ('string')]
[('i', 'am'), ('a'), ('string')]
[('i'), ('am', 'a'), ('string')]
[('i'), ('am'), ('a', 'string')]
[('i', 'am'), ('a', 'string')]
The question is similar to this one, though with an additional layer of complexity.
Working solution - adapted from here.
I have a working solution, but it's really slow for longer strings.
def get_ngrams(s, min_=1, max_=4):
token_lst = []
for n in range(min_, max_):
for idx, grams in enumerate(ngrams(s.split(), n)):
token_lst.append(' '.join(grams))
return token_lst
def to_sum_k(s):
for len_ in range(1, len(s.split())+1):
for i in itertools.permutations(get_ngrams(s), r=len_):
if ' '.join(i) == s:
print(i)
to_sum_k('a b c')
EDIT:
This answer was based on the assumption that the question was to reconstruct an unknown unique string based on it's ngrams. I'll leave it active for anyone interested in this problem. The actual answer for the actual problem as clarified in the comments can be found here.
EDIT END
In general no. Consider e.g. the case n = 2 and s = "a b a b". Then your ngrams would be
[("a"), ("b"), ("a", "b"), ("b", "a")]
The set of strings that generate this set of ngrams in this case however would be all that may be generated by
(ab(a|(ab)*a?))|(ba(b|(ba)*b?)
Or n = 2, s = "a b c a b d a", where "c" and "d" may be arbitrarily ordered within the generating strings. E.g. "a b d a b c a" would also be a valid string. In addition the same issue as above arises and an arbitrary number of strings can generate the set of ngrams.
That being said there exists a way to test whether a set of ngrams uniquely identifies a string:
Consider your set of strings as a description of a non-deterministic state-machine. Each ngram can be defined as a chain of states where the single characters are transitions. As an example for the ngrams [("a", "b", "c"), ("c", "d"), ("a", "d", "b")] we would build the following state-machine:
0 ->(a) 1 ->(b) 2 ->(c) 3
0 ->(c) 3 ->(d) 4
0 ->(a) 1 ->(d) 5 ->(b) 6
Now perform a determinization of this state-machine. Iff there exists a unique string that can be reconstructed from the ngrams, the state-machine will have a longest transition-chain that doesn't contain any cycles and contains all ngrams we built the original state-machine from. In this case the original string is simply the individual state-transitions of this path joined back together. Otherwise there exist multiple strings that can be built from the provided ngrams.
While my previous answer assumed that the problem was to find an unknown string based on it's ngrams, this answer will deal with the problem of finding all ways to construct a given string using it's ngrams.
Assuming repetitions are allowed the solution is fairly simple: Generate all possible number sequences summing up to the length of the original string with no number larger than n and use these to create the ngram-combinations:
import numpy
def generate_sums(l, n, intermediate):
if l == 0:
yield intermediate
elif l < 0:
return
else:
for i in range(1, n + 1):
yield from generate_sums(l - i, n, intermediate + [i])
def combinations(s, n):
words = s.split(' ')
for c in generate_sums(len(words), n, [0]):
cs = numpy.cumsum(c)
yield [words[l:u] for (l, u) in zip(cs, cs[1:])]
EDIT:
As pointed out by #norok2 (thanks for the work) in the comments, it seems to be faster to use alternative cumsum-implementations instead of the one provided by numpy for this usecase.
END EDIT
If repetitions are not allowed things become a little bit more tricky. In this case we can use a non-deterministic finite automaton as defined in my previous answer and build our sequences based on traversals of the automaton:
def build_state_machine(s, n):
next_state = 1
transitions = {}
for ng in ngrams(s.split(' '), n):
state = 0
for word in ng:
if (state, word) not in transitions:
transitions[(state, word)] = next_state
next_state += 1
state = transitions[(state, word)]
return transitions
def combinations(s, n):
transitions = build_state_machine(s, n)
states = [(0, set(), [], [])]
for word in s.split(' '):
new_states = []
for state, term_visited, path, cur_elem in states:
if state not in term_visited:
new_states.append((0, term_visited.union(state), path + [tuple(cur_elem)], []))
if (state, word) in transitions:
new_states.append((transitions[(state, word)], term_visited, path, cur_elem + [word]))
states = new_states
return [path + [tuple(cur_elem)] if state != 0 else path for (state, term_visited, path, cur_elem) in states if state not in term_visited]
As an example the following state machine would be generated for the string "a b a":
Red connections indicate a switch to the next ngram and need to be handled separately (second if in the loop), since they can only be traversed once.

Way to pass multiple parameters in a for loop?

My code:
seperated = startContent.split(' ')
seperatedNum = len(seperated)
#Ask for user input
for word in seperated and for i in seperatedNum:
if word == 'ADJECTIVE':
seperated[i] = input('Enter an adjective:')
elif word == 'NOUN':
seperated[i] = input('Enter a noun:')
elif word == 'ADVERB':
seperated[i] = input('Enter an adverb:')
elif word == 'VERB':
seperated[i] = input('Enter a verb:')
Basically asking the user input each time they run into one of the following words (there can be multiple of each).
I get my sentence, split it into a list with split command. And run the loop for each word. I want to then edit the list using list[x] = 'replacement' method.
The word in seperated, returns the listitem. So I need another argument passed to it, e.g i in len(list) to then get the accurate index of the word. I can't use list.index(str) because it returns the first index of the string when there are multiple iterations of the text.
You're looking for a way to pass multiple parameters in a for loop: There is nothing special about a for loop in this regard. The loop will iterate over a given sequence and will, each iteration, assign the current element to the given left-hand side.
for LEFT_HAND_SIDE in SEQUENCE
Python also supports "automatic" unpacking of sequences during assigments, as you can see in the following example:
>>> a, b = (4, 2)
>>> a
4
>>> b
2
In conclusion, you can just combine multiple variables on the left-hand side in your for loop with a sequence of sequences:
>>> for a, b in [(1, 2), (3, 4)]:
... print(a)
... print(b)
...
1
2
3
4
That for loop had two assignments a, b = (1, 2) and a, b = (3, 4).
In you specific case, you want to combine the value of an element in a sequence with its index. The built-in function enumerate comes in handy here:
>>> enumerate(["x", "y"])
<enumerate object at 0x7fc72f6685a0>
>>> list(enumerate(["x", "y"]))
[(0, 'x'), (1, 'y')]
So you could write your for loop like this:
for i, word in enumerate(seperated)

Convert decimal to Roman numerals

d_hsp={"1":"I","2":"II","3":"III","4":"IV","5":"V","6":"VI","7":"VII","8":"VIII",
"9":"IX","10":"X","11":"XI","12":"XII","13":"XIII","14":"XIV","15":"XV",
"16":"XVI","17":"XVII","18":"XVIII","19":"XIX","20":"XX","21":"XXI",
"22":"XXII","23":"XXIII","24":"XXIV","25":"XXV"}
HSP_OLD['tryl'] = HSP_OLD['tryl'].replace(d_hsp, regex=True)
HSP_OLD is a dataframe, tryl is one column of HSP_OLD, and here's some example of values in tryl:
SAF/HSP: Secondary diagnosis E code 1
SAF/HSP: Secondary diagnosis E code 11
I use a dictionary to replace, it works for 1-10, but for 11, it will become "II" , for 12, it will become "III".
Sorry, didn't notice that you're not merely updating the field but you actually want to replace a number at the end, but even if that's the case - it's much better to properly convert your number to roman numerals than to map every possible occurrence of such (what would happen with your code if there is a number larger than 25?). So, here's one way to do it:
ROMAN_MAP = [(1000, 'M'), (900, 'CM'), (500, 'D'), (400, 'CD'), (100, 'C'), (90, 'XC'),
(50, 'L'), (40, 'XL'), (10, 'X'), (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
def romanize(data):
if not data or not isinstance(data, str): # we know how to work with strings only
return data
data = data.rstrip() # remove potential extra whitespace at the end
space_pos = data.rfind(" ") # find the last space before the number
if space_pos != -1:
try:
number = int(data[space_pos + 1:]) # get the number at the end
roman_number = ""
for i, r in ROMAN_MAP: # loop-reduce substitution based on the ROMAN_MAP
while number >= i:
roman_number += r
number -= i
return data[:space_pos + 1] + roman_number # put everything back together
except (TypeError, ValueError):
pass # couldn't extract a number
return data
So now if we create your data frame as:
HSP_OLD = pd.DataFrame({"tryl": ["SAF/HSP: Secondary diagnosis E code 1",
None,
"SAF/HSP: Secondary diagnosis E code 11",
"Something else without a number at the end"]})
We can noe easily apply our function over the whole column with:
HSP_OLD['tryl'] = HSP_OLD['tryl'].apply(romanize)
Which results in:
tryl
0 SAF/HSP: Secondary diagnosis E code I
1 None
2 SAF/HSP: Secondary diagnosis E code XI
3 Something else without a number at the end
Of course, you can adapt the romanize() function to your needs to search any number within your string and turn it to roman numerals - this is just an example for how to quickly find the number at the end of the string.
You need to keep the order of the items, and start searching with the longest substring.
You may use an OrderDict here. To initialize it, use a list of tuples. You may reverse it already here, when initializing, but you can do it later, too.
import collections
import pandas as pd
# My test data
HSP_OLD = pd.DataFrame({'tryl':['1. Text', '11. New Text', '25. More here']})
d_hsp_lst=[("1","I"),("2","II"),("3","III"),("4","IV"),("5","V"),("6","VI"),("7","VII"),("8","VIII"), ("9","IX"),("10","X"),("11","XI"),("12","XII"),("13","XIII"),("14","XIV"),("15","XV"), ("16","XVI"),("17","XVII"),("18","XVIII"),("19","XIX"),("20","XX"),("21","XXI"), ("22","XXII"),("23","XXIII"),("24","XXIV"),("25","XXV")]
d_hsp = collections.OrderedDict(d_hsp_lst) # Creating the OrderedDict
d_hsp = collections.OrderedDict(reversed(d_hsp.items())) # Here, reversing
>>> HSP_OLD['tryl'] = HSP_OLD['tryl'].replace(d_hsp, regex=True)
>>> HSP_OLD
tryl
0 I. Text
1 XI. New Text
2 XXV. More here

Categories