Python2 tokenization and add to dictonary - python

I have some texts that I need to generate tokens splitting by space. Furthermore, I need to remove all punctuation, as I need to remove everything inside double braces [[...]] (including the double braces).
Each token I will put on a dictionary as the key that will have a list of values.
I have tried regex to remove these double braces patterns, if-elses, but I can't find a solution that works. For the moment I have:
tokenDic = dict()
splittedWords = re.findall(r'\[\[\s*([^][]*?)]]', docs[doc], re.IGNORECASE)
tokenStr = splittedWords.split()
for token in tokenStr:
tokenDic[token].append(value);

Is this what you're looking for?
import re
value_list = []
inp_str = 'blahblah[[blahblah]]thi ng1[[junk]]hmm'
tokenDic = dict()
#remove everything in double brackets
bracket_stuff_removed = re.sub(r'\[\[[^]]*\]\]', '', inp_str)
#function to keep only letters and digits
clean_func = lambda x: 97 <= ord(x.lower()) <= 122 or 48 <= ord(x) <= 57
for token in bracket_stuff_removed.split(' '):
cleaned_token = ''.join(filter(clean_func, token))
tokenDic[cleaned_token] = list(value_list)
print(tokenDic)
Output:
{'blahblahthi': [], 'ng1hmm': []}
As for appending to the list, I don't have enough info right now to tell you the best way in your situation.
If you want to set the value when you're adding the key, do this:
tokenDic[cleaned_token] = [val1, val2, val3]
If you want to set the values after the key has been added, do this:
val_to_add = "something"
if tokenDic.get(cleaned_token, -1) == -1:
print('ERROR', cleaned_token, 'does not exist in dict')
else:
tokenDic[cleaned_token].append(val_to_add)
If you want to directly append to the dict in both cases, you'll need to use defaultdict(list) instead of dict.. then if the key does not exist in the dict, it will create it, make the value an empty list, and then add your value.

To remove everything inside [[]] you can use re.sub and you already have the correct regex so just do this.
x = [[hello]]w&o%r*ld^$
y = re.sub("\[\[\s*([^][]*?)]]","",x)
z = re.sub("[^a-zA-Z\s]","",y)
print(z)
This prints "world"

Related

how to recursively create nested list from string input

So, I would like to convert my string input
'f(g,h(a,b),a,b(g,h))'
into the following list
['f',['g','h',['a','b'],'a','b',['g','h']]]
Essentially, I would like to replace all '(' into [ and all ')' into ].
I have unsuccessfully tried to do this recursively. I thought I would iterate through all the variables through my word and then when I hit a '(' I would create a new list and start extending the values into that newest list. If I hit a ')', I would stop extending the values into the newest list and append the newest list to the closest outer list. But I am very new to recursion, so I am struggling to think of how to do it
word='f(a,f(a))'
empty=[]
def newlist(word):
listy=[]
for i, letter in enumerate(word):
if letter=='(':
return newlist([word[i+1:]])
if letter==')':
listy.append(newlist)
else:
listy.extend(letter)
return empty.append(listy)
Assuming your input is something like this:
a = 'f,(g,h,(a,b),a,b,(g,h))'
We start by splitting it into primitive parts ("tokens"). Since your tokens are always a single symbol, this is rather easy:
tokens = list(a)
Now we need two functions to work with the list of tokens: next_token tells us which token we're about to process and pop_token marks a token as processed and removes it from the list:
def next_token():
return tokens[0] if tokens else None
def pop_token():
tokens.pop(0)
Your input consist of "items", separated by a comma. Schematically, it can be expressed as
items = item ( ',' item )*
In the python code, we first read one item and then keep reading further items while the next token is a comma:
def items():
result = [item()]
while next_token() == ',':
pop_token()
result.append(item())
return result
An "item" is either a sublist in parentheses or a letter:
def item():
return sublist() or letter()
To read a sublist, we check if the token is a '(', the use items above the read the content and finally check for the ')' and panic if it is not there:
def sublist():
if next_token() == '(':
pop_token()
result = items()
if next_token() == ')':
pop_token()
return result
raise SyntaxError()
letter simply returns the next token. You might want to add some checks here to make sure it's indeed a letter:
def letter():
result = next_token()
pop_token()
return result
You can organize the above code like this: have one function parse that accepts a string and returns a list and put all functions above inside this function:
def parse(input_string):
def items():
...
def sublist():
...
...etc
tokens = list(input_string)
return items()
Quite an interesting question, and one I originally misinterpreted. But now this solution works accordingly. Note that I have used list concatenation + operator for this solution (which you usually want to avoid) so feel free to improve upon it however you see fit.
Good luck, and I hope this helps!
# set some global values, I prefer to keep it
# as a set incase you need to add functionality
# eg if you also want {{a},b} or [ab<c>ed] to work
OPEN_PARENTHESIS = set(["("])
CLOSE_PARENTHESIS = set([")"])
SPACER = set([","])
def recursive_solution(input_str, index):
# base case A: when index exceeds or equals len(input_str)
if index >= len(input_str):
return [], index
char = input_str[index]
# base case B: when we reach a closed parenthesis stop this level of recursive depth
if char in CLOSE_PARENTHESIS:
return [], index
# do the next recursion, return it's value and the index it stops at
recur_val, recur_stop_i = recursive_solution(input_str, index + 1)
# with an open parenthesis, we want to continue the recursion after it's associated
# closed parenthesis. and also the recur_val should be within a new dimension of the list
if char in OPEN_PARENTHESIS:
continued_recur_val, continued_recur_stop_i = recursive_solution(input_str, recur_stop_i + 1)
return [recur_val] + continued_recur_val, continued_recur_stop_i
# for spacers eg "," we just ignore it
if char in SPACER:
return recur_val, recur_stop_i
# and finally with normal characters, we just extent it
return [char] + recur_val, recur_stop_i
You can get the expected answer using the following code but it's still in string format and not a list.
import re
a='(f(g,h(a,b),a,b(g,h))'
ans=[]
sub=''
def rec(i,sub):
if i>=len(a):
return sub
if a[i]=='(':
if i==0:
sub=rec(i+1,sub+'[')
else:
sub=rec(i+1,sub+',[')
elif a[i]==')':
sub=rec(i+1,sub+']')
else:
sub=rec(i+1,sub+a[i])
return sub
b=rec(0,'')
print(b)
b=re.sub(r"([a-z]+)", r"'\1'", b)
print(b,type(b))
Output
[f,[g,h,[a,b],a,b,[g,h]]
['f',['g','h',['a','b'],'a','b',['g','h']] <class 'str'>

How do I find the predominant letters in a list of strings

I want to check for each position in the string what is the character that appears most often on that position. If there are more of the same frequency, keep the first one. All strings in the list are guaranteed to be of identical length!!!
I tried the following way:
print(max(((letter, strings.count(letter)) for letter in strings), key=lambda x:[1])[0])
But I get: mistul or qagic
And I can not figure out what's wrong with my code.
My list of strings looks like this:
Input: strings = ['mistul', 'aidteh', 'mhfjtr', 'zxcjer']
Output: mister
Explanation: On the first position, m appears twice. Second, i appears twice twice. Third, there is no predominant character, so we chose the first, that is, s. On the fourth position, we have t twice and j twice, but you see first t, so we stay with him, on the fifth position we have e twice and the last r twice.
Another examples:
Input: ['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih', 'mbpzu', 'pbghn', 'mzsev', 'saqbl', 'myead']
Output: magic
Input: ['sacbkt', 'tnqaex', 'vhcrhl', 'obotnq', 'vevleg', 'rljnlv', 'jdcjrk', 'zuwtee', 'xycbvm', 'szgczt', 'imhepi', 'febybq', 'pqkdfg', 'swwlds', 'ecmrut', 'buwruy', 'icjwet', 'gebgbq', 'djtfzr', 'uenleo']
Expected Output: secret
Some help?
Finally a use case for zip() :-)
If you like cryptic code, it could even be done in one statement:
def solve(strings):
return ''.join([max([(letter, letters.count(letter)) for letter in letters], key=lambda x: x[1])[0] for letters in zip(*strings)])
But I prefer a more readable version:
def solve(strings):
result = ''
# "zip" the strings, so in the first iteration `letters` would be a list
# containing the first letter of each word, the second iteration it would
# be a list of all second letters of each word, and so on...
for letters in zip(*strings):
# Create a list of (letter, count) pairs:
letter_counts = [(letter, letters.count(letter)) for letter in letters]
# Get the first letter with the highest count, and append it to result:
result += max(letter_counts, key=lambda x: x[1])[0]
return result
# Test function with input data from question:
assert solve(['mistul', 'aidteh', 'mhfjtr', 'zxcjer']) == 'mister'
assert solve(['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih', 'mbpzu', 'pbghn',
'mzsev', 'saqbl', 'myead']) == 'magic'
assert solve(['sacbkt', 'tnqaex', 'vhcrhl', 'obotnq', 'vevleg', 'rljnlv',
'jdcjrk', 'zuwtee', 'xycbvm', 'szgczt', 'imhepi', 'febybq',
'pqkdfg', 'swwlds', 'ecmrut', 'buwruy', 'icjwet', 'gebgbq',
'djtfzr', 'uenleo']) == 'secret'
UPDATE
#dun suggested a smarter way of using the max() function, which makes the one-liner actually quite readable :-)
def solve(strings):
return ''.join([max(letters, key=letters.count) for letters in zip(*strings)])
Using collections.Counter() is a nice strategy here. Here's one way to do it:
from collections import Counter
def most_freq_at_index(strings, idx):
chars = [s[idx] for s in strings]
char_counts = Counter(chars)
return char_counts.most_common(n=1)[0][0]
strings = ['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih',
'mbpzu', 'pbghn', 'mzsev', 'saqbl', 'myead']
result = ''.join(most_freq_at_index(strings, idx) for idx in range(5))
print(result)
## 'magic'
If you want something more manual without the magic of Python libraries you can do something like this:
def f(strings):
dic = {}
for string in strings:
for i in range(len(string)):
word_dic = dic.get(i, { string[i]: 0 })
word_dic[string[i]] = word_dic.get(string[i], 0) + 1
dic[i] = word_dic
largest_string = max(strings, key = len)
result = ""
for i in range(len(largest_string)):
result += max(dic[i], key = lambda x : dic[i][x])
return result
strings = ['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih', 'mbpzu', 'pbghn', 'mzsev', 'saqbl', 'myead']
f(strings)
'magic'

obtaining substring from square bracket in a sentence

I would like to ask as a python beginner, I would like to obtain strings from inside a square bracket and best if without trying to import any modules from python. If not it's okay.
For example,
def find_tags
#do some codes
x = find_tags('Hi[Pear]')
print(x)
it will return
1-Pear
if there are more than one brackets for example,
x = find_tags('[apple]and[orange]and[apple]again!')
print(x)
it will return
1-apple,2-orange,3-apple
I would greatly appreciate if someone could help me out thanks!
Here, I tried solving it. Here is my code :
bracket_string = '[apple]and[orange]and[apple]again!'
def find_tags(string1):
start = False
data = ''
data_list = []
for i in string1:
if i == '[':
start = True
if i != ']' and start == True:
if i != '[':
data += i
else:
if data != '':
data_list.append(data)
data = ''
start = False
return(data_list)
x = find_tags(bracket_string)
print(x)
The function will return a list of items that were between brackets of a given string parameter.
Any advice will be appreciated.
If your pattern is consistent like [sometext]sometext[sometext]... you can implement your function like this:
import re
def find_tags(expression):
r = re.findall('(\[[a-zA-Z]+\])', expression)
return ",".join([str(index + 1) + "-" + item.replace("[", "").replace("]", "") for index, item in enumerate(r)])
Btw you can use stack data structure (FIFO) to solve this problem.
You can solve this using a simple for loop over all characters of your text.
You have to remember if you are inside a tag or outside a tag - if inside you add the letter to a temporary list, if you encounter the end of a tag, you add the whole templorary list as word to a return list.
You can solve the numbering using enumerate(iterable, start=1) of the list of words:
def find_tags(text):
inside_tag = False
tags = [] # list of all tag-words
t = [] # list to collect all letters of a single tag
for c in text:
if not inside_tag:
inside_tag = c == "[" # we are inside as soon as we encounter [
elif c != "]":
t.append(c) # happens only if inside a tag and not tag ending
else:
tags.append(''.join(t)) # construct tag from t and set inside back to false
inside_tag = False
t = [] # clear temporary list
if t:
tags.append(''.join(t)) # in case we have leftover tag characters ( "[tag" )
return list(enumerate(tags,start=1)) # create enumerated list
x = find_tags('[apple]and[orange]and[apple]again!')
# x is a list of tuples (number, tag):
for nr, tag in x:
print("{}-{}".format(nr, tag), end = ", ")
Then you specify ',' as delimiter after each print-command to get your output.
x looks like: [(1, 'apple'), (2, 'orange'), (3, 'apple')]

How to delete repeating letters in a string?

I am trying to write a function which will return me the string of unique characters present in the passed string. Here's my code:
def repeating_letters(given_string):
counts = {}
for char in given_string:
if char in counts:
return char
else:
counts[char] = 1
if counts[char] > 1:
del(char)
else:
return char
I am not getting expected results with it. How can I get the desired result.
Here when I am passing this string as input:
sample_input = "abcadb"
I am expecting the result to be:
"abcd"
However my code is returning me just:
nothing
def repeating_letters(given_string):
seen = set()
ret = []
for c in given_string:
if c not in seen:
ret.append(c)
seen.add(c)
return ''.join(ret)
Here we add each letter to the set seen the first time we see it, at the same time adding it to a list ret. Then we return the joined list.
Here's the one-liner to achieve this if the order in the resultant string matters via using set with sorted as:
>>> my_str = 'abcadbgeg'
>>> ''.join(sorted(set(my_str),key=my_str.index))
'abcdge'
Here sorted will sort the characters in the set based on the first index of each in the original string, resulting in ordered list of characters.
However if the order in the resultant string doesn't matter, then you may simply do:
>>> ''.join(set(my_str))
'acbedg'

How to collect defined items in lists python

I have to find the signs "a..,z", "A,..,Z", "space", "." and "," in some data.
I have tried the code:
fh = codecs.open("mydata.txt", encoding = "utf-8")
text = fh.read()
fh1 = unicode(text)
dic_freq_signs = dict(Counter(fh1.split()))
All_freq_signs = dic_freq_signs.items()
List_signs = dic_freq_signs.keys()
List_freq_signs = dic_freq_signs.values()
BUT it gets me ALL signs not the ones i am looking for?
Can anyone help?
(And it has to be unicode)
check dictionary iteration ..
All_freq_signs = [ item for item in dic_freq_signs.items() if item.something == "somevalue"]
def criteria(value):
return value%2 == 0
All_freq_signs = [ item for item in dic_freq_signs.items() if criteria(item)]
Make sure you import string module, with it you can get character ranges a to z and A to Z easily
import string
A Counter(any_string) gives the count of each character in the string. By using split() the counter would return the counts of each word in the string, contradicting with your requirement. So I have assumed that you need character counts.
dic_all_chars = dict(Counter(fh1)) # this gives counts of all characters in the string
signs = string.lowercase + string.uppercase + ' .,' # these are the characters you want to check
# using dict comprehension and checking if the key is in the characters you want
dic_freq_signs = {key: value for key, value in dic_all_chars.items()
if key in signs}
dic_freq_signs would only have the signs that you want to count as keys and their counts as values.

Categories