What I would like to do is to separate each word of a string and dynamically create variables that I can use independently afterwards, like:
String = 'Jurassic Park 3'
Variable1= 'jurassic'
Variable2= 'park'
Variable3= '3'
The thing is, the string can be very long. So it has to be able to handle a sentence containing many words.
I already have the beginning of a code:
import re
input = str(d["input_text"])
l = []
regex = re.compile(r'(\d+|\s+)')
s = regex.split(input)
l = []
for elem in s:
if elem =='':
continue
else:
l.append(elem)
ret_dict = {}
ret_dict["text_list"] = l
ret_dict["returncode"] = 0 # set to 1 to output on the FAILURE output
return ret_dict
with that code, at the moment i have this
input variable = input_text
output variable = text_list
I would like
input variable = input_text
output variable1 = variable1
output variable2 = variable2
output variable3 = variable3
output variable4 = variable4
etc
When having to define an unknown amount of variables, I like to use Dictionaries. It could be done like this:
string = 'Jurassic Park 3' # Original string
string_list = string.split() # Returns: ['Jurassic', 'Park', '3']
dic = {}
for i in range(len(string_list)):
var_name = 'Variable{}'.format(i+1) # Define name of variable, start with 'Variable1'
dic[var_name] = string_list[i] # Insert variable name as key and list entry as value
Printing the dictionary will return:
{'Variable1': 'Jurassic', 'Variable2': 'Park', 'Variable3': '3'}
To access e.g. Variable2, you could do:
dic['Variable2']
which returns
'Park'
If the number of variables become large, I think having them collected in a Dictionary could be easier to handle rather than having the variables defined individually like your question suggests.
If you had e.g. 100 variables but were unsure of the count, it would be easy to check the size of the Dictionary. It would probably be a little harder to keep track of all those variables when they are scattered around and not collected in a bunch.
First you can split your string in a list like this
string = "Jurassic Park 3"
string_list = string.split()
printing this will output:
['Jurassic', 'Park', '3']
Then we itterate through the list like this
for i in range(0, len(string_list)):
exec("word%d = %s" % (i + 1, repr(string_list[i])));
What this does is go through the list of words and puts everything in the variable word1 word2 word3 and however long your string goes on.
Hope this helps
I found something else, and it does work. I just need to do one more thing on it. if element of the list is 'space', add it to the begginning of the next element.
import re
input = str(d["Jurassic Park 3"])
l = []
regex = re.compile(r'(\d+|\s+)')
s = regex.split(input)
l = []
for elem in s:
if elem =='':
continue
else:
l.append(elem)
ret_dict = {}
ret_dict["text_list"] = l
ret_dict["returncode"] = 0 # set to 1 to output on the FAILURE output
return ret_dict
This does return me :
["Jurassic"," ","Park"," ","3"]
and i would like to have:
["Jurassic"," Park"," 3"]
Related
as the title suggests I'm trying to parse a piece of code into a tree or a list.
First off I would like to thank for any contribution and time spent on this.
So far my code is doing what I expect, yet I am not sure that this is the optimal / most generic way to do this.
Problem
1. I want to have a more generic solution since in the future I am going to need further analysis of this sintax.
2. I am unable right now to separate the operators like '=' or '>=' as you can see below in the output I share.
In the future I might change the content of the list / tree from strings to tuples so i can identify the kind of operator (parameter, comparison like = or >= ....). But this is not a real need right now.
Research
My first attempt was parsing the text character by character, but my code was getting too messy and barely readable, so I assumed that I was doing something wrong there (I don't have that code to share here anymore)
So i started looking around how people where doing it and found some approaches that didn't necessarily fullfil the requirements of simplicity and generic.
I would share the links to the sites but I didn't keep track of them.
The Syntax of the code
The syntax is pretty simple, after all I'm no interested in types or any further detail. just the functions and parameters.
strings are defined as 'my string', variables as !variable and numbers as in any other language.
Here is a sample of code:
db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)
My Output
Here my output is partialy correct since I'm still unable to separate the "= '3'" part (of course I have to separate it because in this case its a comparison operator and not part of a string)
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]
Desired Output
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]
My code so far
The parseRecursive method is the entry point.
import re
class FileParser:
#order is important to avoid miss splits
COMPARATOR_SIGN = {
'#='
,'#<>'
,'<>'
,'>='
,'<='
,'='
,'>'
,'<'
}
def __init__(self):
pass
def __charExistsInOccurences(self,current_needle, needles, text):
"""
check if other needles are present in text
current_needle : string -> the current needle being evaluated
needles : list -> list of needles
text : string/list<string> -> a string or a list of string to evaluate
"""
#if text is a string convert it to list of strings
text = text if isinstance(text, list) else [text]
exists = False
for t in text:
#check if needle is inside text value
for needle in needles:
#dont check the same key
if needle != current_needle:
regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#list of 1's and 0's . 1 if another character is found in the string.
found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
if sum(found) > 0:
exists = True
break
return exists
def findOperator(self, needles, haystack):
"""
split parameters from operators
needles : list -> list of operators
haystack : string
"""
string_open = haystack.find("'")
#if no string has been found set the index to 0
if string_open < 0:
string_open = 0
occurences = []
string_closure = haystack.rfind("'")
operator = ''
for needle in needles:
#regex to ignore the possible spaces between characters of the needle
split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#parse parameters before and after the string
before_string = re.split(split_regex, haystack[0:string_open])
after_string = re.split(split_regex, haystack[string_closure+1:])
#check if any other needle exists in the results found
before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)
#if the operator has been found merge the results with the occurences and assign the operator
if not before_string_exists and not after_string_exists:
occurences.extend(before_string)
occurences.extend([haystack[string_open:string_closure+1]])
occurences.extend(after_string)
operator = needle
#filter blank spaces generated
occurences = list(filter(lambda x: len(x.strip())>0,occurences))
result_check = [1 if x==haystack else 0 for x in occurences]
#if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
if len(result_check) == sum(result_check):
occurences= [haystack]
operator = ''
return operator, occurences
def parseRecursive(self,text):
"""
parse a block of text
text : string
"""
assert(len(text) < 1, "text is empty")
function_open = text.find('(')
accumulated_params = []
if function_open > -1:
#there is another function nested
text_prev_function = text[0:function_open]
#find last space coma or equal to retrieve the function name
last_space = -1
for j in range(len(text_prev_function)-1, 0 , -1):
if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
last_space = j
break
func_name = ''
if last_space > -1:
#there is something else behind the function name
func_name = text_prev_function[last_space+1:]
#no parentesis before so previous characters from function name are parameters
text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(',')))
text_prev_func_params = [x.strip() for x in text_prev_func_params]
#debug here
#accumulated_params.extend(text_prev_func_params)
for itext_prev in text_prev_func_params:
operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
if operator == '':
accumulated_params.extend(text_prev_operator)
else:
text_prev_operator.append(operator)
accumulated_params.extend(text_prev_operator)
#accumulated_params.extend(text_prev_operator)
else:
#function name is the start of the string
func_name = text_prev_function[0:].strip()
#find the closure of parentesis
function_close = text.rfind(')')
#parse the next function and extend the current list of parameters
next_func = text[function_open+1:function_close]
func_params = {func_name : self.parseRecursive(next_func)}
accumulated_params.append(func_params)
#
# parameters after the function
#
new_text = text[function_close+1:]
accumulated_params.extend(self.parseRecursive(new_text))
else:
#there is no other function nested
split_text = text.split(',')
current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
current_func_params = [x.strip() for x in current_func_params]
accumulated_params.extend(current_func_params)
#accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
return accumulated_params
text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))
You can use pyparsing to deal with such a case.
* pyparsing can be installed by pip install pyparsing
Code:
import pyparsing as pp
# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)
# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
stack = []
for e in elements:
if isinstance(e, list):
key = stack.pop()
stack.append({key: transform(e)})
else:
stack.append(e)
return stack
# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)
# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
# Show the result
print(result)
Output:
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
Note:
If there is an unbalanced parenthesis inside () (for example a(b(c), a(b)c), etc), an unexpected result is obtained or an IndexError is raised. So be careful in such cases.
At the moment, only a single sample is available to make a pattern to parse string. So if you encounter a parsing error, provide more examples in your question.
I want to write a python function that takes 2 parameters:
List of words and
Ending letters
I want my function to work in such a way that it modifies the original list of words and removes the words which end with the "ending letters" specified.
For example:
list_words = ["hello", "jello","whatsup","right", "cello", "estello"]
ending = "ello"
my_func(list_words, ending)
This should give the following output:
list_words = ["whatsup","right"]
It should pop off all the strings that end with the ending letters given in the second argument of the function.
I can code this function using the .endswith method but I am not allowed to use it. How else can I do this using a loop?
Try:
def my_func(list_words, ending):
return [word for word in list_words if word[len(word)-len(ending):] != ending]
def filter_words(list_words, ending):
return [*filter(lambda x: x[-len(ending):] != ending , list_words)]
Not allowed to use endswith? Not a problem :-P
def my_func(list_words, ending):
list_words[:] = [word for word in list_words
if not word[::-1].startswith(ending[::-1])]
return list_words
Loopholes ftw.
(Adapted to your insistence on modifying the given list. You should probably really decide whether to modify or return, though, not do both, which is rather unusual in Python.)
You can easily check for the last4 characters of a string using string[-4:].
So you can use the below code
list_words = ["hello", "jello","whatsup","right", "cello", "estello"]
ending = "ello"
def my_func(wordsArray, endingStr):
endLen = len(endingStr)
output = []
for x in wordsArray:
if not x[-endLen:] == endingStr:
output.append(x)
return output
list_words = my_func(list_words, ending)
You can shorten the function with some list comprehension like this:
def short_func(wordsArray, endingStr):
endLen = len(endingStr)
output = [x for x in wordsArray if x[-endLen:] != endingStr]
return output
list_words = short_func(list_words, ending)
It is always better to not modify the existing list you can get a list which doesn't have the words with the ending specified like below. If you want to have it as a function you can have it in a following manner. You can assign the formatted list to list_words again.
def format_list(words, ending):
new_list = []
n = len(ending)
for word in words:
if len(word) >= n and n > 0:
if not word[-n:] == ending:
new_list.append(word)
else:
new_list.append(word)
return new_list
list_words = format_list(list_words, ending)
print(list_words)
I have some texts that I need to generate tokens splitting by space. Furthermore, I need to remove all punctuation, as I need to remove everything inside double braces [[...]] (including the double braces).
Each token I will put on a dictionary as the key that will have a list of values.
I have tried regex to remove these double braces patterns, if-elses, but I can't find a solution that works. For the moment I have:
tokenDic = dict()
splittedWords = re.findall(r'\[\[\s*([^][]*?)]]', docs[doc], re.IGNORECASE)
tokenStr = splittedWords.split()
for token in tokenStr:
tokenDic[token].append(value);
Is this what you're looking for?
import re
value_list = []
inp_str = 'blahblah[[blahblah]]thi ng1[[junk]]hmm'
tokenDic = dict()
#remove everything in double brackets
bracket_stuff_removed = re.sub(r'\[\[[^]]*\]\]', '', inp_str)
#function to keep only letters and digits
clean_func = lambda x: 97 <= ord(x.lower()) <= 122 or 48 <= ord(x) <= 57
for token in bracket_stuff_removed.split(' '):
cleaned_token = ''.join(filter(clean_func, token))
tokenDic[cleaned_token] = list(value_list)
print(tokenDic)
Output:
{'blahblahthi': [], 'ng1hmm': []}
As for appending to the list, I don't have enough info right now to tell you the best way in your situation.
If you want to set the value when you're adding the key, do this:
tokenDic[cleaned_token] = [val1, val2, val3]
If you want to set the values after the key has been added, do this:
val_to_add = "something"
if tokenDic.get(cleaned_token, -1) == -1:
print('ERROR', cleaned_token, 'does not exist in dict')
else:
tokenDic[cleaned_token].append(val_to_add)
If you want to directly append to the dict in both cases, you'll need to use defaultdict(list) instead of dict.. then if the key does not exist in the dict, it will create it, make the value an empty list, and then add your value.
To remove everything inside [[]] you can use re.sub and you already have the correct regex so just do this.
x = [[hello]]w&o%r*ld^$
y = re.sub("\[\[\s*([^][]*?)]]","",x)
z = re.sub("[^a-zA-Z\s]","",y)
print(z)
This prints "world"
I would like to ask as a python beginner, I would like to obtain strings from inside a square bracket and best if without trying to import any modules from python. If not it's okay.
For example,
def find_tags
#do some codes
x = find_tags('Hi[Pear]')
print(x)
it will return
1-Pear
if there are more than one brackets for example,
x = find_tags('[apple]and[orange]and[apple]again!')
print(x)
it will return
1-apple,2-orange,3-apple
I would greatly appreciate if someone could help me out thanks!
Here, I tried solving it. Here is my code :
bracket_string = '[apple]and[orange]and[apple]again!'
def find_tags(string1):
start = False
data = ''
data_list = []
for i in string1:
if i == '[':
start = True
if i != ']' and start == True:
if i != '[':
data += i
else:
if data != '':
data_list.append(data)
data = ''
start = False
return(data_list)
x = find_tags(bracket_string)
print(x)
The function will return a list of items that were between brackets of a given string parameter.
Any advice will be appreciated.
If your pattern is consistent like [sometext]sometext[sometext]... you can implement your function like this:
import re
def find_tags(expression):
r = re.findall('(\[[a-zA-Z]+\])', expression)
return ",".join([str(index + 1) + "-" + item.replace("[", "").replace("]", "") for index, item in enumerate(r)])
Btw you can use stack data structure (FIFO) to solve this problem.
You can solve this using a simple for loop over all characters of your text.
You have to remember if you are inside a tag or outside a tag - if inside you add the letter to a temporary list, if you encounter the end of a tag, you add the whole templorary list as word to a return list.
You can solve the numbering using enumerate(iterable, start=1) of the list of words:
def find_tags(text):
inside_tag = False
tags = [] # list of all tag-words
t = [] # list to collect all letters of a single tag
for c in text:
if not inside_tag:
inside_tag = c == "[" # we are inside as soon as we encounter [
elif c != "]":
t.append(c) # happens only if inside a tag and not tag ending
else:
tags.append(''.join(t)) # construct tag from t and set inside back to false
inside_tag = False
t = [] # clear temporary list
if t:
tags.append(''.join(t)) # in case we have leftover tag characters ( "[tag" )
return list(enumerate(tags,start=1)) # create enumerated list
x = find_tags('[apple]and[orange]and[apple]again!')
# x is a list of tuples (number, tag):
for nr, tag in x:
print("{}-{}".format(nr, tag), end = ", ")
Then you specify ',' as delimiter after each print-command to get your output.
x looks like: [(1, 'apple'), (2, 'orange'), (3, 'apple')]
Hi there so I am looking to build this python function with simple things like def, find etc. so far I know how to get the first part of the code.
Given a string such as "HELLODOGMEMEDOGPAPA", I will need to return a list that gives me three things:
Everything before the word dog which i will denote as before_dog
The word dog until dog appears again dog_todog
Everything after the second time dog appears will be denoted by after_todog
The list will be in the form [before_dog,dog_todog,after_todog].
so for example given ("HELLODOGMEMEDOGPAPADD") this will return the list
("HELLO","DOGMEME","DOGPAPADD")
another example would be ("HEYHELLOMANDOGYDOGDADDY") this would return the list
("HEYHELLOMAN","DOGY","DOGDADDY")
but if I have ("HEYHELLODOGDADDY")
the output will be ("HEYHELLO","DOGDADDY","")
also if dog never appears ("HEYHELLOYO") then the output will be ("HEYHELLOYO,"","")
This is what I have so far:
def split_list(words):
# declare the list
lst = []
# find the first position
first_pos=words.find("DOG")
# find the first_pos
before_dog = words [0:first_pos]
lst.append(before_dog)
return lst
Funny function split_2_dogs() with re.findall() function:
import re
def split_2_dogs(s):
if s.count('DOG') == 2: # assuring 2 dogs are "walking" there
return list(re.findall(r'^(.*)(DOG.*)(DOG.*)$', s)[0])
print(split_2_dogs("HELLODOGMEMEDOGPAPADD"))
print(split_2_dogs("HEYHELLOMANDOGYDOGDADDY"))
The output:
['HELLO', 'DOGMEME', 'DOGPAPADD']
['HEYHELLOMAN', 'DOGY', 'DOGDADDY']
Alternative solution with str.index() and str.rfind() functions:
def split_2_dogs(s):
if 'DOG' not in s: return [s,'']
pos1, pos2 = s.index('DOG'), s.rfind('DOG')
return [s[0:pos1], s[pos1:pos2], s[pos2:]]
This is pretty easy to do using the split function. For example, you can split any string by a delimiter, like dog, as so:
>>> chunks = 'HELLODOGMEMEDOGPAPA'.split('DOG')
>>> print(chunks)
['HELLO', 'MEME', 'PAPA']
You could then use the output of that in a list comprehension, like so:
>>> dog_chunks = chunks[:1] + ["DOG" + chunk for chunk in chunks[1:]]
>>> print(dog_chunks)
['HELLO', 'DOGMEME', 'DOGPAPA']
The only slightly tricky bit is making sure you don't prepend dog to the first string in the list, hence the little bits of slicing.
Split the string at 'DOG' and use conditions to get the desired result
s = 'HELLODOGMEMEDOGPAPADD'
l = s.split('DOG')
dl = ['DOG'+i for i in l[1:]]
[l[0]]+dl if l[0] else dl
Output:
['HELLO', 'DOGMEME', 'DOGPAPADD']
Splitting at DOG is the key!! This code will for all the cases that you have mentioned.
from itertools import izip_longest
words = 'HEYHELLODOGDADDY'
words = words.split("DOG")
words = ['DOG'+j if i>0 else j for i,j in enumerate(words)]
# words = ['HEYHELLO', 'DOGDADDY']
ans = ['','','']
# stitch words and ans together
ans = [m+n for m,n in izip_longest(words,ans,fillvalue='')]
print ans
Output :
['HEYHELLO', 'DOGDADDY', '']