I'm trying to create a list from a text file that I am reading into Python. The text file contains a bunch of square brackets throughout the file [some text here]. What I am trying to do is first count how many of those square bracket pairings I have [] and then add whatever text is inside of them into a list.
Here is a super simplified version of the text file I am trying to use with the brackets:
"[name] is going to the store! It's going to be at [place] on [day-of-the-week]."
Here is what I have:
bracket_counter = 0
file_name = "example.txt"
readFile = open(file_name)
text_lines = readFile.readlines()
for line in text_lines:
val = line.split('[', 1)[1].split(']')[0]
print(val)
if line has [ and ]:
bracket_counter += 1
I'm super new to Python. I don't know if I should be using regular expressions for this or if that is overcomplicating things.
Thanks for your help!
You can of course use regular expressions for that. In the following example, you're extracting all content within square brackets and store them in the variable words.
words = re.findall(r'\[([^\]]+)\]', line)
Don't forget to import re at the top of your program.
Explanation of the regex:
\[ and \] match square brackets. As these are used in regex as well, you have to escape them with a backslash
(...) is a capturing group, all regex parts within normal brackets will be returned as a finding (for e.g. in the list words)
[^\]]+ will match all characters except the ]
To put it all together:
This regex looks for an opening square bracket, then matches all characters until a closing square bracket appears and returns the matches within a list.
If there's a lonely soul out there like me who wondered how this could be done without using regex, I guess this could be a solution also:
results = []
with open("example.txt") as read:
for line in read:
start = None; end = None
for i,v in enumerate(line):
if v == "[": start = i
if v == "]": end = i
if start is not None and end is not None:
results.append(line[start+1:end])
start = None; end = None
print(results)
Result:
['name', 'place', 'day-of-the-week']
Python doesn't have a has keyword.
You're probably looking for something like:
('[' in line) or (']' in line)
which will evaluate to True if the line includes [ or ].
Related
I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']
There are many many questions surrounding this, some using regex, some using with open, and others but I have found none suitably fit my requirements.
I am opening a xml file which contains strings, 1 per line. e.g
<string name="AutoConf_5">setup is in progress…</string>
I want to iterate over each line in the file and search each line for exact matches of words in a list. The current code seems to work and prints out matches but it doesn't do exact matches, e.g 'pass' finds 'passed', 'pro' finds 'provide', 'process', 'proceed' etc
def stringRun(self,file):
str_file = ['admin','premium','pro','paid','pass','password','api']
with open(file, 'r') as sf:
for s in sf:
if any(x in str(s) for x in str_file):
self.progressBox.AppendText(s)
Instead of using the function "in" which matches any substring in the line, you should use regex "re.search"
I haven't checked it with python so minor syntax errors might have slipped in but this is the general idea, replace the if in your code with this:
if any(re.search(x, str(s)) for x in str_file):
Then you can use the power of regex to search for the words in the list with word boundaries. You need to add '\b' to the beginning and end of each search string, or add to all in the condition:
if any(re.search(r'\b' + x + r'\b', str(s)) for x in str_file):
If you want an exact match, IMO, the best way is to prepare the strings to match and then search each string in each line.
For instances, you can prepare a mapping between tagged string and strings you want to match:
tagged = {'<string name="AutoConf_5">{0}</string>'.format(s): s
for s in str_file}
This dict is an association between the tagged string you want to match and the actual string.
You can use it like that:
for line in sf:
line = line.strip()
if line in tagged:
self.progressBox.AppendText(tagged[line])
Note: if any of your string contains "&", "<" or ">", you need to escape those characters, like this:
from xml.sax.saxutils import escape
tagged = {'<string name="AutoConf_5">{0}</string>'.format(escape(s)): s
for s in str_file}
Another solution is to use lxml to parse your XML tree and find nodes which match a given xpath expression.
EDIT: match at least a word (form a words list)
You have a list of strings containing words. To match the XML content which contains at least of word of this list, you can use regular expression.
You may encounter 2 difficulties:
a XML content, parsed like a text file, can contains "&", "<" or ">". So you need to unescape the XML content.
some word from your words list may contains RegEx special characters (like "[" or "(") which must be escaped.
First, you can prepare a RegEx (and a function) to find all occurence of a word in a string. To do that, you can use "\b" to match the empty string, but only at the beginning or end of a word:
str_file = ['admin', 'premium', 'pro', 'paid', 'pass', 'password', 'api']
re_any_word = r"\b(?:" + r"|".join(re.escape(e) for e in str_file) + r")\b"
find_any_word = re.compile(re_any_word, flags=re.DOTALL).findall
For instance:
>>> find_any_word("Time has passed")
[]
>>> find_any_word("I pass my exam, I'm a pro")
['pass', 'pro']
To extract the content of a XML fragment, you can also use a RegEx (even if it is not recommended in the general case, it worth it here):
The following RegEx (and function) matches a "<string>...</string>" fragment and select the content in the first group:
re_string = r'<string[^>]*>(.*?)</string>'
match_string = re.compile(re_string, flags=re.DOTALL).match
For instance:
>>> match_string('<string name="AutoConf_5">setup is in progress…</string>').group(1)
setup is in progress…
Now, all you have to do is to parse your file, line by line.
For the demo, I used a list of strings:
lines = [
'<string name="AutoConf_5">setup is in progress…</string>\n',
'<string name="AutoConf_5">it has passed</string>\n',
'<string name="AutoConf_5">I pass my exam, I am a pro</string>\n',
]
for line in lines:
line = line.strip()
mo = match_string(line)
if mo:
content = saxutils.unescape(mo.group(1))
words = find_any_word(content)
if words:
print(line + " => " + ", ".join(words))
You get:
<string name="AutoConf_5">I pass my exam, I am a pro</string> => pass, pro
I tried to search but the information that I am getting seems to be kinda overwhelming and far from what I need. I can't seem to get it to work.
The requirement is to get the function that starts with "meta" and its parentheses.
input:
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
output:
[ metaOmph , (uno) ]
[ metaAsdf, (dos) ]
[ metaPoil, (tres)]
The one that I currently have just gets the entire line if it starts with "meta". so I have the entire "one meta<>" if it's a match, would it be possible do what I'm aiming for?
Edit: It's one input/line at a time.
I'd love to post what I did earlier but I closed repl.it due to my frustration. I'll keep it in mind on my next post. (quite new here)
import re
s = """one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)"""
print(re.findall(".+(meta\w+)(\(\w+\))", s))
Outputs:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
re.findall() approach with valid regex pattern:
import re
s = '''
one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)
'''
result = re.findall(r'\b(meta\w+)(\([^()]+\))', s)
print(result)
The output:
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to pass a multiline string, it would seem simple to use the module level re.findall function.
text = '''one metaOmph(uno)
one metaAsdf(dos)
one metaPoil(tres)'''
r = re.findall(r'\b(meta.*?)(\(.*?\))', text, re.M)
print(r)
[('metaOmph', '(uno)'), ('metaAsdf', '(dos)'), ('metaPoil', '(tres)')]
If you are going to be passing 1-line strings as input to a loop, it might make more sense to compile the pattern beforehand, using re.compile and re.search inside a function:
pat = re.compile(r'\b(meta.*?)(\(.*?\))')
def find(text):
return pat.search(text)
for text in list_of_texts: # assuming you're passing in your strings from a list, or elsewhere
m = find(text)
if m:
print(list(m.groups()))
['metaOmph', '(uno)']
['metaAsdf', '(dos)']
['metaPoil', '(tres)']
Note that m might return a match object or None depending on whether a search was found. You'll want to query the return value, otherwise you'll receive an AttributeError: 'NoneType' object has no attribute 'groups', or something along those lines.
Alternatively, if you want to append the result to a list, you might instead use:
r_list = []
for text in list_of_texts:
m = find(text)
if m:
r_list.append(list(m.groups()))
print(r_list)
[['metaOmph', '(uno)'], ['metaAsdf', '(dos)'], ['metaPoil', '(tres)']]
Regex Details
\b # word boundary (thought to add this in thanks to Roman's answer)
(
meta # literal 'meta'
.*? # non-greedy matchall
)
(
\( # literal opening brace (escaped)
.*?
\) # literal closing brace (escaped)
)
I am new to regex. I am attempting to use regex with python to find a line in a file and extract all of the subsequent words separated by tab stops. My line looks like this.
#position 4450 4452 4455 4465 4476 4496 D110 D111 D112 D114 D116 D118 D23 D24 D27 D29 D30 D56 D59 D69 D85 D88 D90 D91 JW1 JW10 JW15 JW22 JW28 JW3 JW35 JW39 JW43 JW45 JW47 JW49 JW5 JW52 JW54 JW56 JW57 JW59 JW66 JW7 JW70 JW75 JW77 JW9 REF_OR74A
I have identified that the base of this expression involves the positive lookbehind.
(?<=#position).*
I do not expect this to separate the matches by tabstop. However, it does find my line in the file:
import re
file = open('src.txt','r')
f = list(file)
file.close()
pattern = '(?<=#position).*'
regex = re.compile(pattern)
regex.findall(''.join(f))
['\t4450\t4452\t4455\t4465\t4476\t4496\tD110\tD111\tD112\tD114\tD116\tD118\tD23\tD24\tD27\tD29\tD30\tD56\tD59\tD69\tD85\tD88\tD90\tD91\tJW1\tJW10\tJW15\tJW22\tJW28\tJW3\tJW35\tJW39\tJW43\tJW45\tJW47\tJW49\tJW5\tJW52\tJW54\tJW56\tJW57\tJW59\tJW66\tJW7\tJW70\tJW75\tJW77\tJW9\tREF_OR74A']
With some kludge and list slicing / string methods, I can manipulate this and get my data out. What I'd really like to do is have findall yield a list of just these entries. What would the regular expression look like to do that?
Do you need to use regex? List slicing and string methods don't appear to be as much of a kludge as you say.
something like:
f = open('src.txt','r')
for line in f:
if line.startswith("#position"):
l = line.split() # with no arguments it splits on all whitespace characters
l = l[1:] # get rid of the "#position" tag
break
and further manipulate from there?
I have a string like so: "sometext #Syrup #nshit #thebluntislit"
and i want to get a list of all terms starting with '#'
I used the following code:
import re
line = "blahblahblah #Syrup #nshit #thebluntislit"
ht = re.search(r'#\w*', line)
ht = ht.group(0)
print ht
and i get the following:
#Syrup
I was wondering if there is a way that I could instead get a list like:
[#Syrup,#nshit,#thebluntislit]
for all terms starting with '#' instead of just the first term.
Regular expression is not needed with good programming languages like Python:
hashed = [ word for word in line.split() if word.startswith("#") ]
You can use
compiled = re.compile(r'#\w*')
compiled.findall(line)
Output:
['#Syrup', '#nshit', '#thebluntislit']
But there is a problem. If you search the string like 'blahblahblah #Syrup #nshit #thebluntislit beg#end', the output will be ['#Syrup', '#nshit', '#thebluntislit', '#end'].
This problem may be addressed by using positive lookbehind:
compiled = re.compile(r'(?<=\s)#\w*')
(it's not possible to use \b (word boundary) here since # is not among
\w symbols [0-9a-zA-Z_] which may constitute the word which boundary is being searched).
Looks like re.findall() will do what you want.
matches = re.findall(r'#\w*', line)