Search file for exact match of word list

Search file for exact match of word list - python

There are many many questions surrounding this, some using regex, some using with open, and others but I have found none suitably fit my requirements.
I am opening a xml file which contains strings, 1 per line. e.g
<string name="AutoConf_5">setup is in progress…</string>
I want to iterate over each line in the file and search each line for exact matches of words in a list. The current code seems to work and prints out matches but it doesn't do exact matches, e.g 'pass' finds 'passed', 'pro' finds 'provide', 'process', 'proceed' etc
def stringRun(self,file):
str_file = ['admin','premium','pro','paid','pass','password','api']
with open(file, 'r') as sf:
for s in sf:
if any(x in str(s) for x in str_file):
self.progressBox.AppendText(s)

Instead of using the function "in" which matches any substring in the line, you should use regex "re.search"
I haven't checked it with python so minor syntax errors might have slipped in but this is the general idea, replace the if in your code with this:
if any(re.search(x, str(s)) for x in str_file):
Then you can use the power of regex to search for the words in the list with word boundaries. You need to add '\b' to the beginning and end of each search string, or add to all in the condition:
if any(re.search(r'\b' + x + r'\b', str(s)) for x in str_file):

If you want an exact match, IMO, the best way is to prepare the strings to match and then search each string in each line.
For instances, you can prepare a mapping between tagged string and strings you want to match:
tagged = {'<string name="AutoConf_5">{0}</string>'.format(s): s
for s in str_file}
This dict is an association between the tagged string you want to match and the actual string.
You can use it like that:
for line in sf:
line = line.strip()
if line in tagged:
self.progressBox.AppendText(tagged[line])
Note: if any of your string contains "&", "<" or ">", you need to escape those characters, like this:
from xml.sax.saxutils import escape
tagged = {'<string name="AutoConf_5">{0}</string>'.format(escape(s)): s
for s in str_file}
Another solution is to use lxml to parse your XML tree and find nodes which match a given xpath expression.
EDIT: match at least a word (form a words list)
You have a list of strings containing words. To match the XML content which contains at least of word of this list, you can use regular expression.
You may encounter 2 difficulties:
a XML content, parsed like a text file, can contains "&", "<" or ">". So you need to unescape the XML content.
some word from your words list may contains RegEx special characters (like "[" or "(") which must be escaped.
First, you can prepare a RegEx (and a function) to find all occurence of a word in a string. To do that, you can use "\b" to match the empty string, but only at the beginning or end of a word:
str_file = ['admin', 'premium', 'pro', 'paid', 'pass', 'password', 'api']
re_any_word = r"\b(?:" + r"|".join(re.escape(e) for e in str_file) + r")\b"
find_any_word = re.compile(re_any_word, flags=re.DOTALL).findall
For instance:
>>> find_any_word("Time has passed")
[]
>>> find_any_word("I pass my exam, I'm a pro")
['pass', 'pro']
To extract the content of a XML fragment, you can also use a RegEx (even if it is not recommended in the general case, it worth it here):
The following RegEx (and function) matches a "<string>...</string>" fragment and select the content in the first group:
re_string = r'<string[^>]*>(.*?)</string>'
match_string = re.compile(re_string, flags=re.DOTALL).match
For instance:
>>> match_string('<string name="AutoConf_5">setup is in progress…</string>').group(1)
setup is in progress…
Now, all you have to do is to parse your file, line by line.
For the demo, I used a list of strings:
lines = [
'<string name="AutoConf_5">setup is in progress…</string>\n',
'<string name="AutoConf_5">it has passed</string>\n',
'<string name="AutoConf_5">I pass my exam, I am a pro</string>\n',
]
for line in lines:
line = line.strip()
mo = match_string(line)
if mo:
content = saxutils.unescape(mo.group(1))
words = find_any_word(content)
if words:
print(line + " => " + ", ".join(words))
You get:
<string name="AutoConf_5">I pass my exam, I am a pro</string> => pass, pro

Related

Split string with multiple possible delimiters to get substring

I am trying to make a simple Discord bot to respond to some user input and having difficulty trying to parse the response for the info I need. I am trying to get their "gamertag"/username but the format is a little different sometimes.
So, my idea was to make a list of delimiter words I am looking for (different versions of the word gamertag such as Gamertag:, Gamertag -, username, etc.)
Then, look line by line for one that contains any of those delimiters.
Split the string on first matching delim, strip non alphanumeric characters
I had it kinda working for a single line, then realized some people don't put it on the first line so added line by line check and messed it up (on line 19 I just realized).. Also thought there must be a better way than this? please advise, some kinda working code at this link and copied below:
testString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
applicationString = testString
gamertagSplitList = [ "gamertag", "Gamertag","Gamertag:", "gamertag:"]
#splWord = 'Gamertag'
lineNum = 0
for line in applicationString.partition('\n'):
print(line)
if line in gamertagSplitList:
applicationString = line
break
#get first line
#applicationString = applicationString.partition('\n')[0]
res = ""
#split on word, want to split on first occurrence of list of words
for splitWord in gamertagSplitList:
if splitWord in applicationString:
res = applicationString.split(splitWord)
break
splitString = res[1]
#res = test_string.split(spl_word, 1)
#splitString = res[1]
#get rid of non alphaNum characters
finalString = "" #define string for ouput
for character in splitString:
if(character.isalnum()):
# if character is alphanumeric concat to finalString
finalString = finalString + character
print(finalString)

Don't know if this will work with all your different inputs, but you can tweak it to get what you want :
import re
gamertagSplitList = ["gamertag", "Gamertag", "Gamertag:", "gamertag:"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
for line in applicationString.split('\n'):
line = line.replace(' ', '')
for tag in gamertagSplitList:
if tag in line:
gamer_tag = line.replace(tag, '', 1)
break
print(re.sub(r'\W+', '', gamer_tag))
Output :
testGamertag

You can do it without any loops with a single regex:
import re
gamertagSplitList = ["gamertag", "Gamertag"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
print(re.search(r'(' + '|'.join(gamertagSplitList) + ')\s*[:-]?\s*(\w+)\s*', applicationString)[2])
If all values in gamertagSplitList differ just by casing, you can simplify that even further:
print(re.search(r'gamertag\s*[:-]?\s*(\w+)\s*', applicationString, re.IGNORECASE)[1])
Let's take a closer look at this regex:
gamertag will match a string 'gamertag'
\s* will match any (including none) whitespace characters (space, newline, tab, etc.)
[:-]? will match either none or a single character which is either : or -
(\w+) will match 1 or more alphanumeric characters. Parenthesis here denote a group -- specific substring that we can extract later from the match.
By using re.IGNORECASE we make matching case insensitive, so that separator GaMeRtAg will also be recognised by this pattern.
The indexing part [1] means that we're interested in a first group in our pattern (remember the parenthesis). A group with index 0 is always a full match, and groups from index 1 upwards represent substrings that match subexpressions in parenthesis (ordered by their ( appearance in the regex).

How to split a string with multiple delimiters without deleting delimiters in Python?

I currently have a list of filenames in a txt file and I am trying to sort them. The first this I am trying to do is split them into a list since they are all in a single line. There are 3 types of file types in the list. I am able to split the list but I would like to keep the delimiters in the end result and I have not been able to find a way to do this. The way that I am splitting the files is as follows:
import re
def breakLines():
unsorted_list = []
file_obj = open("index.txt", "rt")
file_str = file_obj.read()
unsorted_list.append(re.split('.txt|.mpd|.mp4', file_str))
print(unsorted_list)
breakLines()
I found DeepSpace's answer to be very helpful here Split a string with "(" and ")" and keep the delimiters (Python), but that only seems to work with single characters.
EDIT:
Sample input:
file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4
Expected output:
file_name1234.mp4
file_name1235.mp4
file_name1236.mp4
file_name1237.mp4

In re.split, the key is to parenthesise the split pattern so it's kept in the result of re.split. Your attempt is:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.split('.txt|.mpd|.mp4', s)
['file_name1234', 'file_name1235', 'file_name1236', 'file_name1237', '']
okay that doesn't work (and the dots would need escaping to be really compliant with what an extension is), so let's try:
>>> re.split('(\.txt|\.mpd|\.mp4)', s)
['file_name1234',
'.mp4',
'file_name1235',
'.mp4',
'file_name1236',
'.mp4',
'file_name1237',
'.mp4',
'']
works but this is splitting the extensions from the filenames and leaving a blank in the end, not what you want (unless you want an ugly post-processing). Plus this is a duplicate question: In Python, how do I split a string and keep the separators?
But you don't want re.split you want re.findall:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.findall('(\w*?(?:\.txt|\.mpd|\.mp4))',s)
['file_name1234.mp4',
'file_name1235.mp4',
'file_name1236.mp4',
'file_name1237.mp4']
the expression matches word characters (basically digits, letters & underscores), followed by the extension. To be able to create a OR, I created a non-capturing group inside the main group.
If you have more exotic file names, you can't use \w anymore but it still reasonably works (you may need some str.strip post-processing to remove leading/trailing blanks which are likely not part of the filenames):
>>> s = " file name1234.mp4file-name1235.mp4 file_name1236.mp4file_name1237.mp4"
>>> re.findall('(.*?(?:\.txt|\.mpd|\.mp4))',s)
[' file name1234.mp4',
'file-name1235.mp4',
' file_name1236.mp4',
'file_name1237.mp4']
So sometimes you think re.split when you need re.findall, and the reverse is also true.

How to remove words after certain character in a line in python [duplicate]

I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.

Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.

Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.

If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child

Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)

import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"

The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.

From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)

This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)

Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]

another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.

Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.

You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>

Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search file for exact match of word list - python

Related

Split string with multiple possible delimiters to get substring

How to split a string with multiple delimiters without deleting delimiters in Python?

How to remove words after certain character in a line in python [duplicate]

python regular expression to match strings

python regex for repeating string

Categories

Resources