Regex Doesn't Match Beyond 1st Result - python

I'm using Python to match against a list (array), but I'm sure the problem lies on the regex itself.
Assuming I have the following:
foo.html
bar.html
teep.html
And I use the following regex: .*(?=.html)
where .* will match anything and (?=.html) requires the string be present, but does not include it in the results
Thus, I should just be left with what's before .html
When I check, it only matches the first item in the array (in this case foo), but why not the others
my_regex = re.compile('.html$')
r2 = re.compile('.*(?=.html)')
start = '/path/to/folder'
os.chdir(start)
a = os.listdir(start)
for item in a:
if my_regex.search(item) != None and os.path.isdir(item):
print 'FOLDER MATCH: '+ item # this is a folder and not a file
starterPath = os.path.abspath(item)
outer_file = starterPath + '/index.html'
outer_js = starterPath + '/outliner.js'
if r2.match(item) != None:
filename = r2.match(item).group() # should give me evertying before .html
makePage(outer_file, outer_js, filename) # self defined function
else:
print item + ': no'

filename = r2.match(item).group()
should be
filename = r2.match(item).groups() # plural !
According to the documentation, group will return one or more subgroups, whereas groups will return them all.

Figured out the problem. In my function, I changed directories, but never changed back. So when function ended and went back to for loop, it was now looking for the folder name in the wrong location. It's as simple as
def makePage(arg1, arg2, arg3):
os.chdir('path/to/desktop')
# write file to new location
os.chdir(start) # go back to start and continue original search
return
Also .group() worked for me and returned everything in the folder name before the string .html whereas .groups() just returned ()
The code in original post stayed the same. Something so simple, causing all this headache..

Related

Can you spot the problem with this REGEX statement?

Im running .txt files through a for loop which should slice out keywords and .append them into lists. For some reason my REGEX statements are returning really odd results.
My first statement which iterates through the full filenames and slices out the keyword works well.
# Creates a workflow list of file names within target directory for further iteration
stack = os.listdir(
"/Users/me/Documents/software_development/my_python_code/random/countries"
)
# declares list, to be filled, and their associated regular expression, to be used,
# in the primary loop
names = []
name_pattern = r"-\s(.*)\.txt"
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# extraction of country name from file name into `names` list
name_match = re.search(name_pattern, entry)
name = name_match.group(1)
names.append(name)
This works fine and creates the list that I expect
However, once I move on to a similar process with the actual contents of files, it no longer works.
religions = []
reli_pattern = r"religion\s=\s(.+)."
# PRIMARY LOOP
for entry in stack:
if entry == ".DS_Store":
continue
# opens and reads file within `contents` variable
file_path = (
"/Users/me/Documents/software_development/my_python_code/random/countries" + "/" + entry
)
selection = open(file_path, "rb")
contents = str(selection.read())
# extraction of religion type and placement into `religions` list
reli_match = re.search(reli_pattern, contents)
religion = reli_match.group(1)
religions.append(religion)
The results should be something like: "therevada", "catholic", "sunni" etc.
Instead i'm getting seemingly random pieces of text from the document which have nothing to do with my REGEX like ruler names and stat values that do not contain the word "religion"
To try and figure this out I isolated some of the code in the following way:
contents = "religion = catholic"
reli_pattern = r"religion\s=\s(.*)\s"
reli_match = re.search(reli_pattern, contents)
print(reli_match)
And None is printed to the console so I am assuming the problem is with my REGEX. What silly mistake am I making which is causing this?
Your regular expression (religion\s=\s(.*)\s) requires that there be a trailing whitespace (the last \s there). Since your string doesn't have one, it doesn't find anything when searching thus re.search returns None.
You should either:
Change your regex to be r"religion\s=\s(.*)" or
Change the string you're searching to have a trailing whitespace (i.e 'religion = catholic' to 'religion = catholic ')

How to pass string variable into search function?

I am having issues passing a string variable into a search function.
Here is what I'm trying to accomplish:
I have a file full of values and I want to check the file to make sure a specific matching line exists before I proceed. I want to ensure that the line <endSW=UNIQUE-DNS-NAME-HERE<> exists if a valid <begSW=UNIQUE-DNS-NAME-HERE<> exists and is reachable.
Everything works fine until I call if searchForString(searchString,fileLoc): which always returns false. If I assign the variable 'searchString' a direct value and pass it it works, so I know it must be something with the way I'm combining the strings, but I can't seem to figure out what I'm doing wrong.
If I examine the data that 'searchForString' is using I see what seems to be valid values:
values in fileLines list:
['<begSW=UNIQUE-DNS-NAME-HERE<>', ' <begPortType=UNIQUE-PORT-HERE<>', ' <portNumbers=80,443,22<>', ' <endPortType=UNIQUE-PORT-HERE<>', '<endSW=UNIQUE-DNS-NAME-HERE<>']
value of searchVar:
<endSW=UNIQUE-DNS-NAME-HERE<>
An example of the entry in the file is:
<begSW=UNIQUE-DNS-NAME-HERE<>
<begPortType=UNIQUE-PORT-HERE<>
<portNumbers=80,443,22<>
<endPortType=UNIQUE-PORT-HERE<>
<endSW=UNIQUE-DNS-NAME-HERE<>
Here is the code in question:
def searchForString(searchVar,readFile):
with open(readFile) as findMe:
fileLines = findMe.read().splitlines()
print fileLines
print searchVar
if searchVar in fileLines:
return True
return False
findMe.close()
fileLoc = '/dir/folder/file'
fileLoc.lstrip()
fileLoc.rstrip()
with open(fileLoc,'r') as switchFile:
for line in switchFile:
#declare all the vars we need
lineDelimiter = '#'
endLine = '<>\n'
begSWLine= '<begSW='
endSWLine = '<endSW='
begPortType = '<begPortType='
endPortType = '<endPortType='
portNumList = '<portNumbers='
#skip over commented lines -(REMOVE THIS)
if line.startswith(lineDelimiter):
pass
#checks the file for a valid switch name
#checks to see if the host is up and reachable
#checks to see if there is a file input is valid
if line.startswith(begSWLine):
#extract switch name from file
switchName = line[7:-3]
#check to make sure switch is up
if pingCheck(switchName):
print 'Ping success. Host is reachable.'
searchString = endSWLine+switchName+'<>'
**#THIS PART IS SUCKING, WORKS WITH DIRECT STRING PASS
#WONT WORK WITH A VARIABLE**
if searchForString(searchString,fileLoc):
print 'found!'
else:
print 'not found'
Any advice or guidance would be extremely helpful.
Hard to tell without the file's contents, but I would try
switchName = line[7:-2]
So that would look like
>>> '<begSW=UNIQUE-DNS-NAME-HERE<>'[7:-2]
'UNIQUE-DNS-NAME-HERE'
Additionally, you could look into regex searches to make your cleanup more versatile.
import re
# re.findall(search_expression, string_to_search)
>>> re.findall('\=(.+)(?:\<)', '<begSW=UNIQUE-DNS-NAME-HERE<>')[0]
'UNIQUE-DNS-NAME-HERE'
>>> e.findall('\=(.+)(?:\<)', ' <portNumbers=80,443,22<>')[0]
'80,443,22'
I found how to recursively iterate over XML tags in Python using ElementTree? and used the methods detailed to parse an XML file instead of using a TXT file.

Use of regular expression to exclude characters in file rename,Python?

I am trying to rename files so that they contain an ID followed by a -(int). The files generally come to me in this way but sometimes they come as 1234567-1(crop to bottom).jpg.
I have been trying to use the following code but my regular expression doesn't seem to be having any effect. The reason for the walk is because we have to handles large directory trees with many images.
def fix_length():
for root, dirs, files in os.walk(path):
for fn in files:
path2 = os.path.join(root, fn)
filename_zero, extension = os.path.splitext(fn)
re.sub("[^0-9][-]", "", filename_zero)
os.rename(path2, filename_zero + extension)
fix_length()
I have inserted print statements for filename_zero before and after the re.sub line and I am getting the same result (i.e. 1234567-1(crop to bottom) not what I wanted)
This raises an exception as the rename is trying to create a file that already exists.
I thought perhaps adding the [-] in the regex was the issue but removing it and running again I would then expect 12345671.jpg but this doesn't work either. My regex is failing me or I have failed the regex.
Any insight would be greatly appreciated.
As a follow up, I have taken all the wonderful help and settled on a solution to my specific problem.
path = 'C:\Archive'
errors = 'C:\Test\errors'
num_files = []
def best_sol():
num_files = []
for root, dirs, files in os.walk(path):
for fn in files:
filename_zero, extension = os.path.splitext(fn)
path2 = os.path.join(root, fn)
ID = re.match('^\d{1,10}', fn).group()
if len(ID) <= 7:
if ID not in num_files:
num_files = []
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join(root, ID + '-' + suffix + extension))
else:
num_files.append(ID)
suffix = str(len(num_files))
os.rename(path2, os.path.join( root, ID + '-' + suffix +extension))
else:
shutil.copy(path2, errors)
os.remove(path2)
This code creates an ID based upon (up to) the first 10 numeric characters in the filename. I then use lists that store the instances of this ID and use the, length of the list append a suffix. The first file will have a -1, second a -2 etc...
I am only interested (or they should only be) in ID's with a length of 7 but allow to read up to 10 to allow for human error in labelling. All files with ID longer than 7 are moved to a folder where we can investigate.
Thanks for pointing me in the right direction.
re.sub() returns the altered string, but you ignore the return value.
You want to re-assign the result to filename_zero:
filename_zero = re.sub("[^\d-]", "", filename_zero)
I've corrected your regular expression as well; this removes anything that is not a digit or a dash from the base filename:
>>> re.sub(r'[^\d-]', '', '1234567-1(crop to bottom)')
'1234567-1'
Remember, strings are immutable, you cannot alter them in-place.
If all you want is the leading digits, plus optional dash-digit suffix, select the characters to be kept, rather than removing what you don't want:
filename_zero = re.match(r'^\d+(?:-\d)?', filename_zero).group()
new_filename = re.sub(r'^([0-9]+)-([0-9]+)', r'\g1-\g2', filename_zero)
Try using this regular expression instead, I hope this is how regular expressions work in Python, I don't use it often. You also appear to have forgotten to assign the value returned by the re.sub call to the filename_zero variable.

Trouble returning variable in python function

I'm simply trying to modify a string and return the modified string, however, I'm getting "None" returned when print the variable.
def AddToListTwo(self,IndexPosition):
filename = RemoveLeadingNums(self, str(self.listbox1.get(IndexPosition))) #get the filename, remove the leading numbers if there are any
print filename #this prints None
List2Contents = self.listbox2.get(0, END)
if(filename not in List2Contents): #make sure the file isn't already in list 2
self.listbox2.insert(0, filename)
def RemoveLeadingNums(self, words):
if(isinstance(words,str)):
match = re.search(r'^[0-9]*[.]',words)
if match: #if there is a match, remove it, send it through to make sure there aren't repeating numbers
RemoveLeadingNums(self, re.sub(r'^[0-9]*[.]',"",str(words)).lstrip())
else:
print words #this prints the value correctly
return words
if(isinstance(words,list)):
print "list"
edit - multiple people have commented saying I'm not returning the value if there is match. I don't want to return it if there is. It could be repeating (ex: 1.2. itema). So, I wanted to essentially use recursion to remove it, and THEN return the value
There are multiple conditions where RemoveLeadingNums returns None. e.g. if the if match: branch is taken. Perhaps that should be:
if match:
return RemoveLeadingNums(...
You also return None if you have any datatype that isn't a string passed in.
You're not returning anything in the case of a match. It should be:
return RemoveLeadingNums( ... )

Pyparsing: How can I parse data and then edit a specific value in a .txt file?

my data is located in a .txt file (no, I can't change it to a different format) and it looks like this:
varaiablename = value
something = thisvalue
youget = the_idea
Here is my code so far (taken from the examples in Pyparsing):
from pyparsing import Word, alphas, alphanums, Literal, restOfLine, OneOrMore, \
empty, Suppress, replaceWith
input = open("text.txt", "r")
src = input.read()
# simple grammar to match #define's
ident = Word(alphas + alphanums + "_")
macroDef = ident.setResultsName("name") + "= " + ident.setResultsName("value") + Literal("#") + restOfLine.setResultsName("desc")
for t,s,e in macroDef.scanString(src):
print t.name,"=", t.value
So how can I tell my script to edit a specific value for a specific variable?
Example:
I want to change the value of variablename, from value to new_value.
So essentially variable = (the data we want to edit).
I probably should make it clear that I don't want to go directly into the file and change the value by changing value to new_value but I want to parse the data, find the variable and then give it a new value.
Even though you have already selected another answer, let me answer your original question, which was how to do this using pyparsing.
If you are trying to make selective changes in some body of text, then transformString is a better choice than scanString (although scanString or searchString are fine for validating your grammar expression by looking for matching text). transformString will apply token suppression or parse action modifications to your input string as it scans through the text looking for matches.
# alphas + alphanums is unnecessary, since alphanums includes all alphas
ident = Word(alphanums + "_")
# I find this shorthand form of setResultsName is a little more readable
macroDef = ident("name") + "=" + ident("value")
# define values to be updated, and their new values
valuesToUpdate = {
"variablename" : "new_value"
}
# define a parse action to apply value updates, and attach to macroDef
def updateSelectedDefinitions(tokens):
if tokens.name in valuesToUpdate:
newval = valuesToUpdate[tokens.name]
return "%s = %s" % (tokens.name, newval)
else:
raise ParseException("no update defined for this definition")
macroDef.setParseAction(updateSelectedDefinitions)
# now let transformString do all the work!
print macroDef.transformString(src)
Gives:
variablename = new_value
something = thisvalue
youget = the_idea
For this task you do not need to use special utility or module
What you need is reading lines and spliting them in list, so first index is left and second index is right side.
If you need these values later you might want to store them in dictionary.
Well here is simple way, for somebody new in python. Uncomment lines whit print to use it as debug.
f=open("conf.txt","r")
txt=f.read() #all text is in txt
f.close()
fwrite=open("modified.txt","w")
splitedlines = txt.splitlines():
#print splitedlines
for line in splitedlines:
#print line
conf = line.split('=')
#conf[0] is what it is on left and conf[1] is what it is on right
#print conf
if conf[0] == "youget":
#we get this
conf[1] = "the_super_idea" #the_idea is now the_super_idea
#join conf whit '=' and write
newline = '='.join(conf)
#print newline
fwrite.write(newline+"\n")
fwrite.close()
Actually, you should have a look at the config parser module
Which parses exactly your syntax (you need only to add [section] at the beginning).
If you insist on your implementation, you can create a dictionary :
dictt = {}
for t,s,e in macroDef.scanString(src):
dictt[t.name]= t.value
dictt[variable]=new_value
ConfigParser
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read('example.txt')
variablename = config.get('variablename', 'float')
It'll yell at you if you don't have a [section] header, though, but it's ok, you can fake one.

Categories