There are probably several ways to solve this problem, so I'm open to any ideas.
I have a file, within that file is the string "D133330593" Note: I do have the exact position within the file this string exists, but I don't know if that helps.
Following this string, there are 6 digits, I need to replace these 6 digits with 6 other digits.
This is what I have so far:
def editfile():
f = open(filein,'r')
filedata = f.read()
f.close()
#This is the line that needs help
newdata = filedata.replace( -TOREPLACE- ,-REPLACER-)
#Basically what I need is something that lets me say "D133330593******"
#->"D133330593123456" Note: The following 6 digits don't need to be
#anything specific, just different from the original 6
f = open(filein,'w')
f.write(newdata)
f.close()
Use the re module to define your pattern and then use the sub() function to substitute occurrence of that pattern with your own string.
import re
...
pat = re.compile(r"D133330593\d{6}")
re.sub(pat, "D133330593abcdef", filedata)
The above defines a pattern as -- your string ("D133330593") followed by six decimal digits. Then the next line replaces ALL occurrences of this pattern with your replacement string ("abcdef" in this case), if that is what you want.
If you want a unique replacement string for each occurrence of pattern, then you could use the count keyword argument in the sub() function, which allows you to specify the number of times the replacement must be done.
Check out this library for more info - https://docs.python.org/3.6/library/re.html
Let's simplify your problem to you having a string:
s = "zshisjD133330593090909fdjgsl"
and you wanting to replace the 6 characters after "D133330593" with "123456" to produce:
"zshisjD133330594123456fdjgsl"
To achieve this, we can first need to find the index of "D133330593". This is done by just using str.index:
i = s.index("D133330593")
Then replace the next 6 characters, but for this, we should first calculate the length of our string that we want to replace:
l = len("D133330593")
then do the replace:
s[:i+l] + "123456" + s[i+l+6:]
which gives us the desired result of:
'zshisjD133330593123456fdjgsl'
I am sure that you can now integrate this into your code to work with a file, but this is how you can do the heart of your problem .
Note that using variables as above is the right thing to do as it is the most efficient compared to calculating them on the go. Nevertheless, if your file isn't too long (i.e. efficiency isn't too much of a big deal) you can do the whole process outlined above in one line:
s[:s.index("D133330593")+len("D133330593")] + "123456" + s[s.index("D133330593")+len("D133330593")+6:]
which gives the same result.
Related
I'm trying to find from an external file a user-inputted expression and the 5 words (as flexible as possible) around it. However, the regex to find 5 words is taking far too long to complete
'(?:(.+)?\w+(.+)? ){5}'
So to create the expression I'm using:
exp='(?:(.+)?\w+(.+)? ){5}'
find=input("What would you like to find?")
exp+=find
exp+='(?:(.+)?\w+(.+)? ){5}'
I know the problem isn't with the actual code because using an expression like .20{} works fine.
it would be much faster to find the line that has the word in it first then get the words afterwards once you've found the line.
Currently you're having to compare a much longer string due to the 5 word requirement.
So just find the word, then parse the surrounding elements (even use the regex here if you need to).
Instead of a regex, you should use normal string operations.
wordPos = fileContent.find(userInput)
wordAmount = 5
extractionBegin = wordPos
for i in range(wordAmount + 1):
extractionBegin = fileContent.rfind(' ', extractionBegin)
extractionEnd = wordPos
for i in range(wordAmount + 1):
extractionEnd = fileContent.find(' ', extractionEnd)
print fileContent[extractionBegin:extractionEnd]
I am new in Python and I am trying to to get some contents from a file using regex. I upload a file, I load it in memory and then I run this regular expression. I want to take the names from the file but it also needs to work with names that have spaces like "Marie Anne". So imagine that the array of names has this values:
all_names = [{name:"Marie Anne", id:1}, {name:"Johnathan", id:2}, {name:"Marie", id:3}, {name:"Anne", id:4},{name:"John", id:5}]
An the string that I am searching might have multiple occurrences and it's multiline.
print all_names # this is an array of id and name, ordered descendently by names length
textToStrip = stdout.decode('ascii', 'ignore').lower()
for i in range(len(all_skills)):
print all_names[i]
m = re.search(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W',textToStrip)
if m:
textToStrip = re.sub(r'\W' + re.escape(unicode(all_names[i]['name'].lower())) + '\W', "", textToStrip, 100)
print "found " + all_names[i]['name']
print textToStrip
The script is finding the names, but the line re.sub removes them from the list to avoid that takes "Maria Anne", and "Marie" from the same instance, it's also removing extra characters like "," or "." before or after.
Any help would much appreciated... or if you have a better solution for this problem even better.
The characters on both sides are deleted because you have \W included in re.sub() regexp. That's because re.sub replaced everything the regexp matches -- the way you call re.sub.
There's an alternate way to do this. If you wrap the part that you want keep in the matched regext with grouping parens, and if you call re.sub with a callable (a function) instead of the new string, that function can extract the group values from the match object passed to it and assemble a return value that preserves them.
Read documentation for re.sub for details.
I have the following string:
fname="VDSKBLAG00120C02 (10).gif"
How can I extract the value 10 from the string fname (using re)?
A simpler regex is \((\d+)\):
regex = re.compile(r'\((\d+)\)')
value = int(re.search(regex, fname).group(1))
regex = re.compile(r"(?<=\()\d+(?=\))")
value = int(re.search(regex, fname).group(0))
Explanation:
(?<=\() # Assert that the previous character is a (
\d+ # Match one or more digits
(?=\)) # Assert that the next character is a )
Personally, I'd use this regex:
^.*\(\d+\)(?:\.[^().]+)?$
With this, I can pick the last number in parentheses, just before the extension (if any). It won't go and pick any random number in parentheses if there is any in the middle of the file name. For example, it should correctly pick out 2 from SomeFilmTitle.(2012).RippedByGroup (2).avi. The only drawback is that, it won't be able to differentiate when the number is right before the extension: SomeFilmTitle (2012).avi.
I make assumption that the extension of the file, if any, should not contain ().
So essentially I am trying to read lines from multiple files in a directory and using a regex to specifically find the beginnings of a sort of time stamp, I want to also place an instance of a list of months within the regex and then create a counter for each month based on how many times it appears. I have some code below, but it is still a work in progress. I know I closed off date_parse, but I that's why I'm asking. And please leave another suggestion if you can think of a more efficient method. thanks.
months = ['Jan','Feb','Mar','Apr','May','Jun',\
'Jul','Aug','Sep','Oct','Nov',' Dec']
date_parse = re.compile('[Date:\s]+[[A-Za-z]{3},]+[[0-9]{1,2}\s]')
counter=0
for line in sys.stdin:
if data_parse.match(line):
for month in months in line:
print '%s %d' % (month, counter)
In a regular expression, you can have a list of alternative patterns, separated using vertical bars.
http://docs.python.org/library/re.html
from collections import defaultdict
date_parse = re.compile(r'Date:\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)')
c = defaultdict(int)
for line in sys.stdin:
m = date_parse.match(line)
if m is None:
# pattern did not match
# could handle error or log it here if desired
continue # skip to handling next input line
month = m.group(1)
c[month] += 1
Some notes:
I recommend you use a raw string (with r'' or r"") for a pattern, so that backslashes will not become string escapes. For example, inside a normal string, \s is not an escape and you will get a backslash followed by an 's', but \n is an escape and you will get a single character (a newline).
In a regular expression, when you enclose a series of characters in square brackets, you get a "character class" that matches any of the characters. So when you put [Date:\s]+ you would match Date: but you would also match taD:e or any other combination of those characters. It's perfectly okay to just put in a string that should match itself, like Date:.
I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()