I have an html file that I am reading the below line from. I would like to grab only the number that appears after the ':' and before the ',' using REGEX... THANKS IN ADVANCE
"totalPages":15,"bloodhoundHtml"
"totalPages":([0-9]*),
You can see the Demo here
Then the python code is
import re
p = re.compile('"totalPages":([0-9]*),')
print p.findall('"totalPages":15,"bloodhoundHtml"')
you can try :\d+, to get the ':15,'
then you can trim first':' and trim end ',' to get the pure numbers,
I don't know if python can use variable in the regex, I'm a c# programe, in c#, I can use :(?<id>\d+), to match this string, and get the number directly by result.group["id"]
:\d{1,},
Also works for parsing the line you gave. According to this post, you might run into some trouble parsing the HTML
Related
I have this line in my .txt file:
2016CT1021
I want to make it like this:
2016-CT-1021
I tried to use this Python regex: "re.sub":
data = re.sub(r'\d\d+(?:\w\w\d\d\d\d)', r'\d\d+(?:-\w\w-\d\d\d\d)', data)
But it didn't change/replace. Please someone help me. Thank you!
For current example will work
re.sub(r'(\d\d+)(\w\w)(\d\d\d\d)', r'\1-\2-\3', data)
you should group with brackets and use group number in the replace expression.
As i need to extract only particular pattern from string:
import re
string='/x/eng/wcov/Job148666--rollup_generic/Job148674--ncov_aggregate/Job148678--run_command/Job148678.info: devN_180107_2035'
line2=re.findall(r'(?:/\w*)' ,string)
print(line2)
I'm getting output as below:
['/x', '/eng', '/wcov', '/Job148666', '/Job148674', '/Job148678', '/Job148678']
But actual output i required is:
/x/eng/wcov/Job148666--rollup_generic/Job148674--ncov_aggregate/Job148678--run_command/Job148678.info
Try using split() function
string='/x/eng/wcov/Job148666--rollup_generic/Job148674--ncov_aggregate/Job148678--run_command/Job148678.info: devN_180107_2035'
sp=string.split(':')[0]
Does the string always end with :? Then use this
str.split(":", 1)[0]
I have a string that's
/path/to/file?_subject_ID_SOMEOTHERSTRING
the path/to/file part changes depends on situation, and subject_ID is always there. I try to write a regex that extract only file part of the string. Using ?subject_ID is definite, but I don't know how to safely get the file
My current regex looks like (.*[\/]).*\?_subject_ID
url = '/path/to/file?_subject_ID_SOMEOTHERSTRING'
file_re = re.compile('(.*[\/]).*\?_subject_ID')
file_re.search(url)
this will find the right string, but I still can't extract the file name
printing _.group(1) will get me /path/to/. What's the next step that gets me the actual file name?
As for your '(.*[\/]).*\?_subject_ID' regex approach, you just need to add a capturing group around the second .*. You could use r'(.*/)(.*)\?_subject_ID' (then, there will be .group(1) and .group(2) parts captured), but it is not the most appropriate way to parse URLs in Python.
You may use the non-regex approach here, here is a snippet showing how to leverage urlparse and os.path to parse the URL like yours:
import urlparse
path = urlparse.urlparse('/path/to/file?_subject_ID_SOMEOTHERSTRING').path
import os.path
print(os.path.split(path)[1]) # => file
print(os.path.split(path)[0]) # => /path/to
See the IDEONE demo
It's pretty simple, really. Just match a / before and ?subject_ID after:
([^/?]*)\?subject_ID
The [^/?]* (as opposed to .*) is because otherwise it'd match the part before, too. The ? in the character class
If you want to get both the path and the file, you can do much the same thing, but also grab the part before the /:
([^?]*)([^/?]*)\?subject_ID
It's basically the same as the one before but with the first bit captured instead of ignored.
I am parsing a bunch of HTML and am encountering a lot of "\n" and "\t" inside the code. So I am using
"something\t\n here".replace("\t","").replace("\n","")
This works, but I'm using it often. Is there a way to define a string function, along the lines of replace itself (or find, index, format, etc.) that will pretty my code a little, something like
"something\t\n here".noTabsOrNewlines()
I tried
class str:
def noTabNewline(self):
self.replace("\t","").replace("\n","")
but that was no good. Thanks for any help.
While you could do something along these lines (https://stackoverflow.com/a/4698550/1867876), the more Pythonic thing to do would be:
myString = "something\t\n here"
' '.join(myString.split())
You can see this thread for more information:
Strip spaces/tabs/newlines - python
you can try encoding='utf-8'. otherwise in my opinion there is no other way otherthan replacing it . python also replaces it spaces with '/xa0' so in anyway you have to replace it. our you can read it line by line via (readline()) instead of just read() it .
I am using a simple '.replace()' function on a string to replace some text with nothing as so:
.replace("('ws-stage-stat', ", '')
I have also tried using a regex to do this, like so:
match3a = re.sub("\(\'ws-stage-stat\', ", "", match3a)
This string is extracted from the source code for the following webpage at line 684:
http://www.whoscored.com/Regions/252/Tournaments/26
I have extracted and cleaned up the rest of the code into some usable data, but this one last bit won't co-operate and stubbornly refuses to be replaced. This seems like a very straight forward problem, but it just won't work for me.
Any ideas?
Thanks
The first replacement should work. Make sure that you're assigning the result of the replacement somewhere, for example:
mystring = mystring.replace("('ws-stage-stat', ", '')
I think you aren't escaping the regex correctly.
This is code my "Patterns" app spit out:
re.sub("\\(\\'ws-stage-stat\\', ", "", match3a)
A quick test showed me that it works correctly.