Unable to populate a list using re.findall() - python

I'm working on a simple personal project that's required I learn to use regular expressions. I have successfully used findall() once before in my program:
def getStats():
playername = input("Enter your OSRS name: ")
try:
with urllib.request.urlopen("https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=" + playername) as response:
page = str(response.read())
player.levels = re.findall(r',(\d\d),', page)
This worked fine and populated the list exactly as I wanted. I'm now trying to do something similar with a text file.
The text file contains a string, followed by a lot of digits, and then another string followed by a lot of digits, etc. I just want to populate a list with the text and ignore the digits, but I get no matches (the list is empty):
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r',(\D\D),', q)
print(questList)
Pythex link: https://pythex.org/?regex=%5CD%5CD&test_string=Desert%20Treasure%2C0%2C0%2C0%2C12%0AContact!%2C0%2C0%2C11%2C0%2C0%2C0%2C5%0ACook%27s%20Assistant%2C0%2C0%2C0%2C0%0AHorror%20from%20the%20Deep%2C0%2C0%2C13&ignorecase=0&multiline=0&dotall=0&verbose=0
I've gotten some help with the pattern and edited accordingly, but the list is still printing empty
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r'^(\D+),', q)

Your pattern is incorrect. Firstly, in the demo you linked, the website is not very well designed and shows adjacent matches as one single match. \D\D matches exactly 2 non-digit characters. Also, you didn't include the commas you have in your pattern in the code. Anyway, here is a correct pattern:
^(\D+),
It matches the start of the line, then at least one non-digit character, then a comma. The first group contains the string you want to match.
Demo: https://regex101.com/r/pViF0h/2
In code:
import re
text = '''Desert Treasure,0,0,0,12
Contact!,0,0,11,0,0,0,5
Cook's Assistant,0,0,0,0
Horror from the Deep,0,0,13'''
print(re.findall(r'^(\D+),', text, re.M))
# ['Desert Treasure', 'Contact!', "Cook's Assistant", 'Horror from the Deep']
If the first entry is what you want no matter what, you can also use:
^(.+?),
Also, for these files, it is usually a much better idea to read it as a CSV and extract what you need that way.

Your TypeError solution is correct.
Without knowing what that webpage looks like, I do see one problem. In your working example, you use ',(\d\d),', but in the problem one you use ,(\D\D),. \d Matches any digit characters, but \D matches any non-digits.

Related

Find n characters after a specific string in Python

I have a webpage's source. It's just a ton of random numbers and letters and function names, saved as a string in python3. I want to find the text that says \"followerCount\": in the source code of this string, but I also want to find a little bit of the text that follows it (n characters). This would hopefully have the piece of text I'm looking for. Can I search for a specific part of a string and the n characters that follow it in python3?
Use .find() to get the position:
html = "... lots of html source ..."
position = html.find('"followerCount":')
Then use string slicing to extract that part of the string:
n = 50 # or however many characters you want
print(html[position:position+n])
A standard way of looking for text based on a pattern is a regex. For example here you can ask for any three characters following "followerCount:"
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'(?<="followerCount":).{3}', s)
if match:
print(match.group(0))
#prints '123'
Alternatively you can make a regex without the lookbehind and capture the three characters in a group:
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'"followerCount":(.{3})', s)
if match:
print(match.group(1))
#prints '123'

use dynamic int variable inside regex pattern python

I'm in my initial days of learning python, sorry if this question is already been asked.
I'm writing here as those didn't help me, my requirement is reading a file and printing all the URL's inside in it.Inside a for loop the regex pattern i had used is [^https://][\w\W]*, it worked fine. But I wanted to know if can I dynamically pass the length of line which is after https:// and get the output with occurrences instead of *
I had tried [^https://][\w\W]{var}} where var=len(line)-len(https://)
These are some other patterns I had tried like
pattern = '[^https://][\w\W]{'+str(int(var))+'}'
pattern = r'[^https://][\w\W]{{}}'.format(var)
pattern = r'[^https://][\w\W]{%s}'%var
I might be misunderstanding your question, but if you know that the url is always starting with https:// then that would be the first eight characters. Then you can get the length after finding the urls:
# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
print(url[8:])
Out
stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com
Instead of a for loop you could find all urls using re.findall
import re
url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)
# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))
# print the urls
print(unique_urls)
In your pattern you use [^https://] which is a negated character class [^ which will match any char except the listed.
One option is to make use of literal string interpolation. Assuming your links do not contain spaces, you could use \S instead of [\w\W] as the latter variant will match any character including spaces and newlines.
\bhttps://\S{{{var}}}(?!\S)
Regex demo
The assertion (?!\S) at the end is a whitespace boundary to prevent partial matches and the word boundary \b will prevent http being part of a larger word.
Python demo
For example
import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"
var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"
print(re.findall(pattern, lines))
Output
['https://www.test.com', 'https://thisisatestt']

Finding last word in tweepy tweet response python

I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.
for example in
NC don’t like working together www.linktowtweet.org
get back
together
I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.
However, given a string in python, it simple to extract the last word.
Solution 1
Use str.rfind(' '). The idea here is to find the space, preceding the last word. Here is an example.
text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)
Note: If a string is given with no words, last_word will be a blank string.
Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r, but I presume only newlines and tabs will be found in twitter messages.
Also see: string.whitespace
So a complete example (wrapped as a function) would be
def last_word(text):
text = text.replace('\n', ' ') # Replace newlines with spaces.
text = text.replace('\t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".
This may still be the best situation for basic parsing. There is something better for larger problems.
Solution 2
Regular Expressions
These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.
For example, .*\s(\S+) specifies the last word in a string.
Here is it again with a longer explanation.
.* # Match as many characters as possible.
\s # Until a whitespace ("\t\n\x0b\x0c\r ")
( # Remember the next section for the answer.
\S+ # Match a ~word~ (not whitespace) as possible.
) # End saved section.
So then, in python you would use this as follows.
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".
Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b" would match the last word, but ignore a link if that was the last thing.
Example:
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
The second advantage to this method is speed.
As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)
Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.
EDIT
Changed the url avoiding regex from
re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
to
re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
So that calling last_word("NC don’t like working together http://www.linktowtweet.org") returns together and not http://.
To so how this regex works, look at https://regex101.com/r/sdwpqB/2.
Simple, so if your text is:
text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]
So there you go!! Now you'll get the last word "together".

How do I create a regex with regular variable and some fixed text in Python?

In code i only want to fetch variable name from a c file which is used in if condition.
Following is code snippet of regex:
fieldMatch = re.findall(itemFieldList[i]+"=", codeline, re.IGNORECASE);
here i can find variable itemFieldList[i] from file.
But when i try to add if as shown below nothing is extracted as output even though variable exist in c code in if condition .
fieldMatch = re.findall(("^(\w+)if+[(](\w+)("+itemFieldList[i]+")="), codeline, re.IGNORECASE|re.MULTILINE);
Can anyone suggest how can we create regex to fetch mentioned scenario.
Sample Input :
IF(WORK.env_flow_ind=="R")
OR
IF( WORK.qa_flow_ind=="Q" OR WORK.env_flow_ind=="R")
here itemFieldList[i] = WORK.env_flow_ind
I don't have enough reputation to make this a comment, which it should be and I can't say that I fully understand the question. But to point out a few things:
it's about adding variables to your regex then you should be using string templates to make it more understandable for us and your future self.
"^{}".format(variable)
Doing that will allow you to create a dynamic regex that searches for what you want.
Secondly, I don't think that is your problem. I think that your regex is malformed. I don't know what exactly you are trying to search for but I recommend reading the python regex documentation and testing your regex on a resource like regex101 to make sure that you're capturing what you intend to. From what I can see you are a bit confused about groups. When you put parenthesis around a pattern you are identifying it as a group. You were on the right track trying to exclude the parenthesis in your search by surrounding it with square brackets but it's simpler and cleaner to escape them.
if you are trying to capture this statement:
if(someCondition == fries)
and you want to extract the keyword fries the valid syntax for that pattern is:
(?=if\((?:[\w=\s])+(fries)\))
Since you want this to be dynamic you would replace the string fries with your string template, and you'll get code that ends up something like this:
p = re.compile("(?=if\((?:[\w=\s])+({})\))".format(search), re.IGNORECASE)
p.findall(string)
Regex101 does a better job of breaking down my regex than I ever will:
Link cuz i have no rep
You can build the regex pattern as:
pattern = r"\bif\b\s*\(.*?\b" + re.escape(variablename) + r"\b"
This will look for the word “if” in lowercase, then optionally any spaces, then an opening parenthesis, then optionally any characters, and then your search term, its beginning and its end at word boundaries.
So if variablename is "WORK.env_flow_ind", then re.findall(pattern, textfile) will match the following lines:
if(blabla & WORK.env_flow_ind == "a")
if (WORK.env_flow_id == "b")
if(WORK.env_flow_id == "b")
if( WORK.env_flow_id == "b")
and these won't match:
if (WORK.env_bla == "c")
if (WORK.env_flow_id2 == "d")

Searching and capturing a character using regular expressions Python

While going through one of the problems in Python Challenge, I am trying to solve it as follows:
Read the input in a text file with characters as follows:
DQheAbsaMLjTmAOKmNsLziVMenFxQdATQIjItwtyCHyeMwQTNxbbLXWZnGmDqHhXnLHfEyvzxMhSXzd
BEBaxeaPgQPttvqRvxHPEOUtIsttPDeeuGFgmDkKQcEYjuSuiGROGfYpzkQgvcCDBKrcYwHFlvPzDMEk
MyuPxvGtgSvWgrybKOnbEGhqHUXHhnyjFwSfTfaiWtAOMBZEScsOSumwPssjCPlLbLsPIGffDLpZzMKz
jarrjufhgxdrzywWosrblPRasvRUpZLaUbtDHGZQtvZOvHeVSTBHpitDllUljVvWrwvhpnVzeWVYhMPs
kMVcdeHzFZxTWocGvaKhhcnozRSbWsIEhpeNfJaRjLwWCvKfTLhuVsJczIYFPCyrOJxOPkXhVuCqCUgE
luwLBCmqPwDvUPuBRrJZhfEXHXSBvljqJVVfEGRUWRSHPeKUJCpMpIsrV.......
What I need is to go through this text file and pick all lower case letters that are enclosed by only three upper-case letters on each side.
The python script that I wrote to do the above is as follows:
import re
pattern = re.compile("[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]")
f = open('/Users/Dev/Sometext.txt','r')
for line in f:
result = pattern.search(line)
if result:
print result.groups()
f.close()
The above given script, instead of returning the capture(list of lower case characters), returns all the text blocks that meets the regular expression criteria, like
aXCSdFGHj
vCDFeTYHa
nHJUiKJHo
.........
.........
Can somebody tell me what exactly I am doing wrong here? And instead of looping through the entire file, is there an alternate way to run the regular expression search on the entire file?
Thanks
Change result.groups() to result.group(1) and you will get just the single letter match.
A second problem with your code is that it will not find multiple results on one line. So instead of using re.search you'll need re.findall or re.finditer. findall will return strings or tuples of strings, whereas finditer returns match objects.
Here's where I approached the same problem:
import urllib
import re
pat = re.compile('[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]')
print ''.join(pat.findall(urllib.urlopen(
"http://www.pythonchallenge.com/pc/def/equality.html").read()))
Note that re.findall and re.finditer return non-overlapping results. So when using the above pattern with re.findall searching against string 'aBBBcDDDeFFFg', your only match will be 'c', but not 'e'. Fortunately, this Python Challenge problem contains no such such examples.
I'd suggest using lookaround:
(?<=[A-Z]{3})(?<![A-Z].{3})([a-z])(?=[A-Z]{3})(?!.{3}[A-Z])
This will have no problem with overlapping matches.
Explanation:
(?<=[A-Z]{3}) # assert that there are 3 uppercase letters before the current position
(?<![A-Z].{3}) # assert that there is no uppercase letter 4 characters before the current position
([a-z]) # match a lowercase character (all characters in the example are ASCII)
(?=[A-Z]{3}) # assert that there are 3 uppercase letter after the current position
(?!.{3}[A-Z]) # assert that there is no uppercase letter 4 characters after the current position
import re
with open('/Users/Dev/Sometext.txt','r') as f:
tokens = re.findall(r'[a-z][A-Z]{3}([a-z])[A-Z]{3}[a-z]', f.read())
for token ins tokens:
print token
What findall does:
Return all non-overlapping matches of
pattern in string, as a list of
strings. The string is scanned
left-to-right, and matches are
returned in the order found. If one or
more groups are present in the
pattern, return a list of groups; this
will be a list of tuples if the
pattern has more than one group. Empty
matches are included in the result
unless they touch the beginning of
another match.
Maybe the most useful function in the re module.
The read() function reads the whole file into on big string. This is especially useful if you need to match a regular expression against the whole file.
Warning: Depending on the size of the file, you may prefer iterating over the file line by line as you did in your first approach.

Categories