Find n characters after a specific string in Python - python

I have a webpage's source. It's just a ton of random numbers and letters and function names, saved as a string in python3. I want to find the text that says \"followerCount\": in the source code of this string, but I also want to find a little bit of the text that follows it (n characters). This would hopefully have the piece of text I'm looking for. Can I search for a specific part of a string and the n characters that follow it in python3?

Use .find() to get the position:
html = "... lots of html source ..."
position = html.find('"followerCount":')
Then use string slicing to extract that part of the string:
n = 50 # or however many characters you want
print(html[position:position+n])

A standard way of looking for text based on a pattern is a regex. For example here you can ask for any three characters following "followerCount:"
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'(?<="followerCount":).{3}', s)
if match:
print(match.group(0))
#prints '123'
Alternatively you can make a regex without the lookbehind and capture the three characters in a group:
import re
s = 'a bunch of randoms_characters/"followerCount":123_more_junk'
match = re.search(r'"followerCount":(.{3})', s)
if match:
print(match.group(1))
#prints '123'

Related

use dynamic int variable inside regex pattern python

I'm in my initial days of learning python, sorry if this question is already been asked.
I'm writing here as those didn't help me, my requirement is reading a file and printing all the URL's inside in it.Inside a for loop the regex pattern i had used is [^https://][\w\W]*, it worked fine. But I wanted to know if can I dynamically pass the length of line which is after https:// and get the output with occurrences instead of *
I had tried [^https://][\w\W]{var}} where var=len(line)-len(https://)
These are some other patterns I had tried like
pattern = '[^https://][\w\W]{'+str(int(var))+'}'
pattern = r'[^https://][\w\W]{{}}'.format(var)
pattern = r'[^https://][\w\W]{%s}'%var
I might be misunderstanding your question, but if you know that the url is always starting with https:// then that would be the first eight characters. Then you can get the length after finding the urls:
# Example of list containing urls - you should fill that with your for loop
list_urls = ['https://stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python', 'https://google.com', 'https://stackoverflow.com']
for url in list_urls:
print(url[8:])
Out
stackoverflow.com/questions/61006253/use-dynamic-int-variable-inside-regex-pattern-python
google.com
stackoverflow.com
Instead of a for loop you could find all urls using re.findall
import re
url_pattern = "((https:\/\/)([\w-]+\.)+[\w-]+[.+]+([\w%\/~\+#]*))"
# text referes to your document, that should be read before this
urls = re.findall(url_pattern, text)
# Using list comprehensions
# Get the unique urls by using set
# Only get text after https:// using [8:]
# Only parse the first element of the group that is returned by re.findall using [0]
unique_urls = list(set([x[0][8:] for x in urls]))
# print the urls
print(unique_urls)
In your pattern you use [^https://] which is a negated character class [^ which will match any char except the listed.
One option is to make use of literal string interpolation. Assuming your links do not contain spaces, you could use \S instead of [\w\W] as the latter variant will match any character including spaces and newlines.
\bhttps://\S{{{var}}}(?!\S)
Regex demo
The assertion (?!\S) at the end is a whitespace boundary to prevent partial matches and the word boundary \b will prevent http being part of a larger word.
Python demo
For example
import re
line = "https://www.test.com"
lines = "https://www.test.com https://thisisatestt https://www.dontmatchme"
var=len(line)-len('https://')
pattern = rf"\bhttps://\S{{{var}}}(?!\S)"
print(re.findall(pattern, lines))
Output
['https://www.test.com', 'https://thisisatestt']

Unable to populate a list using re.findall()

I'm working on a simple personal project that's required I learn to use regular expressions. I have successfully used findall() once before in my program:
def getStats():
playername = input("Enter your OSRS name: ")
try:
with urllib.request.urlopen("https://secure.runescape.com/m=hiscore_oldschool/index_lite.ws?player=" + playername) as response:
page = str(response.read())
player.levels = re.findall(r',(\d\d),', page)
This worked fine and populated the list exactly as I wanted. I'm now trying to do something similar with a text file.
The text file contains a string, followed by a lot of digits, and then another string followed by a lot of digits, etc. I just want to populate a list with the text and ignore the digits, but I get no matches (the list is empty):
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r',(\D\D),', q)
print(questList)
Pythex link: https://pythex.org/?regex=%5CD%5CD&test_string=Desert%20Treasure%2C0%2C0%2C0%2C12%0AContact!%2C0%2C0%2C11%2C0%2C0%2C0%2C5%0ACook%27s%20Assistant%2C0%2C0%2C0%2C0%0AHorror%20from%20the%20Deep%2C0%2C0%2C13&ignorecase=0&multiline=0&dotall=0&verbose=0
I've gotten some help with the pattern and edited accordingly, but the list is still printing empty
def getQuests():
try:
with open("quests.txt") as file:
q = file.read()
questList = re.findall(r'^(\D+),', q)
Your pattern is incorrect. Firstly, in the demo you linked, the website is not very well designed and shows adjacent matches as one single match. \D\D matches exactly 2 non-digit characters. Also, you didn't include the commas you have in your pattern in the code. Anyway, here is a correct pattern:
^(\D+),
It matches the start of the line, then at least one non-digit character, then a comma. The first group contains the string you want to match.
Demo: https://regex101.com/r/pViF0h/2
In code:
import re
text = '''Desert Treasure,0,0,0,12
Contact!,0,0,11,0,0,0,5
Cook's Assistant,0,0,0,0
Horror from the Deep,0,0,13'''
print(re.findall(r'^(\D+),', text, re.M))
# ['Desert Treasure', 'Contact!', "Cook's Assistant", 'Horror from the Deep']
If the first entry is what you want no matter what, you can also use:
^(.+?),
Also, for these files, it is usually a much better idea to read it as a CSV and extract what you need that way.
Your TypeError solution is correct.
Without knowing what that webpage looks like, I do see one problem. In your working example, you use ',(\d\d),', but in the problem one you use ,(\D\D),. \d Matches any digit characters, but \D matches any non-digits.

Finding last word in tweepy tweet response python

I am receiving a stream of tweets with python and would like to extract the last word or know where to reference it.
for example in
NC don’t like working together www.linktowtweet.org
get back
together
I am not familiar with tweepy, so I am presuming you have the data in a python string, so maybe there is a better answer.
However, given a string in python, it simple to extract the last word.
Solution 1
Use str.rfind(' '). The idea here is to find the space, preceding the last word. Here is an example.
text = "NC don’t like working together"
text = text.rstrip() # To any spaces at the end, that would otherwise confuse the algorithm.
last_word = text[text.rfind(' ')+1:] # Output every character *after* the space.
print(last_word)
Note: If a string is given with no words, last_word will be a blank string.
Now this presumes that all of the words are separated by spaces. To handle newlines and spaces, use str.replace to turn them into strings. Whitespaces in python are \t\n\x0b\x0c\r, but I presume only newlines and tabs will be found in twitter messages.
Also see: string.whitespace
So a complete example (wrapped as a function) would be
def last_word(text):
text = text.replace('\n', ' ') # Replace newlines with spaces.
text = text.replace('\t', ' ') # Replace tabs with spaces.
text = text.rstrip(' ') # Remove trailing spaces.
return text[text.rfind(' ')+1:]
print(last_word("NC don’t like working together")) # Outputs "together".
This may still be the best situation for basic parsing. There is something better for larger problems.
Solution 2
Regular Expressions
These are a way to handle strings in python, that is a lot more flexible. REGEX, as they are often called, use there own language to specify a portion of text.
For example, .*\s(\S+) specifies the last word in a string.
Here is it again with a longer explanation.
.* # Match as many characters as possible.
\s # Until a whitespace ("\t\n\x0b\x0c\r ")
( # Remember the next section for the answer.
\S+ # Match a ~word~ (not whitespace) as possible.
) # End saved section.
So then, in python you would use this as follows.
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s(\S+)", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together")) # Outputs "together".
Now, even though this method is a lot less obvious, it has a couple of advantages. First off, it is a lot more customizable. If you wanted to match the final word, but not links, the regex r".*\s([^.:\s]+(?!\.\S|://))\b" would match the last word, but ignore a link if that was the last thing.
Example:
import re # Import the REGEX library.
# Compile the code, (DOTALL makes . match \n).
LAST_WORD_PATTERN = re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
def last_word(text):
m = LAST_WORD_PATTERN.match(text)
if not m: # If there was not a last word to this text.
return ''
return m.group(1) # Otherwise return the last word.
print(last_word("NC don’t like working together www.linktowtweet.org")) # Outputs "together".
The second advantage to this method is speed.
As you can Try it online! here, the regex approach is almost as fast as the string manipulation, if not faster in some cases. (I actually found that regex execute .2 usec faster on my machine that in the demo.)
Either way, the regex execution is extremely fast, even in the simple case, and there is no question that the regex is faster then any more complex string algorithm implemented in python. So using the regex can also speed up the code.
EDIT
Changed the url avoiding regex from
re.compile(r".*\s([^.\s]+(?!\.\S))\b", re.DOTALL)
to
re.compile(r".*\s([^.:\s]+(?!\.\S|://))\b", re.DOTALL)
So that calling last_word("NC don’t like working together http://www.linktowtweet.org") returns together and not http://.
To so how this regex works, look at https://regex101.com/r/sdwpqB/2.
Simple, so if your text is:
text = "NC don’t like working together www.linktowtweet.org"
text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) #to remove any URL
text = text.split() #splits sentence into words with delimiter=" "
last_word = text[-1]
So there you go!! Now you'll get the last word "together".

how do i read a string from a file up to a set character (for example reading the words "hello world ¬ this is a string" up to the¬) in python 3

sorry if my English is off, i am making something in python and need help with fixing a problem i have encountered. the problem im having is i need to be able to take information in a txt file up to a point signalled by a key character such as ¬, and then need to be able to take the next part of the string after the 1st ¬ to the next ¬ and so on. the reason for this is because all the strings will be of various lengths that can and will change, so if i have the string
'znPbB t7<)!\oWk_feGTIT:7{.¬ZO9?S9$v9vpd}Z#EMKC¬'
in a note pad file i need it to come out as
'znPbB t7<)!\oWk_feGTIT:7{.'
and when i want the 2nd one, it should come out as
'ZO9?S9$v9vpd}Z#EMKC'
I would use split:
s = 'znPbB t7<)!\oWk_feGTIT:7{.¬ZO9?S9$v9vpd}Z#EMKC¬'
s.split('¬')
# returns
['znPbB t7<)!\\oWk_feGTIT:7{.', 'ZO9?S9$v9vpd}Z#EMKC', '']
One solution is to split the string. Another is to use the re module. I changed the ¬ character to an ö when testing:
import re
text = 'znPbB t7<)!\oWk_feGTIT:7{.öZO9?S9$v9vpd}Z#EMKCöjndIJ%349HBhslö'
regex_result = re.findall(r"(?!ö).*?(?=ö)", text)
split_result = text.split("ö")
The difference between the results of the two is that str.split includes an empty string if the character (in my example "ö") is last.
split_result[:-1] == regex_result # <--- This is True
The regular expression can be divided into three parts. (?!ö) is a negative lookahead which excludes "ö" from the results. .*? matches anything. (?=ö) is a lookahead which tells us that "ö" is needed but not to be included in the match.

How to get only searched word as a result python regex

How can I get only the words that match my regex in python? Because everything I tried also prints the full line where the string was found.
The regex is the following:
\b([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2})\b
It matched IP + CIDR (e.g 12.0.0.0/8)
The text in which I am searching this is as follows:
04/30","172.18.186.0/24","172.18.185.0/24","172.18.177.16/28","dwefwf-1.RI-nc_wefwfwefwefpat_intweb_fe","172.18.176.16/28","edefwfwf
t_pat_infwef_fe","172.18.178.16/28","dwefwefwef-wefwffwefwefwef_dr_efwefeb_fe","172.18.176.80/28","DSwefwfH2.
RI-nc_rat_dr_fweweb_fe","172.18.178.48/28","172.18.177.208/28","wefwef
wefwtfweapp_fe","172.18.176.208/28","wfwfwefwefwefH2.RI-nwefwefdr_app_fe","172.18.177.192/28","de1dfwwf-1.wefewf","172.18.176.1
92/28","
You should modify your regex as follows:
\b(([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2}))\b
and then extract the first matched group: \1
Demo: http://repl.it/R0W/1 (It takes a while to run)
I think your regexp work correctly. If you want to get matched string use group function. Like this:
import re
regexp = r'\b([1-9][0-9]{1,2})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\/([0-9]{1,2})\b'
text = '''04/30","172.18.186.0/24","172.18.185.0/24","172.18.177.16/28","dwefwf-1.RI-nc_wefwfwefwefpat_intweb_fe","172.18.176.16/28","edefwfwf
t_pat_infwef_fe","172.18.178.16/28","dwefwefwef-wefwffwefwefwef_dr_efwefeb_fe","172.18.176.80/28","DSwefwfH2.
RI-nc_rat_dr_fweweb_fe","172.18.178.48/28","172.18.177.208/28","wefwef
wefwtfweapp_fe","172.18.176.208/28","wfwfwefwefwefH2.RI-nwefwefdr_app_fe","172.18.177.192/28","de1dfwwf-1.wefewf","172.18.176.1
92/28","'''
for i in re.finditer(regexp, text):
print i.group(0)

Categories