Trouble using regex to locate special characters - python

I am using beautifulsoup and selenium to collect some data from a page. After narrowing the data down the string that I want, it gives me 'First Blood○○○○○●○○○○'. My goal is to determine the position of the filled in dot (so 5 in this case if we are counting from 0).
I started by trying to remove all of the non-special characters using:
test = re.sub(r'[a-z]+', '', collectStatistics[5], re.I)
Which gave me 'F B○○○○○●○○○○' so I am guessing F B are also special characters. I have no clue how to go about writing a regex that will detect the filled in circle so any advice would be appreciated.
Thanks in advance :)

I think regexes (regices?) are overkill here.
First, cut off everything after the filled dot:
line = line.split('●')[0] # Split on filled dots, then take only the first part
Now, count the empty dots:
result = line.count('○') # Count occurrences

It founds F and B because your regex finds lowercase letters.If you want to find all of them change regex to [a-zA-Z]+
import re
collectStatistics = "First Blood○○○○○●○○○○"
test = re.sub(r'[a-zA-Z]+', '', collectStatistics,re.I)
print (test)
OUTPUT :
○○○○○●○○○○

Related

Matching a String in Python using regex

I have a string say like this:
ARAN22 SKY BYT and TRO_PAN
In the above string The first alphabet can be A or S or T or N and the two numbers after RAN can be any two digit. However the rest will be always same and last three characters will be always like _PAN.
So the few possibilities of the string are :
SRAN22 SK BYT and TRO_PAN
TRAN25 SK BYT and TRO_PAN
NRAN25 SK BYT and TRO_PAN
So I was trying to extract the string every time in python using regex as follows:
import re
pattern = "([ASTN])RAN" + "\w+\s+" +"_PAN"
pat_check = re.compile(pattern, flags=re.IGNORECASE)
sample_test_string = 'NRAN28 SK BYT and TRO_PAN'
re.match(pat_check, sample_test_string)
here string can be anything like the above examples I gave there.
But its not working as I am not getting the string name ( the sample test string) which I should. Not sure what I am doing wrong. Any help will be very much appreciated.
You are using \w+\s+, which will match one or more word (0-9A-Za-z_) characters, followed by one or more space characters. So it will match the two digits and space after RAN but then nothing more. Since the next characters are not _PAN, the match will fail. You need to use [\w\s]+ instead:
pattern = "([ASTN])RAN" + "[\w\s]+" +"_PAN"

Finding most common occurrence of a character that follows another

I'm currently working on a small piece of code and I seem to have run into a roadblock. I was wondering if it's possible to (because I cannot, for the life of me, figure it out) find the most common occurrence of a character that follows a specific character or string?
For example, say I have the following sentence:
"this is a test sentence that happens to be short"
How would could I determine, for example, the most common character that occurs after the letter h?
In this specific example, doing it by hand, I get something like this:
{"i": 1, "a": 2, "o": 1}
I'd then like to be able to get the key of the highest value--in this case, a.
Using Counter from collections, I've been able to find the most common occurrence of a specific word or character, but I'm not sure how to do this specific implementation of doing the most common occurrence after. Any help would be greatly appreciated, thanks!
(The code I wrote to find the most common occurrence of a letter in a file:
Counter(text).most_common(1), which does include white spaces )
EDIT:
How would this be done with words? For example, if I had the sentence: "whales are super neat, but whales don't make good pets. whales are cool."
How would I find the most common character that occurs after the words whales?
In this instance, removing white spaces, the most common character would be a
Just split them by your character and then get the letter after it
import collections
sentence = "this is a test sentence that happens to be short"
character = 'h'
letters_after_some_character = [part[0] for part in str.split(character)[1:] if part[0].isalpha()]
print(collections.Counter(letters_after_some_character).most_common())
If you want a solution without using regex:
import collections
sentence = "this is a test sentence that happens to be short"
characters = [sentence[i] for i in range(1,len(sentence)) if sentence[i-1] == 'h']
most_common_char = collections.Counter(characters).most_common(1)
Using the Counter class we can try:
import collections
s = "this is a test sentence that happens to be short"
s = re.sub(r'^.*n|\s*', '', s)
print(collections.Counter(s).most_common(1)[0])
The above would print o as it is the most frequent character occurring after the last n. Note that we also strip off whitespace before calling collections count.

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

Regex to match strings within braces

I am trying to write a regex to a string that has the following format
12740(34,12) [abc (a1b2c3) (a2b3c4)......] myId123
Currently, I have something like this
\((?P<expression>\S+)\)
But with this, I can capture only the strings within square brackets.
Is there anyway I can capture the integers before the square brackets and also id at the end along with the strings within square brackets.
The number of strings enclosed within small brackets will not be the same. I could also have a string that looks like this
10(3,2) [abc (a1b2c3)] myId1
I know that I can write a simple regex for the above expression using brute force. But could anyone please help me write one when the number of strings within the square bracket keeps changing.
Thanks in advance
You can capture the information by using ^ and $, which mean start and end respectively:
((?P<front>^\d+)|\((?P<expression>\S+)\)|(?P<id>[a-zA-Z0-9]+)$)
Regex101:
https://regex101.com/r/PoA5k4/1
To make the result more usable, I'd turn it into a dictionary:
import re
myStr = "12740(34,12) [abc (a1b2c3) (a2b3c4)......] myId123"
di = {}
for find in re.findall("((?P<front>^\d+)|\((?P<expression>\S+)\)|(?P<id>[a-zA-Z0-9]+)$)",myStr):
if find[1] != "":
di["starter"] = find[1]
elif find[3] != "":
di["id"] = find[3]
else:
di.setdefault("expression",[]).append(find[2])
print(di)

regular expressions to extract phone numbers

I am new to regular expressions and I am trying to write a pattern of phone numbers, in order to identify them and be able to extract them. My doubt can be summarized to the following simple example:
I try first to identify whether in the string is there something like (+34) which should be optional:
prefixsrch = re.compile(r'(\(?\+34\)?)?')
that I test in the following string in the following way:
line0 = "(+34)"
print prefixsrch.findall(line0)
which yields the result:
['(+34)','']
My first question is: why does it find two occurrences of the pattern? I guess that this is related to the fact that the prefix thing is optional but I do not completely understand it. Anyway, now for my big doubt
If we do a similar thing searching for a pattern of 9 digits we get the same:
numsrch = re.compile(r'\d{9}')
line1 = "971756754"
print numsrch.findall(line1)
yields something like:
['971756754']
which is fine. Now what I want to do is identify a 9 digits number, preceded or not, by (+34). So to my understanding I should do something like:
phonesrch = re.compile(r'(\(?\+34\)?)?\d{9}')
If I test it in the following strings...
line0 = "(+34)971756754"
line1 = "971756754"
print phonesrch.findall(line0)
print phonesrch.findall(line1)
this is, to my surprise, what I get:
['(+34)']
['']
What I was expecting to get is ['(+34)971756754'] and ['971756754']. Does anybody has the insight of this? thank you very much in advance.
Your capturing group is wrong. Make the country code within a non-capturing group and the entire expression in the capturing group
>>> line0 = "(+34)971756754"
>>> line1 = "971756754"
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line0)
['(+34)971756754']
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line1)
['971756754']
My first question is: why does it find two occurrences of the pattern?
This is because, ? which means it match 0 or 1 repetitions, so an empty string is also a valid match

Categories