Regex inside findall vs regex inside count

Regex inside findall vs regex inside count - python

This is a follow up question to How to count characters in a string? and to Find out how many times a regex matches in a string in Python
I want to count all alphabet characters in the string:
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
The str.count() method allows for counting a specific letter. How would one do that for counting any letter in the entire alphabet in a string, using the count method?
I am trying to use a regex inside the count method, but it returns 0 instead of 83. The code I am using is:
import re
spam_data['text'][0].count((r'[a-zA-Z]'))
When I use:
len(re.findall((r'[a-zA-Z]'), spam_data['text'][0])) it returns a length of 83.
Why does count return a 0 here?

You should use str.count instead of count.
spam_data['text'].str.count('\w')
0 83
Name: text, dtype: int64
To access the first value use:
spam_data['text'].str.count('\w')[0]
83

How would one do that for counting any letter in the entire alphabet in a string, using the count method?
wrd = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
>>>> count = sum([''.join({_ for _ in wrd if _.isalpha()}).count(w) for w in wrd])
>>>> count
83
explanation: get the sum of unique letters count (inside a set) in the wrd using list comprehension.
similar to:
count = []
set_w = set()
for w in wrd:
if w.isalpha():
set_w.add(w)
for w in set_w:
count.append(wrd.count(w))
print(sum(count))

In this one:
spam_data['text'][0].count((r'[a-zA-Z]'))
the count accepts parameter by string, not regex, that is why it returns 0.
Use your second example.

Short answer: you did not use a regex, but a raw string literal, and thus count occurrences of the string '[a-zA-Z].
Because a string of the format r'..' is not a regex, it is a raw string literal. If you write r'\n', you write a string with two characters: a backslash and an n. not a new line. Raw strings are useful in the context of regexes, because regexes use a lot of escaping as well.
For example:
>>> r'\n'
'\\n'
>>> type(r'\n')
<class 'str'>
But here you thus count the number of times the string '[a-zA-Z]' occurs, and unless your spam_data['text'][0] literally contains a square bracket [ followed by a, etc., the count will be zero. Or as specified in the documentation of str.count [Python-doc]:
string.count(s, sub[, start[, end]])
Return the number of (non-overlapping) occurrences of substring sub in string s[start:end]. Defaults for start and end and interpretation of negative values are the same as for slices.)
In case the string is rather large, and you do not want to construct a list of matches, you can count the number of elements with:
sum(1 for _ in re.finditer('[a-zA-Z]', 'mystring'))
It is however typically faster to simply use re.findall(..) and then calculate the number of elements.

Related

search and count specific phrases with special characters in text files

I have a list of search phrases where some are single words, some are multiple words, some have a hyphen in between them, and others may have both parentheses and hyphens. I'm trying to process a directory of text files and search for 100+ of these phrases, and then count occurrences.
It seems like the code below works in 2.7x python until it hits the hyphenated search phrases. I observed some unexpected counts on some text files for at least one of the hyphenated search phrases.
kwlist = ['phraseone', 'phrase two', 'phrase-three', 'phrase four (a-b-c) abc', 'phrase five abc', 'phrase-six abc abc']
for kws in kwlist:
s_str = kws
kw = re.findall(r"\b" + s_str +r"\b", ltxt)
count = 0
for c in kw:
if c == s_str:
count += 1
output.write(str(count))
Is there a better way to handle the range of phrases in the search, or any improvements I can make to my algorithm?

You could achieve this with what I would call a pythonic one-liner.
We don't need to bother with using a regex, as we can use the built-in .count() method, which will from the documentation:
string.count(s, sub[, start[, end]])
Return the number of (non-overlapping) occurrences of substring sub in string s[start:end]. Defaults for start and end and interpretation of negative values are the same as for slices.
So all we need to do is sum up the occurrences of each keyword in kwlist in the string ltxt. This can be done with a list-comprehension:
output.write(str(sum([ltxt.count(kws) for kws in kwlist])))
Update
As pointed out in #voiDnyx's comment, the above solution writes the sum of all the counts, not for each individual keyword.
If you want the individual keywords outputted, you can just write each one individually from the list:
counts = [ltxt.count(kws) for kws in kwlist]
for cnt in counts:
output.write(str(cnt))
This will work, but if you wanted to get silly and put it all in one-line, you could potentially do:
[output.write(str(ltxt.count(kws))) for kws in kwlist]
Its up to you, hope this helps! :)
If you need to match word boundaries, then yes the only way to do so would be to use the \b in a regex. This doesn't mean that you cant still do it in one line:
[output.write(str(len(re.findall(r'\b'+re.escape(kws)+r'\b'))) for kws in kwlist]
Note how the re.escape is necessary, as the keyword may contain special characters.

Return a string of country codes from an argument that is a string of prices

So here's the question:
Write a function that will return a string of country codes from an argument that is a string of prices (containing dollar amounts following the country codes). Your function will take as an argument a string of prices like the following: "US$40, AU$89, JP$200". In this example, the function would return the string "US, AU, JP".
Hint: You may want to break the original string into a list, manipulate the individual elements, then make it into a string again.
Example:
> testEqual(get_country_codes("NZ$300, KR$1200, DK$5")
> "NZ, KR, DK"
As of now, I'm clueless as to how to separate the $ and the numbers. I'm very lost.

I would advice using and looking up regex expressions
https://docs.python.org/2/library/re.html
If you use re.findall it will return you a list of all matching strings, and you can use a regex expression like /[A-Z]{2}$ to find all the two letter capital words in the list.
After that you can just create a string from the resulting list.
Let me know if that is not clear

def test(string):
return ", ".join([item.split("$")[0] for item in string.split(", ")])
string = "NZ$300, KR$1200, DK$5"
print test(string)

Use a regular expression pattern and append the matches to a string. (\w{2})\$ matches exactly 2 word characters followed by by a $.
def get_country_codes(string):
matches = re.findall(r"(\w{2})\$", string)
return ", ".join(match for match in matches)

Define a function that returns the number of non-overlapping occurrences of a sub string in a string

Q> Suppose the count function for a string didn't exist. Define a function that returns the number of non-overlapping occurrences of a sub string in a string.
I think this problem means that if I type the string "abcd" then the result is 10?
I guess the substrings would be:
a
b
c
d
ab
bc
cd
abc
bcd
abcd
So the result is 10. Is it right?

Given the count function for a string counts the occurrences of a specific substring, it seems like the answer would attempt to mimic it rather than measure every possible substring. You can use the re module to accomplish a count of a substring. The findall function returns a list of matches, and the length can be used to find the count
import re
x ='thetheitem1thetheitem2'
len(re.findall(r'the',x))
4

Confusion with string split method in python

Consider the following example
a= 'Apple'
b = a.split(',')
print(b)
Output is ['Apple'].
I am not getting why is it returning a list even when there is no ',' character in Apple
There might be case when we use split method we are expecting more than one element in list but since we are splitting based on separator not present in string, there will be only one element, wouldn't it be better if this mistake is caught during this split method itself

The behaviour of a.split(',') when no commas are present in a is perfectly consistent with the way it behaves when there are a positive number of commas in a.
a.split(',') says to split string a into a list of substrings that are delimited by ',' in a; the delimiter is not preserved in the substrings.
If 1 comma is found you get 2 substrings in the list, if 2 commas are found you get 3 substrings in the list, and in general, if n commas are found you get n+1 substrings in the list. So if 0 commas are found you get 1 substring in the list.
If you want 0 substrings in the list, then you'll need to supply a string with -1 commas in it. Good luck with that. :)

The docstring of that method says:
Return a list of the words in the string S, using sep as the delimiter string.
The delimiter is used to separate multiple parts of the string; having only one part is not an error.

That's the way split() function works. If you do not want that behaviour, you can implement your my_split() function as follows:
def my_split(s, d=' '):
return s.split(d) if d in s else s

find last occurence of multiple characters in a string in Python

I would like to find the last occurrence of a number of characters in a string.
str.rfind() will give the index of the last occurrence of a single character in a string, but I need the index of the last occurrence of any of a number of characters. For example if I had a string:
test_string = '([2+2])-[3+4])'
I would want a function that returns the index of the last occurence of {, [, or { similar to
test_string.rfind('(', '[', '{')
Which would ideally return 8. What is the best way to do this?
max(test_string.rfind('('), test_string.rfind('['), test_string.rfind('{'))
seems clunky and not Pythonic.

You can use generator expression to do this in a Pythonic way.
max(test_string.rfind(i) for i in "([{")
This iterates through the list/tuple of characters that you want to check and uses rfind() on them, groups those values together, and then returns the maximum value.

This is pretty concise, and will do the trick.
max(map(test_string.rfind, '([{'))

You can use reversed to start at the end of the string getting the first match, using the length of the string -1 - the index i to get the index counting from the start, doing at worst a single pass over the string:
test_string = '([2+2])-[3+4])'
st = {"[", "(", "{"}
print(next((len(test_string) - 1 - i
for i, s in enumerate(reversed(test_string)) if s in st),-1))
8
If there is no match, you will get -1 as the default value. This is a lot more efficient if you a large amount of substrings to search for than doing an O(n) rfind for every substring you want to match and then getting the max of all those

>>> def last_of_many(string, findees):
... return max(string.rfind(s) for s in findees)
...
>>> test_string = '([2+2])-[3+4])'
>>> last_of_many(test_string, '([{')
8
>>> last_of_many(test_string, ['+4', '+2'])
10
>>>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex inside findall vs regex inside count - python

You should use str.count instead of count. spam_data['text'].str.count('\w') 0 83 Name: text, dtype: int64 To access the first value use: spam_data['text'].str.count('\w')[0] 83

In this one: spam_data['text'][0].count((r'[a-zA-Z]')) the count accepts parameter by string, not regex, that is why it returns 0. Use your second example.

Related

search and count specific phrases with special characters in text files

Return a string of country codes from an argument that is a string of prices

Define a function that returns the number of non-overlapping occurrences of a sub string in a string

Confusion with string split method in python

find last occurence of multiple characters in a string in Python

Categories

Resources