compute character frequencies in Python strings

compute character frequencies in Python strings - python

I was wondering if there is a way in Python 3.5 to check if a string contains a certain symbol. Also I'd like to know if there is a way to check the amount the string contains. For example, if I want to check how many times the character '$' appears in this string...
^$#%#$$,
how would I do that?

You can use split to check if symbol's in the string:
if your_str.split('$'):
print(your_str.count('$'))
You can also use re.findall:
import re
print(len(re.findall('\$', your_str)))
It returns 0 if there is no such a symbol in the string, otherwise returns count of that symbol in the string.
But the easiest way is to check and return count if symbol is in:
print(your_str.count('$'))
It returns 0 if nothing is found.

These are the built-in functions index and count. You can find full documentation at the official site. Please get used to doing the research on your own; the first step is to get familiar with the names of the language elements.
if my_str.index('$') != 0:
# Found a dollar sign
print my_str.count('$')

Related

How to call another function's results

def most_frequency_occ(chars,inputString):
count = 0
for ind_char in inputString:
ind_char = ind_char.lower()
if chars == ind_char:
count += 1
return count
def general(inputString):
maxOccurences = 0
for chars in inputString:
most_frequency_occ(chars, inputString)
This is my current code. I'm trying to find the most frequent occurring letter in general. I created another function called most_frequency_occ that finds a specific character in the string that occurs the most often, but how do I generalize it into finding the frequent letter in a string without specifying a specific character and only using loops, without any build in string functions either.
For example:
print(general('aqweasdaza'))
should print 4 as "a" occurs the most frequently, occurring 4 times.

If I got your task, I think that using a dictionary will be more comfortable for you.
# initializing string
str = "Hello world"
# initializing dict of freq
freq = {}
for i in str:
if i in freq:
freq[i] += 1
else:
freq[i] = 1
# Now, you have the count of every char in this string.
# If you want to extract the max, this step will do it for you:
max_freq_chr = max(stats.values())

There are multiple ways you find the most common letter in a string.
One easy to understand and cross-language way of doing this would be:
initialize an array of 26 integers set to 0.
go over each letter one by one of your string, if the first letter is an B (B=2), you can increment the second value of the array
Find the largest value in your array, return the corresponding letter.
Since you are using python, you could use dictionaries since it would be less work to implement.
A word of caution, it sounds like you are doing a school assignment. If your school has a plagiarism checker that checks the internet, you might be caught for academic dishonesty if you copy paste code from the internet.

The other answers have suggested alternative ways of counting the letters in a string, some of which may be better than what you've come up with on your own. But I think it may be worth answering your question about how to call your most_frequency_occ function from your general function even if the algorithm isn't great, since you'll need to understand how functions work in other contexts.
The thing to understand about function calls is that the call expression will be evaluated to the value returned by the function. In this case, that's the count. Often you may want to assign the return value to a variable so you can reference it multiple times. Here's what that might look like:
count = most_frequency_occ(chars, inputString)
Now you can do a comparsion between the count and the previously best count to see if you've just checked the most common letter so far:
maxOccurences = 0
for chars in inputString:
count = most_frequency_occ(chars, inputString)
if count > maxOccurences: # check if chars is more common than the previous best
maxOccurences = count
return maxOccurences
One final note: Some of your variable and function names are a bit misleading. That often happens when you're changing your code around from one design to another, but not changing the variable names at the same time. You may want to occasionally reread your code and double check to make sure that the variable names still match what you're doing with them. If not, you should "refactor" your code by renaming the variables to better match their actual uses.
To be specific, your most_frequency_occ function isn't actually finding the most frequent character itself, it's only doing a small step in that process, counting how often a single character occurs. So I'd call it count_char or something similar. The general function might be named something more descriptive like find_most_frequent_character.
And the variable chars (which exists in both functions) is also misleading since it represents a single character, but the name chars implies something plural (like a list or a string that contains several characters). Renaming it to char might be better, as that seems more like a singular name.

Find and replace semi-common strings in dataframe?

I am attempting to find a semi-common occurring string and remove all other data in the column. Pandas and Re have been imported. For instance, I have dataframe...
>>>df
COLUMN COUNT DATA
1 this row RA-123: data 8b43a
2 here RA-5372: data 94h63c
I need to keep just the RA-'number that follows' and remove everything before and after. The numbers that follow are not always the same length and the 'RA-' string does not always occur in the same position. There is a colon after every instance that can be used as a delimiter.
I tried this (a friend wrote the regex search piece for me because I am not familiar with it).
df.assign(DATA= df['DATA'].str.extract(re.search('RA[^:]+')))
But python returned
TypeError: search() missing 1 required positional argument: 'string'
What am I missing here? Thanks in advance!

You should use acapturing group with extract:
df['DATA'].str.extract(r'(RA-\d+)')
Here, (RA-\d+) is a capturing group matching RA, then a hyphen and then one or more digits.
You may use your own pattern, but you still need to wrap it with capturing parentheses, r'(RA[^:]+)'.

Looking at the docs, you don't need the re.search method. You just call df[DATA] = df['DATA'].str.extract(r'RA[^:]+'))

As I mentioned earlier, no need for re here.
Other answers addressed well how to use extract directly. However, to answer your specificly, if you really want to use re, the way to go is to use re.compile instead of re.search.
df.assign(DATA= df['DATA'].str.extract(re.compile(regex_str)))

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

I'm working with some poorly formatted HTML and I need to find every instance of a certain type of pattern. The issue is as follows:
a space, followed by a 1 to 3 digit number, followed by letters (a word, usually). Here are some examples of what I mean.
hello 7Out
how 99In
are 123May
So I would be looking for the expression to get the "7Out", "99In", "123May", etc. The initial space does not need to be included. I hope this is descriptive enough, as I am literally just starting to expose myself to regular expressions and am still struggling a bit. In the end, I will want to count the total number of these instances and add the total count to a df that already exists, so if you have any suggestions on how to do that I would be open to that as well. Thanks for your help in advance!

Your regular expression will be: r'\w\s(\d{1,3}[a-zA-Z]+)'
So in order to get count you can use len() upon list returned by findall. The code will be
import re
string='hello 70qwqeqwfwe123 12wfgtr123 34wfegr123 dqwfrgb'
result=re.findall(r'\w\s(\d{1,3}[a-zA-Z]+)',string)
print "result = ",result #this will give you all the found occurances as list
print "len(result) = ",len(result) #this will give you total no of occurances.
The result will be:
result = ['70qwqeqwfwe', '12wfgtr', '34wfegr']
len(result) = 3
Hint: findall will evaluate regular expression and returns results based on grouping. I'm using that to solve this problem.
Try these:
re.findall(r'(\w\s((\d{1,3})[a-zA-Z]+))',string)
re.findall(r'\w\s((\d{1,3})[a-zA-Z]+)',string)
To get an idea about regular expressions refer python re, tutorials point and to play with the matching characters use this.

regex does not match only upper case letters, despite being instructed to do so

I'm making a script to crawl through a web page and find all upper case names, equalling a number (ex. DUP_NB_FUNC=8). The part where my regular expression has to match only upper case letters however, does not seem to be working properly.
value = re.findall(r"[A-Z0-9_]*(?==\d).{2,}", input)
|tc_apb_conf_00.v:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Desired output should look something like the above. However, I am getting:
|tc_apb_conf_00.v:-:=1" name="viewport"/>
|:-:DUP_NB_FUNC=2
|:-:DUP_NB_FUNC=4
|:-:DUP_NB_FUNC=5
|tc_apb_conf_01.v:-:DUP_NB_FUNC=8
Based on the input I can see its finding a match starting at =1. I don't however understand why as I've put only A-Z in the regex range. I'd really appreciate a bit of assistance and clearing up.

This should be help:
[A-Z0-9_]+(?==\d).{2,}
or
\b[A-Z0-9_]*(?==\d).{2,}\b
But anyway your regex quite weird, according to your requirement above I suggest this
[A-Z0-9_]+=\d+
Instead of using
(?==\d).{2,}: any letters two or more and make sure that the first two letter are = and a one integer respectively,
you can just use
=\d+

Try this.
value = re.findall(r"[A-Z0-9_]+(?==\d).{2,}", input)
You want the case sensitive match to match at least once, which means you want the + quantifier, not the * quantifier, that matches between zero and unlimited times.

I will suggest you define your pattern and check you input if it is available
for i in tlist:
value=re.compile(r"[A-Z0-9_:-.]+=\d+")
jee=value.match(i)
if jee is not None:
print i
tlist contains your input

regex extraction 2 groups resulting only in one match

New to regex.
Consider you have the following text structure:
"hello_1:45||hello_2:67||bye_1:45||bye_5:89||.....|| bye_last:100" and so on
I want to build a dictionary out of it taking the string value as a key, and the decimal number as the dict value.
I was trying to check my concept using this nice tool
I wrote my regex expression:
(\w+):(\d+)
And got only one match ->the first in the string : hello_1:45
I tried also something like:
.*(\w+):(\d+).*
But also not good, any ideas?

You should use the g (global) modifier to get all the matches and not stop to the first one. In python you can use the re.findall function to get all the matches. Check the example here.

You may achieve this only through split function.
s = "hello_1:45||hello_2:67||bye_1:45||bye_5:89"
print {i.split(':')[0]:i.split(':')[1] for i in s.split('||')}
Try this if you want to convert the value part as int.
print {i.split(':')[0]:int(i.split(':')[1]) for i in s.split('||')}
or
print {i.split(':')[0]:float(i.split(':')[1]) for i in s.split('||')}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

compute character frequencies in Python strings - python

I was wondering if there is a way in Python 3.5 to check if a string contains a certain symbol. Also I'd like to know if there is a way to check the amount the string contains. For example, if I want to check how many times the character '$' appears in this string... ^$#%#$$, how would I do that?

Related

How to call another function's results

Find and replace semi-common strings in dataframe?

Beginner with regular expressions; need help writing a specific query - space, followed by 1-3 numbers, followed by any number of letters

regex does not match only upper case letters, despite being instructed to do so

regex extraction 2 groups resulting only in one match

Categories

Resources