Counting by partial string in Python - python

So, I have a list of strings (upper case letters).
list = ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01', 'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
How can I group and count occurrence in the list?
Expected output:

You may try using the Counter library here:
from collections import Counter
import re
list = ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01', 'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
list = [re.sub(r'\d+$', '', x) for x in list]
print(Counter(list))
This prints:
Counter({'HORSE': 4, 'CAT': 3, 'DOG': 2})
Note that the above approach simply strips off the number endings of each list element, then does an aggregation on the alpha names only.

you can also use dictionary
list= ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01',
'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
dic={}
for i in list:
i=i[:-2]
if i in dic:
dic[i]=dic[i]+1
else:
dic[i]=1
print(dic)

Related

How to create tuples from a dictionary of words and values with the value first

I have a dictionary file, here are a few example lines:
acquires,1.09861228867
acquisition,1.09861228867
acquisitions,1.60943791243
acquisitive,0.69314718056
acridine,0.0
acronyms,1.09861228867
acrylics,0.69314718056
actual,1.60943791243
words = [acquires, acrylics, actual, acridine]
I need the output to be:
word_tuples = ((1.09861228867,acquires),(0.69314718056,acrylics), (1.60943791243,actual),
(0.0,acridine))
I tried doing,
sorted_list[]
word_tuples = [(key,value) for key, value in dict]
if words in word_tuples:
sorted_list.append(word_tuples[value])
You can do something like this:
dict = {"acquires":1.09861228867, "acquisition":1.09861228867, "acquisitions":1.60943791243,
"acquisitive":0.69314718056, "acridine":0.0, "acronyms":1.09861228867, "acrylics":0.69314718056," actual":1.60943791243}
words = ["acquires", "acrylics", "actual", "acridine"]
tuple_list = list()
for key, value in dict.items():
if key in words:
tuple_list.append((value, key))
print(tuple_list)
You could do it by using a passing a generator expression to the tuple constructor:
from operator import itemgetter
my_dict = {
'acquires': 1.09861228867,
'acquisition': 1.09861228867,
'acquisitions': 1.60943791243,
'acquisitive': 0.69314718056,
'acridine': 0.0,
'acronyms': 1.09861228867,
'acrylics': 0.69314718056,
'actual': 1.60943791243,
}
words = {'acquires', 'acrylics', 'actual', 'acridine'}
word_tuples = tuple((value, word) for word, value in
sorted(my_dict.items(), key=itemgetter(0)) if word in words)
Note that I made words a set of strings instead of a list because doing so greatly speeds-up the if word in words membership check.
I would consider iterating over the words List and checking the dictionary for the element. The reason for not doing it the other way around is that searching for an element in a list has a complexity of O(n) while checking the dictionary will have a complexity of O(1) and is therefore much faster.
Here is my solution:
my_dict = {"acquires":1.09861228867, "acquisition":1.09861228867, "acquisitions":1.60943791243,"acquisitive":0.69314718056, "acridine":0.0, "acronyms":1.09861228867, "acrylics":0.69314718056,"actual":1.60943791243}
words = ["acquires", "acrylics", "actual", "acridine"]
word_tuples = list()
for word in words:
if word in my_dict:
word_tuples.append((word, my_dict[word]))
print(word_tuples)

coupling str elements from a list to a tuple list

I have the following list:
lines
['line_North_Mid', 'line_South_Mid',
'line_North_South', 'line_Mid_South',
'line_South_North','line_Mid_North' ]
I would like to couple them in a tuple list as follows, with respect to their names:
tuple_list
[('line_Mid_North', 'line_North_Mid'),
('line_North_South', 'line_South_North'),
('line_Mid_South', 'line_South_Mid')]
I thought maybe I could do a string search in the elements of the lines but it wont be efficient. Is there a better way to order lines elements in a way which would look like tuple_list
Paring Criteria:
If the both elements have the same Area_name: ('North', 'Mid', 'South')
E.g.: 'line_North_Mid' should be coupled with 'line_Mid_North'
Try this:
from itertools import combinations
tuple_list = [i for i in combinations(lines,2) if i[0].split('_')[1] == i[1].split('_')[2] and i[0].split('_')[2] == i[1].split('_')[1]]
or I think this is better:
[i for i in combinations(lines,2) if i[0].split('_')[1:] == i[1].split('_')[1:][::-1]]
An order-agnostic O(n) solution is possible using collections.defaultdict. The idea is to use as our dictionary keys the last 2 components of your strings delimited by '_', appending values from your input list. Then extract values and convert to a list of tuples.
from collections import defaultdict
L = ['line_North_Mid', 'line_South_Mid',
'line_North_South', 'line_Mid_South',
'line_South_North', 'line_Mid_North']
dd = defaultdict(list)
for item in L:
dd[frozenset(item.rsplit('_', maxsplit=2)[1:])].append(item)
res = list(map(tuple, dd.values()))
# [('line_North_Mid', 'line_Mid_North'),
# ('line_South_Mid', 'line_Mid_South'),
# ('line_North_South', 'line_South_North')]
You can use the following list comprehension:
lines = ['line_Mid_North', 'line_North_Mid',
'line_North_South', 'line_South_North',
'line_Mid_South', 'line_South_Mid']
[(j,i) for i in lines for j in lines if j not in i
if set(j.split('_')[1:]) < set(i.split('_'))][::2]
[('line_Mid_North', 'line_North_Mid'),
('line_North_South', 'line_South_North'),
('line_Mid_South', 'line_South_Mid')]
I suggest you have a function that returns the same key for string that are supposed to be together (a grouping-key).
def key(s):
# ignore first part and sort other 2 parts, so they will always be in same order
_, part_1, part_2 = s.split('_')
return tuple(sorted([part_1, part_2]))
The you have to use some grouping method; I used defaultdict for example:
import collections
lines = [
'line_North_Mid', 'line_South_Mid',
'line_North_South', 'line_Mid_South',
'line_South_North','line_Mid_North',
]
dd = collections.defaultdict(list)
for s in lines:
dd[key(s)].append(s) # those with same key get grouped
print(list(tuple(v) for v in dd.values()))
# [
# ('line_North_Mid', 'line_Mid_North'),
# ('line_South_Mid', 'line_Mid_South'),
# ('line_North_South', 'line_South_North'),
# ]

Counting only the frequency of letters in a string

I am trying to make my program count everything but numbers in a string, and store it in a dictionary.
So far I have this:
string = str(input("Enter a string: "))
stringUpper = string.upper()
dict = {}
for n in stringUpper:
keys = dict.keys()
if n in keys:
dict[n] += 1
else:
dict[n] = 1
print(dict)
I just want the alphabetical numbers quantified, but I cannot figure out how to exclude the non-alphabetical characters.
Basically there are multiple steps involved:
Getting rid of the chars that you don't want to count
Count the remaining
You have several options available to do these. I'll just present one option, but keep in mind that there might be other (and better) alternatives.
from collections import Counter
the_input = input('Enter something')
Counter(char for char in the_input.upper() if char.isalpha())
For example:
Enter something: aashkfze3f8237rhbjasdkvjuhb
Counter({'A': 3,
'B': 2,
'D': 1,
'E': 1,
'F': 2,
'H': 3,
'J': 2,
'K': 2,
'R': 1,
'S': 2,
'U': 1,
'V': 1,
'Z': 1})
So it obviously worked. Here I used collections.Counter to count and a generator expression using str.isalpha as condition to get rid of the unwanted characters.
Note that there are several bad habits in your code that will make your life more complicated than it needs to be:
dict = {} will shadow the built-in dict. So it's better to choose a different name.
string is the name of a built-in module, so here a different name might be better (but not str which is a built-in name as well).
stringUpper = string.upper(). In Python you generally don't use camelCase but use _ to seperate word (i.e. string_upper) but since you only use it to loop over you might as well use for n in string.upper(): directly.
Variable names like n aren't very helpful. Usually you can name them char or character when iterating over a string or item when iterating over a "general" iterable.
You can use re to replace all non-alphabetical characters before doing any manipulation:
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', stringUpper )
string = str(input("Enter a string: "))
stringUpper = string.upper()
dict = {}
for n in stringUpper:
if n not in '0123456789':
keys = dict.keys()
if n in keys:
dict[n] += 1
else:
dict[n] = 1
print(dict)
for n in stringUpper:
if n.isalpha()
dict[n] += 1
else:
dict[n] = 1
print(dict)
You can check string for alphanumeric
n.isalnum()
for aphabetic:
n.isalpha()
So your code will be like:
dict = {}
for n in stringUpper:
if n.isalpha():
keys = dict.keys()
if n in keys:
dict[n] += 1
else:
dict[n] = 1
print(dict)
else:
#do something....
While iterating, check if the lower() and upper() is the same for a character. If they are different from each other, then it is an alphabetical letter.
if n.upper() == n.lower():
continue
This should do it.

How to count elements on each position in lists

I have a lot of lists like:
SI821lzc1n4
MCap1kr01lv
All of them have the same length. I need to count how many times each symbol appears on each position. Example:
abcd
a5c1
b51d
Here it'll be a5cd
One way is to use zip to associate characters in the same position. We can then send all of the characters from each position to a Counter, then use Counter.most_common to get the most common character
from collections import Counter
l = ['abcd', 'a5c1', 'b51d']
print(''.join([Counter(z).most_common(1)[0][0] for z in zip(*l)]))
# a5cd
from statistics import mode
[mode([x[i] for x in y]) for i in xrange(len(y[0]))]
where y is your list.
Python 3.4 and up
You could use combination of zip and Counter
a = ("abcd")
b = ("a5c1")
c = ("b51d")
from collections import Counter
zippedList = list(zip(a,b,c))
print("zipped: {}".format(zippedList))
final = ""
for x in zippedList:
countLetters = Counter(x)
print(countLetters)
final += countLetters.most_common(3)[0][0]
print("output: {}".format(final))
output:
zipped: [('a', 'a', 'b'), ('b', '5', '5'), ('c', 'c', '1'), ('d', '1', 'd')]
Counter({'a': 2, 'b': 1})
Counter({'5': 2, 'b': 1})
Counter({'c': 2, '1': 1})
Counter({'d': 2, '1': 1})
output: a5cd
This all depends on where your list is. Is your list coming from another file or is it an actual array? At the end of the day, the best way to do this simply is going to be to use a dictionary and a for loop.
new_dict = {}
for i in range(len(line)):
if i in new_dict:
new_dict[i].append(line[i])
else:
new_dict[i] = [line[i]]
Then after that I'm assuming that you'd like to output the four most common element appearances. For that I'd recommend importing statistics and using the mode method...
from statistics import mode
new_line = ""
for key in new_dict:
x = mode(new_dict[key])
new_line = new_line + x
However, your question is quite vague, please elaborate more next time.
P.s. I'm a newbie so all you experienced programmers plz don't hate :)
I would use a combination of defaultdict, enumerate, and Counter:
>>> from collections import Counter, defaultdict
>>> data = '''abcd
a5c1
b51d
'''
>>> poscount = defaultdict(Counter)
>>> for line in data.split():
for i, character in enumerate(line):
poscount[i][character] += 1
>>> ''.join([poscount[i].most_common(1)[0][0] for i in sorted(poscount)])
'a5cd'
Here's how it works:
The defaultdict() creates new entries when it sees a new key.
The enumerate() function returns both the character and its position in the line.
The Counter counts the occurences of individual characters
Combining the three makes a defaultdict whose keys are the column positions and whose values are character counters. That gives you one character counter per column.
The most_common() method returns the highest frequency (character, count) pair for that counter.
The [0][0] extracts the character from the list of (character, count) tuples.
The str.join() method combines the results back together.

Creating and rearranging a dictionary

I am new to python! I have created a code which successfully opens my text file and sorts my list of 100's of words. I then have put these in a list labelled stimuli_words, which consists of no duplicates words, all lower case etc.
However I now want to convert this list into a dictionary, where the keys are all possible 3 letter endings in my list of words, and the values are the words that correspond to those endings.
For instance 'ing: going, hiring...', but I only want the words in which have more than 40 words corresponding to the last two characters. So far I have this code:
from collections import defaultdict
fq = defaultdict( int )
for w in stimuli_list:
fq[w] += 1
print fq
However it is just returning a dictionary with my words and how many times they occur which is obviously once. e.g 'going': 1, 'hiring': 1, 'driving': 1.
Really would appreciate some help!! Thank You!!
You could do something like this:
dictionary = {}
words = ['going', 'hiring', 'driving', 'letter', 'better', ...] # your list or words
# Creating words dictionary
for word in words:
dictionary.setdefault(word[-3:], []).append(word)
# Removing lists that contain less than 40 words:
for key, value in dictionary.copy().items():
if len(value) < 40:
del dictionary[key]
print(dictionary)
Output:
{ # Only lists that are longer than 40 words
'ing': ['going', 'hiring', 'driving', ...],
'ter': ['letter', 'better', ...],
...
}
Since you're counting the words (because your key is the word), you only get 1 count per word.
You could create a key of the 3 last characters (and use Counter instead):
import collections
wordlist = ["driving","hunting","fishing","drive","a"]
endings = collections.Counter(x[-3:] for x in wordlist)
print(endings)
result:
Counter({'ing': 3, 'a': 1, 'ive': 1})
Create DemoData:
import random
# seed the same for any run
random.seed(10)
# base lists for demo data
prae = ["help","read","muck","truck","sleep"]
post= ["ing", "biothign", "press"]
# lots of data
parts = [ x+str(y)+z for x in prae for z in post for y in range(100,1000,100)]
# shuffle and take on ever 15th
random.shuffle(parts)
stimuli_list = parts[::120]
Creation of dictionary from stimuli_list
# create key with empty lists
dic = dict(("".join(e[len(e)-3:]),[]) for e in stimuli_list)
# process data and if fitting, fill list
for d in dic:
fitting = [x for x in parts if x.endswith(d)] # adapt to only fit 2 last chars
if len(fitting) > 5: # adapt this to have at least n in it
dic[d] = fitting[:]
for d in [x for x in dic if not dic[x]]: # remove keys with empty lists
dic.remove(d)
print()
print(dic)
Output:
{'ess': ['help400press', 'sleep100press', 'sleep600press', 'help100press', 'muck400press', 'muck900press', 'muck500press', 'help800press', 'muck100press', 'read300press', 'sleep400press', 'muck800press', 'read600press', 'help200press', 'truck600press', 'truck300press', 'read700press', 'help900press', 'truck400press', 'sleep200press', 'read500press', 'help600press', 'truck900press', 'truck800press', 'muck200press', 'truck100press', 'sleep700press', 'sleep500press', 'sleep900press', 'truck200press', 'help700press', 'muck300press', 'sleep800press', 'muck700press', 'sleep300press', 'help500press', 'truck700press', 'read400press', 'read100press', 'muck600press', 'read900press', 'read200press', 'help300press', 'truck500press', 'read800press']
, 'ign': ['truck200biothign', 'muck500biothign', 'help800biothign', 'muck700biothign', 'help600biothign', 'truck300biothign', 'read200biothign', 'help500biothign', 'read900biothign', 'read700biothign', 'truck400biothign', 'help300biothign', 'read400biothign', 'truck500biothign', 'read800biothign', 'help700biothign', 'help400biothign', 'sleep600biothign', 'sleep500biothign', 'muck300biothign', 'truck700biothign', 'help200biothign', 'sleep300biothign', 'muck100biothign', 'sleep800biothign', 'muck200biothign', 'sleep400biothign', 'truck100biothign', 'muck800biothign', 'read500biothign', 'truck900biothign', 'muck600biothign', 'truck800biothign', 'sleep100biothign', 'read300biothign', 'read100biothign', 'help900biothign', 'truck600biothign', 'help100biothign', 'read600biothign', 'muck400biothign', 'muck900biothign', 'sleep900biothign', 'sleep200biothign', 'sleep700biothign']
}

Categories