How to collect defined items in lists python - python

I have to find the signs "a..,z", "A,..,Z", "space", "." and "," in some data.
I have tried the code:
fh = codecs.open("mydata.txt", encoding = "utf-8")
text = fh.read()
fh1 = unicode(text)
dic_freq_signs = dict(Counter(fh1.split()))
All_freq_signs = dic_freq_signs.items()
List_signs = dic_freq_signs.keys()
List_freq_signs = dic_freq_signs.values()
BUT it gets me ALL signs not the ones i am looking for?
Can anyone help?
(And it has to be unicode)

check dictionary iteration ..
All_freq_signs = [ item for item in dic_freq_signs.items() if item.something == "somevalue"]
def criteria(value):
return value%2 == 0
All_freq_signs = [ item for item in dic_freq_signs.items() if criteria(item)]

Make sure you import string module, with it you can get character ranges a to z and A to Z easily
import string
A Counter(any_string) gives the count of each character in the string. By using split() the counter would return the counts of each word in the string, contradicting with your requirement. So I have assumed that you need character counts.
dic_all_chars = dict(Counter(fh1)) # this gives counts of all characters in the string
signs = string.lowercase + string.uppercase + ' .,' # these are the characters you want to check
# using dict comprehension and checking if the key is in the characters you want
dic_freq_signs = {key: value for key, value in dic_all_chars.items()
if key in signs}
dic_freq_signs would only have the signs that you want to count as keys and their counts as values.

Related

How can I convert a string to dictionary in a manner where I have to extract key and value from same string

I need to convert string to dictionary in a manner for example
str1 = "00001000-0009efff : a 00100000-656b2fff : b"
Output what I require is
dict1 = {'a':['00001000','0009efff'], 'b':['00100000','656b2fff']}
Note: str1 can have many more such c, d, e with range.
You can do it with a regex:
import re
pattern = r'([\w\-]+) : ([\w\.]+)'
out = {m[1]: m[0].split('-') for m in re.findall(pattern, str1)}
Explanation of the regex:
match combination of alphanumeric characters and dashes [\w-]+
followed by a space, a colon and a space _:_
followed by a combination of alphanumeric characters and dot [\w\.]+
The groups are catching your relevant infos.
Assuming you only have a single key letter for each value
str1 = str1.replace(" : ", ":").split(" ")
output = {}
for _, s in enumerate(str1):
output[s[-1]] = s[:-2].split("-")
This code will work in general
str1 = "00001000-0009efff : a 00100000-656b2fff : b"
needed_dictionary = dict()
split_string = str1.split()
for i in range(len(split_string)):
if split_string[i] == ":":
needed_dictionary[split_string[i+1]]= split_string[i-1].split("-")
print(needed_dictionary)
But in Case the values or keys have "-" or ":" in them then this will fail.

Python2 tokenization and add to dictonary

I have some texts that I need to generate tokens splitting by space. Furthermore, I need to remove all punctuation, as I need to remove everything inside double braces [[...]] (including the double braces).
Each token I will put on a dictionary as the key that will have a list of values.
I have tried regex to remove these double braces patterns, if-elses, but I can't find a solution that works. For the moment I have:
tokenDic = dict()
splittedWords = re.findall(r'\[\[\s*([^][]*?)]]', docs[doc], re.IGNORECASE)
tokenStr = splittedWords.split()
for token in tokenStr:
tokenDic[token].append(value);
Is this what you're looking for?
import re
value_list = []
inp_str = 'blahblah[[blahblah]]thi ng1[[junk]]hmm'
tokenDic = dict()
#remove everything in double brackets
bracket_stuff_removed = re.sub(r'\[\[[^]]*\]\]', '', inp_str)
#function to keep only letters and digits
clean_func = lambda x: 97 <= ord(x.lower()) <= 122 or 48 <= ord(x) <= 57
for token in bracket_stuff_removed.split(' '):
cleaned_token = ''.join(filter(clean_func, token))
tokenDic[cleaned_token] = list(value_list)
print(tokenDic)
Output:
{'blahblahthi': [], 'ng1hmm': []}
As for appending to the list, I don't have enough info right now to tell you the best way in your situation.
If you want to set the value when you're adding the key, do this:
tokenDic[cleaned_token] = [val1, val2, val3]
If you want to set the values after the key has been added, do this:
val_to_add = "something"
if tokenDic.get(cleaned_token, -1) == -1:
print('ERROR', cleaned_token, 'does not exist in dict')
else:
tokenDic[cleaned_token].append(val_to_add)
If you want to directly append to the dict in both cases, you'll need to use defaultdict(list) instead of dict.. then if the key does not exist in the dict, it will create it, make the value an empty list, and then add your value.
To remove everything inside [[]] you can use re.sub and you already have the correct regex so just do this.
x = [[hello]]w&o%r*ld^$
y = re.sub("\[\[\s*([^][]*?)]]","",x)
z = re.sub("[^a-zA-Z\s]","",y)
print(z)
This prints "world"

How do I find the predominant letters in a list of strings

I want to check for each position in the string what is the character that appears most often on that position. If there are more of the same frequency, keep the first one. All strings in the list are guaranteed to be of identical length!!!
I tried the following way:
print(max(((letter, strings.count(letter)) for letter in strings), key=lambda x:[1])[0])
But I get: mistul or qagic
And I can not figure out what's wrong with my code.
My list of strings looks like this:
Input: strings = ['mistul', 'aidteh', 'mhfjtr', 'zxcjer']
Output: mister
Explanation: On the first position, m appears twice. Second, i appears twice twice. Third, there is no predominant character, so we chose the first, that is, s. On the fourth position, we have t twice and j twice, but you see first t, so we stay with him, on the fifth position we have e twice and the last r twice.
Another examples:
Input: ['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih', 'mbpzu', 'pbghn', 'mzsev', 'saqbl', 'myead']
Output: magic
Input: ['sacbkt', 'tnqaex', 'vhcrhl', 'obotnq', 'vevleg', 'rljnlv', 'jdcjrk', 'zuwtee', 'xycbvm', 'szgczt', 'imhepi', 'febybq', 'pqkdfg', 'swwlds', 'ecmrut', 'buwruy', 'icjwet', 'gebgbq', 'djtfzr', 'uenleo']
Expected Output: secret
Some help?
Finally a use case for zip() :-)
If you like cryptic code, it could even be done in one statement:
def solve(strings):
return ''.join([max([(letter, letters.count(letter)) for letter in letters], key=lambda x: x[1])[0] for letters in zip(*strings)])
But I prefer a more readable version:
def solve(strings):
result = ''
# "zip" the strings, so in the first iteration `letters` would be a list
# containing the first letter of each word, the second iteration it would
# be a list of all second letters of each word, and so on...
for letters in zip(*strings):
# Create a list of (letter, count) pairs:
letter_counts = [(letter, letters.count(letter)) for letter in letters]
# Get the first letter with the highest count, and append it to result:
result += max(letter_counts, key=lambda x: x[1])[0]
return result
# Test function with input data from question:
assert solve(['mistul', 'aidteh', 'mhfjtr', 'zxcjer']) == 'mister'
assert solve(['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih', 'mbpzu', 'pbghn',
'mzsev', 'saqbl', 'myead']) == 'magic'
assert solve(['sacbkt', 'tnqaex', 'vhcrhl', 'obotnq', 'vevleg', 'rljnlv',
'jdcjrk', 'zuwtee', 'xycbvm', 'szgczt', 'imhepi', 'febybq',
'pqkdfg', 'swwlds', 'ecmrut', 'buwruy', 'icjwet', 'gebgbq',
'djtfzr', 'uenleo']) == 'secret'
UPDATE
#dun suggested a smarter way of using the max() function, which makes the one-liner actually quite readable :-)
def solve(strings):
return ''.join([max(letters, key=letters.count) for letters in zip(*strings)])
Using collections.Counter() is a nice strategy here. Here's one way to do it:
from collections import Counter
def most_freq_at_index(strings, idx):
chars = [s[idx] for s in strings]
char_counts = Counter(chars)
return char_counts.most_common(n=1)[0][0]
strings = ['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih',
'mbpzu', 'pbghn', 'mzsev', 'saqbl', 'myead']
result = ''.join(most_freq_at_index(strings, idx) for idx in range(5))
print(result)
## 'magic'
If you want something more manual without the magic of Python libraries you can do something like this:
def f(strings):
dic = {}
for string in strings:
for i in range(len(string)):
word_dic = dic.get(i, { string[i]: 0 })
word_dic[string[i]] = word_dic.get(string[i], 0) + 1
dic[i] = word_dic
largest_string = max(strings, key = len)
result = ""
for i in range(len(largest_string)):
result += max(dic[i], key = lambda x : dic[i][x])
return result
strings = ['qagic', 'cafbk', 'twggl', 'kaqtc', 'iisih', 'mbpzu', 'pbghn', 'mzsev', 'saqbl', 'myead']
f(strings)
'magic'

How to delete repeating letters in a string?

I am trying to write a function which will return me the string of unique characters present in the passed string. Here's my code:
def repeating_letters(given_string):
counts = {}
for char in given_string:
if char in counts:
return char
else:
counts[char] = 1
if counts[char] > 1:
del(char)
else:
return char
I am not getting expected results with it. How can I get the desired result.
Here when I am passing this string as input:
sample_input = "abcadb"
I am expecting the result to be:
"abcd"
However my code is returning me just:
nothing
def repeating_letters(given_string):
seen = set()
ret = []
for c in given_string:
if c not in seen:
ret.append(c)
seen.add(c)
return ''.join(ret)
Here we add each letter to the set seen the first time we see it, at the same time adding it to a list ret. Then we return the joined list.
Here's the one-liner to achieve this if the order in the resultant string matters via using set with sorted as:
>>> my_str = 'abcadbgeg'
>>> ''.join(sorted(set(my_str),key=my_str.index))
'abcdge'
Here sorted will sort the characters in the set based on the first index of each in the original string, resulting in ordered list of characters.
However if the order in the resultant string doesn't matter, then you may simply do:
>>> ''.join(set(my_str))
'acbedg'

Python string replace

I have this code:
ALPHABET1 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
key = "TES"
ALPHABET2 = key + ALPHABET1
count_result = ALPHABET2.count("T")
if (count_result > 1):
ALPHABET3 = ALPHABET1.replace("T","")
ALPHABET2 = key + ALPHABET3
print(ALPHABET2)
I want to be able to put the keyword at the start of the alphabet string to create a new string without repeating the letters in the keyword. I'm having some problems doing this though. I need the keyword to work for all letters as it will be user input in my program. Any suggestions?
Two things:
You don't need to make the alphabet yourself, import string and use string.ascii_uppercase; and
You can use a for loop to work through the characters in your key.
To illustrate the latter:
for c in key:
alphabet = alphabet.replace(c, "")
Better yet, a list is mutable, so you can do:
alpha = [c for c in string.ascii_uppercase if c not in key]
alpha.extend(set(key))
its easy and clean to do this with a regex
import re
ALPHABET1 = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
key = "TES"
newalphabet = key.upper() + re.sub(r'%s'%'|'.join(key.upper()), '', ALPHABET1)
or with a list comprehension like #jonrsharpe suggested

Categories