Detect substrings using regex in python [duplicate] - python

This question already has answers here:
Identify strings while removing substrings in python
(3 answers)
Closed 5 years ago.
I have a dictionary as follows.
myfood = {'yummy tim tam': 1, 'tasty chips': 3, 'yummy': 10, 'a loaf of bread': 5}
I also have a set as follows.
myset = {'yummy', 'a', 'tasty', 'of', 'delicious', 'yum'}
Now I want to identify the elements of myset in substrings of myfood and remove them. Hence, my final myfood dictionary should look as follows.
myfood = {'tim tam': 1, 'chips': 3, 'yummy': 10, 'loaf bread':5}
NOTE: I do not want to remove myset elements if they are full strings. E.g., 'yummy': 10 in myfood is not removed as it is not a substring, but a full string.
My current code is as follows.
for word in myfood.keys():
if word in myset:
#Do nothing
else:
######Find the substring part and remove it
Please help me.

Use re.sub to replace only keys that are substrings:
pat = re.compile(r'|'.join([r'(\s|\b){}\b'.format(x) for x in myset]))
dct = {}
for k, v in myfood.items():
if k not in myset: # exclude full strings
k = pat.sub('', k).strip()
dct[k] = v
print(dct)
# {'yummy': 10, 'loaf bread': 5, 'tim tam': 1, 'chips': 3}

Related

Counting by partial string in Python

So, I have a list of strings (upper case letters).
list = ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01', 'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
How can I group and count occurrence in the list?
Expected output:
You may try using the Counter library here:
from collections import Counter
import re
list = ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01', 'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
list = [re.sub(r'\d+$', '', x) for x in list]
print(Counter(list))
This prints:
Counter({'HORSE': 4, 'CAT': 3, 'DOG': 2})
Note that the above approach simply strips off the number endings of each list element, then does an aggregation on the alpha names only.
you can also use dictionary
list= ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01',
'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
dic={}
for i in list:
i=i[:-2]
if i in dic:
dic[i]=dic[i]+1
else:
dic[i]=1
print(dic)

How to assign same values in dictionary to another list of strings

I am trying to convert strings to numbers and then assign same values for same words in another list of strings. Assume I have string A like below. I converted to values using dictionary like the code below. Now I need to assign same values to same string in list B and the output should be like res_B
A='hello world how are you doing'`
res_A = [1, 2, 3, 4, 5, 6]
B=['hello world how', 'hello are' ,'hello', 'hello are you doing']
res_B = [[1,2,3],[1,4],[1],[1,4,5,6]]
A='hello world how are you doing'
d = {}
res_A = [d.setdefault(word, len(d)+1) for word in A.lower().split()]
# map words from A into indices 1..N
mapping = {k: v for v, k in enumerate(A.split(), 1)}
# find mappings of words in B
B_res = [[mapping[word] for word in s.split()] for s in B]
Using a list comprehension again, and the same dictionary from before:
res_B = [
[d[word] for word in phrase.lower().split()]
for phrase in B
]
Here's a step-by-step and functional way to do it
# Create a lookup dictionary
lookup = {word: index for word, index in zip(A.split(' '), res_A)}
# Map every sentence to be replaced with lookup values per word
res_B = [list(map(lambda x: lookup[x], sentence.split(' '),
sentence)) for sentence in B]

Creating and rearranging a dictionary

I am new to python! I have created a code which successfully opens my text file and sorts my list of 100's of words. I then have put these in a list labelled stimuli_words, which consists of no duplicates words, all lower case etc.
However I now want to convert this list into a dictionary, where the keys are all possible 3 letter endings in my list of words, and the values are the words that correspond to those endings.
For instance 'ing: going, hiring...', but I only want the words in which have more than 40 words corresponding to the last two characters. So far I have this code:
from collections import defaultdict
fq = defaultdict( int )
for w in stimuli_list:
fq[w] += 1
print fq
However it is just returning a dictionary with my words and how many times they occur which is obviously once. e.g 'going': 1, 'hiring': 1, 'driving': 1.
Really would appreciate some help!! Thank You!!
You could do something like this:
dictionary = {}
words = ['going', 'hiring', 'driving', 'letter', 'better', ...] # your list or words
# Creating words dictionary
for word in words:
dictionary.setdefault(word[-3:], []).append(word)
# Removing lists that contain less than 40 words:
for key, value in dictionary.copy().items():
if len(value) < 40:
del dictionary[key]
print(dictionary)
Output:
{ # Only lists that are longer than 40 words
'ing': ['going', 'hiring', 'driving', ...],
'ter': ['letter', 'better', ...],
...
}
Since you're counting the words (because your key is the word), you only get 1 count per word.
You could create a key of the 3 last characters (and use Counter instead):
import collections
wordlist = ["driving","hunting","fishing","drive","a"]
endings = collections.Counter(x[-3:] for x in wordlist)
print(endings)
result:
Counter({'ing': 3, 'a': 1, 'ive': 1})
Create DemoData:
import random
# seed the same for any run
random.seed(10)
# base lists for demo data
prae = ["help","read","muck","truck","sleep"]
post= ["ing", "biothign", "press"]
# lots of data
parts = [ x+str(y)+z for x in prae for z in post for y in range(100,1000,100)]
# shuffle and take on ever 15th
random.shuffle(parts)
stimuli_list = parts[::120]
Creation of dictionary from stimuli_list
# create key with empty lists
dic = dict(("".join(e[len(e)-3:]),[]) for e in stimuli_list)
# process data and if fitting, fill list
for d in dic:
fitting = [x for x in parts if x.endswith(d)] # adapt to only fit 2 last chars
if len(fitting) > 5: # adapt this to have at least n in it
dic[d] = fitting[:]
for d in [x for x in dic if not dic[x]]: # remove keys with empty lists
dic.remove(d)
print()
print(dic)
Output:
{'ess': ['help400press', 'sleep100press', 'sleep600press', 'help100press', 'muck400press', 'muck900press', 'muck500press', 'help800press', 'muck100press', 'read300press', 'sleep400press', 'muck800press', 'read600press', 'help200press', 'truck600press', 'truck300press', 'read700press', 'help900press', 'truck400press', 'sleep200press', 'read500press', 'help600press', 'truck900press', 'truck800press', 'muck200press', 'truck100press', 'sleep700press', 'sleep500press', 'sleep900press', 'truck200press', 'help700press', 'muck300press', 'sleep800press', 'muck700press', 'sleep300press', 'help500press', 'truck700press', 'read400press', 'read100press', 'muck600press', 'read900press', 'read200press', 'help300press', 'truck500press', 'read800press']
, 'ign': ['truck200biothign', 'muck500biothign', 'help800biothign', 'muck700biothign', 'help600biothign', 'truck300biothign', 'read200biothign', 'help500biothign', 'read900biothign', 'read700biothign', 'truck400biothign', 'help300biothign', 'read400biothign', 'truck500biothign', 'read800biothign', 'help700biothign', 'help400biothign', 'sleep600biothign', 'sleep500biothign', 'muck300biothign', 'truck700biothign', 'help200biothign', 'sleep300biothign', 'muck100biothign', 'sleep800biothign', 'muck200biothign', 'sleep400biothign', 'truck100biothign', 'muck800biothign', 'read500biothign', 'truck900biothign', 'muck600biothign', 'truck800biothign', 'sleep100biothign', 'read300biothign', 'read100biothign', 'help900biothign', 'truck600biothign', 'help100biothign', 'read600biothign', 'muck400biothign', 'muck900biothign', 'sleep900biothign', 'sleep200biothign', 'sleep700biothign']
}

NLTK Brown Corpus Tags

When I print nltk.corpus.brown.tagged_words() it prints about 1161192 tuples with words and their associated tags.
I want to distinguish different distinct words having different distinct tags. One word can have more than one tag.
Append list items by number of hyphens available I tried every code with this thread but I am not getting any word more than 3 tags. As far as I know, there are words with even 8 or 9 tags also.
Where my approach is wrong? How to resolve this? I have two different questions:
How to figure out the count of different words of the corpus under different distinct tags? the number of distinct words in the corpus having let's say 8 distinct tags.
Again, I want to know word with the greatest number of distinct tags.
And, I have interest with words only. I am removing punctuations.
Use a defaultdict(Counter) to keep track of words and their POS. Then sort the dictionary by the keys' len(Counter):
from collections import defaultdict, Counter
from nltk.corpus import brown
# Keeps words and pos into a dictionary
# where the key is a word and
# the value is a counter of POS and counts
word_tags = defaultdict(Counter)
for word, pos in brown.tagged_words():
word_tags[word][pos] +=1
# To access the POS counter.
print 'Red', word_tags['Red']
print 'Marlowe', word_tags['Marlowe']
print
# Greatest number of distinct tag.
word_with_most_distinct_pos = sorted(word_tags, key=lambda x: len(word_tags[x]), reverse=True)[0]
print word_with_most_distinct_pos
print word_tags[word_with_most_distinct_pos]
print len(word_tags[word_with_most_distinct_pos])
[out]:
Red Counter({u'JJ-TL': 49, u'NP': 21, u'JJ': 3, u'NN-TL': 1, u'JJ-TL-HL': 1})
Marlowe Counter({u'NP': 4})
that
Counter({u'CS': 6419, u'DT': 1975, u'WPS': 1638, u'WPO': 135, u'QL': 54, u'DT-NC': 6, u'WPS-NC': 3, u'CS-NC': 2, u'WPS-HL': 2, u'NIL': 1, u'CS-HL': 1, u'WPO-NC': 1})
12
To get words with X no. of distinct POS:
# Words with 8 distinct POS
word_with_eight_pos = filter(lambda x: len(word_tags[x]) == 8, word_tags.keys())
for i in word_with_eight_pos:
print i, word_tags[i]
print
# Words with 9 distinct POS
word_with_nine_pos = filter(lambda x: len(word_tags[x]) == 9, word_tags.keys())
for i in word_with_nine_pos:
print i, word_tags[i]
[out]:
a Counter({u'AT': 21824, u'AT-HL': 40, u'AT-NC': 7, u'FW-IN': 4, u'NIL': 3, u'FW-IN-TL': 1, u'AT-TL': 1, u'NN': 1})
: Counter({u':': 1558, u':-HL': 138, u'.': 46, u':-TL': 22, u'IN': 20, u'.-HL': 8, u'NIL': 1, u',': 1, u'NP': 1})
The NLTK provides the perfect tool to index all tags used for each word:
wordtags = nltk.ConditionalFreqDist(nltk.corpus.brown.tagged_words())
Or if you want to case-fold the words as you go:
wordtags = nltk.ConditionalFreqDist((w.lower(), t) for w, t in brown.tagged_words())
We now have an index of the tags belonging to each word (plus their frequencies, which the OP didn't care about):
>>> print(wordtags["clean"].items())
dict_items([('JJ', 48), ('NN-TL', 1), ('RB', 1), ('VB-HL', 1), ('VB', 18)])
To find the words with the most tags, fall back on general Python sorting:
>>> wtlist = sorted(wordtags.items(), key=lambda x: len(x[1]), reverse=True)
>>> for word, freqs in wtlist[:10]:
print(word, "\t", len(freqs), list(freqs))
that 15 ['DT', 'WPS-TL', 'CS-NC', 'DT-NC', 'WPS-NC', 'WPS', 'NIL', 'CS-HL', 'WPS-HL',
'WPO-NC', 'DT-TL', 'DT-HL', 'CS', 'QL', 'WPO']
a 13 ['NN-TL', 'AT-NC', 'NP', 'AT', 'AT-TL-HL', 'NP-HL', 'NIL', 'AT-TL', 'NN',
'NP-TL', 'AT-HL', 'FW-IN-TL', 'FW-IN']
(etc.)
You can use itertools.groupby to achieve what you want. Do note that the following code is just quickly bashed together and most likely not the most efficient way to achieve your goal (I'll leave it up to you to optimise it), however it does the job...
import itertools
import operator
import nltk
for k, g in itertools.groupby(sorted(nltk.corpus.brown.tagged_words()), key=operator.itemgetter(0)):
print k, set(map(operator.itemgetter(1), g))
Output:
...
yonder set([u'RB'])
yongst set([u'JJT'])
yore set([u'NN', u'PP$'])
yori set([u'FW-NNS'])
you set([u'PPSS-NC', u'PPO', u'PPSS', u'PPO-NC', u'PPO-HL', u'PPSS-HL'])
you'd set([u'PPSS+HVD', u'PPSS+MD'])
you'll set([u'PPSS+MD'])
you're set([u'PPSS+BER'])
...
A two-line way to find the word with the greatest number of distinct tags (along with its tags):
word2tags = nltk.Index(set(nltk.corpus.brown.tagged_words()))
print(max(word2tags.items(), key=lambda wt: len(wt[1])))

String editing in python to find anagrams [duplicate]

This question already has an answer here:
Python Anagram Finder from given File
(1 answer)
Closed 9 years ago.
Given the string...
able\nacre\nbale\nbeyond\nbinary\nboat\nbrainy\ncare\ncat\ncater\ncrate\nlawn\nlist\nrace\nreact\nsheet\nsilt\nslit\ntrace\n
I am trying to figure out how to assign each word in the string to a variable, and then sort each word alphabetically which will allow me to compare them to see which ones are anagrams and which ones are not. I have around a month of Python experience so dumb everything WAY down if you could.
Instead of saving each word to a variable, you should save them all to a list. Here is how I would approach the complete problem:
from itertools import groupby
from operator import itemgetter
s = 'able\nacre\nbale\nbeyond\nbinary\nboat\nbrainy\ncare\ncat\ncater\ncrate\nlawn\nlist\nrace\nreact\nsheet\nsilt\nslit\ntrace\n'
words = s.strip().split()
sorted_words = (''.join(sorted(line)) for line in words)
grouped = sorted((v, i) for i, v in enumerate(sorted_words))
anagrams = [[words[i] for v, i in g] for k, g in groupby(grouped, itemgetter(0))]
Result:
>>> import pprint
>>> pprint.pprint(anagrams)
[['able', 'bale'],
['binary', 'brainy'],
['boat'],
['acre', 'care', 'race'],
['cater', 'crate', 'react', 'trace'],
['cat'],
['lawn'],
['beyond'],
['sheet'],
['list', 'silt', 'slit']]
In [27]: s = 'able\nacre\nbale\nbeyond\nbinary\nboat\nbrainy\ncare\ncat\ncater\ncrate\nlawn\nlist\nrace\nreact\nsheet\nsilt\nslit\ntrace\n'
In [28]: words = s.split()
In [29]: [''.join(sorted(w)) for w in words]
Out[29]:
['abel',
'acer',
'abel',
'bdenoy',
'abinry',
'abot',
'abinry',
...
You can do yourstring.split('whattosplitat'). In this case, that would be
l='able\nacre\nbale\nbeyond\nbinary\nboat\nbrainy\ncare\ncat\ncater\ncrate\nlawn\nlist\nrace\nreact\nsheet\nsilt\nslit\ntrace\n'.split('\n')
Then you can do l.sort() which will sort your list alphabetically.
s = 'able\nacre\nbale\nbeyond\nbinary\nboat\nbrainy\ncare\ncat\ncater\ncrate\nlawn\nlist\nrace\nreact\nsheet\nsilt\nslit\ntrace\n'
words = sorted(s.split('\n')[:-1]) # the last one will be '', so you want to get rid of that
To test whether or not a string is an anagram of another string:
def isAnagram(a, b):
aLtrs = sorted(list(a)) # if a='test', aLtrs=['e', 's', 't', 't']
bLtrs = sorted(list(a)) # same as above
return True if aLtrs==bLtrs else False

Categories