Update a DataFrame based on Counter values - python

I have a corpus data, stored as a list of list of strings.
Based on this data I have the following variables:
vocab_dict = Counter()
for text in data_words:
temp_count = Counter(text)
vocab_dict.update(temp_count)
vocab=list(sorted(vocab_dict.keys()))
Now, I want to create a pandas DataFrame in which each column represents a word from vocab if its value in vocab_dict is higher than 3.
To do so, I have the following code:
def get_occurrence_df(data):
vocab_words = [word for word in vocab if vocab_dict[word] > 3]
occurrence_df = pd.DataFrame(0, index = np.arange(len(data)), columns = vocab_words)
for i, text in enumerate(data):
text_count = Counter(text)
for word in text_count.keys():
occurrence_df.loc[i, word] = text_count[word]
return occurrence_df
However, running the function get_occurrence_df() takes very long. Is there a way to get the same df faster?

This should work a bit faster, it's not in a functional form, but should be straightforward to refactor:
from collections import Counter
import pandas as pd
data_words = [["abc", "def", "abc"], ["xyz", "xyz", "xyz", "def"]]
# create a list of dictionaries with counts
temp_list = [
{k: v for k, v in Counter(words).items() if v >= 2}
for words in data_words
]
occurrence_df = pd.DataFrame(temp_list).fillna(0)
Note that it's better to filter for frequent words right-away because there will be a lot of infrequent words and it's not good to clog memory with objects that will not be used downstream.

Related

How to create tuples from a dictionary of words and values with the value first

I have a dictionary file, here are a few example lines:
acquires,1.09861228867
acquisition,1.09861228867
acquisitions,1.60943791243
acquisitive,0.69314718056
acridine,0.0
acronyms,1.09861228867
acrylics,0.69314718056
actual,1.60943791243
words = [acquires, acrylics, actual, acridine]
I need the output to be:
word_tuples = ((1.09861228867,acquires),(0.69314718056,acrylics), (1.60943791243,actual),
(0.0,acridine))
I tried doing,
sorted_list[]
word_tuples = [(key,value) for key, value in dict]
if words in word_tuples:
sorted_list.append(word_tuples[value])
You can do something like this:
dict = {"acquires":1.09861228867, "acquisition":1.09861228867, "acquisitions":1.60943791243,
"acquisitive":0.69314718056, "acridine":0.0, "acronyms":1.09861228867, "acrylics":0.69314718056," actual":1.60943791243}
words = ["acquires", "acrylics", "actual", "acridine"]
tuple_list = list()
for key, value in dict.items():
if key in words:
tuple_list.append((value, key))
print(tuple_list)
You could do it by using a passing a generator expression to the tuple constructor:
from operator import itemgetter
my_dict = {
'acquires': 1.09861228867,
'acquisition': 1.09861228867,
'acquisitions': 1.60943791243,
'acquisitive': 0.69314718056,
'acridine': 0.0,
'acronyms': 1.09861228867,
'acrylics': 0.69314718056,
'actual': 1.60943791243,
}
words = {'acquires', 'acrylics', 'actual', 'acridine'}
word_tuples = tuple((value, word) for word, value in
sorted(my_dict.items(), key=itemgetter(0)) if word in words)
Note that I made words a set of strings instead of a list because doing so greatly speeds-up the if word in words membership check.
I would consider iterating over the words List and checking the dictionary for the element. The reason for not doing it the other way around is that searching for an element in a list has a complexity of O(n) while checking the dictionary will have a complexity of O(1) and is therefore much faster.
Here is my solution:
my_dict = {"acquires":1.09861228867, "acquisition":1.09861228867, "acquisitions":1.60943791243,"acquisitive":0.69314718056, "acridine":0.0, "acronyms":1.09861228867, "acrylics":0.69314718056,"actual":1.60943791243}
words = ["acquires", "acrylics", "actual", "acridine"]
word_tuples = list()
for word in words:
if word in my_dict:
word_tuples.append((word, my_dict[word]))
print(word_tuples)

how can I found the most repeated word and how much repeated it [duplicate]

I am using Python 3.3
I need to create two lists, one for the unique words and the other for the frequencies of the word.
I have to sort the unique word list based on the frequencies list so that the word with the highest frequency is first in the list.
I have the design in text but am uncertain how to implement it in Python.
The methods I have found so far use either Counter or dictionaries which we have not learned. I have already created the list from the file containing all the words but do not know how to find the frequency of each word in the list. I know I will need a loop to do this but cannot figure it out.
Here's the basic design:
original list = ["the", "car",....]
newlst = []
frequency = []
for word in the original list
if word not in newlst:
newlst.append(word)
set frequency = 1
else
increase the frequency
sort newlst based on frequency list
use this
from collections import Counter
list1=['apple','egg','apple','banana','egg','apple']
counts = Counter(list1)
print(counts)
# Counter({'apple': 3, 'egg': 2, 'banana': 1})
You can use
from collections import Counter
It supports Python 2.7,read more information here
1.
>>>c = Counter('abracadabra')
>>>c.most_common(3)
[('a', 5), ('r', 2), ('b', 2)]
use dict
>>>d={1:'one', 2:'one', 3:'two'}
>>>c = Counter(d.values())
[('one', 2), ('two', 1)]
But, You have to read the file first, and converted to dict.
2.
it's the python docs example,use re and Counter
# Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
words = file("test.txt", "r").read().split() #read the words into a list.
uniqWords = sorted(set(words)) #remove duplicate words and sort
for word in uniqWords:
print words.count(word), word
Pandas answer:
import pandas as pd
original_list = ["the", "car", "is", "red", "red", "red", "yes", "it", "is", "is", "is"]
pd.Series(original_list).value_counts()
If you wanted it in ascending order instead, it is as simple as:
pd.Series(original_list).value_counts().sort_values(ascending=True)
Yet another solution with another algorithm without using collections:
def countWords(A):
dic={}
for x in A:
if not x in dic: #Python 2.7: if not dic.has_key(x):
dic[x] = A.count(x)
return dic
dic = countWords(['apple','egg','apple','banana','egg','apple'])
sorted_items=sorted(dic.items()) # if you want it sorted
One way would be to make a list of lists, with each sub-list in the new list containing a word and a count:
list1 = [] #this is your original list of words
list2 = [] #this is a new list
for word in list1:
if word in list2:
list2.index(word)[1] += 1
else:
list2.append([word,0])
Or, more efficiently:
for word in list1:
try:
list2.index(word)[1] += 1
except:
list2.append([word,0])
This would be less efficient than using a dictionary, but it uses more basic concepts.
You can use reduce() - A functional way.
words = "apple banana apple strawberry banana lemon"
reduce( lambda d, c: d.update([(c, d.get(c,0)+1)]) or d, words.split(), {})
returns:
{'strawberry': 1, 'lemon': 1, 'apple': 2, 'banana': 2}
Using Counter would be the best way, but if you don't want to do that, you can implement it yourself this way.
# The list you already have
word_list = ['words', ..., 'other', 'words']
# Get a set of unique words from the list
word_set = set(word_list)
# create your frequency dictionary
freq = {}
# iterate through them, once per unique word.
for word in word_set:
freq[word] = word_list.count(word) / float(len(word_list))
freq will end up with the frequency of each word in the list you already have.
You need float in there to convert one of the integers to a float, so the resulting value will be a float.
Edit:
If you can't use a dict or set, here is another less efficient way:
# The list you already have
word_list = ['words', ..., 'other', 'words']
unique_words = []
for word in word_list:
if word not in unique_words:
unique_words += [word]
word_frequencies = []
for word in unique_words:
word_frequencies += [float(word_list.count(word)) / len(word_list)]
for i in range(len(unique_words)):
print(unique_words[i] + ": " + word_frequencies[i])
The indicies of unique_words and word_frequencies will match.
The ideal way is to use a dictionary that maps a word to it's count. But if you can't use that, you might want to use 2 lists - 1 storing the words, and the other one storing counts of words. Note that order of words and counts matters here. Implementing this would be hard and not very efficient.
Try this:
words = []
freqs = []
for line in sorted(original list): #takes all the lines in a text and sorts them
line = line.rstrip() #strips them of their spaces
if line not in words: #checks to see if line is in words
words.append(line) #if not it adds it to the end words
freqs.append(1) #and adds 1 to the end of freqs
else:
index = words.index(line) #if it is it will find where in words
freqs[index] += 1 #and use the to change add 1 to the matching index in freqs
Here is code support your question
is_char() check for validate string count those strings alone, Hashmap is dictionary in python
def is_word(word):
cnt =0
for c in word:
if 'a' <= c <='z' or 'A' <= c <= 'Z' or '0' <= c <= '9' or c == '$':
cnt +=1
if cnt==len(word):
return True
return False
def words_freq(s):
d={}
for i in s.split():
if is_word(i):
if i in d:
d[i] +=1
else:
d[i] = 1
return d
print(words_freq('the the sky$ is blue not green'))
for word in original_list:
words_dict[word] = words_dict.get(word,0) + 1
sorted_dt = {key: value for key, value in sorted(words_dict.items(), key=lambda item: item[1], reverse=True)}
keys = list(sorted_dt.keys())
values = list(sorted_dt.values())
print(keys)
print(values)
Simple way
d = {}
l = ['Hi','Hello','Hey','Hello']
for a in l:
d[a] = l.count(a)
print(d)
Output : {'Hi': 1, 'Hello': 2, 'Hey': 1}
word and frequency if you need
def counter_(input_list_):
lu = []
for v in input_list_:
ele = (v, lc.count(v)/len(lc)) #if you don't % remove <</len(lc)>>
if ele not in lu:
lu.append(ele)
return lu
counter_(['a', 'n', 'f', 'a'])
output:
[('a', 0.5), ('n', 0.25), ('f', 0.25)]
the best thing to do is :
def wordListToFreqDict(wordlist):
wordfreq = [wordlist.count(p) for p in wordlist]
return dict(zip(wordlist, wordfreq))
then try to :
wordListToFreqDict(originallist)

Filter a list of sets with specific criteria

I have a list of sets:
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
Expected outcome:
I want to look at each individual string, do a count and return everything x >= 2 in the original format:
a = [{'foo','cpu'}, {'foo'}, {'cpu'}]
Here's what I have so far but I'm stuck on the last part where I need to append the new list:
from collections import Counter
counter = Counter()
for a_set in a:
# Created a counter to count the occurrences a word
counter.update(a_set)
result = []
for a_set in a:
for word in a_set:
if counter[word] >= 2:
# Not sure how I should append my new set below.
result.append(a_set)
break
print(result)
You are just appending the original set. So you should create a new set with the words that occur at least twice.
result = []
for a_set in a:
new_set = {
word for word in a_set
if counter[word] >= 2
}
if new_set: # check if new set is not empty
result.append(new_set)
Instead, use the following short approach based on sets intersection:
from collections import Counter
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
c = Counter([i for s in a for i in s])
valid_keys = {k for k,v in c.items() if v >= 2}
res = [s & valid_keys for s in a if s & valid_keys]
print(res) # [{'cpu', 'foo'}, {'foo'}, {'cpu'}]
Here's what I ended up doing:
Build a counter then iterate over the original list of sets and filter items with <2 counts, then filter any empty sets:
from itertools import chain
from collections import Counter
a = [{'foo','cpu','phone'},{'foo','mouse'}, {'dog','cat'}, {'cpu'}]
c = Counter(chain.from_iterable(map(list, a)))
res = list(filter(None, ({item for item in s if c[item] >= 2} for s in a)))
print(res)
Out: [{'foo', 'cpu'}, {'foo'}, {'cpu'}]

Creating and rearranging a dictionary

I am new to python! I have created a code which successfully opens my text file and sorts my list of 100's of words. I then have put these in a list labelled stimuli_words, which consists of no duplicates words, all lower case etc.
However I now want to convert this list into a dictionary, where the keys are all possible 3 letter endings in my list of words, and the values are the words that correspond to those endings.
For instance 'ing: going, hiring...', but I only want the words in which have more than 40 words corresponding to the last two characters. So far I have this code:
from collections import defaultdict
fq = defaultdict( int )
for w in stimuli_list:
fq[w] += 1
print fq
However it is just returning a dictionary with my words and how many times they occur which is obviously once. e.g 'going': 1, 'hiring': 1, 'driving': 1.
Really would appreciate some help!! Thank You!!
You could do something like this:
dictionary = {}
words = ['going', 'hiring', 'driving', 'letter', 'better', ...] # your list or words
# Creating words dictionary
for word in words:
dictionary.setdefault(word[-3:], []).append(word)
# Removing lists that contain less than 40 words:
for key, value in dictionary.copy().items():
if len(value) < 40:
del dictionary[key]
print(dictionary)
Output:
{ # Only lists that are longer than 40 words
'ing': ['going', 'hiring', 'driving', ...],
'ter': ['letter', 'better', ...],
...
}
Since you're counting the words (because your key is the word), you only get 1 count per word.
You could create a key of the 3 last characters (and use Counter instead):
import collections
wordlist = ["driving","hunting","fishing","drive","a"]
endings = collections.Counter(x[-3:] for x in wordlist)
print(endings)
result:
Counter({'ing': 3, 'a': 1, 'ive': 1})
Create DemoData:
import random
# seed the same for any run
random.seed(10)
# base lists for demo data
prae = ["help","read","muck","truck","sleep"]
post= ["ing", "biothign", "press"]
# lots of data
parts = [ x+str(y)+z for x in prae for z in post for y in range(100,1000,100)]
# shuffle and take on ever 15th
random.shuffle(parts)
stimuli_list = parts[::120]
Creation of dictionary from stimuli_list
# create key with empty lists
dic = dict(("".join(e[len(e)-3:]),[]) for e in stimuli_list)
# process data and if fitting, fill list
for d in dic:
fitting = [x for x in parts if x.endswith(d)] # adapt to only fit 2 last chars
if len(fitting) > 5: # adapt this to have at least n in it
dic[d] = fitting[:]
for d in [x for x in dic if not dic[x]]: # remove keys with empty lists
dic.remove(d)
print()
print(dic)
Output:
{'ess': ['help400press', 'sleep100press', 'sleep600press', 'help100press', 'muck400press', 'muck900press', 'muck500press', 'help800press', 'muck100press', 'read300press', 'sleep400press', 'muck800press', 'read600press', 'help200press', 'truck600press', 'truck300press', 'read700press', 'help900press', 'truck400press', 'sleep200press', 'read500press', 'help600press', 'truck900press', 'truck800press', 'muck200press', 'truck100press', 'sleep700press', 'sleep500press', 'sleep900press', 'truck200press', 'help700press', 'muck300press', 'sleep800press', 'muck700press', 'sleep300press', 'help500press', 'truck700press', 'read400press', 'read100press', 'muck600press', 'read900press', 'read200press', 'help300press', 'truck500press', 'read800press']
, 'ign': ['truck200biothign', 'muck500biothign', 'help800biothign', 'muck700biothign', 'help600biothign', 'truck300biothign', 'read200biothign', 'help500biothign', 'read900biothign', 'read700biothign', 'truck400biothign', 'help300biothign', 'read400biothign', 'truck500biothign', 'read800biothign', 'help700biothign', 'help400biothign', 'sleep600biothign', 'sleep500biothign', 'muck300biothign', 'truck700biothign', 'help200biothign', 'sleep300biothign', 'muck100biothign', 'sleep800biothign', 'muck200biothign', 'sleep400biothign', 'truck100biothign', 'muck800biothign', 'read500biothign', 'truck900biothign', 'muck600biothign', 'truck800biothign', 'sleep100biothign', 'read300biothign', 'read100biothign', 'help900biothign', 'truck600biothign', 'help100biothign', 'read600biothign', 'muck400biothign', 'muck900biothign', 'sleep900biothign', 'sleep200biothign', 'sleep700biothign']
}

Check text/string for occurence of predefined list elements

I have several text files, which I want to compare against a vocabulary list consisting of expressions and single words. The desired output should be a dictionary containing all elements of that list as keys and their respective frequency in the textfile as value. To construct the vocabulary list I need to match two lists together,
list1 = ['accounting',..., 'yields', 'zero-bond']
list2 = ['accounting', 'actual cost', ..., 'zero-bond']
vocabulary_list = ['accounting', 'actual cost', ..., 'yields', 'zero-bond']
sample_text = "Accounting experts predict an increase in yields for zero-bond and yields for junk-bonds."
desired_output = ['accounting':1, 'actual cost':0, ..., 'yields':2, 'zero-bond':1]
what I tried:
def word_frequency(fileobj, words):
"""Build a Counter of specified words in fileobj"""
# initialise the counter to 0 for each word
ct = Counter(dict((w, 0) for w in words))
file_words = (word for line in fileobj for word in line)
filtered_words = (word for word in file_words if word in words)
return Counter(filtered_words)
def print_summary(filepath, ct):
words = sorted(ct.keys())
counts = [str(ct[k]) for k in words] with open(filepath[:-4] + '_dict' + '.txt', mode = 'w') as outfile:
outfile.write('{0}\n{1}\n{2}\n\n'.format(filepath,', '.join(words),', '.join(counts)))
return outfile
Is there any way to do this in Python? I figured out how to manage this with a vocabulary list of single words (1token) but couldnt figure out a solution for the multiple-word case?
If you want to consider words ending with punctuation you will need to clean the text also i.e 'yields' and 'yields!'
from collections import Counter
c = Counter()
import re
vocabulary_list = ['accounting', 'actual cost','yields', 'zero-bond']
d = {k: 0 for k in vocabulary_list}
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = set(sample_text.split())
c.update(splitted) # get count of all words
for k in d:
spl = k.split()
ln = len(spl)
# if we have multiple words we cannot split
if ln > 1:
check = re.findall(r'\b{0}\b'.format(k),sample_text)
if check:
d[k] += len(check)
# else we are looking for a single word
elif k in splitted:
d[k] += c[k]
print(d)
To chain all the lists into a single vocab dict:
from collections import Counter
from itertools import chain
import re
c = Counter()
l1,l2 = ['accounting', 'actual cost'], ['yields', 'zero-bond']
vocabulary_dict = {k:0 for k in chain(l1,l2)}
print(vocabulary_dict)
sample_text = "Accounting experts predict actual costs an increase in yields for zero-bond and yields for junk-bonds.".lower()
splitted = sample_text.split()
c.update(splitted)
for k in vocabulary_dict:
spl = k.split()
ln = len(spl)
if ln > 1:
check = re.findall(r'\b{0}\b'.format(k),sample_text)
if check:
vocabulary_dict[k] += len(check)
elif k in sample_text.split():
vocabulary_dict[k] += c[k]
print(vocabulary_dict)
You could create two dicts one for phrases and the other for words and do a pass over each.

Categories