Group nltk.FreqDist output by first word (python) - python

I'm an amateur with basic coding skills in python, I'm working on a data frame that has a column as below. The intent is to group the output of nltk.FreqDist by the first word
What I have so far
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
I have 10000+ rows in my output.
My Expected Output
I would like to group the output by the first word and extract it as a dataframe
What I have tried among other solutions
I have tried adapting solutions given here and here, but no satisfactory results.
Any help/guidance appreciated.

Try the following (documentation is inside the code):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)

I managed to do it like below. There could be an easier implementation. But for now, this gives me what I had expected.
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
Sample Output

Related

Finding the last data number recorded in the text using Python

I have a txt file of data that is recorded daily. The program runs every day and records the data it receives from the user and considers a number for each input data.
like this:
#1
data number 1
data
data
data
-------------
#2
data number 2
text
text
-------------
#3
data number 3
-------------
My problem is in numbering the data. For example, when I run the program to record a data in a txt file, the program should find the number of the last recorded data, add one to it and record my data.
But I can't write the program to find the last data number.
I tried these:
Find "#" in text. List all numbers after hashtags and find the biggest number that can be the number of the last recorded data.
text_file = open(r'test.txt', 'r')
line = text_file.read().splitlines()
for Number in line:
hashtag = Number[Number.find('#')]
if hashtag == '#':
hashtag = Number[Number.find('#')+1]
hashtag = int(hashtag)
record_list.append(hashtag)
last_number = max(record_list)
But when I use hashtag = Number[Number.find('#')], even in the lines where there is no hashtag, it returns the first or last letters in that line as a hashtag.
And if the text file is empty, it gives the following error:
hashtag = Number[Number.find('#')]
~~~~~~^^^^^^^^^^^^^^^^^^
IndexError: string index out of range
How can I find the number of the last data and use it in saving the next data?
Consider:
>>> s = "hello world"
>>> s[s.find('#')]
'd'
>>> s.find('#')
-1
If # is not in the line, -1 is returned, which when we use as an index, returns the last character.
We can use regular expressions and a list comprehension as one approach to solving this. Iterate over the lines, selecting only those which match the pattern of a numbered line. We'll match the number part, converting that to an int. We select the last one, which should be the highest number.
with open('test.txt' ,'r') as text_file:
next_number = [
int(m.group(1))
for x in text_file.read().splitlines()
if (m := re.match(r'^\s*#(\d+)\s*$', x))
][-1] + 1
Or we can pass a generator expression to max to ensure we get the highest number.
with open('test.txt' ,'r') as text_file:
next_number = max(
int(m.group(1))
for x in text_file.read().splitlines()
if (m := re.match(r'^\s*#(\d+)\s*$', x))
) + 1

Identify lines of speech which contain words from a list using pandas

I have the following dataframe:
test = pd.DataFrame(columns = ['Line No.','Person','Speech'])
test['Person'] = ['A','B','A','B','A','B']
test['Line No.'] = [1,2,3,4,5,6]
test['Speech'] = ['hello. how was your assessment day? i heard it went very well.',
'The beginning was great and the rest of the day was kinda ok.',
'why did things go from great to ok?',
'i was positive at the beginning and went right with all my answers but then i was not feeling well.',
"that's very unfortunate. if there's anything i can help you with please let me know how.",
'Will do.']
And the following list which contains keywords:
keywords = ['hello','day','great','well','happy','right','ok','why','positive']
I would like to generate an output which shows both the speaker and line no. associated with them for each time their speech contains at least 3 words from the keywords list. I have tried iterating through each line in the dataframe to see if there was at least 3 keywords present however my code only returns the last line. Below is the code I used:
def identify_line_numebr(dataframe, keywords:list, thresh:int=3):
is_person = False
keyword_match_list = []
for index, row in dataframe.iterrows():
if is_person == False:
# Pulling out the speech
line = row['Speech']
for token in line:
# Checking if each line of speech contains key words
if token in keywords:
keyword_match_list.append(token)
print(index, is_person, row['Line No.'], row['Person'])
print(len(keyword_match_list))
if len(keyword_match_list) == thresh:
is_person == True
else:
break
return {row['Person'], row['Line No.']}
The expected output for this particular case should be in a similar format:
output = [{1, 'A'},{2, 'B'},{3, 'A'},{5, 'A'}]
whereby the first value is the Line No. which contains speech which has at least 3 keywords and the letter is the person.
The problem is that you stop the iteration over the rows as soon as you find a line containing at least three keywords. Instead, you should iterate over all lines and add the person and line number to a list if the threshold count is met:
def identify_line_numbers(dataframe, keywords, thresh=3):
person_line = [] # will contain sets of {Line No., Person}
for line_index, line in enumerate(dataframe.Speech):
# check if each word is in the current line
words_in_speech = [word in line for word in keywords]
# add person and line number to our list if the threshold count is met
if sum(words_in_speech) >= thresh:
person_line.append(
{dataframe.Person[line_index], dataframe['Line No.'][line_index]}
)
return person_line

create list of organisms based on pattern matching of sequence to a genome

I have a dataframe with two columns, the first are names of organisms and the second is there sequence which is a string of letters. I am trying to create an algorithm to see if an organism's sequence is in a string of a larger genome also comprised of letters. If it is in the genome, I want to add the name of the organism to a list. So for example if flu is in the genome below I want flu to be added to a list.
dict_1={'organisms':['flu', 'cold', 'stomach bug'], 'seq_list':['HTIDIJEKODKDMRM',
'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df=pd.DataFrame(dict_1)
organisms seq_list
0 flu HTIDIJEKODKDMRM
1 cold AGGTTTEFGFGEERDDTER
2 stomach bug EGHDGGEDCGRDSGRDCFD
genome='TLTPSRDMEDHTIDIJEKODKDMRM'
This first functions finds the index of the match if there is one where p is the organism and t is the genome. The second portion is the one I am having trouble with. I am trying to use a for loop to search each entry in the df, but if I get a match I am not sure how to reference the first column in the df to add the name to the empty list. Thank you for your help!
def naive(p, t):
occurences = []
for i in range(len(t) - len(p) + 1):
match = True
for j in range(len(p)):
if t[i+j] != p[j]:
match = False
break
if match:
occurences.append(i)
return occurences
Organisms_that_matched = []
for x in df:
matches=naive(genome, x)
if len(matches) > 0:
#add name of organism to Organisms_that_matched list
I'm not sure if you are learning about different ways to transverse and apply custom logic in a list, but you can use list comprehensions:
import pandas as pd
dict_1 = {
'organisms': ['flu', 'cold', 'stomach bug'],
'seq_list': ['HTIDIJEKODKDMRM', 'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df = pd.DataFrame(dict_1)
genome = 'TLTPSRDMEDHTIDIJEKODKDMRM'
organisms_that_matched = [dict_1['organisms'][index] for index, x in enumerate(dict_1['seq_list']) if x in genome]
print(organisms_that_matched)

Join the results of two MapReduce jobs together

I am trying to join the results I get from two MapReduce jobs. The first job returns the 5 most influential papers. Below is the code for the first reducer.
import sys
import operator
current_word = None
current_count = 0
word = None
topFive = {}
# input comes from stdin
for line in sys.stdin:
line = line.strip()
# parse the input we got from mapper.py
word, check = line.split('\t')
if check != None:
count = 1
if current_word == word:
current_count += count
else:
if current_word:
topFive.update({current_word: current_count})
#print(current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print(current_word, current_count)
t = sorted(topFive.iteritems(), key=lambda x:-x[1])[:6]
print("Top five most cited papers")
count = 1
for x in t:
if x[0] != 'nan' and count <= 5:
print("{0}: {1}".format(*x))
count = count + 1
The second job finds the 5 most influential authors and the code is more or less the same as the code above. I want to take the results from these two jobs and join them so that I can determine for each author, the average number of citation of their 3 most influential papers. I cannot figure out how to do this, it seems I need to somehow join the results?
So far you will end up with two output directories, one for the authors and one for the papers.
Now you want to do a JOIN operation (as in the DBs lingo) with both of the files. To do so, the MapReduce way is to make a third job with performs this operation with the two outputs files.
JOIN operations in Hadoop are well studied. One way to do it is the reducer-side join pattern. The pattern consists in the mapper creating a composite key of two subkeys (One the original key + a boolean key specifying whether is table 0 or 1).
Before getting to the reducer you need to make a partitioner that separates those composite keys. The reducers will just get all the same keys from every table.
Let me know if you need further clarification, I wrote this one pretty fast.

Create dictionary and see if key always has same value

If I had a file of lines starting with a number followed by some text, how could I see if the numbers are always followed by different text? For example:
0 Brucella abortus Brucellaceae
0 Brucella ceti Brucellaceae
0 Brucella canis Brucellaceae
0 Brucella ceti Brucellaceae
So here, I'd like to know that 0 is followed by 3 different "types" of text.
Ideally I could read a file into a python script that would have output something like this:
1:250
2:98
3:78
4:65
etc.
The first number would be the number of different "texts", and the number after the : would be how many numbers have this occurring.
I have the following script that calculates how many times a "text" is found in different numbers, so I'm wondering how to kind of reverse it so I know how many times a number has different texts, and how many different texts are present. This script makes the files of numbers and "text" into a dictionary but I'm unsure of how to manipulate this dictionary to get what I want.
#!/usr/bin/env python
#Dictionary to broken species, genus, family
fileIn = 'usearchclusternumgenus.txt'
d = {}
with open(fileIn, "r") as f:
for line in f:
clu, gen, spec, fam = line.split()
d.setdefault(clu, []).append((spec))
# Iterate through and find out how many times each key occurs
vals = {} # A dictonary to store how often each value occurs.
for i in d.values():
for j in set(i): # Convert to a set to remove duplicates
vals[j] = 1 + vals.get(j,0) # If we've seen this value iterate the count
# Otherwise we get the default of 0 and iterate it
#print vals
# Iterate through each possible freqency and find how many values have that count.
counts = {} # A dictonary to store the final frequencies.
# We will iterate from 0 (which is a valid count) to the maximum count
for i in range(0,max(vals.values())+1):
# Find all values that have the current frequency, count them
#and add them to the frequency dictionary
counts[i] = len([x for x in vals.values() if x == i])
for key in sorted(counts.keys()):
if counts[key] > 0:
print key,":",counts[key]`
Use a collections.defaultdict() object with a set as the factory to track different lines, then print out the sizes of the collected sets:
from collections import defaultdict
unique_clu = defaultdict(set)
with open(fileIn) as infh:
for line in infh:
clu, gen, spec, rest = line.split(None, 3)
unique_clu[clu].add(spec)
for key in sorted(unique_clu):
count = len(unique_clu[key])
if count:
print '{}:{}'.format(key, count)

Categories