Vectorizing trigrams with all possible 3-grams - Python - python

I'm trying to create a 3-gram model to apply machine learning techniques.
Basically I'm trying as follow:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import itertools
my_array = ['worda', 'wordb']
vector = CountVectorizer(analyzer=nltk.trigrams,ngram_range=(3,3))
vector.fit_transform(my_array)
My vocabulary:
{('o', 'r', 'd'): 0,
('r', 'd', 'a'): 1,
('r', 'd', 'b'): 2,
('w', 'o', 'r'): 3}
None of my words have spaces or special characters.
So when I run this:
tr_test = vector.transform(['word1'])
print(tr_test)
print(tr_test.shape)
I get this return:
(0, 0) 1
(0, 1) 1
(0, 3) 1
(1, 4) #this is the shape
I think this is right... at least makes sense...
But I would like to represent each word with a matrix containing all 3-gram possibilities. So, each work would be represented by a (1x17576) matrix.
Now I'm using 1x4 matrix (in this particular case), because my vocabulary is built based on my data.
17576 (26ˆ3)- Represents all 3 letters combination in the alphabet (aaa, aab, aac, etc...)
I tried to set my vocabulary to an array with all 3-grams possibilities, like this:
#This creates an array with all 3 letters combination
#['aaa', 'aab', 'aac', ...]
keywords = [''.join(i) for i in itertools.product(ascii_lowercase, repeat = 3)]
vector = CountVectorizer(analyzer=nltk.trigrams,ngram_range=(3,3), vocabulary=keywords)
This didn't work... Someone can figure out how to do this?
Thanks!!!

I tried to change the analyzer to 'char', and it seems to work now:
keywords = [''.join(i) for i in itertools.product(ascii_lowercase, repeat = 3)]
vector = CountVectorizer(analyzer='char', ngram_range=(3,3), vocabulary=keywords)
tr_test = vector.transform(['word1'])
print(tr_test)
And the output is:
(0, 9909) 1
(0, 15253) 1
Just as a check:
test = vector.transform(['aaa aab'])
print(test)
The output:
(0, 0) 1
(0, 1) 1

Related

outputting dictionary values to text file: uncoupling lists and strings

I have a dictionary called shared_double_lists which is made of 6 keys called [(0, 1), (1, 2), (1, 3), (2, 3), (0, 3), (0, 2)]. The values for all the keys are lists.
I am trying to output the values for key (0, 1) to a file. Here is my code:
output = open('test_output.txt', 'w')
counter = 0
for locus in shared_double_lists[(0, 1)]:
for value in locus:
output.write(str(shared_double_lists[(0, 1)][counter]))
output.write ("\t")
output.write ("\n")
counter +=1
output.close()
This almost works, the output looks like this:
['ACmerged_contig_10464', '1259', '.', 'G', 'C', '11.7172', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41', 'GT:PL', '1/1:41,3,0']
['ACmerged_contig_10464', '1260', '.', 'A', 'T', '11.7172', '.', 'DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41', 'GT:PL', '1/1:41,3,0']
Whereas I want it to look like this:
ACmerged_contig_10464 1259 . G C 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41 GT:PL 1/1:41,3,0
ACmerged_contig_10464 1260 . A T 11.7172 . DP=1;SGB=-0.379885;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=41 GT:PL 1/1:41,3,0
i.e. not have the lines of text in list format in the file, but have each item of each list separated by a tab
You can simply join lists to a string: Docs
my_string = '\t'.join(my_list)
\t should join them with a tab, but you can use what you want there.
In this example:
output.write('\t'.join(shared_double_lists[(0, 1)][counter]))

Python - Finding most occurring words in a CSV row

I want to find the most occurring substring in a CSV row either by itself, or by using a list of keywords for lookup.
I've found a way to find out the top 5 most occurring words in each row of a CSV file using Python using the below responses, but, that doesn't solve my purpose. It gives me results like -
[(' Trojan.PowerShell.LNK.Gen.2', 3),
(' Suspicious ZIP!lnk', 2),
(' HEUR:Trojan-Downloader.WinLNK.Powedon.a', 2),
(' TROJ_FR.8D496570', 2),
('Trojan.PowerShell.LNK.Gen.2', 1),
(' Trojan.PowerShell.LNK.Gen.2 (B)', 1),
(' Win32.Trojan-downloader.Powedon.Lrsa', 1),
(' PowerShell.DownLoader.466', 1),
(' malware (ai score=86)', 1),
(' Probably LNKScript', 1),
(' virus.lnk.powershell.a', 1),
(' Troj/LnkPS-A', 1),
(' Trojan.LNK', 1)]
Whereas, I would want something like 'Trojan', 'Downloader', 'Powershell' ... as the top results.
The matching words can be a substring of a value (cell) in the CSV or can be a combination of two or more words. Can someone help fix this either by using a keywords list or without.
Thanks!
Let, my_values = ['A', 'B', 'C', 'A', 'Z', 'Z' ,'X' , 'A' ,'X','H','D' ,'A','S', 'A', 'Z'] is your list of words which is to sort.
Now take a list which will store information of occurrences of every words.
count_dict={}
Populate the dictionary with appropriate values :
for i in my_values:
if count_dict.get(i)==None: #If the value is not present in the dictionary then this is the first occurrence of the value
count_dict[i]=1
else:
count_dict[i] = count_dict[i]+1 #If previously found then increment it's value
Now sort the values of dict according to their occurrences :
sorted_items= sorted(count_dict.items(),key=operator.itemgetter(1),reverse=True)
Now you have your expected results!
The most occurring 3 values are:
print(sorted_items[:3])
output :
[('A', 5), ('Z', 3), ('X', 2)]
The most occurring 2 values are :
print(sorted_items[:3])
output:
[('A', 5), ('Z', 3)]
and so on.

Name of Algorithm(s) for Looping over a Variable Number of Independent Lists

Please excuse the vagueness of my question, I don't have any formal training in CS. I'm pretty sure a solution already exists, but I can't find an answer because I don't know what question to ask.
Essentially, I'm looking for the name of an algorithm, or group of algorithms, to find all combinations of several lists where each list contains the possibilities for a single position. That is, some function that can perform the mapping:
((a,b,c), (1,2), (z,g,h), (7)) ->
((a,1,z,7), (a,1,g,7), (a,1,h,7), (a,2,z,7), ... (c,2,h,7))
such that the result can be used to iterate over all possible combinations of the per-position lists in order.
The number of lists is variable, and the size of each list is variable and independent of the other lists. All the example solutions are missing at least one of those criteria.
It would also be awesome if anyone knew of pseudocode or example implementations for any of those algorithms, or a Python package that can handle this.
I can and have solved the problem previously, but I would like to know if there are better solutions out there before I implement my design again.
Thanks for your time.
For those curious, my previous solution looked something like what is described in this paper, which is the only example solution I've found. However, it doesn't name the problem, nor does it give me a jumping off point for further research. Here is an (untested) Python listing of my general solution:
def nloop(*args):
lists = args
n = len(lists) # Number of lists
# Exit if no lists were passed
if n <= 0:
raise StopIteration
i = [0] * n # Current index of each list
l = [len(L) for L in lists] # Length of each list
# Exit if any list is zero-length
if min(l) <= 0:
raise StopIteration
while True:
# Create and yield a list using the current indices
yield tuple( lists[x][ i[x] ] for x in range(n) )
# Increment the indices for the next loop
# Move to the next item in the last list
i[-1] += 1
# Check the lists in reverse order to carry any
# indices that have wrapped
for x in reverse(list(range(n))):
if i[x] >= l[x]:
i[x] = 0
if x > 0:
i[x-1] += 1
# If the first list has wrapped, we're done
if i[0] >= l[0]:
break
raise StopIteration
You are looking for itertools.product. It is equivalent to the nested for loop of arbitrary depth that you describe.
For your particular example (with appropriate syntactic modifications):
>>> from itertools import product
>>> list(product(('a', 'b', 'c'), (1, 2), ('z', 'g', 'h'), (7,)))
[('a', 1, 'z', 7),
('a', 1, 'g', 7),
('a', 1, 'h', 7),
('a', 2, 'z', 7),
('a', 2, 'g', 7),
('a', 2, 'h', 7),
('b', 1, 'z', 7),
('b', 1, 'g', 7),
('b', 1, 'h', 7),
('b', 2, 'z', 7),
('b', 2, 'g', 7),
('b', 2, 'h', 7),
('c', 1, 'z', 7),
('c', 1, 'g', 7),
('c', 1, 'h', 7),
('c', 2, 'z', 7),
('c', 2, 'g', 7),
('c', 2, 'h', 7)]

Apriori create 3 set of word from 2 set

I'm doing implement on Apriori algorithm at the moment I am stuck to create 3 set of word
Suppose I have list of 2 words like this
FI2 = [('a','b'),('a','c'),('a','d'),('b','d'),('b','e'),('e','f')];
First approach I did with by distinct all element into 1 word and using itertools.combinations of 3 which is the compute expesive and not right approach since the result should be subset from C2
It should be like this result
C3 = [('a','b','c'),('a','b','d'),('a','c','d'),('b','d','e')]
I am having a problem how to approach this problem. I would be appreciate how to give me some guideline how to do this one
any chance C3 is missing some values? ('b','e','f'), ('a','b','e')
im sure it's not the best way but its a start:
from itertools import combinations
FI2 = [('a','b'),('a','c'),('a','d'),('b','d'),('b','e'),('e','f')]
# check if two tuples have at least one var in common
check_intersection = (lambda c: len(set(c[0]).intersection(set(c[1]))) > 0)
# run on all FI2 pairs combinations
# if two tuples have at least one var in common, a merged tuple is added
# remove the duplicates tuples from the new list
C3 = list(set([tuple(sorted(set(c[0] + c[1])))for c in combinations(FI2,2) if check_intersection(c)]))
print(C3)
#=> [('b', 'd', 'e'), ('a', 'b', 'e'), ('b', 'e', 'f'), ('a', 'b', 'd'), ('a','c','d'), ('a', 'b', 'c')]

How to get a split up a list of numbers and insert into another list

Currently I have a file with 6 rows of numbers and each row containing 9 numbers. The point is to test each row of numbers in the file if it completes a magic square. So for example, say a row of numbers from the file is 4 3 8 9 5 1 2 7 6. The first three numbers need to be the first row in a matrix. The next three numbers need to be the second row, and same for the third.
Therefore you would need to end up with a matrix of:
[['4','3','8'],['9','5','1'],['2','7','6']]
I need to test the matrix to see if it is a valid magic square (Rows add up to 15, columns add to 15, and diagonals add to 15).
My code is currently:
def readfile(fname):
"""Return a list of lines from the file"""
f = open(fname, 'r')
lines = f.read()
lines = lines.split()
f.close()
return lines
def assignValues(lines):
magicSquare = []
rows = 3
columns = 3
for row in range(rows):
magicSquare.append([0] * columns)
for row in range(len(magicSquare)):
for column in range(len(magicSquare[row])):
magicSquare[row][column] = lines[column]
return magicSquare
def main():
lines = readfile(input_fname)
matrix = assignValues(lines)
print(matrix)
Whenever I run my code to test it, I'm getting:
[['4', '3', '8'], ['4', '3', '8'], ['4', '3', '8']]
So as you can see I am only getting the first 3 numbers into my matrix.
Finally, my question is how would I go by continuing my matrix with the following 6 numbers of the line of numbers? I'm not sure if it is something I can do in my loop, or if I am splitting my lines wrong, or am I completely on the wrong track?
Thanks.
To test if each row in your input file contains magic square data you need to re-organize the code slightly. I've used a different technique to Francis to fill the matrix. It might be a bit harder to understand how zip(*[iter(seq)] * size) works, but it's a very useful pattern. Please let me know if you need an explanation for it.
My code uses a list of tuples for the matrix, rather than a list of lists, but tuples are more suitable here anyway, since the data in the matrix doesn't need to be modified. Also, I convert the input data from str into int, since you need to do arithmetic on the numbers to test if matrix is a magic square.
#! /usr/bin/env python
def make_square(seq, size):
return zip(*[iter(seq)] * size)
def main():
fname = 'mydata'
size = 3
with open(fname, 'r') as f:
for line in f:
nums = [int(s) for s in line.split()]
matrix = make_square(nums, size)
print matrix
#Now call the function to test if the data in matrix
#really is a magic square.
#test_square(matrix)
if __name__ == '__main__':
main()
Here's a modified version of make_square() that returns a list of lists instead of a list of tuples, but please bear in mind that a list of tuples is actually better than a list of lists if you don't need the mutability that lists give you.
def make_square(seq, size):
square = zip(*[iter(seq)] * size)
return [list(t) for t in square]
I suppose I should mention that there's actually only one possible 3 x 3 magic square that uses all the numbers from 1 to 9, not counting rotations and reflections. But I guess there's no harm in doing a brute-force demonstration of that fact. :)
Also, I have Python code that I wrote years ago (when I was first learning Python) which generates magic squares of size n x n for odd n >= 5. Let me know if you'd like to see it.
zip and iterator objects
Here's some code that briefly illustrates what the zip() and iter() functions do.
''' Fun with zip '''
numbers = [1, 2, 3, 4, 5, 6]
letters = ['a', 'b', 'c', 'd', 'e', 'f']
#Using zip to create a list of tuples containing pairs of elements of numbers & letters
print zip(numbers, letters)
#zip works on other iterable objects, including strings
print zip(range(1, 7), 'abcdef')
#zip can handle more than 2 iterables
print zip('abc', 'def', 'ghi', 'jkl')
#zip can be used in a for loop to process two (or more) iterables simultaneously
for n, l in zip(numbers, letters):
print n, l
#Using zip in a list comprehension to make a list of lists
print [[l, n] for n, l in zip(numbers, letters)]
#zip stops if one of the iterables runs out of elements
print [[n, l] for n, l in zip((1, 2), letters)]
print [(n, l) for n, l in zip((3, 4), letters)]
#Turning an iterable into an iterator object using the iter function
iletters = iter(letters)
#When we take some elements from an iterator object it remembers where it's up to
#so when we take more elements from it, it continues from where it left off.
print [[n, l] for n, l in zip((1, 2, 3), iletters)]
print [(n, l) for n, l in zip((4, 5), iletters)]
#This list will just contain a single tuple because there's only 1 element left in iletters
print [(n, l) for n, l in zip((6, 7), iletters)]
#Rebuild the iletters iterator object
iletters = iter('abcdefghijkl')
#See what happens when we zip multiple copies of the same iterator object.
print zip(iletters, iletters, iletters)
#It can be convenient to put multiple copies of an iterator object into a list
iletters = iter('abcdefghijkl')
gang = [iletters] * 3
#The gang consists of 3 references to the same iterator object
print gang
#We can pass each iterator in the gang to zip as a separate argument
#by using the "splat" syntax
print zip(*gang)
#A more compact way of doing the same thing:
print zip(* [iter('abcdefghijkl')]*3)
Here's the same code running in the interactive interpreter so you can easily see the output of each statement.
>>> numbers = [1, 2, 3, 4, 5, 6]
>>> letters = ['a', 'b', 'c', 'd', 'e', 'f']
>>>
>>> #Using zip to create a list of tuples containing pairs of elements of numbers & letters
... print zip(numbers, letters)
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e'), (6, 'f')]
>>>
>>> #zip works on other iterable objects, including strings
... print zip(range(1, 7), 'abcdef')
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e'), (6, 'f')]
>>>
>>> #zip can handle more than 2 iterables
... print zip('abc', 'def', 'ghi', 'jkl')
[('a', 'd', 'g', 'j'), ('b', 'e', 'h', 'k'), ('c', 'f', 'i', 'l')]
>>>
>>> #zip can be used in a for loop to process two (or more) iterables simultaneously
... for n, l in zip(numbers, letters):
... print n, l
...
1 a
2 b
3 c
4 d
5 e
6 f
>>> #Using zip in a list comprehension to make a list of lists
... print [[l, n] for n, l in zip(numbers, letters)]
[['a', 1], ['b', 2], ['c', 3], ['d', 4], ['e', 5], ['f', 6]]
>>>
>>> #zip stops if one of the iterables runs out of elements
... print [[n, l] for n, l in zip((1, 2), letters)]
[[1, 'a'], [2, 'b']]
>>> print [(n, l) for n, l in zip((3, 4), letters)]
[(3, 'a'), (4, 'b')]
>>>
>>> #Turning an iterable into an iterator object using using the iter function
... iletters = iter(letters)
>>>
>>> #When we take some elements from an iterator object it remembers where it's up to
... #so when we take more elements from it, it continues from where it left off.
... print [[n, l] for n, l in zip((1, 2, 3), iletters)]
[[1, 'a'], [2, 'b'], [3, 'c']]
>>> print [(n, l) for n, l in zip((4, 5), iletters)]
[(4, 'd'), (5, 'e')]
>>>
>>> #This list will just contain a single tuple because there's only 1 element left in iletters
... print [(n, l) for n, l in zip((6, 7), iletters)]
[(6, 'f')]
>>>
>>> #Rebuild the iletters iterator object
... iletters = iter('abcdefghijkl')
>>>
>>> #See what happens when we zip multiple copies of the same iterator object.
... print zip(iletters, iletters, iletters)
[('a', 'b', 'c'), ('d', 'e', 'f'), ('g', 'h', 'i'), ('j', 'k', 'l')]
>>>
>>> #It can be convenient to put multiple copies of an iterator object into a list
... iletters = iter('abcdefghijkl')
>>> gang = [iletters] * 3
>>>
>>> #The gang consists of 3 references to the same iterator object
... print gang
[<iterator object at 0xb737eb8c>, <iterator object at 0xb737eb8c>, <iterator object at 0xb737eb8c>]
>>>
>>> #We can pass each iterator in the gang to zip as a separate argument
... #by using the "splat" syntax
... print zip(*gang)
[('a', 'b', 'c'), ('d', 'e', 'f'), ('g', 'h', 'i'), ('j', 'k', 'l')]
>>>
>>> #A more compact way of doing the same thing:
... print zip(* [iter('abcdefghijkl')]*3)
[('a', 'b', 'c'), ('d', 'e', 'f'), ('g', 'h', 'i'), ('j', 'k', 'l')]
>>>
it only gets the first 3 column always because
magicSquare[row][column] = lines[column]
thus
def assignValues(lines):
magicSquare = []
rows = 3
columns = 3
for row in range(rows):
magicSquare.append([0] * columns)
for line in range((sizeof(lines)/9)) #since the input is already split this means that the size of 'lines' divided by 9 is equal to the number of rows of numbers
for row in range(len(magicSquare)):
for column in range(len(magicSquare[row])):
magicSquare[row][column] = lines[(9*line)+(3*row)+column]
return magicSquare
note that (3*row)+column will move to it 3 columns to the right every iteration
and that (9*line)+(3*row)+column will move to it 9 columns (a whole row) to the right every iteration
once you get this you are now ready to process in finding out for the magic square
def testMagicSquare(matrix):
rows = 3
columns = 3
for a in len(matrix)
test1 = 0
test2 = 0
test3 = 0
for b in range(3)
if(sum(matrix[a][b])==15) test1=1 #flag true if whole row is 15 but turns false if a row is not 15
else test1=0
if((matrix[a][0][b]+matrix[a][1][b]+matrix[a][2][b])==15) test2=1 #flag true if column is 15 but turns false if a column is not 15
else test2=0
if(((matrix[a][0][0]+matrix[a][1][1]+matrix[a][2][2])==15) and
((matrix[a][0][2]+matrix[a][1][1]+matrix[a][2][0])==15)) test3=1 #flag true if diagonal is 15 but turns false if diagonal is not 15
else test3=0
if(test1>0 and test2>0 and test3>0) println('line ' + a + ' is a magic square')
else println('line ' + a + ' is not a magic square')

Categories