Finding word in a matrix - python

I have a matrix file (which python reads like a list of lists) and I need to tell if a word from a different file appears in that matrix, in a given direction.
for example: given this matrix:
c,a,T,e
o,a,t,s
w,o,t,e
n,o,l,e
the words:
caT, cow, own, cat
and the directions:
downright (diagonal)
I expect an output:
cat:1
and for the direction:
down, right
I expect:
cow:1
own:1
cat:1
my function is set like so:
def word_finder(word_list,matrix, directions):
What I find hard to do is go through a list of lists and run over indexes that are horizontal for example or diagonal :(
thx for the help

There already seem to be several partial answers to your question. While it would not be efficient, with some simple parsing of the directions you could easily chain together the following separate solutions to come up with an answer to your problem.
Diagonal Traversal: In Python word search, searching diagonally, printing result of where word starts and ends
Linear Traversal: How to find words in a matrix - Python

Try this:
from itertools import chain
from collections import defaultdict
matrix= [['c', 'a', 'T', 'e'],
['o', 'a', 't', 's'],
['w', 'o', 't', 'e'],
['n', 'o', 'l', 'e']]
words = ['caT', 'cow', 'own', 'cat']
d = defaultdict(int)
for i in chain(matrix, list(zip(*matrix))): # list(zip(*matrix)) is for down direction
for word in words:
if word in ''.join(i):
d[word] = d[word] + 1
The d will be your expected output. {'caT': 1, 'cow': 1, 'own': 1}

Related

generate list of tuples to import to pandas df

I have found a list of consonant cluster with the following code:
list_2 = ['financial','disastrous','accuracy','important','numbers']
reg = r'[bdðfghjklmnprstvxþ]+'
d = []
largest = []
for w in list_2:
d.append(re.findall(reg, str(w), re.IGNORECASE))
print(d)
[['f', 'n', 'n', 'l'], ['d', 's', 'str', 's'], ['r'], ['mp', 'rt', 'nt'], ['n', 'mb', 'rs']]
I need to get the largest consonant count for each word to import as list (of tuples) to a pandas dataframe. I have tried various things but without success.
This should give you the int you are looking for:
def largest_cluster(cons_list):
return max(len(c) for c in cons_list)
Then you can get the tuples by:
tuples = [(w, largest_cluster(cons)) for w, cons in zip(list2, d)]
Works fine, now to import it to pandas with column labels. That will hopefully be OK.
If the word has no consonants I get ValueError: max() arg is an empty sequence. How do I fix that?

Filter words from a file based on their number of syllables

I need to identify complex words from a .txt file.
I am trying to use nltk but no such module exist.
Complex words are words in the text that contains more than two syllables.
I would use Pyphen. This module has a Pyphen class used for hyphenation. One of its methods, positions(), returns the number of places in a word where it can be split:
>>> from pyphen import Pyphen
>>> p = Pyphen(lang='en_US')
>>> p.positions('exclamation')
[2, 5, 7]
If the word "exclamation" can be split in three places, it has four syllables, so you just need to filter all words with more than one split place.
. . .
But I noted you tagged it as an [t:nltk] question. I'm not experienced with NLTK myself but the question suggested by #Jules has a nice suggestion in this aspect: to use the cmudict module. It gives you a list of pronunciations of a word in American English:
>>> from nltk.corpus import cmudict
>>> d = cmudict.dict()
>>> pronounciations = d['exasperation']
>>> pronounciations
[['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']]
Luckily, our fist word has only one pronounciation. It is represented as a list of strings, each one representing a phoneme:
>>> phonemes = pronounciations[0]
>>> phonemes
['EH2', 'K', 'S', 'AE2', 'S', 'P', 'ER0', 'EY1', 'SH', 'AH0', 'N']
Note that vowel phonemes have a number at the end, indicating stress:
Vowels are marked for stress (1=primary, 2=secondary, 0=no stress). E.g.: NATURAL 1 N AE1 CH ER0 AH0 L
So, we just need to count the number of phonemes with digits at the end:
>>> vowels = [ph for ph in phonemes if ph[-1].isdigit()]
>>> vowels
['EH2', 'AE2', 'ER0', 'EY1', 'AH0']
>>> len(vowels)
5
. . .
Not sure which is the best option but I guess you can work your problem out from here.

Remove stop words from list using only Numpy in Python

I am working on removing stop words in python using only numpy. The stopwords file is imported as a list. So here is what I came up:
method 1, I try to loop through the stop words list, and remove everyone from the tw_line
# loop through the stop words list, and remove each one from the splitted line list
for line in stopwords:
if line in words:
words.remove(line)
continue
print (tw_line)
Result: NO stop words are removed.
0 my whole body feels itchy and like its on fire
method 2, I try to loop the word through the stopwords list,
# loop through the line, and check with stop words, if not in stop words, add to clean_line
clean_line=[]
tw_line.split(" ")
for line in tw_line:
if line in stopwords:
clean_line.append(line)
print(clean_line)
Result: All words are broken into characters
['m', 'y', 'w', 'h', 'o', 'l', 'e', 'b', 'o', 'd', 'y', 'f', 'e', 'e', 'l', 's', 'i', 'c', 'h', 'y', 'a', 'n', 'd', 'l', 'i', 'k', 'e', 'i', 's', 'o', 'n', 'f', 'i', 'r', 'e']
Any help?
Try apply this:
>>> str1 = "my whole body feels itchy and like its on fire"
>>> str1.split()
['my', 'whole', 'body', 'feels', 'itchy', 'and', 'like', 'its', 'on', 'fire']
>>>
And then remove words which are in stopwords. BTW, I don't see any numpy here.
You should print word not tw_line since word is where you removed the stopword?
for line in stopwords:
if line in words:
words.remove(line)
continue
print (words)
The method 2 is clearly what you want to do. However, there are some things you can improve:
as Paul Panzer stated, split doesn't work in place so you need to do
tw_list = tw_line.split(" ")
You could make use of list comprehension rather than looping (or even a generator if you intend to join afterward).
clean_line = [word for word in tw_list if word not in stopwords]
I saw from your code comment that stopwords is a list. You might want to make it a set for efficiency reasons ( https://wiki.python.org/moin/TimeComplexity).

Partial Substring Matching in Python

I'm interested in creating a program that will search for a certain string (known henceforth as string A) in a large library of other strings. Basically, if string A existed in the library it would be discarded and another string's existence would be checked for within the library. The program would then give me a final list of strings that did not exist as substrings within the large library. I was able to make a program that finds EXACT matches, but I need to add an additional module that allows the sub-string search to allow for partial matches. Namely, one or two of the sub-string characters would be alright. The list of string A's (which are all permutations of a,t,g,c in a 7-letter string 4^7 different ones) has difficulties with highly diverse libraries.
My initial thought was to use regex and perhaps a hamming distance algorithm to find all those partial matches. Basically this first attempt allows me to put a "?" or wildcard into all positions of the string A in question (1-7), but I can only get it into the first position. The wildcard would then allow me to search for partial matches of the particular string A in question. If this the wrong way to approach this problem, I'd gladly change it up. I used fnmatch as per suggestion on another question This is what I have so far:
from Bio import SeqIO
import fnmatch
import random
import itertools
#Define a splitting string algorithm
def split_by_n(seq,n):
while seq:
yield seq[:n]
seq = seq[n:]
#Import all combinations/permutations from fasta fille, 4^7
my_combinations = []
fasta_sequences = SeqIO.parse(open("Combinations/base_combinations_7.fasta"),'fasta')
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
x = sequence.lower()
my_combinations.append(x)
primer = "tgatgag"
final = []
#List to make wildcard permutations
wildCard = ['?']
i = list(split_by_n(primer, 1))
for letter in i:
wildCard.append(letter)
del wildCard[1]
final.append(''.join(wildCard))
#Search for wildcard permutation
for entry in final:
filtered = fnmatch.filter(my_combinations, entry)
This is my desired output:
primer = "tgatgag"
['?', 'g', 'a', 't', 'g', 'a', 'g']
['t', '?', 'a', 't', 'g', 'a', 'g']
['t', 'g', '?', 't', 'g', 'a', 'g']
['t', 'g', 'a', '?', 'g', 'a', 'g']
['t', 'g', 'a', 't', '?', 'a', 'g']
['t', 'g', 'a', 't', 'g', '?', 'g']
['t', 'g', 'a', 't', 'g', 'a', '?']
['agatgag', 'tgatgag', 'cgatgag', 'ggatgag']
['taatgag', 'ttatgag', 'tcatgag', 'tgatgag']
['tgatgag', 'tgttgag', 'tgctgag', 'tggtgag']
['tgaagag', 'tgatgag', 'tgacgag', 'tgaggag']
['tgataag', 'tgattag', 'tgatcag', 'tgatgag']
['tgatgag', 'tgatgtg', 'tgatgcg', 'tgatggg']
['tgatgaa', 'tgatgat', 'tgatgac', 'tgatgag']
Here's an example solution for 2 element replacement:
primer = 'cattagc'
bases = ['a','c','g','t']
# this is the generator for all possible index combinations
p = itertools.permutations(range(len(primer)), 2)
# this is the list of all possible base pair combinations
c = list(itertools.combinations_with_replacement(bases, 2))
results = []
for i1, i2 in p:
for c1, c2 in c:
temp = list(primer)
temp[i1], temp[i2] = c1, c2
results.append(''.join(temp))
This will create all possible replacements for subing out any two elements of the original primer.

How do i get words from rows and columns having letters?

doing an assignment and struck to this problem
def board_contains_word(board, word):
'''(list of list of str, str) -> bool
Return True if and only if word appears in board.
Precondition: board has at least one row and one column.
>>> board_contains_word([['A', 'N', 'T', 'T'], ['X', 'S', 'O', 'B']], 'ANT')
True
'''
return word in board
but i am getting FALSE
Thanks in advance
The python in operator works a bit differently from how you're using it. Here are some examples:
>>> 'laughter' in 'slaughter'
True
>>> 1 in [1,6,5]
True
>>> 'eta' in ['e','t','a']
False
>>> 'asd' in ['asdf','jkl;']
False
>>>
As you can see, it's got two major uses: testing to see if a string can be found in another string, and testing to see if an element can be found in an array. Also note that the two uses can't be combined.
Now, about solving your problem. You'll need some sort of loop for going through all of the rows one by one. Once you've picked out a single row, you'll need some way to join all of the array elements together. After that, you can figure out if the word is in the board.
Note: this only solves the problem of searching horizontally. Dunno if that's the whole assignment. You can adapt this method to searching vertically using the zip function.
Here's something to get you unstuck:
def board_contains_word(board, word):
# check accross
for row in board:
return word in ''.join(row):
# try with board's rows and columns transposed
for row in zip(*board):
return word in ''.join(row):
return False
print board_contains_word([['A', 'N', 'T', 'T'], ['X', 'S', 'O', 'B']], 'ANT')
print board_contains_word([['A', 'N', 'T', 'T'], ['X', 'S', 'O', 'B']], 'TO')
Hint: You could simplify things by using the any() function.

Categories