I'm trying to write code to help me at crossword puzzle. I'm experiencing the following errors.
1.When I try to use the much larger text file with my word list I receive no output only the small 3 string word list works.
2.The match test positive for the first two strings of my test word list. I need it to only test true for the entire words in my word list. [ SOLVED SOLUTION in the code bellow ]
lex.txt contains
dad
add
test
I call the code using the following.
./cross.py dad
[ SOLVED SOLUTION ] This is really slow.
#!/usr/bin/env python
import itertools, sys, re
sys.dont_write_bytecode = True
original_string=str(sys.argv[1])
lenth_of_string=len(original_string)
string_to_tuple=tuple(original_string)
with open('wordsEn.txt', 'r') as inF:
for line in inF:
for a in set (itertools.permutations(string_to_tuple, lenth_of_string)):
joined_characters="".join(a)
if re.search('\\b'+joined_characters+'\\b',line):
print joined_characters
Let's take a look at your code. You take the input string, you create all possible permutations of it and then you look for these permutations in the dictionary.
The most significant speed impact from my point of view is that you create the permutations of the word over and over again, for every word in your dictionary. This is very time consuming.
Besides of that, you don't even need the permutations. It's obvious that two words can be "converted" to each other by permuting if they've got the same letters. So your piece of code can be reimplemented as follows :
import itertools, sys, re
import time
from collections import Counter
sys.dont_write_bytecode = True
original_string=str(sys.argv[1]).strip()
lenth_of_string=len(original_string)
string_to_tuple=tuple(original_string)
def original_impl():
to_return = []
with open('wordsEn.txt', 'r') as inF:
for line in inF:
for a in set (itertools.permutations(string_to_tuple, lenth_of_string)):
joined_characters="".join(a)
if re.search('\\b'+joined_characters+'\\b',line):
to_return.append(joined_characters)
return to_return
def new_impl():
to_return = []
stable_counter = Counter(original_string)
with open('wordsEn.txt', 'r') as inF:
for line in inF:
l = line.strip()
c = Counter(l)
if c == stable_counter:
to_return.append(l)
return to_return
t1 = time.time()
result1 = original_impl()
t2 = time.time()
result2 = new_impl()
t3 = time.time()
assert result1 == result2
print "Original impl took ", t2 - t1, ", new impl took ", t3 - t2, "i.e. new impl is ", (t2-t1) / (t3 - t2), " faster"
For a dictionary with 100 words of 8 letters, the output is :
Original impl took 42.1336319447 , new impl took 0.000784158706665 i.e. new impl is 53731.0006081 faster
The time consumed by the original implementation for 10000 records in the dictionary is unbearable.
Related
Hello so I am trying to filter the bad words from this list, I have for this script usually list of 5 to 10 million line of words, I tried threading to make it fast but after the first 20k word it gets slower and slower why is that, will it be faster if I use Multiprocessing instead ?
I run this script on Ubuntu with 48 CPU core and 200GB RAM
from tqdm import tqdm
import queue
import threading
a=input("The List: ")+".txt"
thr=input('Threads: ')
c=input("clear old[y]: ")
inputQueue = queue.Queue()
if c == 'y' or c == 'Y':#clean
if c =="y":
open("goodWord.txt",'w').close()
s = ["bad_word"]#bad words list
class myclass:
def dem(self,my_word):
for key in s:
if key in my_word:
return 1
return 0
def chk(self):
while 1:
old = open("goodWord.txt","r",encoding='utf-8',errors='ignore').readlines()
y = inputQueue.get()
if my_word not in old:
rez = self.dem(my_word)
if rez == 0:
sav = open("goodWord.txt","a+")
sav.write(my_word+"\n")
sav.close()
self.pbar.update(1)
else :
self.pbar.update(1)
inputQueue.task_done()
def run_thread(self):
for y in tqdm(open(a, 'r',encoding='utf-8', errors='ignore').readlines()):
inputQueue.put(y)
tqdm.write("All in the Queue")
self.pbar = tqdm(total=inputQueue.qsize(),unit_divisor=1000)
for x in range(int(thr)):
t = threading.Thread(target=self.chk)
t.setDaemon(True)
t.start()
inputQueue.join()
try:
open("goodWord.txt","a")
except:
open("goodWord.txt","w")
old = open("goodWord.txt","r",encoding='utf-8',errors='ignore').readlines()
myclass=myclass()
omyclass.run_thread()
For the sake of curiosity and education, I wrote a virtually identical (in function) program:
import pathlib
from tqdm import tqdm
# check_words_file_path = pathlib.Path(input("Enter the path of the file which contains the words to check: "))
check_words_file_path = pathlib.Path("/Users/****/Documents/Projects/AdHoc/resources/temp/check_words.txt")
good_words_file_path = pathlib.Path("/Users/****/Documents/Projects/AdHoc/resources/temp/good_words.txt")
bad_words = {"abadword", "anotherbadword"}
# load the list of good words
with open(good_words_file_path) as good_words_file:
stripped_lines = (line.rstrip() for line in good_words_file)
good_words = set(stripped_line for stripped_line in stripped_lines if stripped_line)
# check each word to see if is one of the bad words
# if it isn't, add it to the good words
with open(check_words_file_path) as check_words_file:
for curr_word in tqdm(check_words_file):
curr_word = curr_word.rstrip()
if curr_word not in bad_words:
good_words.add(curr_word)
# write the new/expanded list of good words back to file
with open(good_words_file_path, "w") as good_words_file:
for good_word in good_words:
good_words_file.write(good_word + "\n")
It is based on my understanding of the original program, which, as I already mentioned, I find far too complex.
I hope that this one is clearer, and it is almost certainly much faster. In fact, this might be fast enough that there is no need to consider things like multiprocessing.
This is my code below and I would like to write new column in my original csv , the columns are supposed to contain the values of each dictionary created during my code and I would like for the last dictionary since it contains 3 values , that each values is inserted in a single column. The code to write in the csv is at the end but maybe there is a way to write the values at each time i am producing a new dictionary.
My code for the csv route : I cannot figure it out how to add without deleting the content of the original file
# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd
try:
import treetaggerwrapper
from treetaggerwrapper import TreeTagger, make_tags
print("import TreeTagger OK")
except:
print("Import TreeTagger pas Ok")
from itertools import islice
from collections import defaultdict
#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)
# extraction colonne verbatim
d_verbatim = {}
with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
csv_file.readline()
for line in csv_file:
token = line.split(';')
try:
d_verbatim[token[0]] = token[1]
except:
print(line)
#print(d_verbatim)
#Using treetagger
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items():
newvalues = tagger.tag_text(val)
d_tag[key] = newvalues
#print(d_tag)
#lemmatisation
d_lemma = defaultdict(list)
for k, v in d_tag.items():
for p in v:
parts = p.split('\t')
try:
if parts[2] == '':
d_lemma[k].append(parts[0])
else:
d_lemma[k].append(parts[2])
except:
print(parts)
#print(d_lemma)
stopWords = set(stopwords.words('french'))
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}
print(d_filtered_words)
d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
for word in v:
if word in dico_lexique:
if word
print(word, dico_lexique[word])
your edit seemed to make things worse, you've ended up deleting a lot of relevant context. I think I've pieced together what you are trying to do. the core of it seems to be a routine that is performing sentiment analysis on text.
I'd start by creating a class that keeps track of this, e.g:
class Sentiment:
__slots__ = ('positive', 'neutral', 'negative')
def __init__(self, positive=0, neutral=0, negative=0):
self.positive = positive
self.neutral = neutral
self.negative = negative
def __repr__(self):
return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'
def __add__(self, other):
return Sentiment(
self.positive + other.positive,
self.neutral + other.neutral,
self.negative + other.negative,
)
which will allow you to replace your convoluted bits of code like [a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])] with score += sentiment in the function below, and allows us to refer to the various values by name
I'd then suggest preprocessing your pickled data, so you don't have to convert things to ints in the middle of unrelated code, e.g:
with open("dict_pickle", "rb") as fd:
dico_lexique = {}
for word, (pos, neu, neg) in pickle.load(fd):
dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))
this puts them directly into the above class and seems to match up with other constraints in your code. but I don't have your data, so can't check.
after pulling apart all your comprehensions and loops, we are left with a single nice routine for processing a single piece of text:
def process_text(text):
"""process the specified text
returns (words, filtered words, total sentiment score)
"""
words = []
filtered = []
score = Sentiment()
for tag in make_tags(tagger.tag_text(text)):
word = tag.lemma
words.append(word)
if word not in stopWords and lemma.isalpha():
filtered.append(word)
sentiment = dico_lexique.get(word)
if sentiment is not None:
score += sentiment
return words, filtered, score
and we can put this into a loop that reads lines from the input and sends them to an output file:
filename = sys.argv[1]
tempname = filename + '~'
with open(filename) as fdin, open(tempname, 'w') as fdout:
inp = csv.reader(fdin, delimiter=';')
out = csv.writer(fdout, delimiter=';')
# get the header, and blindly append out column names
header = next(inp)
out.writerow(header + [
'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
])
for row in inp:
# assume that second item contains the text we want to process
words, filtered, score = process_text(row[1])
extra_values = [
words, filtered,
score.positive, score.neutal, score.negative,
]
# add the values and write out
assert len(row) == len(header), "code needed to pad the columns out"
out.writerow(row + extra_values)
# only replace if everything succeeds
os.rename(tempname, filename)
we write out to a different file and only rename on success, this means that if the code crashes it won't leave partially written files around. I'd discourage working like though, and tend to make my scripts read from stdin and write to stdout. that way I can run as:
$ python script.py < input.csv > output.csv
when all is OK, but also lets me run as:
$ head input.csv | python script.py
if I just want to test with the first few lines of input, or:
$ python script.py < input.csv | less
if I want to checkout output as it's generated
note that none of this code has been run, so there are probably bugs in it, but I can actually see what the code is trying to do like this. comprehensions and 'functional' style code is great, but it can easily get unreadable if you're not careful
Hey everyone I know that this has been asked a couple times here already but I am having a hard time finding document frequency using python. I am trying to find TF-IDF then find the cosin scores between them and a query but am stuck at finding document frequency. This is what I have so far:
#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter
#number of command line argument checker
if len(sys.argv) != 3:
print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
sys.exit(1)
#Read in the directory to the files
path = sys.argv[1]
#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec
#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))
if os.path.exists(path) and os.path.isfile(y):
word_TF = []
word_IDF = {}
TFvec = []
IDFvec = []
#this is my attempt at finding IDF
for filename in glob.glob(os.path.join(path, '*.txt')):
words_IDF = re.findall(r'\w+', open(filename).read().lower())
doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]
word_IDF = doc_IDF
#psudocode!!
"""
for key in word_idf:
if key in word_idf:
word_idf[key] =+1
else:
word_idf[key] = 1
print word_IDF
"""
#goes to that directory and reads in the files there
for filename in glob.glob(os.path.join(path, '*.txt')):
words_TF = re.findall(r'\w+', open(filename).read().lower())
#scans each document for words greater or equal to 3 in length
doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]
#this assigns values to each term this is my TF for each vector
TFvec = Counter(doc_TF)
#weighing the Tf with a log function
for key in TFvec:
TFvec[key] = 1 + math.log10(TFvec[key])
#placed here so I dont get a command line full of text
print TFvec
#Error checker
else:
print "That path does not exist"
I am using python 2 and so far I don't really have any idea how to count how many documents a term appears in. I can find the total number of documents but I am really stuck on finding the number of documents a term appears in. I was just going to create one large dictionary that held all of the terms from all of the documents that could be fetched later when a query needed those terms. Thank you for any help you can give me.
DF for a term x is a number of documents in which x appears. In order to find that, you need to iterate over all documents first. Only then you can compute IDF from DF.
You can use a dictionary for counting DF:
Iterate over all documents
For each document, retrieve the set of it's words (without repetitions)
Increase the DF count for each word from stage 2. Thus you will increase the count exactly by one, regardless of how many times the word was in document.
Python code could look like this:
from collections import defaultdict
import math
DF = defaultdict(int)
for filename in glob.glob(os.path.join(path, '*.txt')):
words = re.findall(r'\w+', open(filename).read().lower())
for word in set(words):
if len(word) >= 3 and word.isalpha():
DF[word] += 1 # defaultdict simplifies your "if key in word_idf: ..." part.
# Now you can compute IDF.
IDF = dict()
for word in DF:
IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
PS It's good for learning to implement things manually, but if you ever get stuck, I suggest you to look at NLTK package. It provides useful functions for working with corpora (collection of texts).
I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?
Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)
15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()
You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.
Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys
I was trying to create a python program that reads the fasta file "seqs.fa"
and have the program to sort the sequences in order by the name.
The Fasta file looks like this:
>seqA - human
GCTGACGTGGTGAAGTCAC
>seqC - gorilla
GATGACAA
GATGAAGTCAG
>seqB - chimp
GATGACATGGTGAAGTAAC
My program looks like this:
import sys
inFile = open(sys.argv[1], 'r')
a = inFile.readlines()
a.sort()
seq = ''.join(a[0:])
seq = seq.replace('\n', "\n")
print seq
The expected result:
>seqA - human
GCTGACGTGGTGAAGTCAC
>seqB - chimp
GATGACATGGTGAAGTAAC
>seqC - gorilla
GATGACAAGATGAAGTCAG
My result:
>seqA - human
>seqB - chimp
>seqC - gorilla
GATGACAA
GATGAAGTCAG
GATGACATGGTGAAGTAAC
GCTGACGTGGTGAAGTCAC
The last four lines are the gorilla, chimp, and human sequences, with the gorilla sequence split over the first two lines.
Can anyone give me some tips on how to sort it or a way to fix the problem?
Don't implement a FASTA reader yourself! Like most cases, there are some smart people that already did this for you. Use for example BioPython instead. Like this:
from Bio import SeqIO
handle = open("seqs.fa", "rU")
l = SeqIO.parse(handle, "fasta")
sortedList = [f for f in sorted(l, key=lambda x : x.id)]
for s in sortedList:
print s.description
print str(s.seq)
There are some problems with your code. The main one is that in the list returned by readlines() your descriptions and sequences are all separate lines, so when you sort the list, they are detached from each other. Also, all descriptions go before sequences because they have '>' in the beginning.
Second, a[0:] is the same as a.
Third, seq.replace('\n', "\n") won't do anything. Single and double quotes mean the same thing. You replace a newline character with itself.
Reading fasta files is not a very complex task for Python, but still I hope I'll be excused for offering to use the package I work on - pyteomics.
Here's the code I'd use:
In [1]: from pyteomics import fasta
In [2]: with fasta.read('/tmp/seqs.fa') as f:
...: fasta.write(sorted(f))
...:
>seqA - human
GCTGACGTGGTGAAGTCAC
>seqB - chimp
GATGACATGGTGAAGTAAC
>seqC - gorilla
GATGACAAGATGAAGTCAG
To save this to a new file, give its name to fasta.write as argument:
fasta.write(sorted(f), 'newfile.fa')
Generally, pyteomics.fasta is for protein sequences, not DNA, but it does the job. Maybe you can use the fact that it returns descriptions and sequences in tuples.
file = open("seqs.fa")
a = file.readlines()
i = 0
ar = []
while True:
l1=file.readline()
l2=file.readline()
if not (l1 and l2):
break;
l = l1.strip('\n') + '////////' + l2
ar.append(l)
ar = ar.sort()
for l in ar:
l1 = l.split('////////')[0]+'\n'
print l1
l2 = l.split('////////')[1]
print l2