Counting keyword in the file - python

I am trying to count the keywords in a .py file but the code I wrote is also counting keywords which occur in strings.
How can I differentiate between actual keywords and the ones in strings? For example: is, with and in are keywords, but you can also spot those in comments and user input strings. This is what I have tried:
from collections import Counter
import keyword
count = {}
scode = input("Enter the name of Python source code file name :")
with open(scode,'r') as f:
for line in f:
words = line.split()
for i in words:
if(keyword.iskeyword(i)):
count[i]= count.get(i,0)+1
print(count)

You can use ast.parse to parse the code, create a ast.NodeTransformer subclass to clear all the string nodes (no need to clear comments because comments are automatically ignored by ast.parse already), install the astunparse package to turn the node back to source code, and then count the keywords:
import ast
import astunparse
import keyword
import re
class clear_strings(ast.NodeTransformer):
def visit_Str(self, node):
node.s = ''
return node
n = ast.parse('''
a = 'True'
assert False
# [[] for _ in range(9)]
"""if"""
''')
clear_strings().visit(n)
print(sum(map(keyword.iskeyword, re.findall(r'\w+', astunparse.unparse(n)))))
This outputs: 2 (because only assert and False are counted as keywords)

Related

Is there any way to obtain a random word from PyEnchant?

Is there a way to obtain a random word from PyEnchant's dictionaries?
I've tried doing the following:
enchant.Dict("<language>").keys() #Type 'Dict' has no attribute 'keys'
list(enchant.Dict("<language>")) #Type 'Dict' is not iterable
I've also tried looking into the module to see where it gets its wordlist from but haven't had any success.
Using the separate "Random-Words" module is a workaround, but as it doesn't follow the same wordlist as PyEnchant, not all words will match. It is also quite a slow method. It is, however, the best alternative I've found so far.
Your question really got me curious so I thought of some way to make a random word using enchant.
import enchant
import random
import string
# here I am getting hold of all the letters
letters = string.ascii_lowercase
# crating a string with a random length with random letters
word = "".join([random.choice(letters) for _ in range(random.randint(3, 8))])
d = enchant.Dict("en_US")
# using the `enchant` to suggest a word based on the random string we provide
random_word = d.suggest(word)
Sometimes the suggest method will not return any suggestion so you will need to make a loop to check if random_word has any value.
With the help of #furas this question has been resolved.
Using the dict-en text file in furas' PyWordle, I wrote a short code that filters out invalid words in pyenchant's wordlist.
import enchant
wordlist = enchant.Dict("en_US")
baseWordlist = open("dict-en.txt", "r")
lines = baseWordlist.readlines()
baseWordlist.close()
newWordlist = open("dict-en_NEW.txt", "w") #write to new text file
for line in lines:
word = line.strip("\n")
if wordList.check(word) == True: #if word exists in pyenchant's dictionary
print(line + " is valid.")
newWordlist.write(line)
else:
print(line + " is invalid.")
newWordlist.close()
Afterwards, calling the text file will allow you to gather the information in that line.
validWords = open("dict-en_NEW", "r")
wordList = validWords.readlines()
myWord = wordList[<line>]
#<line> can be any int (max is .txt length), either a chosen one or a random one.
#this will return the word located at line <line>.

multiple modification to a list at once

I have a text file of some ip's and Mac's. The format of the Mac's are xxxx.xxxx.xxxx, I need to change all the MAC's to xx:xx:xx:xx:xx:xx
I am already reading the file and putting it into a list. Now I am looping through each line of the list and I need to make multiple modification. I need to remove the IP's and then change the MAC format.
The problem I am running into is that I cant seem to figure out how to do this in one shot unless I copy the list to a newlist for every modification.
How can I loop through the list once, and update each element on the list with all my modification?
count = 0
output3 = []
for line in output:
#print(line)
#removes any extra spaces between words in a string.
output[count] = (str(" ".join(line.split())))
#create a new list with just the MAC addresses
output3.append(str(output[count].split(" ")[3]))
#create a new list with MAC's using a ":"
count += 1
print(output3)
It appears you are trying to overthink the problem, so that may be where your frustration is spinning you around a bit.
First, you should always consider if you need a count variable in python. Usually you do not, and the enumerate() function is your friend here.
Second, there is no need to process data multiple times in python. You can use variables to your advantage and leverage python's expressiveness, rather than trying to hide your problem from the language.
PSA an implementation example that may help you think through your approach. Good luck on solving your harder problems, and I hope python will help you out with them!
#! /usr/bin/env python3
import re
from typing import Iterable
# non-regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: this assumes a source with '.' separators only
# reformat_mac = lambda _: ':'.join(_ for _ in _.split('.') for _ in (_[:2], _[2:]))
# regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: Only requires at least two hex digits adjacent at a time
reformat_mac = lambda _: ":".join(re.findall(r"(?i)[\da-f]{2}", _))
def generate_output3(output: Iterable[str]) -> Iterable[str]:
for line in output:
col1, col2, col3, mac, *cols = line.split()
mac = reformat_mac(mac)
yield " ".join((col1, col2, col3, mac, *cols))
if __name__ == "__main__":
output = [
"abc def ghi 1122.3344.5566",
"jklmn op qrst 11a2.33c4.55f6 uv wx yz",
"zyxwu 123 next 11a2.33c4.55f6 uv wx yz",
]
for line in generate_output3(output):
print(line)
Solution
You can use the regex (regular expression) module to extract any pattern that matches that of the
mac-ids: "xxxx:xxxx:xxxx" and then process it to produce the expected output ("xx-xx-xx-xx-xx-xx")
as shown below.
Note: I have used a dummy data file (see section: Dummy Data below) to make this answer
reproducible. It should work with your data as well.
# import re
filepath = "input.txt"
content = read_file(filepath)
mac_ids = extract_mac_ids(content, format=True) # format=False --> "xxxx:xxxx:xxxx"
print(mac_ids)
## OUTPUT:
#
# ['a0-b1-ff-33-ac-d5',
# '11-b9-33-df-55-f6',
# 'a4-d1-e7-33-ff-55',
# '66-a1-b2-f3-b9-c5']
Code: Convenience Functions
How does the regex work? see this example
def read_file(filepath: str):
"""Reads and returns the content of a file."""
with open(filepath, "r") as f:
content = f.read() # read in one attemp
return content
def format_mac_id(mac_id: str):
"""Returns a formatted mac_id.
INPUT FORMAT: "xxxxxxxxxxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
mac_id = list(mac_id)
mac_id = ''.join([ f"-{v}" if (i % 2 == 0) else v for i, v in enumerate(mac_id)])[1:]
return mac_id
def extract_mac_ids(content: str, format: bool=True):
"""Extracts and returns a list of formatted mac_ids after.
INPUT FORMAT: "xxxx:xxxx:xxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
import re
# pattern = "(" + ':'.join([r"\w{4}"]*3) + "|" + ':'.join([r"\w{2}"]*6) + ")"
# pattern = r"(\w{4}:\w{4}:\w{4}|\w{2}:\w{2}:\w{2}:\w{2}:\w{2}:\w{2})"
pattern = r"(\w{4}:\w{4}:\w{4})"
pat = re.compile(pattern)
mac_ids = pat.findall(content) # returns a list of all mac-ids
# Replaces the ":" with "" and then formats
# each mac-id as: "xx-xx-xx-xx-xx-xx"
if format:
mac_ids = [format_mac_id(mac_id.replace(":", "")) for mac_id in mac_ids]
return mac_ids
Dummy Data
The following code block creates a dummy file with some sample mac-ids.
filepath = "input.txt"
s = """
a0b1:ff33:acd5 ghwvauguvwi ybvakvi
klasilvavh; 11b9:33df:55f6
haliviv
a4d1:e733:ff55
66a1:b2f3:b9c5
"""
# Create dummy data file
with open(filepath, "w") as f:
f.write(s)

Replace "*" (asterics) in HTML file with increasing number with python

I have a HTML file that has a series of * (asterics) in it and would like to replace it with numbers starting from 0 and on until it replaces all * (asterics) with a number.
I am unsure if this is possible in python or if another methods would be better.
Edit 2
Here is a short snippet from the TXT file that I am working on
<td nowrap>4/29/2011 14.42</td>
<td align="center">*</td></tr>
I made a file just containing those lines to test out the code.
And here is the code that I am attempting to use to change the asterics:
number = 0
with open('index.txt', 'r+') as inf:
text = inf.read()
while "*" in text:
print "I am in the loop"
text = text.replace("*", str(number), 1)
number += 1
I think that is as much detail as I can go into. Please let me know if I should just add this edit as another comment or keep it as an edit.
And thanks for all the quick responses so far~!
Use the re.sub() function, this lets you produce a new value for each replacement by using a function for the repl argument:
from itertools import count
with open('index.txt', 'r') as inf:
text = inf.read()
text = re.sub(r'\*', lambda m, c=count(): str(next(c)), text)
with open('index.txt', 'w') as outf:
outf.write(text)
The count is taken care of by itertools.count(); each time you call next() on such an object the next value in the series is produced:
>>> import re
>>> from itertools import count
>>> sample = '''\
... foo*bar
... bar**foo
... *hello*world
... '''
>>> print(re.sub(r'\*', lambda m, c=count(): str(next(c)), sample))
foo0bar
bar12foo
3hello4world
Huapito's approach would work too, albeit slowly, provided you limit the number of replacements and actually store the result of the replacement:
with open('index.txt', 'r') as inf:
text = inf.read()
while "*" in text:
text = text.replace("*", str(number), 1)
number += 1
Note the third argument to str.replace(); that tells the method to only replace the first instance of the character.
html = 'some string containing html'
new_html = list(html)
count = 0
for char in range(0, len(new_html)):
if new_html[char] == '*':
new_html[char] = count
count += 1
new_html = ''.join(new_html)
This would replace each asteric with the numbers 1 to one less than the number of asterics, in order.
You need to iterate over each char, you can write to a tempfile and then replace the original with shutil.move using itertools.count to assign a number incrementally each time you find an asterix:
from tempfile import NamedTemporaryFile
from shutil import move
from itertools import count
cn = count()
with open("in.html") as f, NamedTemporaryFile("w+",dir="",delete=False) as out:
out.writelines((ch if ch != "*" else str(next(cn))
for line in f for ch in line ))
move(out.name,"in.html")
using a test file with:
foo*bar
bar**foo
*hello*world
Will output:
foo1bar
bar23foo
4hello5world
It is possible. Have a look at the docs. You should use something like a 'while' loop and 'replace'
Example:
number=0 # the first number
while "*" in text: #repeats the following code until this is false
text = text.replace("*", str(number), maxreplace=1) # replace with 'number'
number+=1 #increase number
Use fileinput
import fileinput
with fileinput.FileInput(fileToSearch, inplace=True) as file:
number=0
for line in file:
print(line.replace("*", str(number))
number+=1

Faster operation reading file

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?
Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)
15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()
You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.
Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys

Parse fasta sequence to the dictionary

I need most trivial solution to convert fasta.txt containing multiple nucleotide sequences like
>seq1
TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG
CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG
>seq2
TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA
AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA
>seq3
GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA
TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT
to the dictionary(name,value) object where name will be the >header and value will be assigned to corresponded sequence.
Below you can find my failed attempt do it via 2 lists (does not work for long sequence containing >1 line )
f = open('input2.txt', 'r')
list={}
names=[]
seq=[]
for line in f:
if line.startswith('>'):
names.append(line[1:-1])
elif line.startswith('A') or line.startswith('C') or line.startswith('G') or line.startswith('T'):
seq.append(line)
list = dict(zip(names, seq))
I'll be thankful if you provide me with the solution of how fix it and example how to do it via separate function.
Thanks for help,
Gleb
It is better to use biopython library
from Bio import SeqIO
input_file = open("input.fasta")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
a simple correction to your code:
from collections import defaultdict #this will make your life simpler
f = open('input2.txt','r')
list=defaultdict(str)
name = ''
for line in f:
#if your line starts with a > then it is the name of the following sequence
if line.startswith('>'):
name = line[1:-1]
continue #this means skips to the next line
#This code is only executed if it is a sequence of bases and not a name.
list[name]+=line.strip()
UPDATE:
Since I've got a notification that this old answer was upvoted, I've decided to present what I now think is the proper solution using Python 3.7. Translation to Python 2.7 only requires removing the typing import line and the function annotations:
from collections import OrderedDict
from typing import Dict
NAME_SYMBOL = '>'
def parse_sequences(filename: str,
ordered: bool=False) -> Dict[str, str]:
"""
Parses a text file of genome sequences into a dictionary.
Arguments:
filename: str - The name of the file containing the genome info.
ordered: bool - Set this to True if you want the result to be ordered.
"""
result = OrderedDict() if ordered else {}
last_name = None
with open(filename) as sequences:
for line in sequences:
if line.startswith(NAME_SYMBOL):
last_name = line[1:-1]
result[last_name] = []
else:
result[last_name].append(line[:-1])
for name in result:
result[name] = ''.join(result[name])
return result
Now, I realize that the OP asked for the "most trivial solution", however since they are working with genome data, it seems fair to assume that each sequence could potentially be very large. In that case it makes sense to optimize a little bit by collecting the sequence lines into a list, and then to use the str.join method on those lists at the end to produce the final result.

Categories