How to Check if a RE In Python was Performed - python

I'm trying to check if a regular expression was executed on a specific line of the opened document and then if so add to
a count variable by 1. If the count exceeds 2 I want it to stop. The below code is what I have so far.
for line in book:
if count<=2:
reg1 = re.sub(r'Some RE',r'Replaced with..',line)
f.write(reg1)
"if reg1 was Performed add to count variable by 1"

Definitely the best way of doing this is to use re.subn() instead re.sub()
The re.subn() returns a tuple (new_string, number_of_changes_made) so it's perfect for you:
for line in book:
if count<=2:
reg1, num_of_changes = re.subn(r'Some RE',r'Replaced with..',line)
f.write(reg1)
if num_of_changes > 0:
count += 1

If the idea is to determine if a substitution was performed on the line, it is fairly simple:
count = 0
for line in book:
if count<=2:
reg1 = re.sub(r'Some RE',r'Replaced with..',line)
f.write(reg1)
count += int(reg1 == line)

You can pass a function to re.sub as the replacement value. This lets you do stuff like this: (though a simple search then sub approach while slower would be easier to reason about):
import re
class Counter(object):
def __init__(self, start=0):
self.value = start
def incr(self):
self.value += 1
book = """This is some long text
with the text 'Some RE' appearing twice:
Some RE see?
"""
def countRepl(replacement, counter):
def replacer(matchobject):
counter.incr()
return replacement
return replacer
counter = Counter(0)
print re.sub(r'Some RE', countRepl('Replaced with..', counter), book)
print counter.value
This produces the following output:
This is some long text
with the text 'Replaced with..' appearing twice:
Replaced with.. see?
2

You could compare it to the original string to see if it changed:
for line in book:
if count<=2:
reg1 = re.sub(r'Some RE',r'Replaced with..',line)
f.write(reg1)
if line != reg1:
count += 1

subn will tell you how many substitutions were made in the line and the count parameter will limit the number of substitutions that will be attempted. Put them together and you have code that will stop after two substitutions, even if there are multiple subs on a single line.
look_count = 2
for line in book:
reg1, sub_count = re.subn(r'Some RE', r'Replaced with..', line,count=look_count)
f.write(reg1)
look_count -= sub_count
if not look_count:
break

Related

Questions about split() and for each loops

I am given a file that looks like this with many more lines that I am giving.
4
5 r begin
20 wr Dark tunnel
I have created a class to handle each part of the line that I am trying to split using the split() operation. To do this I am splitting based off of spaces, but for example in the 3rd line that says "Dark tunnel" I am splitting this apart as well, but need it to read as "Dark tunnel".
The other question that I have is with the for each loop, I want to perform the same operation on each line, except for the first line that is just the number 4, where I need to multiply that by itself minus 1 (4 * (4-1))
I created a class that takes the split line and assigns each part that is split. I have also made the for each loop, but as of now it performs the same operation on every line including the first one.
class point:
def __init__(self, val, route, title):
self.value = val
self.route = route
self.title = title
Want to properly split the lines, and perform a different operation on the first than the rest.
For split, you could do:
parts = s.split()
val, route, title = parts[0], parts[1], ' '.join(parts[2:])
For the for loop, you could do:
for index, line in enumerate(lines):
if index == 0:
result = int(line)*(int(line)-1)
else:
# do something else
All together:
for index, line in enumerate(lines):
if index == 0:
result = int(line)*(int(line)-1)
else:
parts = line.split()
val, route, title = parts[0], parts[1], ' '.join(parts[2:])
p = point(val, route, title)

how to use the def function

i am new to python so i hope you guys can help me out. Currently i am using the import function to get my output, but i wish to include def function to this set of code that count the top 10 most frequent word. But i can't figure it out. Hope you guys can help me. Thanks in advance!!!
import collections
import re
file = open('partA', 'r')
file = file.read()
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['it', 'is']))
wordcount = collections.defaultdict(int)
"""
the next paragraph does all the counting and is the main point of difference from the original article. More on this is explained later.
"""
pattern = r"\W"
for word in file.lower().split():
word = re.sub(pattern, '', word)
if word not in stopwords:
wordcount[word] += 1
to_print = int(input("How many top words do you wish to print?"))
print(f"The most common {to_print} words are:")
mc = sorted(wordcount.items(), key=lambda k_v: k_v[1], reverse=True) [:to_print]
for word, count in mc:
print(word, ":", count)
The output:
How many top words do you wish to print?30
The most common 30 words are:
hey : 1
there : 1
this : 1
joey : 1
how : 1
going : 1
'def' is used to create a user defined function to later call in a script. For example:
def printme(str):
print(str)
printme('Hello')
I've now created a function called 'printme' that I call later to print the string. This is obviously a pointless function as it just does what the 'print' function does, but I hope this clears up what the purpose of 'def' is!

Replace "*" (asterics) in HTML file with increasing number with python

I have a HTML file that has a series of * (asterics) in it and would like to replace it with numbers starting from 0 and on until it replaces all * (asterics) with a number.
I am unsure if this is possible in python or if another methods would be better.
Edit 2
Here is a short snippet from the TXT file that I am working on
<td nowrap>4/29/2011 14.42</td>
<td align="center">*</td></tr>
I made a file just containing those lines to test out the code.
And here is the code that I am attempting to use to change the asterics:
number = 0
with open('index.txt', 'r+') as inf:
text = inf.read()
while "*" in text:
print "I am in the loop"
text = text.replace("*", str(number), 1)
number += 1
I think that is as much detail as I can go into. Please let me know if I should just add this edit as another comment or keep it as an edit.
And thanks for all the quick responses so far~!
Use the re.sub() function, this lets you produce a new value for each replacement by using a function for the repl argument:
from itertools import count
with open('index.txt', 'r') as inf:
text = inf.read()
text = re.sub(r'\*', lambda m, c=count(): str(next(c)), text)
with open('index.txt', 'w') as outf:
outf.write(text)
The count is taken care of by itertools.count(); each time you call next() on such an object the next value in the series is produced:
>>> import re
>>> from itertools import count
>>> sample = '''\
... foo*bar
... bar**foo
... *hello*world
... '''
>>> print(re.sub(r'\*', lambda m, c=count(): str(next(c)), sample))
foo0bar
bar12foo
3hello4world
Huapito's approach would work too, albeit slowly, provided you limit the number of replacements and actually store the result of the replacement:
with open('index.txt', 'r') as inf:
text = inf.read()
while "*" in text:
text = text.replace("*", str(number), 1)
number += 1
Note the third argument to str.replace(); that tells the method to only replace the first instance of the character.
html = 'some string containing html'
new_html = list(html)
count = 0
for char in range(0, len(new_html)):
if new_html[char] == '*':
new_html[char] = count
count += 1
new_html = ''.join(new_html)
This would replace each asteric with the numbers 1 to one less than the number of asterics, in order.
You need to iterate over each char, you can write to a tempfile and then replace the original with shutil.move using itertools.count to assign a number incrementally each time you find an asterix:
from tempfile import NamedTemporaryFile
from shutil import move
from itertools import count
cn = count()
with open("in.html") as f, NamedTemporaryFile("w+",dir="",delete=False) as out:
out.writelines((ch if ch != "*" else str(next(cn))
for line in f for ch in line ))
move(out.name,"in.html")
using a test file with:
foo*bar
bar**foo
*hello*world
Will output:
foo1bar
bar23foo
4hello5world
It is possible. Have a look at the docs. You should use something like a 'while' loop and 'replace'
Example:
number=0 # the first number
while "*" in text: #repeats the following code until this is false
text = text.replace("*", str(number), maxreplace=1) # replace with 'number'
number+=1 #increase number
Use fileinput
import fileinput
with fileinput.FileInput(fileToSearch, inplace=True) as file:
number=0
for line in file:
print(line.replace("*", str(number))
number+=1

Python: Counting a specific set of character occurrences in lines of a file

I am struggling with a small program in Python which aims at counting the occurrences of a specific set of characters in the lines of a text file.
As an example, if I want to count '!' and '#' from the following lines
hi!
hello#gmail.com
collection!
I'd expect the following output:
!;2
#;1
So far I got a functional code, but it's inefficient and does not use the potential that Python libraries have.
I have tried using collections.counter, with limited success. The efficiency blocker I found is that I couldn't select specific sets of characters on counter.update(), all the rest of the characters found were also counted. Then I would have to filter the characters I am not interested in, which adds another loop...
I also considered regular expressions, but I can't see an advantage in this case.
This is the functional code I have right now (the simplest idea I could imagine), which looks for special characters in file's lines. I'd like to see if someone can come up with a neater Python-specific idea:
def count_special_chars(filename):
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = dict(zip(special_chars, [0] * len(special_chars)))
with open(filename) as f:
for passw in f:
for c in passw:
if c in special_chars:
dict_count[c] += 1
return dict_count
thanks for checking
Why not count the whole file all together? You should avoid looping through string for each line of the file. Use string.count instead.
from pprint import pprint
# Better coding style: put constant out of the function
SPECIAL_CHARS = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ '
def count_special_chars(filename):
with open(filename) as f:
content = f.read()
return dict([(i, content.count(i)) for i in SPECIAL_CHARS])
pprint(count_special_chars('example.txt'))
example output:
{' ': 0,
'!': 2,
'.': 1,
'#': 1,
'[': 0,
'~': 0
# the remaining keys with a value of zero are ignored
...}
Eliminating the extra counts from collections.Counter is probably not significant either way, but if it bothers you, do it during the initial iteration:
from collections import Counter
special_chars = '''!"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~ '''
found_chars = [c for c in open(yourfile).read() if c in special_chars]
counted_chars = Counter(found_chars)
need not to process file contents line-by-line
to avoid nested loops, which increase complexity of your program
If you want to count character occurrences in some string, first, you loop over the entire string to construct an occurrence dict. Then, you can find any occurrence of character from the dict. This reduce complexity of the program.
When constructing occurrence dict, defaultdict would help you to initialize count values.
A refactored version of the program is as below:
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = defaultdict(int)
with open(filename) as f:
for c in f.read():
dict_count[c] += 1
for c in special_chars:
print('{0};{1}'.format(c, dict_count[c]))
ref. defaultdict Examples: https://docs.python.org/3.4/library/collections.html#defaultdict-examples
I did something like this where you do not need to use the counter library. I used it to count all the special char but you can adapt to put the count in a dict.
import re
def countSpecial(passwd):
specialcount = 0
for special in special_chars:
lenght = 0
#print special
lenght = len(re.findall(r'(\%s)' %special , passwd))
if lenght > 0:
#print lenght,special
specialcount = lenght + specialcount
return specialcount

Python: Parallely Fast Dictionary search of a given word list + List Enumeration + Do Something()

So I thought this Title would produce good search results. Anyway, given the following code:
It takes a one yield word as word from text_file_reader_gen() and iterates over under while loop until error where exception is given (Is there a better way than that other than try except?) and the interlock function just mixes them up.
def wordparser():
#word_freq={}
word=text_file_reader_gen()
word.next()
wordlist=[]
index=0
while True: #for word in ftext:
try:
#print 'entered try'
current=next(word)
wordlist.append(current) #Keep adding new words
#word_freq[current]=1
if len(wordlist)>2:
while index < len(wordlist)-1:
#print 'Before: len(wordlist)-1: %s || index: %s' %(len(wordlist)-1, index)
new_word=interlock_2(wordlist[index],wordlist[index+1]) #this can be any do_something() function, irrelevant and working fine
new_word2=interlock_2(wordlist[index+1],wordlist[index])
print new_word,new_word2
'''if new_word in word_freq:
correct_interlocked_words.append(new_word)
if new_word2 in word_freq:
correct_interlocked_words.append(new_word2)'''
index+=1
#print 'After: len(wordlist)-1: %s || index: %s' %(len(wordlist)-1, index)
'''if w not in word_freq:
word_freq[w]=1
else:
word_freq[w]=+1'''
except StopIteration,e:
#print 'entered except'
#print word_freq
break
#return word_freq
text_file_reader_gen() code:
def text_file_reader_gen():
path=str(raw_input('enter full file path \t:'))
fin=open(path,'r')
ftext=(x.strip() for x in fin)
for word in ftext:
yield word
Q1. Is it possible for word to be iterated and at the same time appending those word to the dictionary word_freq while at the same time enumerating over for key in word_freq where keys are words & are still being added, while the for loop runs and new words are mixed using the interlock function so that most of these iterations happen at one go- something like
while word.next() is not StopIteration:
word_freq[ftext.next()]+=1 if ftext not in word_freq #and
for i,j in word_freq.keys():
new_word=interlock_2(j,wordlist[i+1])
I just wanted a very simple thing and a hash dict search, like really very fast because the txt file from where it is taking words is a-z very long, it may have duplicates as well.
Q2. Ways to improvise this existing code?
Q3. Is there a way to 'for i,j in enumerate(dict.items())' so that i can reach dict[key] & dict[next_key] at the same time, although they are unordered, but that's also irrelevant.
UPDATE: After reviewing answers here, this is what I came up. It's working but I have a question regarding the following code:
def text_file_reader_gen():
path=str(raw_input('enter full file path \t:'))
fin=open(path,'r')
ftext=(x.strip() for x in fin)
return ftext #yield?
def wordparser():
wordlist=[]
index=0
for word in text_file_reader_gen():
works but instead if I use yield ftext, it doesn't.
Q4. What is the basic difference and why does that happen?
As far as I understand your example code, you're simply counting words. Take the following examples as ideas on which you can build on.
Q1. Yes and no. Running things in parallel is not trivial. You could use threading (GIL won't allow you true parallelism) or multiprocessing, but I don't see why you'd need to do this.
Q2. I don't understand the need for the text_file_reader_gen() function. Generators are iterators, you achieve the same thing by reading for line in file.
def word_parser():
path = raw_input("enter full file path\t: ")
words = {}
with open(path, "r") as f:
for line in f:
for word in line.split():
try:
words[word] += 1
except KeyError:
words[word] = 1
return words
The above goes through the file line by line, splits each line at whitespace and counts the word. It does not handle punctuation.
If your input files are natural language, you might want to take a look at the NTLK library. Here's another example that uses the collections library.
import collections
import string
def count_words(your_input):
result = {}
translate_tab = string.maketrans("","")
with open(your_input, "r") as f:
for line in f:
result.update(collections.Counter(x.translate(translate_tab, string.punctuation) for x in line.split()))
return result
# Test.txt contains 5 paragraphs of Lorem Ipsum from some online generator
In [61]: count_words("test.txt")
Out[61]:
{'Aenean': 1,
'Aliquam': 1,
'Class': 1,
'Cras': 1,
'Cum': 1,
'Curabitur': 2,
'Donec': 1,
'Duis': 1,
'Etiam': 2,
'Fusce': 1,
'In': 1,
'Integer': 1,
'Lorem': 1,
......
}
The function goes through the file line by line, creates a collections.Counter object – basically a sub-class of dict – splits each line by anything resembling whitespace, removes punctuation with string.translate and finally updates result dictionary with the Counter-dict. The Counter does all the ...counting.
Q3. Don't know why or how you'd achieve that.
Q3. Is there a way to 'for i,j in enumerate(dict.items())' so that i can reach dict[key] & dict[next_key] at the same time
You can get the next item in the iterable. So you can write a function to pair the current item with the next
Like this:
def with_next(thing):
prev = next(thing)
while True:
try:
cur = next(thing)
except StopIteration, e:
# There's no sane next item at the end of the iterable, so
# use None.
yield (prev, None)
raise e
yield (prev, cur)
prev = cur
As the comment says, it's not obvious what to do at the end of the list (where there is no "next key"), so it just returns None
For example:
for curitem, nextitem in with_next(iter(['mouse', 'cat', 'dog', 'yay'])):
print "%s (next: %s)" % (curitem, nextitem)
Outputs this:
mouse (next: cat)
cat (next: dog)
dog (next: yay)
yay (next: None)
It'll work for any iterable (e.g dict.iteritems(), dict.iterkeys(), enumerate etc):
mydict = {'mouse': 'squeek', 'cat': 'meow', 'dog': 'woof'}
for cur_key, next_key in with_next(mydict.iterkeys()):
print "%s (next: %s)" % (cur_key, next_key)
Regarding your update:
def text_file_reader_gen():
path=str(raw_input('enter full file path \t:'))
fin=open(path,'r')
ftext=(x.strip() for x in fin)
return ftext #yield?
Q4. What is the basic difference [between yield and return] and why does that happen?
yield and return are very different things.
return returns a value from the function, and then the function terminates.
yield turns the function into a "generator function". Instead of returning a single object and ending, a generator function outputs a series of objects, one each time yield is called.
Here are a bunch of good pages explaining generators:
http://docs.python.org/2/tutorial/classes.html#generators
http://zetcode.com/lang/python/itergener/
What does the "yield" keyword do in Python?
http://www.python.org/dev/peps/pep-0255/
The return statement works like it does in many other programming langages. Things like the official tutorial should explain it

Categories