Find multiple instances of a string in a text file - python - python

I'm trying to find multiple instances of a string in a text file but i can only work out a way to find the first instance. I've tried various while and for loops to no avail and i'm stuck for answers. What is the most efficient way to do this in python?
movinf = open("movinf.txt", "a")
match = re.search('"string":([^,]+)', name)
if match:
result = match.group(1)
movinf.write(result + "\n")
else:
pass
movinf.close()

You are only running the blocks once and therefore only getting one result. Use re.findall instead:
match = re.findall('"string":([^,]+)', name)
if len(match) > 0:
movinf.write("\n".join(match))
movinf.close()

You could try re.findall()
p = re.pattern('"string":([^,]+)')
print p.findall(name)

Folowing program use simple file and list operation :
str1 = raw_input("Enter the string you want to search : ")
with open("C:\\Users\\priyank\\Desktop\\movinf.txt","r") as movinf:
listp = movinf.readlines()
count =0
for i in range(0, len(listp)):
if str1 in listp[i]:
# do something
count=count+1
print "number of word exist in file :" + str(count)

Related

Python - line split with spaces?

I'm sure this is a basic question, but I have spent about an hour on it already and can't quite figure it out. I'm parsing smartctl output, and here is the a sample of the data I'm working with:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-39-pve] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA MD04ACA500
Serial Number: Y9MYK6M4BS9K
LU WWN Device Id: 5 000039 5ebe01bc8
Firmware Version: FP2A
User Capacity: 5,000,981,078,016 bytes [5.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Jul 2 11:24:08 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
What I'm trying to achieve is pulling out the device model (some devices it's just one string, other devices, such as this one, it's two words), serial number, time, and a couple other fields. I assume it would be easiest to capture all data after the colon, but how to eliminate the variable amounts of spaces?
Here is the relevant code I currently came up with:
deviceModel = ""
serialNumber = ""
lines = infoMessage.split("\n")
for line in lines:
parts = line.split()
if str(parts):
if parts[0] == "Device Model: ":
deviceModel = parts[1]
elif parts[0] == "Serial Number: ":
serialNumber = parts[1]
vprint(3, "Device model: %s" %deviceModel)
vprint(3, "Serial number: %s" %serialNumber)
The error I keep getting is:
File "./tester.py", line 152, in parseOutput
if parts[0] == "Device Model: ":
IndexError: list index out of range
I get what the error is saying (kinda), but I'm not sure what else the range could be, or if I'm even attempting this in the right way. Looking for guidance to get me going in the right direction. Any help is greatly appreciated.
Thanks!
The IndexError occurs when the split returns a list of length one or zero and you access the second element. This happens when it isn't finding anything to split (empty line).
No need for regular expressions:
deviceModel = ""
serialNumber = ""
lines = infoMessage.split("\n")
for line in lines:
if line.startswith("Device Model:"):
deviceModel = line.split(":")[1].strip()
elif line.startswith("Serial Number:"):
serialNumber = line.split(":")[1].strip()
print("Device model: %s" %deviceModel)
print("Serial number: %s" %serialNumber)
I guess your problem is the empty line in the middle. Because,
>>> '\n'.split()
[]
You can do something like,
>>> f = open('a.txt')
>>> lines = f.readlines()
>>> deviceModel = [line for line in lines if 'Device Model' in line][0].split(':')[1].strip()
# 'TOSHIBA MD04ACA500'
>>> serialNumber = [line for line in lines if 'Serial Number' in line][0].split(':')[1].strip()
# 'Y9MYK6M4BS9K'
Try using regular expressions:
import re
r = re.compile("^[^:]*:\s+(.*)$")
m = r.match("Device Model: TOSHIBA MD04ACA500")
print m.group(1) # Prints "TOSHIBA MD04ACA500"
Not sure what version you're running, but on 2.7, line.split() is splitting the line by word, so
>>> parts = line.split()
parts = ['Device', 'Model:', 'TOSHIBA', 'MD04ACA500']
You can also try line.startswith() to find the lines you want https://docs.python.org/2/library/stdtypes.html#str.startswith
The way I would debug this is by printing out parts at every iteration. Try that and show us what the list is when it fails.
Edit: Your problem is most likely what #jonrsharpe said. parts is probably an empty list when it gets to an empty line and str(parts) will just return '[]' which is True. Try to test that.
I think it would be far easier to use regular expressions here.
import re
for line in lines:
# Splits the string into at most two parts
# at the first colon which is followed by one or more spaces
parts = re.split(':\s+', line, 1)
if parts:
if parts[0] == "Device Model":
deviceModel = parts[1]
elif parts[0] == "Serial Number":
serialNumber = parts[1]
Mind you, if you only care about the two fields, startswith might be better.
When you split the blank line, parts is an empty list.
You try to accommodate that by checking for an empty list, But you turn the empty list to a string which causes your conditional statement to be True.
>>> s = []
>>> bool(s)
False
>>> str(s)
'[]'
>>> bool(str(s))
True
>>>
Change if str(parts): to if parts:.
Many would say that using a try/except block would be idiomatic for Python
for line in lines:
parts = line.split()
try:
if parts[0] == "Device Model: ":
deviceModel = parts[1]
elif parts[0] == "Serial Number: ":
serialNumber = parts[1]
except IndexError:
pass

Python - How to make sure that a line being read from a file contain only a given string and nothing else

In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?
import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False
Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']
You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.
If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.
You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)

python write can take 2 arguments

I have a question to make an "output.txt".
I would like to write both word and prob(l.19) results into
an "output.txt" file.
When I write "model_file.write(word, prob)", the terminal scolds me with
"TypeError: function takes exactly 1 argument (2 given)" message.
I tried to add more arguments but it didn't work..
Could anybody help me with my question??
THIS IS A WORD COUNT.PY
total_count = 0
train_file = open(sys.argv[1],"r")
for line in train_file:
words = line.strip().split(" ")
words.append("</s>")
for word in words:t
counts[word] = counts.get(word, 0) + 1
total_count = total_count + 1
model_file = open('output.txt',"w")
for word, count in sorted(counts.items(),reverse=True):
prob = counts[word]*1.0/total_count
print "%s --> %f" % (word, prob)
model_file.write(word, prob)
model_file.close()
#
Just simply replace
model_file.write(word, prob)
with
model_file.write(word+' '+str(prob)+'\n')
Be aware that the method write() was implemented to take only one string argument, so you have to convert prob into a string (by the method str()) and then combine it with word by the string operator +, so that you got only one string argument.
P.S.: though you didn't ask this, I have to say that if you are going to write each word and its probability, you should put model_file.write(word+' '+str(prob)+'\n') into the for statement. Otherwise, if you resist to call it outside of the for statement for some purpose, then you should assign word and prob outside of the for statement too. Or it would cause another error.
You could use the print statement to do this:
print >>model_file, word, prob
I wanna to created a kind on description about my df so I write this:
# Create an empty txt
f = open(os.path.join(pathlib.Path().absolute(),'folder','florder','name.txt'), "a")
# Create an kind of header
f.write('text'+'\n')
f.write('text'+'\n')
f.write("""
-------------------
""")
f.write('text:'+ '\n')
f.write("""
""")
for c in range(0, len(df.columns)):
campo = df.columns[c]
if df[df.columns[c]].dtype== 'object':
text= 'Tex'
outfile = open('name.txt','w')
f.write('str:'+"'"+str(xxx)+"'"'\n')
f.write('str:'+ str(str)+'\n')
f.write('\n')
f.close()

Python RegEx Woes

I'm not sure why this isn't working:
import re
import csv
def check(q, s):
match = re.search(r'%s' % q, s, re.IGNORECASE)
if match:
return True
else:
return False
tstr = []
# test strings
tstr.append('testthisisnotworking')
tstr.append('This is a TEsT')
tstr.append('This is a TEST mon!')
f = open('testwords.txt', 'rU')
reader = csv.reader(f)
for type, term, exp in reader:
for i in range(2):
if check(exp, tstr[i]):
print exp + " hit on " + tstr[i]
else:
print exp + " did NOT hit on " + tstr[i]
f.close()
testwords.txt contains this line:
blah, blah, test
So essentially 'test' is the RegEx pattern. Nothing complex, just a simple word. Here's the output:
test did NOT hit on testthisisnotworking
test hit on This is a TEsT
test hit on This is a TEST mon!
Why does it NOT hit on the first string? I also tried \s*test\s* with no luck. Help?
The csv module by default returns blank spaces around words in the input (this can be changed by using a different "dialect"). So exp contains " test" with a leading space.
A quick way to fix this would be to add:
exp = exp.strip()
after you read from the CSV file.
Adding a print repr(exp) to the top of the first for loop shows that exp is ' test', note the leading space.
This isn't that surprising since csv.reader() splits on commas, try changing your code to the following:
for type, term, exp in reader:
exp = exp.strip()
for s in tstr:
if check(exp, s):
print exp + " hit on " + s
else:
print exp + " did NOT hit on " + s
Note that in addition to the strip() call which will remove the leading a trailing whitespace, I change your second for loop to just loop directly over the strings in tstr instead of over a range. There was actually a bug in your current code because tstr contained three values but you only checked the first two because for i in range(2) will only give you i=0 and i=1.

Python RegEx nested search and replace

I need to to a RegEx search and replace of all commas found inside of quote blocks.
i.e.
"thing1,blah","thing2,blah","thing3,blah",thing4
needs to become
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
my code:
inFile = open(inFileName,'r')
inFileRl = inFile.readlines()
inFile.close()
p = re.compile(r'["]([^"]*)["]')
for line in inFileRl:
pg = p.search(line)
# found comment block
if pg:
q = re.compile(r'[^\\],')
# found comma within comment block
qg = q.search(pg.group(0))
if qg:
# Here I want to reconstitute the line and print it with the replaced text
#print re.sub(r'([^\\])\,',r'\1\,',pg.group(0))
I need to filter only the columns I want based on a RegEx, filter further,
then do the RegEx replace, then reconstitute the line back.
How can I do this in Python?
The csv module is perfect for parsing data like this as csv.reader in the default dialect ignores quoted commas. csv.writer reinserts the quotes due to the presence of commas. I used StringIO to give a file like interface to a string.
import csv
import StringIO
s = '''"thing1,blah","thing2,blah","thing3,blah"
"thing4,blah","thing5,blah","thing6,blah"'''
source = StringIO.StringIO(s)
dest = StringIO.StringIO()
rdr = csv.reader(source)
wtr = csv.writer(dest)
for row in rdr:
wtr.writerow([item.replace('\\,',',').replace(',','\\,') for item in row])
print dest.getvalue()
result:
"thing1\,blah","thing2\,blah","thing3\,blah"
"thing4\,blah","thing5\,blah","thing6\,blah"
General Edit
There was
"thing1\\,blah","thing2\\,blah","thing3\\,blah",thing4
in the question, and now it is not there anymore.
Moreover, I hadn't remarked r'[^\\],'.
So, I completely rewrite my answer.
"thing1,blah","thing2,blah","thing3,blah",thing4
and
"thing1\,blah","thing2\,blah","thing3\,blah",thing4
being displays of strings (I suppose)
import re
ss = '"thing1,blah","thing2,blah","thing3\,blah",thing4 '
regx = re.compile('"[^"]*"')
def repl(mat, ri = re.compile('(?<!\\\\),') ):
return ri.sub('\\\\',mat.group())
print ss
print repr(ss)
print
print regx.sub(repl, ss)
print repr(regx.sub(repl, ss))
result
"thing1,blah","thing2,blah","thing3\,blah",thing4
'"thing1,blah","thing2,blah","thing3\\,blah",thing4 '
"thing1\blah","thing2\blah","thing3\,blah",thing4
'"thing1\\blah","thing2\\blah","thing3\\,blah",thing4 '
You can try this regex.
>>> re.sub('(?<!"),(?!")', r"\\,",
'"thing1,blah","thing2,blah","thing3,blah",thing4')
#Gives "thing1\,blah","thing2\,blah","thing3\,blah",thing4
The logic behind this is to substitute a , with \, if it is not immediately both preceded and followed by a "
I came up with an iterative solution using several regex functions:
finditer(), findall(), group(), start() and end()
There's a way to turn all this into a recursive function that calls itself.
Any takers?
outfile = open(outfileName,'w')
p = re.compile(r'["]([^"]*)["]')
q = re.compile(r'([^\\])(,)')
for line in outfileRl:
pg = p.finditer(line)
pglen = len(p.findall(line))
if pglen > 0:
mpgstart = 0;
mpgend = 0;
for i,mpg in enumerate(pg):
if i == 0:
outfile.write(line[:mpg.start()])
qg = q.finditer(mpg.group(0))
qglen = len(q.findall(mpg.group(0)))
if i > 0 and i < pglen:
outfile.write(line[mpgend:mpg.start()])
if qglen > 0:
for j,mqg in enumerate(qg):
if j == 0:
outfile.write( mpg.group(0)[:mqg.start()] )
outfile.write( re.sub(r'([^\\])(,)',r'\1\\\2',mqg.group(0)) )
if j == (qglen-1):
outfile.write( mpg.group(0)[mqg.end():] )
else:
outfile.write(mpg.group(0))
if i == (pglen-1):
outfile.write(line[mpg.end():])
mpgstart = mpg.start()
mpgend = mpg.end()
else:
outfile.write(line)
outfile.close()
have you looked into str.replace()?
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old
replaced by new. If the optional argument count is given, only the
first count occurrences are replaced.
here is some documentation
hope this helps

Categories