How to count & print specific strings from a .txt file in python?

How to count & print specific strings from a .txt file in python? - python

I'm having some trouble with the output I am receiving on this problem. Basically, I have a text file (https://www.py4e.com/code3/mbox.txt) and I am attempting to first have python print how many email addresses are found in it and then print each of those addresses on subsequent lines. A sample of my output is looking like this:
Received: (from apache#localhost)
There were 22003 email addresses in mbox.txt
for source#collab.sakaiproject.org; Thu, 18 Oct 2007 11:31:49 -0400
There were 22004 email addresses in mbox.txt
X-Authentication-Warning: nakamura.uits.iupui.edu: apache set sender to zach.thomas#txstate.edu using -f
There were 22005 email addresses in mbox.txt
What am I doing wrong here? Here's my code
fhand = open('mbox.txt')
count = 0
for line in fhand:
line = line.rstrip()
if '#' in line:
count = count + 1
print('There were', count, 'email addresses in mbox.txt')
if '#' in line:
print(line)

The following modifies your code to use a regular expression to find emails in text lines.
import re
# Pattern for email
# (see https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/)
pattern = re.compile(r'\S+#\S+')
with open('mbox.txt') as fhand:
emails = []
for line in fhand:
# Detect all emails in line using regex pattern
found_emails = pattern.findall(line)
if found_emails:
emails.extend(found_emails)
print('There were', len(emails), 'email addresses in mbox.txt')
if emails:
print(*emails, sep="\n")
Output
There were 44018 email addresses in mbox.txt
stephen.marquard#uct.ac.za
<postmaster#collab.sakaiproject.org>
<200801051412.m05ECIaH010327#nakamura.uits.iupui.edu>
<source#collab.sakaiproject.org>;
<source#collab.sakaiproject.org>;
<source#collab.sakaiproject.org>;
apache#localhost)
source#collab.sakaiproject.org;
stephen.marquard#uct.ac.za
source#collab.sakaiproject.org
....
....
...etc...

Can you make it clearer what your expected output is compared to your actual output?
You have two if '#' in line' statements that should be combined; there's no reason to ask the same question twice.
You count the number of lines that contain an # symbol and then per line, print the current count.
If you want to only print the count once, then put it outside (after) your for loop.
If you want to print the email addresses and not the whole lines that contain them, then you'll need to do some more string processing to extract the email from the line.
Don't forget to close your file when you've finished with it.

Related

Beginner: Index Error - What's wrong with this code?

As mentioned, I am a beginner and am trying to do short exercises. Unfortunately my online tutor is not able to or unwilling to help me with this (keeps suggesting other ways of doing it).
My task is to check if the first word for the line is 'From ' in which case I need to print the next word (the email address).
For example the file has series of lines like the following
From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008
Return-Path: <postmaster#collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
From louis#media.berkeley.edu Fri Jan 4 18:10:48 2008
Return-Path: <postmaster#collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.97])
The code should result with the following output:
stephen.marquard#uct.ac.za
louis#media.berkeley.edu
I have written the following code to do this:
fname = "mbox-short.txt"
f = open(fname,'r')
lines = f.readlines()
i = 0
count = len(lines)
while i < count :
test = lines[i].split()
if test[0] == "From " :
print(test[1])
i += 1
I keep getting the following error:
Traceback (most recent call last):
File "C:\Users\38775\Desktop\py4e\Project 2\email.py", line 10, in <module>
if test[0] == "From " :
IndexError: list index out of range
I just want to understand why this is happening, and how I can correct this. Request you not to take time to share alternatives.
Thanks

An IndexError indicates that you're trying to access some part of a list which doesn't exist (for example, trying to find the 5th value of [1, 2, 3]).
It would be good to know a small part of the contents of your file or an example input and your desired output so we can figure out exactly what's going wrong.

So here is a modification of what you have that works:
fname = "mbox-short.txt"
f = open(fname,'r')
lines = f.readlines()
i = 0
count = len(lines)
while i < count :
test = lines[i].strip().split()
if test[0] == "From":
print(test[1])
i += 1
After you "strip()" you get this "From email#email.com". You need to tack on a ".split()" to split it into a list of two parts like this
['From', 'email#email.com'].
Now if test[0] == "From" (if the first word is From) you can print test[1] which will be the second word (the email).
The ".split()" was your mistake because that's what splits the string by spaces or a different character chosen.
Hope this helps!

Thanks everyone!
Turns out the issue was that there were some blank lines in the file and so I needed to nest an if function to keep it moving.
Thanks!

How to extract a part of data, that we get from website using url.open()

I wrote a Program that connects to this website
http://mbox.dr-chuck.net/sakai.devel/1/2
I need to parse it and get email in that website
url = http://mbox.dr-chuck.net/sakai.devel/1/2
data = urllib.urlopen(url).read()
for line in data:
templine = line.strip()
print templine
but it prints individual letters instead of words
like when i try to print a particular line from it
F
r
o
m
n
e
w
s
how to fix this please help me
what to do,I need my program to print as lines
sorry about my language, this is my first question to post

If you are using python3, you can do something like this:
from urllib.request import urlopen
data = urlopen("http://mbox.dr-chuck.net/sakai.devel/1/2").read().decode("utf8").split("\n")
for k in data:
print(k)
Update:
If you want to print only the second line from the given url, you can do something like this:
print(data[1])
>>> 'From: "Glenn R. Golden" <ggolden#umich.edu>'
otherwise, if you want to print all the lines which starts with From or From:, you can do something like this:
for k in data:
if k.split(" ")[0] == "From" or k.split(" ")[0] == "From:":
print(k)
Output:
From news#gmane.org Tue Mar 04 03:33:20 200
From: "Glenn R. Golden" <ggolden#umich.edu>

url = 'http://mbox.dr-chuck.net/sakai.devel/1/2'
data = urllib.urlopen(url).readlines()
for line in data:
if line.startswith('From'):
print (line)
out:
From news#gmane.org Tue Mar 04 03:33:20 2003
From: "Glenn R. Golden" <ggolden#umich.edu>
use readlines() to get each line in the file
use startswith() to get line which starts with From

Python - line split with spaces?

I'm sure this is a basic question, but I have spent about an hour on it already and can't quite figure it out. I'm parsing smartctl output, and here is the a sample of the data I'm working with:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-39-pve] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA MD04ACA500
Serial Number: Y9MYK6M4BS9K
LU WWN Device Id: 5 000039 5ebe01bc8
Firmware Version: FP2A
User Capacity: 5,000,981,078,016 bytes [5.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Jul 2 11:24:08 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
What I'm trying to achieve is pulling out the device model (some devices it's just one string, other devices, such as this one, it's two words), serial number, time, and a couple other fields. I assume it would be easiest to capture all data after the colon, but how to eliminate the variable amounts of spaces?
Here is the relevant code I currently came up with:
deviceModel = ""
serialNumber = ""
lines = infoMessage.split("\n")
for line in lines:
parts = line.split()
if str(parts):
if parts[0] == "Device Model: ":
deviceModel = parts[1]
elif parts[0] == "Serial Number: ":
serialNumber = parts[1]
vprint(3, "Device model: %s" %deviceModel)
vprint(3, "Serial number: %s" %serialNumber)
The error I keep getting is:
File "./tester.py", line 152, in parseOutput
if parts[0] == "Device Model: ":
IndexError: list index out of range
I get what the error is saying (kinda), but I'm not sure what else the range could be, or if I'm even attempting this in the right way. Looking for guidance to get me going in the right direction. Any help is greatly appreciated.
Thanks!

The IndexError occurs when the split returns a list of length one or zero and you access the second element. This happens when it isn't finding anything to split (empty line).
No need for regular expressions:
deviceModel = ""
serialNumber = ""
lines = infoMessage.split("\n")
for line in lines:
if line.startswith("Device Model:"):
deviceModel = line.split(":")[1].strip()
elif line.startswith("Serial Number:"):
serialNumber = line.split(":")[1].strip()
print("Device model: %s" %deviceModel)
print("Serial number: %s" %serialNumber)

I guess your problem is the empty line in the middle. Because,
>>> '\n'.split()
[]
You can do something like,
>>> f = open('a.txt')
>>> lines = f.readlines()
>>> deviceModel = [line for line in lines if 'Device Model' in line][0].split(':')[1].strip()
# 'TOSHIBA MD04ACA500'
>>> serialNumber = [line for line in lines if 'Serial Number' in line][0].split(':')[1].strip()
# 'Y9MYK6M4BS9K'

Try using regular expressions:
import re
r = re.compile("^[^:]*:\s+(.*)$")
m = r.match("Device Model: TOSHIBA MD04ACA500")
print m.group(1) # Prints "TOSHIBA MD04ACA500"

Not sure what version you're running, but on 2.7, line.split() is splitting the line by word, so
>>> parts = line.split()
parts = ['Device', 'Model:', 'TOSHIBA', 'MD04ACA500']
You can also try line.startswith() to find the lines you want https://docs.python.org/2/library/stdtypes.html#str.startswith

The way I would debug this is by printing out parts at every iteration. Try that and show us what the list is when it fails.
Edit: Your problem is most likely what #jonrsharpe said. parts is probably an empty list when it gets to an empty line and str(parts) will just return '[]' which is True. Try to test that.

I think it would be far easier to use regular expressions here.
import re
for line in lines:
# Splits the string into at most two parts
# at the first colon which is followed by one or more spaces
parts = re.split(':\s+', line, 1)
if parts:
if parts[0] == "Device Model":
deviceModel = parts[1]
elif parts[0] == "Serial Number":
serialNumber = parts[1]
Mind you, if you only care about the two fields, startswith might be better.

When you split the blank line, parts is an empty list.
You try to accommodate that by checking for an empty list, But you turn the empty list to a string which causes your conditional statement to be True.
>>> s = []
>>> bool(s)
False
>>> str(s)
'[]'
>>> bool(str(s))
True
>>>
Change if str(parts): to if parts:.
Many would say that using a try/except block would be idiomatic for Python
for line in lines:
parts = line.split()
try:
if parts[0] == "Device Model: ":
deviceModel = parts[1]
elif parts[0] == "Serial Number: ":
serialNumber = parts[1]
except IndexError:
pass

Merging Words into a Line

I am currently using Python v2.6 and trying to merge words into a line. My code supposed to read data from a text file, in which I have two rows of data both of which are strings. Then, it takes the second row data every time, which are the words of sentences, those are separated by delimiter strings, such that:
Inside the .txt:
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
"delimiter_string"
"row_1_data" "row_2_data"
"row_1_data" "row_2_data"
...
Those "row_2_data" will add-up to a sentence later. Sorry for the long introduction btw.
Here is my code:
import sys
import re
newLine = ''
for line in sys.stdin:
word = line.split(' ')[1]
if word == '<S>+BSTag':
continue
elif word == '</S>+ESTag':
print newLine
newLine = ''
continue
else:
w = re.sub('\[.*?]', '', word)
if newLine == '':
newLine += w
else:
newLine += ' ' + w
"BSTag" is the tag for "Sentence Begins" and "ESTag" is for "Sentence Ends": the so called "delimiters". "re.sub" is used for a special purpose and it works as far as I checked.
The problem is that, when I execute this python script from the command line in linux with the following command: $ cat file.txt | script.py | less, I can not see any output, but just a blank file.
For those who are not familiar with linux, I guess the problem has nothing to do with terminal execution, thus you can neglect that part. Simply, the code does not work as intended and I can not find a single mistake.
Any help will be appreciated, and thanks for reading the long post :)
Ok, the problem is solved, which was actually a corpus error instead of a coding one. A very odd entry was detected in the text file, which was causing problems. Removing it solved it. You can use both of these approaches: mine and the one presented by "snurre" if you want a similar text processing.
Cheers.

def foo(lines):
output = []
for line in lines:
words = line.split()
if len(words) < 2:
word = words[0]
else:
word = words[1]
if word == '</S>+ESTag':
yield ' '.join(output)
output = []
elif word != '<S>+BSTag':
output.append(words[1])
for sentence in foo(sys.stdin):
print sentence
Your regex is a little funky. From what I can tell, it's replacing anything between (and including) a pair of [ and ] with '', so it ends up printing empty strings.

I think the problem is that the script isn't being executed (unless you just excluded the shebang in the code you posted)
Try this
cat file.txt | python script.py | less

Python - go to two lines above match

In a text file like this:
First Name last name #
secone name
Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
First Name last name #
....same as above...
I need to match string 'Work Phone:' then go two lines up and insert character '|' in the begining of line. so pseudo code would be:
if "Work Phone:" in line:
go up two lines:
write | + line
write rest of the lines.
File is about 10 mb and there are about 1000 paragraphs like this.
Then i need to write it to another file. So desired result would be:
First Name last name #
secone name
|Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
thanks for any help.

This solution doesn't read whole file into memory
p=""
q=""
for line in open("file"):
line=line.rstrip()
if "Work Phone" in line:
p="|"+p
if p: print p
p,q=q,line
print p
print q
output
$ python test.py
First Name last name #
secone name
|Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:

You can use this regex
(.*\n){2}(Work Phone:)
and replace the matches with
|\1\2
You don't even need Python, you can do such a thing in any modern text editor, like Vim.

Something like this?
lines = text.splitlines()
for i, line in enumerate(lines):
if 'Work Phone:' in line:
lines[i-2] = '|' + lines[i-2]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to count & print specific strings from a .txt file in python? - python

Related

Beginner: Index Error - What's wrong with this code?

How to extract a part of data, that we get from website using url.open()

Python - line split with spaces?

Merging Words into a Line

Python - go to two lines above match

Categories

Resources