Dictionary to count the email-address from the maillog file python [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
Experts, I am trying to count the E-maill address and number of their repitions in the maillog file which somehow i am able to make using Regular expression (re.search) OR (re.match) but i am looking this to be accomplished with (re.findall) which currently i am dabbling with.. would appreciate any suggestions..
1) Code Line ...
# cat maillcount31.py
#!/usr/bin/python
import re
#count = 0
mydic = {}
counts = mydic
fmt = " %-32s %-15s"
log = open('kkmail', 'r')
for line in log.readlines():
myre = re.search('.*from=<(.*)>,\ssize', line)
if myre:
name = myre.group(1)
if name not in mydic.keys():
mydic[name] = 0
mydic[name] +=1
for key in counts:
print fmt % (key, counts[key])
2) Output from the Current code..
# python maillcount31.py
root#MyServer1.myinc.com 13
User01#MyServer1.myinc.com 14

Hope this help...
from collections import Counter
emails = re.findall('.*from=<(.*)>,\ssize', line)# Modify re according to your file pattern OR line pattern. If findall() on each line, each returned list should be combined.
result = Counter(emails)# type is <class 'collections.Counter'>
dict(result)#convert to regular dict
re.findall() will return a list. Looking into How can I count the occurrences of a list item in Python?, there are other ways to count the words in the returned list.
By the way, interesting functions of Counter:
>>> tmp1 = Counter(re.findall('from=<([^\s]*)>', "from=<usr1#gmail.com>, from=<usr2#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>,") )
>>> tmp1
Counter({'usr1#gmail.com': 4, 'usr2#gmail.com': 1})
>>> tmp2 = Counter(re.findall('from=<([^\s]*)>', "from=<usr2#gmail.com>, from=<usr3#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>,") )
>>> dict(tmp1+tmp2)
{'usr2#gmail.com': 2, 'usr1#gmail.com': 7, 'usr3#gmail.com': 1}
So, if the file is very large, we can count each line and combine them by aid of Counter.

Have you considered using pandas, It can give you a nice table of results without the need for regex commands.
import pandas as pd
emails = pd.Series(email_list)
individual_emails = emails.unique()
tally = pd.DataFrame( [individual_emails , [0]*len(individual_emails)] )
#makes a table with emails and a zeroed talley
for item in individual_emails.index:
address = tally.iloc[item,0]
sum = len(email[email==address])
tally.iloc[item,1] = sum
print tally

I hope the code at the bottom helps.
However, here are three things to generally note:
Use (with) when opening files
When iterating over dictionaries, use iteritems()
When working with containers, collections are your best friend
#!/usr/bin/python
import re
from collections import Counter
fmt = " %-32s %-15s"
filename = 'kkmail'
# Extract the email addresses
email_list = []
with open(filename, 'r') as log:
for line in log.readlines():
_re = re.search('.*from=<(.*)>,\ssize', line)
if _re:
name = _re.group(1)
email_list.append(name)
# Count the email addresses
counts = dict(Counter(email_list)) # List to dict of counts: {'a':3, 'b':7,...}
for key, val in counts.iteritems():
print fmt % (key, val)

Related

How do i print this list in a readable form?

I have written a short python script to search for urls with a http status code in a logfile. The script works as intended and counts how often an url is requested in combination with a certain http status code. The dictionary with the results is unsorted. Thats why i sorted the data afterwards using the values in the dictionary. This part of the script works as intended and i get a sorted list with the urls and the counter, The list looks like:
([('http://example1.com"', 1), ('http://example2.com"', 5), ('http://example3.com"', 10)])
I just want to make it better readable and print the list in rows.
http://example1.com 1
http://example2.com 5
http://example3.com 10
I started with python only two weeks ago and i cant find a solution. I tried several solutions i found here on stackoverflow but nothing works. My current solution prints all urls in seperate rows but does not show the count. I cant use comma as a seperator because i got some url with commas in my logfile. Im sorry for my bad english and the stupid question. Thank you in advance.
from operator import itemgetter
from collections import OrderedDict
d=dict()
with open("access.log", "r") as f:
for line in f:
line_split = line.split()
list = line_split[5], line_split[8]
url=line_split[8]
string='407'
if string in line_split[5]:
if url in d:
d[url]+=1
else:
d[url]=1
sorted_d = OrderedDict(sorted(d.items(), key=itemgetter(1)))
for element in sorted_d:
parts=element.split(') ')
print(parts)
for url, count in sorted_d.items():
print(f'{url} {count}')
Replace your last for loop with the above.
To explain: we unpack the url, count pairs in sorted_d in the for loop, and then use the an f-string to print the url and count separated by a space.
First if you're already importing from the collections library, why not import a Counter?
from collections import Counter
d=Counter()
with open("access.log", "r") as f:
for line in f:
line_split = line.split()
list = line_split[5], line_split[8]
url=line_split[8]
string='407'
if string in line_split[5]:
d[url] += 1
for key, value in d.most_common(): # or reversed(d.most_common())
print(f'{key} {value}')
There are many good tutorials on how to format strings in Python such as this
Here an example code how to print a dictionary. I set the width of the columns with the variables c1 and c2.
c1 = 34; c2 = 10
printstr = '\n|%s|%s|' % ('-'*c1, '-'*c2)
for key in sorted(d.keys()):
val_str = str(d[key])
printstr += '\n|%s|%s|' % (str(key).ljust(c1), val_str.rjust(c2))
printstr += '\n|%s|%s|\n\n' % ('-' * c1, '-' * c2)
print(printstr)
The string function ljust() creates a string of the length passed as an argument where the content of the string is left justified.

Python: Counting a specific set of character occurrences in lines of a file

I am struggling with a small program in Python which aims at counting the occurrences of a specific set of characters in the lines of a text file.
As an example, if I want to count '!' and '#' from the following lines
hi!
hello#gmail.com
collection!
I'd expect the following output:
!;2
#;1
So far I got a functional code, but it's inefficient and does not use the potential that Python libraries have.
I have tried using collections.counter, with limited success. The efficiency blocker I found is that I couldn't select specific sets of characters on counter.update(), all the rest of the characters found were also counted. Then I would have to filter the characters I am not interested in, which adds another loop...
I also considered regular expressions, but I can't see an advantage in this case.
This is the functional code I have right now (the simplest idea I could imagine), which looks for special characters in file's lines. I'd like to see if someone can come up with a neater Python-specific idea:
def count_special_chars(filename):
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = dict(zip(special_chars, [0] * len(special_chars)))
with open(filename) as f:
for passw in f:
for c in passw:
if c in special_chars:
dict_count[c] += 1
return dict_count
thanks for checking
Why not count the whole file all together? You should avoid looping through string for each line of the file. Use string.count instead.
from pprint import pprint
# Better coding style: put constant out of the function
SPECIAL_CHARS = '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ '
def count_special_chars(filename):
with open(filename) as f:
content = f.read()
return dict([(i, content.count(i)) for i in SPECIAL_CHARS])
pprint(count_special_chars('example.txt'))
example output:
{' ': 0,
'!': 2,
'.': 1,
'#': 1,
'[': 0,
'~': 0
# the remaining keys with a value of zero are ignored
...}
Eliminating the extra counts from collections.Counter is probably not significant either way, but if it bothers you, do it during the initial iteration:
from collections import Counter
special_chars = '''!"#$%&'()*+,-./:;<=>?#[\\]^_`{|}~ '''
found_chars = [c for c in open(yourfile).read() if c in special_chars]
counted_chars = Counter(found_chars)
need not to process file contents line-by-line
to avoid nested loops, which increase complexity of your program
If you want to count character occurrences in some string, first, you loop over the entire string to construct an occurrence dict. Then, you can find any occurrence of character from the dict. This reduce complexity of the program.
When constructing occurrence dict, defaultdict would help you to initialize count values.
A refactored version of the program is as below:
special_chars = list('!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ')
dict_count = defaultdict(int)
with open(filename) as f:
for c in f.read():
dict_count[c] += 1
for c in special_chars:
print('{0};{1}'.format(c, dict_count[c]))
ref. defaultdict Examples: https://docs.python.org/3.4/library/collections.html#defaultdict-examples
I did something like this where you do not need to use the counter library. I used it to count all the special char but you can adapt to put the count in a dict.
import re
def countSpecial(passwd):
specialcount = 0
for special in special_chars:
lenght = 0
#print special
lenght = len(re.findall(r'(\%s)' %special , passwd))
if lenght > 0:
#print lenght,special
specialcount = lenght + specialcount
return specialcount

Faster operation reading file

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?
Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)
15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()
You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.
Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys

parsing repeated lines of string based on initial characters [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am working on lists and strings in python. I have following lines of string.
ID abcd
AC efg
RF hij
ID klmno
AC p
RF q
I want the output as :
abcd, efg, hij
klmno, p, q
This output is based on the the first two characters in the line. How can I achieve it in efficient way?
I'm looking to output the second part of the line for every entry between the ID tags.
I'm having a little trouble parsing the question, but according to my best guess, this should do what you're looking for:
all_data = " ".join([line for line in file]).split("ID")
return [", ".join([item.split(" ")[::2] for item in all_data])]
Basically what you're doing here is first just joining together all of your data (removing the newlines) then splitting on your keyphrase of "ID"
After that, if I'm correctly interpreting the question, you're looking to get the second value of each pair. These pairs are space delimited (as is everything in that item due to the " ".join in the first line), so we just step through that list grabbing every other item.
In general splits have a little more syntactic sugar than is usually used, and the full syntax is: [start:end:step], so [::2] just returns every other item.
You could use the following, which takes order into account so that transposing the dict's values makes more sense...
from collections import OrderedDict
items = OrderedDict()
with open('/home/jon/sample_data.txt') as fin:
lines = (line.strip().partition(' ')[::2] for line in fin)
for key, value in lines:
items.setdefault(key[0], []).append(value)
res = [', '.join(el) for el in zip(*items.values())]
# ['abcd, efg, hij', 'klmno, p, q']
Use a default dict:
from collections import defaultdict
result = defaultdict(list)
for line in lines:
split_line = line.split(' ')
result[split_line[0]].append(split_line[1])
This will give you a dictionary result that stores all the values that have the same key in an array. To get all the strings that were in a line that started with e.g. ID:
print result[ID]
Based on your answers in comments, this should work (if I understand what you're looking for):
data = None
for line in lines:
fields = line.split(2)
if fields[0] == "ID":
#New set of data
if data is not None:
#Output last set of data.
print ", ".join(data)
data = []
data.append(fields[1])
if data is not None:
#Output final data set
print ", ".join(data)
It's pretty straight forward, you're just collecting the second field in each line into data until you see that start of the next data set, at which point you output the previous data set.
I think using itertools.groupby is best for this kind of parsing (do something until next token X)
import itertools
class GroupbyHelper(object):
def __init__(self):
self.state = None
def __call__(self, row):
if self.state is None:
self.state = True
else:
if row[0] == 'ID':
self.state = not self.state
return self.state
# assuming you read data from 'stream'
for _, data in itertools.groupby((line.split() for line in stream), GroupbyHelper()):
print ','.join(c[1] for c in data)
output:
$ python groupby.py
abcd,efg,hij
klmno,p,q
It looks like you would like to sub group your data, when ever 'ID' is present as your key. Groupby solution could work wonder here, if you know how to group your data. Here is one such implementation that might work for you
>>> data=[e.split() for e in data.splitlines()]
>>> def new_key(key):
toggle = [0,1]
def helper(e):
if e[0] == key:
toggle[:] = toggle[::-1]
return toggle[0]
return helper
>>> from itertools import groupby
>>> for k,v in groupby(data, key = new_key('ID')):
for e in v:
print e[-1],
print
abcd efg hij
klmno p q
If lines is equal to
['ID abcd', 'AC efg', 'RF hij']
then
[line.split()[1] for line in lines]
Edit: Added everything below after down votes
I am not sure why this was down voted. I thought that code was the simplest way to get started with the information provided at the time. Perhaps this is a better explanation of what I thought/think the data was/is?
if input is a list of strings in repeated sequence, called alllines;
alllines = [ #a list of repeated lines of string based on initial characters
'ID abcd',
'AC efg',
'RF hij',
'ID klmno',
'AC p',
'RF q'
]
then code is;
[[line.split()[1] for line in lines] for lines in [[alllines.pop(0) \
for i in range(3)] for o in range(len(alllines)/3)]]
This basically says, create a sublist of three split [1] strings from the whole list of all strings for every three strings in the whole list.
and output is;
[[
'abcd', 'efg', 'hij'
], [
'klmno', 'p', 'q'
]]
Edit: 8-6-13 This is an even better one without pop();
zip(*[iter([line.split()[1] for line in alllines])]*3)
with a slightly different output
[(
'abcd', 'efg', 'hij'
), (
'klmno', 'p', 'q'
)]

Analysing a text file in Python

I have a text file that needs to be analysed. Each line in the file is of this form:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3
I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. How would I go about doing this in Python?
The other answers with regex and splitting the line will get the job done, but if you want a fully maintainable solution that will grow with you, you should build a grammar. I love pyparsing for this:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
This gives as output:
lq_viz_server 1
OFM32 -1
Which would look more impressive if your sample log file was longer. The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage.
If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. You will have this:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela#nabltas1"]
And then I think you have to be capable of apply any logic comparing the values that you need.
i made some wild assumptions about your specification and here is a sample code to help you start:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names
You have two options:
Use the .split() function of the string (as pointed out in the comments)
Use the re module for regular expressions.
I would suggest using the re module and create a pattern with named groups.
Recipe:
first create a pattern with re.compile() containing named groups
do a for loop over the file to get the lines use .match() od the
created pattern object on each line use .groupdict() of the
returned match object to access your values of interest
In the mode of just get 'er done with the standard distribution, this works:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
Prints:
Counter({'lq_viz_server': 1, 'OFM32': -1})

Categories