Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a file with columns ID MAIL (20 millions):
000000#0000.com 0xE618EF6B90AG
000000#0000.com 0xE618EF6B90AF
00000#00000.com 0xE618EFBCC83D
00000#00000.com 0xE618EFBCC83C
#000000000 0xE618F02C223E432CEA
00000#0000.com 0x01010492A
0000#00000.com 0x52107A
# 0xE618F032F829432CE04343307C570906A
00000#0000.com 0xE618F032F829432CEB
000000#000.com 0xE618F032FE7B432CEC
000000#000.com 0xE618F032FE7B432CED
#hotmail.com 0x41970588
# 0x52087617
I need to map ID's registered to an email, so we can find what ID's have registered on a given mail. The email may have several ID's registered on it.
Here is the function i made, but it turns out that i need to exclude mostly non-valid emails like #.com # etc.
In the first version of script it works almost perfectly with a little thing, my parser breaks down if the email has a space somewhere in between symbols
So i added a regexp to check the email value but i get the error i don't know how to handle:
import re
def duplicates(filename):
with open(filename, 'r') as f:
lines = f.readlines()
query = (line.replace('\n','') for line in lines)
split_query = (line.split(' ') for line in query)
result_mail = {}
for line in split_query:
#added if statement to validate email, remove to check w/o
if re.match(r"[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+", line[0]):
if line[0] not in result_mail:
result_mail[line[0]] = []
result_mail[line[0]].append(line[1])
for mail, ids in result_mail.iteritems():
if len(ids) > 1:
with open('MAIL_ids.txt', 'a') as r_mail:
r_mail.write(str(mail) + '\n')
r_mail.write(str(ids) + '\n')
if __name__ == '__main__':
import sys
filename = sys.argv[1]
duplicates(filename)
After running the script i get the error about KeyError '', why is this happening ?
File ".\dup_1.2.py", line 44, in <module>
duplicates(filename)
File ".\dup_1.2.py", line 32, in duplicates
result_mail[line[0]].append(line[1])
KeyError: ''
I also would like to rewrite the part where i add keys and values to dictionary. I'd like to use a generator defaultdict() smth like:
result_mail = defaultdict(list)
for line in lines:
if line[0] not in result_mail:
result_mail[line[0]].append(line[1])
It seems you just put the line result_mail[line[0]].append(line[1]) at the wrong level of indentation, so it is executed even when the if re.match condition does not apply.
Also, you might want to use collections.defaultdict to get rid of that if line[0] not in result_mail check.
result_mail = collections.defaultdict(list)
for (id_, mail) in split_query:
if re.match(r"[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+", id_):
result_mail[id_].append(mail)
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
Given a data stream, how do I extract the information (between \x15 and \x15\n) that comes right after a *PAR?
Here is the data stream.
'%gra:bla bla bla\n',
'*PAR:\tthe cat wants\n',
'\tcookies . \x159400_14000\x15\n',
'%mor:\tdet:art|the adj|cat\n',
'\tpro:rel|which aux|be&3S part|tip-PRESP coord|and pro:sub|he\n',
'\tv|want-3S n|cookie-PL .\n',
'%gra:\t1|3|DET 2|3|MOD 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|7|DET 7|5|POBJ 8|10|LINK\n',
'\t9|10|AUX 10|7|CMOD 11|13|LINK 12|13|SUBJ 13|10|CJCT 14|13|OBJ 15|16|INF\n',
'\t16|14|XMOD 17|16|JCT 18|19|DET 19|17|POBJ 20|4|PUNCT\n',
'*PAR:\ cookies biscuit\n',
'\tis eating a cookie . \x1514000_18647\x15\n',
My output should be:
"9400_14000"
"14000_18647"
...
Go over the data line by line and try to look for the desired pattern just on lines that follow *PAR:
import re
[re.search('\x15(.*)\x15\n', line).groups()[0]
for i, line in enumerate(data) if '*PAR' in data[i - 1]]
This code will throw an exception if the pattern cannot be matched on a line that follows *PAR. To get all valid matches use:
[match
for i, line in enumerate(data) if '*PAR' in data[i - 1]
for match in re.findall('\x15(.*)\x15\n', line)]
If you expect more than a single pair of \x15 on a line you can use this regex instead to find the shortest match:
'\x15(.*?)\x15\n'
I like to use this function
def get_between(string:str,start:str,end:str,with_brackets=True):
#re.escape escapes all characters that need to be escaped
new_start=re.escape(start)
new_end=re.escape(end)
pattern=f"{new_start}.*?{new_end}"
res=re.findall(pattern,string)
if with_brackets:
return res #-> this is with the brackets
else:
return [x[len(start):-len(end)] for x in res]#-> this is without the brackets
To use it in your example do this:
result = []
for i,string in enumerate(data_stream):
if i>0 and "*PAR" in data_stream[i-1]:
result+=get_between(string,"\x15","\x15\n",False)
print(result)
I don't know the type of the data stream so, here is a generator:
def generator(data_stream):
pattern = r"\x15([^\x15]+)\x15\n"
search_next=False
for line in data_stream:
if search_next:
for out in re.findall(pattern, line):
yield out
search_next = False
if line.find("*PAR") > -1:
search_next = True
If it can be converted to a list, you can use this:
[x for x in generator(data)]
You could use the newer regex module with
(?:\G(?!\A)|\*PAR) # fast forward to *PAR
(?:(?!:\*PAR).)+?\\x15\K # do not overrun another *PAR, fast forward to \x15
.+? # start matching...
(?=\\x15|\z) # ... up to either \x15 or the end
See a demo on regex101.com (and mind the singleline and verbose modifier!).
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a very large csv file (+50k lines).
This file contains IRC logs and here's the data format:
1st column: Message type (1 for message, 2 for system)
2nd column: Timestamps (numbers of seconds since a precise date)
3rd column: Username of the one writing the message
4th column: Message
Here's an example of the data:
1,1382445487956,"bob","i don't know how to do such a task"
1,1382025765196,"alice","bro ask stackoverflow"
1,1382454875476,"_XxCoder_killerxX_","I'm pretty sure it can be done with python, bob"
2,1380631520410,"helloman","helloman_ join the chan."
For example, _XxCoder_killerxX_ mentioned bob.
So, knowing all of this, I want to know which pair of usernames mentioned each others the most.
I want messages to be count, so I only need to work on lines starting with the number "1" (as there is a bunch of lines starting with "2" and other irrelevant numbers)
I know it can be done with the csv Python module, but I've never worked with such larges files so I really don't know how to start all of this.
You should perform two passes of the CSV: one to capture all sender usernames, the second to find sender usernames mentioned in messages.
import csv
users = set()
with open("test.csv", "r") as file:
reader = csv.reader(file)
for line in reader:
users.add(line[2])
mentions = {}
with open("test.csv", "r") as file:
reader = csv.reader(file)
for line in reader:
sender, message = line[2], line[3]
for recipient in users:
if recipient == sender:
continue # can't mention yourself
if recipient in message:
key = (sender, recipient)
mentions[key] = mentions.get(key, 0) + 1
for mention, times in mentions.items():
print(f"{mention[0]} mentioned {mention[1]} {times} time(s)")
totals = {}
for mention, times in mentions.items():
key = tuple(sorted(mention))
totals[key] = totals.get(key, 0) + times
for names, times in totals.items():
print(f"{names[0]} and {names[1]} mentioned each other {times} time(s)")
This example is naive, as it's performing simple substring matches. So, if there's someone named "foo" and someone mentions "food" in a message, it will indicate a match.
Here is a solution using pandas and sets. The use of pandas significantly simplifies the import and manipulation of csv data, and the use of sets allows one to count {'alice', 'bob'} and {'bob', 'alice'} as two occurrences of the same combination.
df = pd.read_csv('sample.csv', header=None)
df.columns = ['id','timestamp','username','message']
lst = []
for name in df.username:
for i,m in enumerate(df.message):
if name in m:
author = df.iloc[i,2]
lst.append({author, name})
most_freq = max(lst, key=lst.count)
print(most_freq)
#{'bob', '_XxCoder_killerxX_'}
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
Experts, I am trying to count the E-maill address and number of their repitions in the maillog file which somehow i am able to make using Regular expression (re.search) OR (re.match) but i am looking this to be accomplished with (re.findall) which currently i am dabbling with.. would appreciate any suggestions..
1) Code Line ...
# cat maillcount31.py
#!/usr/bin/python
import re
#count = 0
mydic = {}
counts = mydic
fmt = " %-32s %-15s"
log = open('kkmail', 'r')
for line in log.readlines():
myre = re.search('.*from=<(.*)>,\ssize', line)
if myre:
name = myre.group(1)
if name not in mydic.keys():
mydic[name] = 0
mydic[name] +=1
for key in counts:
print fmt % (key, counts[key])
2) Output from the Current code..
# python maillcount31.py
root#MyServer1.myinc.com 13
User01#MyServer1.myinc.com 14
Hope this help...
from collections import Counter
emails = re.findall('.*from=<(.*)>,\ssize', line)# Modify re according to your file pattern OR line pattern. If findall() on each line, each returned list should be combined.
result = Counter(emails)# type is <class 'collections.Counter'>
dict(result)#convert to regular dict
re.findall() will return a list. Looking into How can I count the occurrences of a list item in Python?, there are other ways to count the words in the returned list.
By the way, interesting functions of Counter:
>>> tmp1 = Counter(re.findall('from=<([^\s]*)>', "from=<usr1#gmail.com>, from=<usr2#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>,") )
>>> tmp1
Counter({'usr1#gmail.com': 4, 'usr2#gmail.com': 1})
>>> tmp2 = Counter(re.findall('from=<([^\s]*)>', "from=<usr2#gmail.com>, from=<usr3#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>, from=<usr1#gmail.com>,") )
>>> dict(tmp1+tmp2)
{'usr2#gmail.com': 2, 'usr1#gmail.com': 7, 'usr3#gmail.com': 1}
So, if the file is very large, we can count each line and combine them by aid of Counter.
Have you considered using pandas, It can give you a nice table of results without the need for regex commands.
import pandas as pd
emails = pd.Series(email_list)
individual_emails = emails.unique()
tally = pd.DataFrame( [individual_emails , [0]*len(individual_emails)] )
#makes a table with emails and a zeroed talley
for item in individual_emails.index:
address = tally.iloc[item,0]
sum = len(email[email==address])
tally.iloc[item,1] = sum
print tally
I hope the code at the bottom helps.
However, here are three things to generally note:
Use (with) when opening files
When iterating over dictionaries, use iteritems()
When working with containers, collections are your best friend
#!/usr/bin/python
import re
from collections import Counter
fmt = " %-32s %-15s"
filename = 'kkmail'
# Extract the email addresses
email_list = []
with open(filename, 'r') as log:
for line in log.readlines():
_re = re.search('.*from=<(.*)>,\ssize', line)
if _re:
name = _re.group(1)
email_list.append(name)
# Count the email addresses
counts = dict(Counter(email_list)) # List to dict of counts: {'a':3, 'b':7,...}
for key, val in counts.iteritems():
print fmt % (key, val)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have an archive with unusable data and I want to clean them with python.
First, the lines have the form:
Xac:0.01660156#,Yac:0.02343750?,Zac:1.00683593*
I want to delete: Xac:, Yac:, and Zac:, and also the characters at the ends of number like #, ?, * to leave only numbers.
Also, I want to delete some trash lines in the archive like:
!Data Logger Accelerometer] ,
Initializing...
Lines like those in the archive are trash for me and I need to delete them to leave a clean archive with only numbers on three columns. (Really, those numbers are accelerometer readings on the x, y, and z axes, but I have unusable data like I showed above).
How can I achieve this?
You need to parse the data file.
First, skip invalid lines:
if not line.startswith('Xac:'):
return None
Second, split by non-number chars:
parts = re.split('[,Xac:YZ#?*]', line)
Third, filter empty strs:
parts = filter(lambda x: bool(x), parts)
Fourth, covert str to float:
parts = map(lambda x: float(x), parts)
Finally, convert list to tuple
return tuple(parts)
The full example is like this:
import re
def parse_line(line):
""" line -> (int, int, int), None if invalid
"""
if not line.startswith('Xac:'):
return None
parts = re.split('[,Xac:YZ#?*]', line)
parts = filter(lambda x: bool(x), parts)
parts = map(lambda x: float(x), parts)
return tuple(parts)
output = []
with open('input.txt') as f:
for line in iter(f.readline, ''):
axes = parse_line(line.strip())
if axes:
output.append(axes)
print output
Input file input.txt:
!Data Logger Accelerometer] ,
Initializing...
Xac:0.01660156#,Yac:0.02343750?,Zac:1.00683593*
OUTPUT:
[(0.01660156, 0.0234375, 1.00683593)]
you can use python regular expressions.
import re
x = 'Xac:0.01660156#,Yac:0.02343750?,Zac:1.00683593*'
print re.findall('(\d*\.?\d+)', x) #['0.01660156', '0.02343750', '1.00683593']
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a text file with format :
2014-04-10
Arjun 22 Class 10 60
Anil 23 Class 09 85
2013-03-10
Jhon 21 Class 10 78
How should the code be, if I want the dictionary as shown :
{'2014_Arjun' : ['22','Class 10','60'],'2014_Anil':['23','Class 09','85'],'2013_Jhon':['21','Class10','78']}
The idea is to iterate over file lines, try parsing the line into the datetime via strptime() - if successful, remember the year of the date, if not - parse the line via regular expression and write to the data dict:
from datetime import datetime
import re
data = {}
pattern = re.compile('(\w+)\s+(\d+)\s+(\w+\s\d+)\s+(\d+)')
with open('input.txt') as f:
for line in f:
try:
year = datetime.strptime(line.strip(), '%Y-%m-%d').year
except ValueError:
item = pattern.match(line.strip()).groups()
data[str(year) + "_" + item[0]] = item[1:]
print data
prints:
{'2013_Jhon': ('21', 'Class 10', '78'),
'2014_Arjun': ('22', 'Class 10', '60'),
'2014_Anil': ('23', 'Class 09', '85')}
Make sure you understand what is going on here. If not - feel free to ask in comments.
This one is the simplest solution I can imagine, if you really use TSV file format (Tab Separated Values):
PATH = r"C:\text.txt"
reader = open(PATH, 'rb')
result = {}
for line in reader:
if line.count("\t") == 0:
year = line.split("-")[0]
else:
name, day, class_no, mark = line.split(TAB)
key = year + "_" + name
value = [day, class_no, mark]
result[key] = value
reader.close()
The "result" dictionary is what you asked for :)
I'm not going to write this for you but it should help put you on a path.
If the format of your file is going to consistently be
YYYY-MM-DD
Name ## Class ## ##
Then the following is fairly simple.
You can do the following, check the line to see if it contains 'Class'.
If it doesn't (which implies that the line contains YYYY-MM-DD), then you now have a dictionary key prefix and can split on '-' to pull the year.
If it contains class, then you can now complete your dictionary prefix (YYYY_Name) and assign the remaining values in a list with d["YYYY_Name"] as the key.