Match lines from a file and parse them on Python - python

I have this file with different lines, and I want to take only some information from each line (not the whole of it) here is a sample of how the file looks like:
18:10:12.960404 IP 132.227.127.62.12017 > 134.157.0.129.53: 28192+ A? safebrowsing-cache.google.com. (47)
18:10:12.961114 IP 134.157.0.129.53 > 132.227.127.62.12017: 28192 12/4/4 CNAME safebrowsing.cache.l.google.com., A 173.194.40.102, A 173.194.40.103, A 173.194.40.104, A 173.194.40.105, A 173.194.40.110, A 173.194.40.96, A 173.194.40.97, A 173.194.40.98, A 173.194.40.99, A 173.194.40.100, A 173.194.40.101 (394)
18:13:46.206371 IP 132.227.127.62.49296 > 134.157.0.129.53: 47153+ PTR? b._dns-sd._udp.upmc.fr. (40)
18:13:46.206871 IP 134.157.0.129.53 > 132.227.127.62.49296: 47153 NXDomain* 0/1/0 (101)
18:28:57.253746 IP 132.227.127.62.54232 > 134.157.0.129.53: 52694+ TXT? time.apple.com. (32)
18:28:57.254647 IP 134.157.0.129.53 > 132.227.127.62.54232: 52694 1/8/8 TXT "ntp minpoll 9 maxpoll 12 iburst" (381)
.......
.......
It is actually the output of a DNS request, so from it I want to extract these elements:
[timestamp], [srcip], [src prt], [dst ip], [dst prt], [domaine (if existed)], [related ips addresses]
After looking in the website old topics, I found that the re.match() is a great and helpful way to do that, but since as you see every line is different of the other, I am kind of lost, some help would be great, here is the code I wrote so far and it is correct:
def extractDNS(filename):
objList = []
obj = {}
with open(filename) as fi:
for line in fi:
line = line.lower().strip()
#18:09:29.960404
m = re.match("(\d+):(\d+):(\d+.\d+)",line)
if m:
obj = {} #New object detected
hou = int(m.group(1))
min = int(m.group(2))
sec = float(m.group(3))
obj["time"] = (hou*3600)+(min*60)+sec
objList.append(obj)
#IP 134.157.0.129.53
m=re.match("IP\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+)",bb)
if m:
obj["dnssrcip"] = m.group(1)
obj["dnssrcport"] = m.group(2)
# > 134.157.0.129.53:
m = re.match("\s+>\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+):",line)
if m:
obj["dnsdstip"] = m.group(1)
obj["dnsdstport"] = m.group(2)
tstFile3=open("outputFile","w+")
tstFile3.write("%s\n" %objList)
tstFile3.close()
extractDNS(sys.argv[1])
I know I have to make if else statements after this, because what comes after them is different every time, and I showed in the 3 cases I get generaly in every dns output file, which are :
- A? followed by CNAME, the exact domain, and the IP addresses,
- PTR? followed by a NXDOmain, means the domain is non-existant, so I will just ignore this line,
- TXT? followed by a domain, but it only gives words, so I ll ignore this one two
I only want the request that their responses contain IP addresses, which are in this case the A?

If you know the first 5 columns are always present, why don't you just split the line up and handle those directly (use datetime for the timestamp, and manually parse the IP addresses/ports). Then, you could use your regular expression to match only the CNAME records and the contents you are interested in from that one field. There is no need to have a regular expression scan over the different possibilities if you aren't going to actually use the output. So, if it doesn't match the CNAME form, then you don't care how it would be handled. At least, that's what it sounds like.

As user632657 said above, there's no need to care about the lines you, well, don't care about. Just use one regular expression for that line, and if it doesn't match, ignore that line:
pattern = re.compile( r'(\d{2}):(\d{2}):(\d{2}\.\d+)\s+IP\s+(\d+\.\d+\.\d+\.\d+)\.(\d+)\s+>\s+(\d+\.\d+\.\d+\.\d+)\.(\d+):\s+(\d+).*?CNAME\s+([^,]+),\s+(.*?)\s+\(\d+\)' )
That will match the CNAME records only. You only need to define that once, outside your loop. Then, within the loop:
try:
hour, minute, seconds, source_ip, source_port, dst_ip, dst_port, domain, records = pattern.match( line ).groups()
except:
continue
records = [ r.split() for r in records.split( ', ' ) ]
This will pull all the fields you've asked about in the relevant variables, and parse the associated IPs into a list of pairs (class, IP), which I figure will be useful :P

Related

Sort with re.search() - Python

i have some problems with solving the follwoing problem.
I have to *.txt files in both files are cities from austria. In the first file "cities1" are the cities are ordered by population.
The first file (cities1.txt) is looking like this:
1.,Vienna,Vienna,1.840.573
2.,Graz,Styria,273.838
3.,Linz,Upper Austria,198.181
4.,Salzburg,Salzburg,148.420
5.,Innsbruck,Tyrol,126.851
The second file (cities2.txt) is looking like this:
"Villach","Carinthia",60480,134.98,501
"Innsbruck","Tyrol",126851,104.91,574
"Graz","Styria",273838,127.57,353
"Dornbirn","Vorarlberg",47420,120.93,437
"Vienna","Vienna",1840573,414.78,151
"Linz","Upper Austria",198181,95.99,266
"Klagenfurt am Woerthersee","Carinthia",97827,120.12,446
"Salzburg","Salzburg",148420,65.65,424
"Wels","Upper Austria",59853,45.92,317
"Sankt Poelten","Lower Austria",52716,108.44,267
What i like to do, or in other words what i should do is, the first file cities1.txt is already sorted. I only need the second element of every line. That means i only need the name of the city. For example from the line 2.,Graz,Styria,273.838, i only need Graz.
Than second i should print out the area of the city, this is the fourth element of every line in cities2.txt. That means, for example from the third line "Graz","Styria",273838,127.57,353, i only need 127.57.
At the end the console should display the following:
Vienna,414.78
Graz,127.57
Linz,95.99
Salzburg,65.65
Innsbruck,104.91
So, my problem now is, how can i do this, if i only allowed to use the re.search() method. Cause the second *.txt file is not in the same order and i have to bring the cities in the same order as in the first file that this will work, or?
I know, it would be much easier to use re.split() because than you are able to compare the list elements form both files. But I'm not allowed to do this.
I hope someone can help me and sorry for the long text.
Here's an implementation based on my earlier comment:
with open('cities2.txt') as c:
D = {}
for line in c:
t = line.split(',')
cn = t[0].strip('"')
D[cn] = t[-2]
with open('cities1.txt') as d:
for line in d:
t = line.split(',')
print(f'{t[1]},{D[t[1]]}')
Note that this may not be robust. If there's a city name in cities1.txt that does not exist in cities2.txt then you'll get a KeyError
This is just a hint, it's your university assignment after all.
import re
TEST = '2.,Graz,Styria,273.838'
RE = re.compile('^[^,]*,([^,]*),')
if match := RE.search(TEST):
print(match.group(1)) # 'Graz'
Let's break down the regexp:
^ - start of line
[^,]* - any character except a comma - repeated 0 or more times
this is the first field
, - one comma character
this is the field separator
( - start capturing, we are interested in this field
[^,]* - any character except a comma - repeated 0 or more times
this is the second field
) - stop capturing
, - one comma character
(don't care about the rest of line)

How to get the same name with multiple value get unique results in Python

I have a large csv file that compares the URLs of my txt files
How to get the same name with multiple value get unique results in Python and Is there a way to better compare the speed of two files? because it has a minimum large csv file of 1 gb
file1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
file2.txt
1 www.amazon.com shop
1 wakers.com shop
script:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
my output:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
how to get if the url is the same, get the total value with a unique value
how to get results like this?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95
Short answer: you can't directly do so. Well you can but with low performances.
CSV is a good storing format but if you want to do something like that you might want to store everything in another custom data file. you could first parse your file to have only Unique IDs instead of long strings (like amazon = 0, wakers = 1 and so on) to perform better and reduce compare cost.
The thing is, those thing are pretty bad for variable csv, memory mapping or building a database from your csv might also be great though (and making the changes on the database, only dumping the csv when you need to)
look at: How do quickly search through a .csv file in Python for a more complete answer.
Problem solution
import csv
import re
def possible_urls(filename, category, category_position, url_position):
# Here we will read a txt file to create a list of domains, that could correspond to shops
domains = []
with open(filename, "r") as file:
file_content = file.read().splitlines()
for line in file_content:
info_in_line = line.split(" ")
# Here i use a regular expression, to prase domain from url.
domain = re.sub('www.', '', info_in_line[url_position])
if info_in_line[category_position] == category:
domains.append(domain)
return domains
def read_from_csv(filename, ip_position, url_position, possible_domains):
# Here we will create a dictionary, where will
# all ips that this domain can have.
# Dictionary will look like this:
# {domain_name: [list of possible ips]}
domain_ip = {domain: [] for domain in possible_domains}
with open(filename, 'r') as f:
reader = csv.reader(f)
for line in reader:
if len(line) < max(ip_position, url_position):
print(f'Not enough items in line {line}, to obtain url or ip')
continue
ip = line[ip_position]
url = line[url_position]
# Using python regular expression to get a domain name
# from url.
domain = re.search('//[w]?[w]?[w]?\.?(.[^/]*)[:|/]', url).group(1)
if domain in domain_ip.keys():
domain_ip[domain].append(ip)
return domain_ip
def print_fomatted_result(result):
# Prints formatted result
for shop_domain in result.keys():
print(f'{shop_domain}: ')
for shop_ip in result[shop_domain]:
print(f' {shop_ip}')
def create_list_of_shops():
# Function that first creates a list of possible domains, and
# then read ip for that domains from csv
possible_domains = possible_urls('file2.txt', 'shop', 2, 1)
shop_domains_with_ip = read_from_csv('file1.csv', 2, 6, possible_domains)
# Display result, we get in previous operations
print(shop_domains_with_ip)
print_fomatted_result(shop_domains_with_ip)
create_list_of_shops()
Output
Dictionary of ip's where domains are keys, so you can get all possible ip's for domain by giving a name of that domain:
{'amazon.com': ['102.12.14.22', '167.27.14.62', '167.27.14.62', '197.99.94.32', '157.87.34.72'], 'wakers.com': ['192.10.77.95']}
amazon.com:
102.12.14.22
167.27.14.62
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com:
192.10.77.95
Regular expressions
A very useful thing you can learn from the solution is regular expressions. Regular expressions are tools that allow you to filter or retrieve information from lines in a very convenient way. It also greatly reduces the amount of code, which makes the code more readable and safe.
Let's consider your code of removing ports from strings and think how we can replace it with regex.
lines = url.replace(":443", "").replace(":8080", "")
Replacing of ports in such way is vulnerable, because you never can be sure, what port numbers can actually be in url. What if there will appear port number 5460, or port number 1022, etc. For each of such ports you will add new replaces and soon your code will look something like this
lines = url.replace(":443", "").replace(":8080", "").replace(":5460","").replace(":1022","")...
Not very readable. But with regular experssion you can describe a pattern. And the great news is that we actually know pattern for url with port numbers. They all looking like this:
:some_digits. So if we know pattern we can describe it with regular expression, and tell python to find everything, that match it and replace with empty string '':
re.sub(':\d+', '', url)
It tells to python regular expression engine:
Look for all digits in string url, that goes after : and replace them with empty string. This solution is shorter, safer and a way more readable then solution with replace chain, so I suggest you to read about them a little. Great resource to learn about regular expressions is
this site. Here you can test your regex.
Explanation of Regular expressions in code
re.sub('www.', '', info_in_line[url_position])
Look for all www. in string info_in_line[url_position] and replace it with empty string.
re.search('www.(.[^/]*)[:|/]', url).group(1)
Let's split it on parts:
[^/] - here could be everything except /
(.[^/]*) - Here i used match group. It tells to engine where solution we intersted in will be.
[:|/] - it means characters that could stay on that place. Long story short: after capturing group could be : or(|) /.
So summarizing. Regex can be expressed in words as follows:
Find all substrings, that starts with www., and ends with : or \ and return me everything that stadns between them.
group(1) - means get the first match.
Hope answer will be helpful!
If you used the URL as the key in a dictionary, and had your IP address sets as the elements of the dictionary, would that achieve what you intended?
my_dict = {
'www.amazon.com' = {
'102.12.14.22',
'167.27.14.62',
'197.99.94.32',
'157.87.34.72',
},
'www.wakers.com' = {'192.10.77.95'},
}
## I have used your code & Pandas to get your desired output
## Copy paste the code & execute to get the result
import csv
url_dict = {}
## STEP 1: Open file2.txt to get url names
with open("file2.txt", "r") as f:
for i in f:
val = i.strip().split(" ")[1]
url_dict[val] = []
## STEP 2: 2.1 Open csv file 'file1.csv' to extract url name & ip address
## 2.2 Check if url from file2.txt is available from the extracted url from 'file1.csv'
## 2.3 Create a dictionary with the matched url & its ip address
## 2.4 Remove duplicates in ip addresses from same url
with open("file1.csv", 'r') as f: ## 2.1
reader = csv.reader(f)
for k in reader:
#ko = set()
srcip = k[2]
#print(srcip)
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
for key, value in url_dict.items():
if key in war: ## 2.2
url_dict[key].append(srcip) ## 2.3
## 2.4
for key, value in url_dict.items():
url_dict[key] = list(set(value))
## STEP 3: Print dictionary output to .TXT file
file3 = open('output_text.txt', 'w')
for key, value in url_dict.items():
file3.write('\n' + key + '\n')
for item in value:
file3.write(' '*15 + item + '\n')
file3.close()

regex to find match in element of list

I'm new to Python and have complied a list of items from a file that has the an element which appeared in the file and its frequency in the file like this
('95.108.240.252', 9)
its mostly IP addresses I'm gathering. I'd like to output the address and frequency like this instead
IP Frequency
95.108.240.252 9
I'm trying to do this by regexing the list item and printing that but it returns the following error when I try TypeError: expected string or bytes-like object
This is the code I'm using to do all the now:
ips = [] # IP address list
for line in f:
match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) # Get all IPs line by line
if match:
ips.append(match.group()) # if found add to list
from collections import defaultdict
freq = defaultdict( int )
for i in ips:
freq[i] += 1 # get frequency of IPs
print("IP\t\t Frequency") # Print header
freqsort = sorted(freq.items(), reverse = True, key=lambda item: item[1]) # sort in descending frequency
for c in range(0,4): # print the 4 most frequent IPs
# print(freqsort[c]) # This line prints the item like ('95.108.240.252', 9)
m1 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group()) # Then print it
Not trying to even parse the frequency yet, just wanted the IPs as a starting point
The second parameter in re.search() should be string and you are passing tuple. So it is generating an error saying that it expected string or buffer.
NOTE:- Also you need to make sure that there at least 4 elements for IP address, otherwise there will be index out of bounds error
Delete the last two lines and use this instead
print(freqsort[c][0])
If you want to stick to your format you can use the following but it is of no use
m1 = re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c][0]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group())
Use a byte object instead:
# notice the `b` before the quotes.
match = re.search(b'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)
Try regex with positive and negative lookaround.
(?<=\(\')(.*)(?=\').*(\d+)
First captured group will be your IP and second frequency.
You can use the ipaddress and Counter in the stdlib to assist with this...
from collections import Counter
from ipaddress import ip_address
with open('somefile.log') as fin:
ips = Counter()
for line in fin:
ip, rest_of_line = line.partition(' ')[::2]
try:
ips[ip_address(ip)] += 1
except ValueError:
pass
print(ips.most_common(4))
This'll also handle IPv4 and IPv6 style addresses and make sure they're technically correct not just "look" correct. Using a collections.Counter also gives you a .most_common() method to automatically sort by the most frequent and limit it to n amounts.

Find and list items from different lists from a file in python

i want to know how to retrieve multiple items from a list if i give to the program an specific item of the list. This is how my lists looks like:
["randomname", "188.xx.xx.xx", "uselessinfo", "2013-09-04 12:03:18"]
["saelyth", "189.xx.xx.xx", "uselessinfoalso", "2013-09-04 12:03:23"]
["randomname2", "121.xxx.xxx.x", "uselessinfoforstackoverflow", "2013-09-04 12:03:25"]
This is intended for a chat bot. The first item is the user name, the second one is the IP, and what i need is to find all names associated to the same IP and then print them or send it to chat, this is as far as i did get.
if message.body.startswith("!Track"):
vartogetname = vartogetip = None
filename = "listas\datosdeusuario.txt"
for line in open(filename, 'r'):
retrieved = json.loads(line)
if retrieved[0] == targettotrack:
vartogetname = retrieved[0]
vartogetip = retrieved[1]
break
#So far it has opened the file, check my target and get the right IP to track, no issues until here.
if not vartogetip == None: #So if has found an IP with the target...
print("Tracking "+targettotrack+": Found "+str(vartogetip)+"...")
for line in open(filename, 'r'):
retrieved2 = json.loads(line)
if retrieved2[1] == vartogetip: #If IP is found
if not retrieved2[0] == targettotrack: #But the name is different...
print("I found "+retrieved2[0]+" with the same ip") #Did worked, but did several different prints.
#HERE i'm lost, read text and comments after this code.
sendtochat("I found "+retrieved2[0]+" with the same ip") #Only sent 1 name, not all of them :(
else:
sendtochat("IP Not found")
Where I said #HERE is where I'd need a code to add items found in the list and add them to another list (I guess?) and then I'd be able to call it in the sendtochat command, however I must be really tired because I can't remember how to do it.
I am working with Python 3.3.2 IDLE and the lists into the file are saved with json and it adds a \n at the end for easy reading.
You need to collect your matches in a list, then send that list of matches to your chatbot:
if vartogetip is not None:
matches = []
for line in open(filename, 'r'):
ip, name = json.loads(line)[:2]
if ip == vartogetip and name != targettotrack:
matches.append(name)
if matches: # matches have been found, the list is not empty
sendtochat("I found {} with the same ip".format(', '.join(matches)))
The ', '.join(matches) call joins the found names together with commas to make for a nicer and more readable format for the names.
"This is intended for a chat bot. The first item is the user name, the second one is the IP, and what i need is to find all names associated to the same IP and then print them or send it to chat, this is as far as i did get."
Seems like what you want is a dict which maps IP address strings to a list of usernames associated to that IP address. Maybe try something like this:
user_infos = [
["randomname", "188.xx.xx.xx", 'data'],
["saelyth", "189.xx.xx.xx", 'data'],
["randomname2", "121.xxx.xxx.x", 'data'],
["saelyth2", "189.xx.xx.xx", 'data']
]
# Mapping of IP address to a list of usernames :)
ip_dict = {}
# Scan through all the users, assign their usernames to the IP address
for u in user_infos:
ip_addr = u[1]
if ip_addr in ip_dict:
ip_dict[ip_addr].append(u[0])
else:
ip_dict[ip_addr] = [u[0]]
# We want to query an IP address "189.xx.xx.xx"
# Let's find all usernames with this IP address
for username in ip_dict["189.xx.xx.xx"]:
print(username) # Have your chatbot say these names.
That will print both 'saelyth' and 'saelyth2' since they have the same IP address.

Parsing text files using Python

I am very new to Python and am looking to use it to parse a text file. The file has between 250-300 lines of the following format:
---- Mark Grey (mark.grey#gmail.com) changed status from Busy to Available # 14/07/2010 16:32:36 ----
---- Silvia Pablo (spablo#gmail.com) became Available # 14/07/2010 16:32:39 ----
I need to store the following information into another file (excel or text) for all the entries from this file
UserName/ID Previous Status New Status Date Time
So my result file should look like this for the above entried
Mark Grey/mark.grey#gmail.com Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo#gmail.com NaN Available 14/07/2010 16:32:39
Thanks in advance,
Any help would be really appreciated
To get you started:
result = []
regex = re.compile(
r"""^-*\s+
(?P<name>.*?)\s+
\((?P<email>.*?)\)\s+
(?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
(?P<new>.*?)\s+#\s+
(?P<date>\S+)\s+
(?P<time>\S+)\s+
-*$""", re.VERBOSE)
with open("inputfile") as f:
for line in f:
match = regex.match(line)
if match:
result.append([
match.group("name"),
match.group("email"),
match.group("previous")
# etc.
])
else:
# Match attempt failed
will get you an array of the parts of the match. I'd then suggest you use the csv module to store the results in a standard format.
import re
pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) # (.*?) ----\s*")
with open("data.txt") as f:
for line in f:
(name, email, prev, curr, date) = pat.match(line).groups()
print "{0}/{1} {2} {3} {4}".format(name, email, prev or "NaN", curr, date)
This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that pat.match() doesn't return None) if you want to handle dirty input gracefully.
The two RE patterns of interest seem to be...:
p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'
so I'd do:
import csv, re, sys
# assign p1, p2 as above (or enhance them, etc etc)
r1 = re.compile(p1)
r2 = re.compile(p2)
data = []
with open('somefile.txt') as f:
for line in f:
m = p1.match(line)
if m:
data.append(m.groups())
continue
m = p2.match(line)
if not m:
print>>sys.stderr, "No match for line: %r" % line
continue
listofgroups = m.groups()
listofgroups.insert(2, 'NaN')
data.append(listofgroups)
with open('result.csv', 'w') as f:
w = csv.writer(f)
w.writerow('UserName/ID Previous Status New Status Date Time'.split())
w.writerows(data)
If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.
Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used if re.search('something', s): instead of if 'something' in s:!-).
But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.
Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:
from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime
DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()#")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")
statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR +
(statechange1 | statechange2) +
AT + date('date') + time('time') + DASHES)
def convertFields(tokens):
if 'fromstate' not in tokens:
tokens['fromstate'] = 'NULL'
tokens['name'] = tokens.name.strip()
tokens['email'] = tokens.email.strip()
d,mon,yr = map(int, tokens.date.split('/'))
h,m,s = map(int, tokens.time.split(':'))
tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)
for line in text.splitlines():
fields = linefmt.parseString(line)
print "%(name)s/%(email)s %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields
prints:
Mark Grey/mark.grey#gmail.com Busy Available 2010-07-14 16:32:36
Silvia Pablo/spablo#gmail.com NULL Available 2010-07-14 16:32:39
pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.
Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:
for line in text.splitlines():
fields = linefmt.parseString(line)
print fields.dump()
prints:
['Mark Grey ', 'mark.grey#gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey#gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo#gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo#gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available
I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like statechange1 or statechange2, and insert it into linefmt with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.
Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a inputfile.split('\n') is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.
thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.
#Script to extract info from individual data files and print out a data file combining info from these files
import os
import commands
dataFileDir="data/";
#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};
#User id to user name mapping to check for duplicate names
usrId2Name={};
#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};
#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):
userName="";
for i in range(mailInd-1,0,-1):
if info[i].endswith("-") or info[i].endswith("+"):
break;
userName=info[i]+" "+userName;
userName=userName.strip();
userName=userName.replace(" "," ");
userName=userName.replace(" ","_");
return userName;
#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
timeStamp="";
for i in range(timeStartInd+1,len(info)):
timeStamp=timeStamp+" "+info[i];
timeStamp=timeStamp.replace("-","");
timeStamp=timeStamp.strip();
return timeStamp;
#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
msg="";
for i in range(startInd,endInd):
msg=msg+" "+info[i];
msg=msg.strip();
msg=msg.replace(" ","_");
return msg;
#Extract and store info from each line in the datafile
def extractLineInfo(line):
print line;
info=line.split(" ");
mailInd=-1;userId="-NONE-";
timeStartInd=-1;timeStamp="-NONE-";
becameInd="-1";
statusMsg="-NONE-";
#Find indices of email id and "#" char indicating start of timestamp
for i in range(0,len(info)):
#print (str(i)+" "+info[i]);
if(info[i].startswith("(") and info[i].endswith("#in.ibm.com)")):
mailInd=i;
if(info[i]=="#"):
timeStartInd=i;
if(info[i]=="became"):
becameInd=i;
#Debug print of mail and time stamp start inds
"""print "\n";
print "Index of mail id: "+str(mailInd);
print "Index of time start index: "+str(timeStartInd);
print "\n";"""
#Extract IBM user id and name for lines with ibm id
if(mailInd>=0):
userId=info[mailInd].replace("(","");
userId=userId.replace(")","");
userName=getUserName(info,mailInd);
#Lines with no ibm id are of the form "Suraj Godar Mr became idle # 15/07/2010 16:30:18"
elif(becameInd>0):
userName=getUserName(info,becameInd);
#Time stamp info
if(timeStartInd>=0):
timeStamp=getTimeStamp(info,timeStartInd);
if(mailInd>=0):
statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
elif(becameInd>0):
statusMsg=getStatusMsg(info,becameInd,timeStartInd);
print userId;
print userName;
print timeStamp
print statusMsg+"\n";
if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
usrName2Id[userName]=userId;
#Store status messages keyed by user email ids
timeDict={};
#Retrieve user id corresponding to user name
if userName in usrName2Id:
userId=usrName2Id[userName];
#For valid user ids, store status message in the dict within dict data str arrangement
if not(userId=="-NONE-"):
if not(userId in infoDict.keys()):
infoDict[userId]={};
timeDict=infoDict[userId];
if not(timeStamp in timeDict.keys()):
timeDict[timeStamp]=statusMsg;
else:
timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;
#Print for each user a file containing status
def printStatusFiles(dataFileDir):
volNum=0;
for userName in usrName2Id:
volNum=volNum+1;
filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
file = open(filename,"w");
print "Printing output file name: "+filename;
print volNum,userName,usrName2Id[userName]+"\n";
file.write(userName+" "+usrName2Id[userName]+"\n");
timeDict=infoDict[usrName2Id[userName]];
for time in sorted(timeDict.keys()):
file.write(time+" "+timeDict[time]+"\n");
#Read and store data from individual data files
def readDataFiles(dataFileDir):
#Process each datafile
files=os.listdir(dataFileDir)
files.sort();
for i in range(0,len(files)):
#for i in range(0,1):
file=files[i];
#Do not process other non-data files lying around in that dir
if not file.endswith(".txt"):
continue
print "Processing data file: "+file
dataFile=dataFileDir+str(file);
inpFile=open(dataFile,"r");
lines=inpFile.readlines();
#Process lines
for line in lines:
#Clean lines
line=line.strip();
line=line.replace("/India/Contr/IBM","");
line=line.strip();
#Skip header line of the file and L's sign in sign out times
if(line.startswith("System log for account") or line.find("signed")>-1):
continue;
extractLineInfo(line);
print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");

Categories