Parsing text files using Python - python

I am very new to Python and am looking to use it to parse a text file. The file has between 250-300 lines of the following format:
---- Mark Grey (mark.grey#gmail.com) changed status from Busy to Available # 14/07/2010 16:32:36 ----
---- Silvia Pablo (spablo#gmail.com) became Available # 14/07/2010 16:32:39 ----
I need to store the following information into another file (excel or text) for all the entries from this file
UserName/ID Previous Status New Status Date Time
So my result file should look like this for the above entried
Mark Grey/mark.grey#gmail.com Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo#gmail.com NaN Available 14/07/2010 16:32:39
Thanks in advance,
Any help would be really appreciated

To get you started:
result = []
regex = re.compile(
r"""^-*\s+
(?P<name>.*?)\s+
\((?P<email>.*?)\)\s+
(?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
(?P<new>.*?)\s+#\s+
(?P<date>\S+)\s+
(?P<time>\S+)\s+
-*$""", re.VERBOSE)
with open("inputfile") as f:
for line in f:
match = regex.match(line)
if match:
result.append([
match.group("name"),
match.group("email"),
match.group("previous")
# etc.
])
else:
# Match attempt failed
will get you an array of the parts of the match. I'd then suggest you use the csv module to store the results in a standard format.

import re
pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) # (.*?) ----\s*")
with open("data.txt") as f:
for line in f:
(name, email, prev, curr, date) = pat.match(line).groups()
print "{0}/{1} {2} {3} {4}".format(name, email, prev or "NaN", curr, date)
This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that pat.match() doesn't return None) if you want to handle dirty input gracefully.

The two RE patterns of interest seem to be...:
p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'
so I'd do:
import csv, re, sys
# assign p1, p2 as above (or enhance them, etc etc)
r1 = re.compile(p1)
r2 = re.compile(p2)
data = []
with open('somefile.txt') as f:
for line in f:
m = p1.match(line)
if m:
data.append(m.groups())
continue
m = p2.match(line)
if not m:
print>>sys.stderr, "No match for line: %r" % line
continue
listofgroups = m.groups()
listofgroups.insert(2, 'NaN')
data.append(listofgroups)
with open('result.csv', 'w') as f:
w = csv.writer(f)
w.writerow('UserName/ID Previous Status New Status Date Time'.split())
w.writerows(data)
If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.
Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used if re.search('something', s): instead of if 'something' in s:!-).
But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.

Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:
from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime
DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()#")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")
statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR +
(statechange1 | statechange2) +
AT + date('date') + time('time') + DASHES)
def convertFields(tokens):
if 'fromstate' not in tokens:
tokens['fromstate'] = 'NULL'
tokens['name'] = tokens.name.strip()
tokens['email'] = tokens.email.strip()
d,mon,yr = map(int, tokens.date.split('/'))
h,m,s = map(int, tokens.time.split(':'))
tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)
for line in text.splitlines():
fields = linefmt.parseString(line)
print "%(name)s/%(email)s %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields
prints:
Mark Grey/mark.grey#gmail.com Busy Available 2010-07-14 16:32:36
Silvia Pablo/spablo#gmail.com NULL Available 2010-07-14 16:32:39
pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.
Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:
for line in text.splitlines():
fields = linefmt.parseString(line)
print fields.dump()
prints:
['Mark Grey ', 'mark.grey#gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey#gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo#gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo#gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available
I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like statechange1 or statechange2, and insert it into linefmt with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.

Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a inputfile.split('\n') is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.

thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.
#Script to extract info from individual data files and print out a data file combining info from these files
import os
import commands
dataFileDir="data/";
#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};
#User id to user name mapping to check for duplicate names
usrId2Name={};
#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};
#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):
userName="";
for i in range(mailInd-1,0,-1):
if info[i].endswith("-") or info[i].endswith("+"):
break;
userName=info[i]+" "+userName;
userName=userName.strip();
userName=userName.replace(" "," ");
userName=userName.replace(" ","_");
return userName;
#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
timeStamp="";
for i in range(timeStartInd+1,len(info)):
timeStamp=timeStamp+" "+info[i];
timeStamp=timeStamp.replace("-","");
timeStamp=timeStamp.strip();
return timeStamp;
#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
msg="";
for i in range(startInd,endInd):
msg=msg+" "+info[i];
msg=msg.strip();
msg=msg.replace(" ","_");
return msg;
#Extract and store info from each line in the datafile
def extractLineInfo(line):
print line;
info=line.split(" ");
mailInd=-1;userId="-NONE-";
timeStartInd=-1;timeStamp="-NONE-";
becameInd="-1";
statusMsg="-NONE-";
#Find indices of email id and "#" char indicating start of timestamp
for i in range(0,len(info)):
#print (str(i)+" "+info[i]);
if(info[i].startswith("(") and info[i].endswith("#in.ibm.com)")):
mailInd=i;
if(info[i]=="#"):
timeStartInd=i;
if(info[i]=="became"):
becameInd=i;
#Debug print of mail and time stamp start inds
"""print "\n";
print "Index of mail id: "+str(mailInd);
print "Index of time start index: "+str(timeStartInd);
print "\n";"""
#Extract IBM user id and name for lines with ibm id
if(mailInd>=0):
userId=info[mailInd].replace("(","");
userId=userId.replace(")","");
userName=getUserName(info,mailInd);
#Lines with no ibm id are of the form "Suraj Godar Mr became idle # 15/07/2010 16:30:18"
elif(becameInd>0):
userName=getUserName(info,becameInd);
#Time stamp info
if(timeStartInd>=0):
timeStamp=getTimeStamp(info,timeStartInd);
if(mailInd>=0):
statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
elif(becameInd>0):
statusMsg=getStatusMsg(info,becameInd,timeStartInd);
print userId;
print userName;
print timeStamp
print statusMsg+"\n";
if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
usrName2Id[userName]=userId;
#Store status messages keyed by user email ids
timeDict={};
#Retrieve user id corresponding to user name
if userName in usrName2Id:
userId=usrName2Id[userName];
#For valid user ids, store status message in the dict within dict data str arrangement
if not(userId=="-NONE-"):
if not(userId in infoDict.keys()):
infoDict[userId]={};
timeDict=infoDict[userId];
if not(timeStamp in timeDict.keys()):
timeDict[timeStamp]=statusMsg;
else:
timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;
#Print for each user a file containing status
def printStatusFiles(dataFileDir):
volNum=0;
for userName in usrName2Id:
volNum=volNum+1;
filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
file = open(filename,"w");
print "Printing output file name: "+filename;
print volNum,userName,usrName2Id[userName]+"\n";
file.write(userName+" "+usrName2Id[userName]+"\n");
timeDict=infoDict[usrName2Id[userName]];
for time in sorted(timeDict.keys()):
file.write(time+" "+timeDict[time]+"\n");
#Read and store data from individual data files
def readDataFiles(dataFileDir):
#Process each datafile
files=os.listdir(dataFileDir)
files.sort();
for i in range(0,len(files)):
#for i in range(0,1):
file=files[i];
#Do not process other non-data files lying around in that dir
if not file.endswith(".txt"):
continue
print "Processing data file: "+file
dataFile=dataFileDir+str(file);
inpFile=open(dataFile,"r");
lines=inpFile.readlines();
#Process lines
for line in lines:
#Clean lines
line=line.strip();
line=line.replace("/India/Contr/IBM","");
line=line.strip();
#Skip header line of the file and L's sign in sign out times
if(line.startswith("System log for account") or line.find("signed")>-1):
continue;
extractLineInfo(line);
print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");

Related

How to get the same name with multiple value get unique results in Python

I have a large csv file that compares the URLs of my txt files
How to get the same name with multiple value get unique results in Python and Is there a way to better compare the speed of two files? because it has a minimum large csv file of 1 gb
file1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
file2.txt
1 www.amazon.com shop
1 wakers.com shop
script:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
my output:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
how to get if the url is the same, get the total value with a unique value
how to get results like this?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95
Short answer: you can't directly do so. Well you can but with low performances.
CSV is a good storing format but if you want to do something like that you might want to store everything in another custom data file. you could first parse your file to have only Unique IDs instead of long strings (like amazon = 0, wakers = 1 and so on) to perform better and reduce compare cost.
The thing is, those thing are pretty bad for variable csv, memory mapping or building a database from your csv might also be great though (and making the changes on the database, only dumping the csv when you need to)
look at: How do quickly search through a .csv file in Python for a more complete answer.
Problem solution
import csv
import re
def possible_urls(filename, category, category_position, url_position):
# Here we will read a txt file to create a list of domains, that could correspond to shops
domains = []
with open(filename, "r") as file:
file_content = file.read().splitlines()
for line in file_content:
info_in_line = line.split(" ")
# Here i use a regular expression, to prase domain from url.
domain = re.sub('www.', '', info_in_line[url_position])
if info_in_line[category_position] == category:
domains.append(domain)
return domains
def read_from_csv(filename, ip_position, url_position, possible_domains):
# Here we will create a dictionary, where will
# all ips that this domain can have.
# Dictionary will look like this:
# {domain_name: [list of possible ips]}
domain_ip = {domain: [] for domain in possible_domains}
with open(filename, 'r') as f:
reader = csv.reader(f)
for line in reader:
if len(line) < max(ip_position, url_position):
print(f'Not enough items in line {line}, to obtain url or ip')
continue
ip = line[ip_position]
url = line[url_position]
# Using python regular expression to get a domain name
# from url.
domain = re.search('//[w]?[w]?[w]?\.?(.[^/]*)[:|/]', url).group(1)
if domain in domain_ip.keys():
domain_ip[domain].append(ip)
return domain_ip
def print_fomatted_result(result):
# Prints formatted result
for shop_domain in result.keys():
print(f'{shop_domain}: ')
for shop_ip in result[shop_domain]:
print(f' {shop_ip}')
def create_list_of_shops():
# Function that first creates a list of possible domains, and
# then read ip for that domains from csv
possible_domains = possible_urls('file2.txt', 'shop', 2, 1)
shop_domains_with_ip = read_from_csv('file1.csv', 2, 6, possible_domains)
# Display result, we get in previous operations
print(shop_domains_with_ip)
print_fomatted_result(shop_domains_with_ip)
create_list_of_shops()
Output
Dictionary of ip's where domains are keys, so you can get all possible ip's for domain by giving a name of that domain:
{'amazon.com': ['102.12.14.22', '167.27.14.62', '167.27.14.62', '197.99.94.32', '157.87.34.72'], 'wakers.com': ['192.10.77.95']}
amazon.com:
102.12.14.22
167.27.14.62
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com:
192.10.77.95
Regular expressions
A very useful thing you can learn from the solution is regular expressions. Regular expressions are tools that allow you to filter or retrieve information from lines in a very convenient way. It also greatly reduces the amount of code, which makes the code more readable and safe.
Let's consider your code of removing ports from strings and think how we can replace it with regex.
lines = url.replace(":443", "").replace(":8080", "")
Replacing of ports in such way is vulnerable, because you never can be sure, what port numbers can actually be in url. What if there will appear port number 5460, or port number 1022, etc. For each of such ports you will add new replaces and soon your code will look something like this
lines = url.replace(":443", "").replace(":8080", "").replace(":5460","").replace(":1022","")...
Not very readable. But with regular experssion you can describe a pattern. And the great news is that we actually know pattern for url with port numbers. They all looking like this:
:some_digits. So if we know pattern we can describe it with regular expression, and tell python to find everything, that match it and replace with empty string '':
re.sub(':\d+', '', url)
It tells to python regular expression engine:
Look for all digits in string url, that goes after : and replace them with empty string. This solution is shorter, safer and a way more readable then solution with replace chain, so I suggest you to read about them a little. Great resource to learn about regular expressions is
this site. Here you can test your regex.
Explanation of Regular expressions in code
re.sub('www.', '', info_in_line[url_position])
Look for all www. in string info_in_line[url_position] and replace it with empty string.
re.search('www.(.[^/]*)[:|/]', url).group(1)
Let's split it on parts:
[^/] - here could be everything except /
(.[^/]*) - Here i used match group. It tells to engine where solution we intersted in will be.
[:|/] - it means characters that could stay on that place. Long story short: after capturing group could be : or(|) /.
So summarizing. Regex can be expressed in words as follows:
Find all substrings, that starts with www., and ends with : or \ and return me everything that stadns between them.
group(1) - means get the first match.
Hope answer will be helpful!
If you used the URL as the key in a dictionary, and had your IP address sets as the elements of the dictionary, would that achieve what you intended?
my_dict = {
'www.amazon.com' = {
'102.12.14.22',
'167.27.14.62',
'197.99.94.32',
'157.87.34.72',
},
'www.wakers.com' = {'192.10.77.95'},
}
## I have used your code & Pandas to get your desired output
## Copy paste the code & execute to get the result
import csv
url_dict = {}
## STEP 1: Open file2.txt to get url names
with open("file2.txt", "r") as f:
for i in f:
val = i.strip().split(" ")[1]
url_dict[val] = []
## STEP 2: 2.1 Open csv file 'file1.csv' to extract url name & ip address
## 2.2 Check if url from file2.txt is available from the extracted url from 'file1.csv'
## 2.3 Create a dictionary with the matched url & its ip address
## 2.4 Remove duplicates in ip addresses from same url
with open("file1.csv", 'r') as f: ## 2.1
reader = csv.reader(f)
for k in reader:
#ko = set()
srcip = k[2]
#print(srcip)
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
for key, value in url_dict.items():
if key in war: ## 2.2
url_dict[key].append(srcip) ## 2.3
## 2.4
for key, value in url_dict.items():
url_dict[key] = list(set(value))
## STEP 3: Print dictionary output to .TXT file
file3 = open('output_text.txt', 'w')
for key, value in url_dict.items():
file3.write('\n' + key + '\n')
for item in value:
file3.write(' '*15 + item + '\n')
file3.close()

Parsing mixed text file with a lot header info

I'm an amateur astronomer (and retired) and just messing around with an idea. I want to scrape data from a NASA website that is a text file and extract specific values to help me identify when to observe. The text file automatically updates every 60 seconds. The text file has some header information before the actual rows and columns of data that I need to process. The actual data is numerical. Example:
Prepared by NASA
Please send comments and suggestions to xxx.com
Date Time Year
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
I want to access the string numerical data and convert it into a double
From what I can see the file is space delimited
I want to poll the website every 60 seconds and if Value 1 and Value 2 are above a specific threshold that will trigger a PyAutoGUI to automate a software application to take an image.
After reading the file from the website I tried converting the file into a dictionary thinking that I could then map keys to values, but I can't predict the exact location that I need. I thought once I extract the values I need I would write the file then try to convert the string into a double or float
I have tried to use
import re
re.split
to read each line and split out info but I get a huge mess because of the header information
I wanted to use a simple approach to open the file and this works
import urllib
import csv
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
print (data)
I found this on Stack overflow but I don't understand how I would use this
file = open('abc.txt','r')
while 1:
a = file.readline()
if a =='': break
a = a.split() #This creates a list of the input
name = a[0]
value = int(a[1]) # or value=float(a[1]) whatever you want
#use the name and value howsoever
f.close()
What I want is it to extract Value 1 and Value 2 as a double or float than in Part II (which I have not yet even started) I will compare Value 1 and Value 2 and if they are about a specific threshold this would trigger a PyAutoGUI to interact with my imaging software and trigger taking an image.
Here's a simple example of using regular expressions. This assumes you'd read the whole file into memory with a single f.read() rather that bothering to process individual lines, which with regular expressions, is often the simpler way to go (and I'm lazy and didn't want to have to create a test file):
import re
data = """
blah
blah
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
blah
"""
pattern = re.compile(r"(\d+) (\d+) (\d+) (\d+) (\d+) ([^\s]+) ([^\s]+)")
def main():
m = pattern.search(data)
if m:
# Do whatever processing you want to do here. You have access to all 7 input
# fields via m.group(1-7)
d1 = float(m.group(6))
d2 = float(m.group(7))
print(">{}< >{}<".format(d1, d2))
else:
print("no match")
main()
Output:
>6e-09< >1e-09<
You'd want to tweak this a bit if I've made a wrong assumption about the input data, but this gives you the general idea anyway.
This should handle just about anything else that exists in the input as long as nothing else does that looks like that one line you're interested in.
UPDATE:
I can't leave well enough alone. Here's code that pulls the data from the URL you provide and processes all the matching lines:
import re
import urllib
pattern = re.compile(r"(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)")
def main():
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
pos = 0
while True:
m = pattern.search(data, pos)
if not m:
break
pos = m.end()
# Do whatever processing you want to do here. You have access to all 8 input
# fields via m.group(1-8)
f1 = float(m.group(7))
f2 = float(m.group(8))
print(">{}< >{}<".format(f1, f2))
main()
Result:
>9.22e-09< >1e-09<
>1.06e-08< >1e-09<
...
>8.99e-09< >1e-09<
>1.01e-08< >1e-09<
This was a fun little challenge, I've pulled all the data out of the table for you, mapped it to class and converted the data to int and Decimal as appropriate. Once it's populated you can read all the data you want from it.
To get the data I've used the requests library, instead of urllib, that's merely personal preference. You could use pip install requests, if you wanted to use it too. It has a method iter_lines that can traverse the rows of data.
This may be overkill for what you need, but as I wrote it anyway I thought I'd post it for you.
import re
from datetime import datetime
from decimal import Decimal
import requests
class SolarXrayFluxData:
def __init__(
self,
year,
month,
day,
time,
modified_julian_day,
seconds_of_the_day,
short,
long
):
self.date = datetime(
int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
)
self.modified_julian_day = int(modified_julian_day)
self.seconds_of_the_day = int(seconds_of_the_day)
self.short = Decimal(short)
self.long = Decimal(long)
class GoesXrayFluxPrimary:
def __init__(self):
self.created_at = ''
self.data = []
def extract_data(self, url):
data = requests.get(url)
for i, line in enumerate(data.iter_lines(decode_unicode=True)):
if line[0] in [':', '#']:
if i is 1:
self.set_created_at(line)
continue
row_data = re.findall(r"(\S+)", line)
self.data.append(SolarXrayFluxData(*row_data))
def set_created_at(self, line):
date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')
if __name__ == '__main__':
goes_xray_flux_primary = GoesXrayFluxPrimary()
goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
print("Created At: %s" % goes_xray_flux_primary.created_at)
for row in goes_xray_flux_primary.data:
print(row.date)
print("%.12f, %.12f" % (row.short, row.long))
The intention of the SolarXrayFluxData class is to store each items data and to make sure it is in a nice usable format. While the GoesXrayFluxPrimary class is used populate a list of SolarXrayFluxData and to store any other data that you might want to pull out. For example I've grabbed the Created date and time. You could also get the Location and Source from the header data.

Counting date occurrences in python?

I'm currently trying to count the number of times a date occurs within a chat log for example the file I'm reading from may look something like this:
*username* (mm/dd/yyyy hh:mm:ss): *message here*
However I need to split the date from the time as I currently treat them as one. Im currently struggling to solve my problem so any help is appreciated. Down below is some sample code that I'm currently using to try get the date count working. Im currently using a counter however I'm wondering if there are other ways to count dates.
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,))
mtxtr = filename.read()
date = []
number = []
occurences = Counter(date)
mtxtformat = mtxtr.split("\r\n")
print 'The Dates in the chat are as follows'
print "--------------------------------------------"
for mtxtf in mtxtformat:
participant = mtxtf.split("(")[0]
date = mtxtf.split("(")[-1]
message = date.split(")")[0]
date.append(date1.strip())
for item in date:
if item not in number:
number.append(item)
for item in number:
occurences = date.count(item)
print("Date Occurences " + " is: " + str(occurences))
Easiest way would be to use regex and take the count of the date pattern you have in the log file. It would be faster too.
If you know the date and time are going to be enclosed in parentheses at the start of the message (i.e. no parentheses (...): will be seen before the one containing the date and time):
*username* (mm/dd/yyyy hh:mm:ss): *message here*
Then you can extract based on the parens:
import re
...
parens = re.compile(r'\((.+)\)')
for mtxtf in mtxtformat:
match = parens.search(mtxtf)
date.append(match.group(1).split(' ')[0])
...
Note: If the message itself contains parens, this may match more than just the needed (mm/dd/yyyy hh:mm:ss). Doing match.group(1).split(' ')[0] would still give you the information you are looking for assuming there is no information enclosed in parens before your date-time information (for the current line).
Note2: Ideally enclose this in a try-except to continue on to the next line if the current line doesn't contain useful information.

Match lines from a file and parse them on Python

I have this file with different lines, and I want to take only some information from each line (not the whole of it) here is a sample of how the file looks like:
18:10:12.960404 IP 132.227.127.62.12017 > 134.157.0.129.53: 28192+ A? safebrowsing-cache.google.com. (47)
18:10:12.961114 IP 134.157.0.129.53 > 132.227.127.62.12017: 28192 12/4/4 CNAME safebrowsing.cache.l.google.com., A 173.194.40.102, A 173.194.40.103, A 173.194.40.104, A 173.194.40.105, A 173.194.40.110, A 173.194.40.96, A 173.194.40.97, A 173.194.40.98, A 173.194.40.99, A 173.194.40.100, A 173.194.40.101 (394)
18:13:46.206371 IP 132.227.127.62.49296 > 134.157.0.129.53: 47153+ PTR? b._dns-sd._udp.upmc.fr. (40)
18:13:46.206871 IP 134.157.0.129.53 > 132.227.127.62.49296: 47153 NXDomain* 0/1/0 (101)
18:28:57.253746 IP 132.227.127.62.54232 > 134.157.0.129.53: 52694+ TXT? time.apple.com. (32)
18:28:57.254647 IP 134.157.0.129.53 > 132.227.127.62.54232: 52694 1/8/8 TXT "ntp minpoll 9 maxpoll 12 iburst" (381)
.......
.......
It is actually the output of a DNS request, so from it I want to extract these elements:
[timestamp], [srcip], [src prt], [dst ip], [dst prt], [domaine (if existed)], [related ips addresses]
After looking in the website old topics, I found that the re.match() is a great and helpful way to do that, but since as you see every line is different of the other, I am kind of lost, some help would be great, here is the code I wrote so far and it is correct:
def extractDNS(filename):
objList = []
obj = {}
with open(filename) as fi:
for line in fi:
line = line.lower().strip()
#18:09:29.960404
m = re.match("(\d+):(\d+):(\d+.\d+)",line)
if m:
obj = {} #New object detected
hou = int(m.group(1))
min = int(m.group(2))
sec = float(m.group(3))
obj["time"] = (hou*3600)+(min*60)+sec
objList.append(obj)
#IP 134.157.0.129.53
m=re.match("IP\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+)",bb)
if m:
obj["dnssrcip"] = m.group(1)
obj["dnssrcport"] = m.group(2)
# > 134.157.0.129.53:
m = re.match("\s+>\s+(\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}).(\d+):",line)
if m:
obj["dnsdstip"] = m.group(1)
obj["dnsdstport"] = m.group(2)
tstFile3=open("outputFile","w+")
tstFile3.write("%s\n" %objList)
tstFile3.close()
extractDNS(sys.argv[1])
I know I have to make if else statements after this, because what comes after them is different every time, and I showed in the 3 cases I get generaly in every dns output file, which are :
- A? followed by CNAME, the exact domain, and the IP addresses,
- PTR? followed by a NXDOmain, means the domain is non-existant, so I will just ignore this line,
- TXT? followed by a domain, but it only gives words, so I ll ignore this one two
I only want the request that their responses contain IP addresses, which are in this case the A?
If you know the first 5 columns are always present, why don't you just split the line up and handle those directly (use datetime for the timestamp, and manually parse the IP addresses/ports). Then, you could use your regular expression to match only the CNAME records and the contents you are interested in from that one field. There is no need to have a regular expression scan over the different possibilities if you aren't going to actually use the output. So, if it doesn't match the CNAME form, then you don't care how it would be handled. At least, that's what it sounds like.
As user632657 said above, there's no need to care about the lines you, well, don't care about. Just use one regular expression for that line, and if it doesn't match, ignore that line:
pattern = re.compile( r'(\d{2}):(\d{2}):(\d{2}\.\d+)\s+IP\s+(\d+\.\d+\.\d+\.\d+)\.(\d+)\s+>\s+(\d+\.\d+\.\d+\.\d+)\.(\d+):\s+(\d+).*?CNAME\s+([^,]+),\s+(.*?)\s+\(\d+\)' )
That will match the CNAME records only. You only need to define that once, outside your loop. Then, within the loop:
try:
hour, minute, seconds, source_ip, source_port, dst_ip, dst_port, domain, records = pattern.match( line ).groups()
except:
continue
records = [ r.split() for r in records.split( ', ' ) ]
This will pull all the fields you've asked about in the relevant variables, and parse the associated IPs into a list of pairs (class, IP), which I figure will be useful :P

Analysing a text file in Python

I have a text file that needs to be analysed. Each line in the file is of this form:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3
I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. How would I go about doing this in Python?
The other answers with regex and splitting the line will get the job done, but if you want a fully maintainable solution that will grow with you, you should build a grammar. I love pyparsing for this:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela#nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj#nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj#nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
This gives as output:
lq_viz_server 1
OFM32 -1
Which would look more impressive if your sample log file was longer. The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage.
If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. You will have this:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela#nabltas1"]
And then I think you have to be capable of apply any logic comparing the values that you need.
i made some wild assumptions about your specification and here is a sample code to help you start:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names
You have two options:
Use the .split() function of the string (as pointed out in the comments)
Use the re module for regular expressions.
I would suggest using the re module and create a pattern with named groups.
Recipe:
first create a pattern with re.compile() containing named groups
do a for loop over the file to get the lines use .match() od the
created pattern object on each line use .groupdict() of the
returned match object to access your values of interest
In the mode of just get 'er done with the standard distribution, this works:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
Prints:
Counter({'lq_viz_server': 1, 'OFM32': -1})

Categories