I want to process a log file that contains events log, but only today logs.
The log file looks like this:
Aug 23 07:23:05 iZk1a211s8hkb4hkecu7w1Z sshd[19569]: Invalid user test from 10.148.0.13 port 48382
...
Sep 20 07:23:06 iZk1a211s8hkb4hkecu7w1Z sshd[19569]: Failed password for invalid user test from 10.148.0.13 port 48382 ssh2
...
Aug 23 07:23:07 iZk1a211s8hkb4hkecu7w1Z sshd[19564]: Failed password for invalid user sysadm from 10.148.0.13 port 48380 ssh2
...
Oct 15 07:23:09 iZk1a211s8hkb4hkecu7w1Z sshd[19573]: Invalid user sinusbot from 10.148.0.13 port 48384
...
Sep 08 07:23:11 iZk1a211s8hkb4hkecu7w1Z sshd[19573]: Failed password for invalid user sinusbot from 10.148.0.13 port 48384 ssh2
...
Nov 01 07:23:16 iZk1a211s8hkb4hkecu7w1Z sshd[19587]: Invalid user smkim from 10.148.0.13 port 48386
...
Nov 12 07:23:18 iZk1a211s8hkb4hkecu7w1Z sshd[19587]: Failed password for invalid user smkim from 10.148.0.13 port 48386 ssh2
How to grab the today line in the log?
I've tried this and got stuck in finding the patterns:
from datetime import date
today = date.today()
today = today.strftime("%B %d")
with open('file.log','r') as f:
for line in f:
date = line.find("*idk I'm stuck at this point*")
if date = today:
`*run my process script*`
Does anyone have any suggestions?
You need to extract the part of the string containing the date, parse it as datetime and convert it to a date:
from datetime import date
today: date = date.today()
with open('file.log','r') as f:
for line in f:
date: date = datetime.strptime(line[:15], "%b %d %H:%M:%S").date().replace(year=today.year)
if date == today:
`*run my process script*`
Related
When I run the fabric.operations.sudo to get the info from a remote VM (its kernel is 4.14.35 EL7.6), such as "date +%s", the excepted result should be "1549853543", but in my test, it's "Last login: Mon Feb 11 02:53:18 UTC 2019 on pts/0\r\n1549853543".
I have run the command "ssh user#vm 'date +%s'", the result is normal(only the number).
Does anyone know what's the reason? I have also fixed the "PrintLastLog" to "no" in the /etc/ssh/sshd_config.
result = sudo('date +%s').stdout.strip()
run_time = int(result) => exception occurs
Except: 1549853543
Actual: invalid literal for int() with base 10: 'Last login: Mon Feb 11 02:53:18 UTC 2019 on pts/0\r\n1549853543'
Fix the 2 places it seems the last login info disappear:
/etc/pam.d/system-auth:
session required pam_lastlog.so silent showfailed
2: /etc/ssh/sshd_config:
# Per CCE-CCE-80225-6: Set PrintLastLog yes in /etc/ssh/sshd_config
PrintLastLog no
I have a file full of hundreds of un-separated tweets all formatted like so:
{"text": "Just posted a photo # Navarre Conference Center", "created_at": "Sun Nov 13 01:52:03 +0000 2016", "coordinates": [-86.8586, 30.40299]}
I am trying to split them up so I can assign each part to a variable.
The text
The timestamp
The location coordinates
I was able to split the tweets up using .split('{}') but I don't really know how to split the rest into the three things that I want.
My basic idea that didn't work:
file = open('tweets_with_time.json' , 'r')
line = file.readline()
for line in file:
line = line.split(',')
message = (line[0])
timestamp = (line[1])
position = (line[2])
#just to test if it's working
print(position)
Thanks!
I just downloaded your file, it's not as bad as you said. Each tweet is on a separate line. It would be nicer if the file was a JSON list, but we can still parse it fairly easily, line by line. Here's an example that extracts the 1st 10 tweets.
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line into a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
Unfortunately, I can't show the output of this script: Stack Exchange won't allow me to put those shortened URLs into a post.
Here's a modified version that cuts off each message at the URL.
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line to a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Remove the URL from the message
idx = message.find('https://')
if idx != -1:
message = message[:idx]
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
output
1
Message: Just posted a photo # Navarre Conference Center
Timestamp: Sun Nov 13 01:52:03 +0000 2016
Position: [-86.8586, 30.40299]
2
Message: I don't usually drink #coffee, but I do love a good #Vietnamese drip coffee with condense milkβ¦
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-123.04437109, 49.26211779]
3
Message: #bestcurryπ₯π£ππ½ππ€ππ½ππΌππΌβπ½ππΌπͺπΌπ΄πΊπππ·ππππΌππ½ππ½πβοΈπΈβπ―ππΏπ¦πΊπΈππΌ#johanvanaarde #kauai #rugby #surfingβ¦
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-159.4958861, 22.20321232]
4
Message: #thatonePerezwedding ππ # Scenic Springs
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-98.68685568, 29.62182898]
5
Message: Miami trends now: Heat, Wade, VeteransDay, OneLetterOffBands and TheyMightBeACatfishIf.
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-80.19240081, 25.78111669]
6
Message: Thank you family for supporting my efforts. I love you all!β¦
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-117.83012, 33.65558157]
7
Message: If you're looking for work in #HONOLULU, HI, check out this #job:
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-157.7973653, 21.2868901]
8
Message: Drinking a L'Brett d'Apricot by #CrookedStave # FOBAB β
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-87.6455, 41.8671]
9
Message: Can you recommend anyone for this #job? Barista (US) -
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-121.9766823, 38.350109]
10
Message: He makes me happy # Frank and Bank
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-75.69360487, 45.41268776]
It looks like well-formatted JSON data. Try the following:
import json
from pprint import pprint
file_ptr = open('tweets_with_time.json' , 'r')
data = json.load(file_ptr)
pprint(data)
It should parse your data into a nice Python dictionary. You can access the elements by their names like:
# Return the first 'coordinates' data point as a list of floats
data[0]["coordinates"]
# Return the 5th 'text' data point as a string
data[4]["text"]
This script that displays how many attacks occur per hour per day. I want it to also count by IP address so it will show the IP addresses that were attacked per hour, per day.
from itertools import groupby
#open the auth.log for reading
myAuthlog=open('auth.log', 'r')
# Goes through the log file line by line and produces a list then looks for 'Failed password for'
myAuthlog = (line for line in myAuthlog if "Failed password for" in line)
# Groups all the times and dates together
for key, group in groupby(myAuthlog, key = lambda x: x[:9]):
month, day, hour = key[0:3], key[4:6], key[7:9]
# prints the results out in a format to understand e.g date, time then amount of attacks
print "On%s-%s at %s:00 There was %d attacks"%(day, month, hour, len(list(group)))
The Log File looks like This
Feb 3 13:34:05 j4-be02 sshd[676]: Failed password for root from 85.17.188.70 port 48495 ssh2
Feb 3 21:45:18 j4-be02 sshd[746]: Failed password for invalid user test from 62.45.87.113 port 50636 ssh2
Feb 4 08:39:46 j4-be02 sshd[1078]: Failed password for root from 1.234.51.243 port 60740 ssh2
A Example outcome of the code i have is:
On 3-Feb at 21:00 There was 1 attacks
On 4-Feb at 08:00 There was 15 attacks
On 4-Feb at 10:00 There was 60 attacks
from itertools import groupby
import re
myAuthlog=open('dict.txt', 'r')
myAuthlog = (line for line in myAuthlog if "Failed password for" in line)
for key, group in groupby(myAuthlog, key = lambda x: x[:9] + re.search('from(.+?) port', x).group(1)):
month, day, hour, ip = key[0:3], key[4:6], key[7:9] , key[10:]
print "On%s-%s at %s:00 There was %d attacks FROM IP %s"%(day, month, hour, len(list(group)), ip)
Log file:
Feb 3 13:34:05 j4-be02 sshd[676]: Failed password for root from 85.17.188.70 port 48495 ssh2
Feb 3 21:45:18 j4-be02 sshd[746]: Failed password for invalid user test from 62.45.87.113 port 50636 ssh2
Feb 4 08:39:46 j4-be02 sshd[1078]: Failed password for root from 1.234.51.243 port 60740 ssh2
Feb 4 08:53:46 j4-be02 sshd[1078]: Failed password for root from 1.234.51.243 port 60740 ssh2
output:
On 3-Feb at 13:00 There was 1 attacks FROM IP 85.17.188.70
On 3-Feb at 21:00 There was 1 attacks FROM IP 62.45.87.113
On 4-Feb at 08:00 There was 2 attacks FROM IP 1.234.51.243
Since you already know how to get the log lines per hour per day, use the following to count the IPs per hour per day. This is not a complete solution.
from collections import defaultdict
import re
ip_count = defaultdict(int)
with open('logfile') as data:
for line in data:
ip_count[re.findall(r'.*from (.*) port.*', line)[0]] += 1
for ip, count in ip_count.iteritems():
print ip, count
In Python, with TwitterSearch, I'm able to get the timestamp of the tweet in UTC time, in the following format :
Thu Mar 19 12:37:15 +0000 2015
However, I would like to obtain it automatically in the EST timezone (UTC - 4), in this format :
2015-03-19 08:37:15
Here is a sample of my code. What should I change in it for an automatic conversion?
for tweet in ts.search_tweets_iterable(tso):
lat = None
long = None
user = tweet['user']['screen_name']
user_creation = tweet['user']['created_at']
created_at = tweet['created_at'] # UTC time when Tweet was created.
favorite = tweet['favorite_count']
retweet = tweet ['retweet_count']
id_status = tweet['id']
in_reply_to = tweet['in_reply_to_screen_name']
followers = tweet['user']['followers_count'] # nombre d'abonnΓ©s
statuses_count = tweet['user']['statuses_count'] # nombre d'abonnΓ©s
location = tweet['user']['location'] # rΓ©sidence du twittos
tweet_text = tweet['text'].strip() # deux lignes enlèvent espaces inutiles
tweet_text = ''.join(tweet_text.splitlines())
print i,created_at,user_creation,user, tweet_text
if tweet['geo'] and tweet['geo']['coordinates'][0]:
lat, long = tweet['geo']['coordinates'][:2]
print u'#%s: %s' % (user, tweet_text), lat, long
else:
print u'#%s: %s' % (user, tweet_text)
print favorite,retweet,id_status,in_reply_to,followers,statuses_count,location
writer.writerow([user.encode('utf8'), user_creation.encode('utf8'), created_at.encode('utf8'),
tweet_text.encode('utf8'), favorite, retweet, id_status, in_reply_to, followers, statuses_count, location.encode('utf8'), lat, long])
i += 1
if i > max:
return()
Thank you in advance!
Florent
If EST is your local timezone then you could do it using only stdlib:
#!/usr/bin/env python
from datetime import datetime
from email.utils import parsedate_tz, mktime_tz
timestamp = mktime_tz(parsedate_tz('Thu Mar 19 12:37:15 +0000 2015'))
s = str(datetime.fromtimestamp(timestamp))
# -> '2015-03-19 08:37:15'
It supports non-UTC input timezones too.
Or you could specify the destination timezone explicitly:
import pytz # $ pip install pytz
dt = datetime.fromtimestamp(timestamp, pytz.timezone('US/Eastern'))
s = dt.strftime('%Y-%m-%d %H:%M:%S')
# -> '2015-03-19 08:37:15'
You could put it in a function:
#!/usr/bin/env python
from datetime import datetime
from email.utils import parsedate_tz, mktime_tz
def to_local_time(tweet_time_string):
"""Convert rfc 5322 -like time string into a local time
string in rfc 3339 -like format.
"""
timestamp = mktime_tz(parsedate_tz(tweet_time_string))
return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
time_string = to_local_time('Thu Mar 19 12:37:15 +0000 2015')
# use time_string here..
Remove the +0000 from the date sent by twitter and do something like:
from datetime import datetime
import pytz
local = 'Europe/London' #or the local from where twitter date is coming from
dt = datetime.strptime("Thu Mar 19 12:37:15 2015", "%a %b %d %H:%M:%S %Y")
dt = pytz.timezone(local).localize(dt)
est_dt = dt.astimezone(pytz.timezone('EST'))
print est_dt.strftime("%Y-%m-%d %H:%M:%S")
Output:
2015-03-19 07:37:15
Alternatively you can do something like (in this case you don't need to remove the +0000 timezone info):
from dateutil import parser
dt = parser.parse("Thu Mar 19 12:37:15 +0000 2015")
est_dt = dt.astimezone(pytz.timezone('EST'))
print est_dt.strftime("%Y-%m-%d %H:%M:%S")
Output
2015-03-19 07:37:15
By the way, EST is UTC-4 or UTC-5?
I'm a python beginner trying to extract data from email headers. I have thousands of email messages in a single text file, and from each message I want to extract the sender's address, recipient(s) address, and the date, and write it to a single, semicolon-delimitted line in a new file.
this is ugly, but it's what I've come up with:
import re
emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results
resultsList = []
for line in emails:
if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
if newMessage:
resultsList.append("\n")
if "From: " in line:
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
resultsList.append(address)
resultsList.append(";")
if "To: " in line:
if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
for person in address:
resultsList.append(person)
resultsList.append(";")
if "Date: " in line:
date = re.findall(r'\w\w\w\,.*', line)
resultsList.append(date)
resultsList.append(";")
for result in resultsList:
results.writelines(result)
emails.close()
results.close()
and here's my 'demo_text.txt':
From - Sun Jan 06 19:08:49 2013
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Delivered-To: somebody_1#hotmail.com
Received: by 10.48.48.3 with SMTP id v3cs417003nfv;
Mon, 15 Jan 2007 10:14:19 -0800 (PST)
Received: by 10.65.211.13 with SMTP id n13mr5741660qbq.1168884841872;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Return-Path: <nobody#hotmail.com>
Received: from bay0-omc3-s21.bay0.hotmail.com (bay0-omc3-s21.bay0.hotmail.com [65.54.246.221])
by mx.google.com with ESMTP id e13si6347910qbe.2007.01.15.10.13.58;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Received-SPF: pass (google.com: domain of nobody#hotmail.com designates 65.54.246.221 as permitted sender)
Received: from hotmail.com ([65.54.250.22]) by bay0-omc3-s21.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668);
Mon, 15 Jan 2007 10:13:48 -0800
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
Mon, 15 Jan 2007 10:13:47 -0800
Message-ID: <BAY115-F12E4E575FF2272CF577605A1B50#phx.gbl>
Received: from 65.54.250.200 by by115fd.bay115.hotmail.msn.com with HTTP;
Mon, 15 Jan 2007 18:13:43 GMT
X-Originating-IP: [200.122.47.165]
X-Originating-Email: [nobody#hotmail.com]
X-Sender: nobody#hotmail.com
From: =?iso-8859-1?B?UGF1bGEgTWFy7WEgTGlkaWEgRmxvcmVuemE=?=
<nobody#hotmail.com>
To: somebody_1#hotmail.com, somebody_2#gmail.com, 3_nobodies#yahoo.com.ar
Bcc:
Subject: fotos
Date: Mon, 15 Jan 2007 18:13:43 +0000
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_NextPart_000_d98_1c4f_3aa9"
X-OriginalArrivalTime: 15 Jan 2007 18:13:47.0572 (UTC) FILETIME=[E68D4740:01C738D0]
Return-Path: nobody#hotmail.com
The output is:
somebody_1#hotmail.com;somebody_2#gmail.com;3_nobodies#yahoo.com.ar;Mon, 15 Jan 2007 18:13:43 +0000;
This output would be fine except there's a line break in the 'From:' field in my demo_text.txt (line 24), and so I miss 'nobody#hotmail.com'.
I'm not sure how to tell my code to skip line break and still find email address in the From: tag.
More generally, I'm sure there are many more sensible ways to go about this task. If anyone could point me in the right direction, I'd sure appreciate it.
Your demo text is practicallly the mbox format, which can be perfectly processed with the appropriate object in the mailbox module:
from mailbox import mbox
import re
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
mymbox = mbox("demo.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
date = [ email["date"], ]
print ";".join(from_address + to_address + date)
In order to skip newlines, you can't read it line by line. You can try loading in your file, and using your keywords (From, To, etc.) as boundaries. So when you search for 'From -', you use the rest of your keywords as boundaries so they won't be included in the portion of the list.
Also, mentioning this cause you said you're a beginner:
The "Pythonic" way of naming your non-class variables is with underscores. So resultsList should be results_list.