An issue occurs when I try to find a date in a .txt file using datefinder. I have the feeling I am unnecessarily switching between data types to obtain the result I desire.
Underneath is a MWE which results in generator object, which in turn is empty when changed to a list. I would like to obtain a datetime in the format %d-%m-%Y.
MWE:
import datefinder
f = ['this is text', 'this is a date', '* Model creation date: Sun Apr 25 08:52:06 2021']
for line in f:
if "creation date" in line:
date_line = str(line)
rev_date = datefinder.find_dates(_date_line)
dateutil's parser seems to do a better job:
import dateutil
f = ['this is text', 'this is a date', '* Model creation date: Sun Apr 25 08:52:06 2021']
dates = []
for line in f:
try:
dates.append(dateutil.parser.parse(line, fuzzy=True))
except dateutil.parser.ParserError:
pass
print(dates)
# [datetime.datetime(2021, 4, 25, 8, 52, 6)]
For the specific use-case:
for line in f:
if "* Model creation date:" in line:
rev_date = dateutil.parser.parse(line, fuzzy=True)
break
print(rev_date)
# 2021-04-25 08:52:06
Seems datefinder.find_dates works based on :. If you can remove : character after creation date get right result.
If always your string include creation date: you can remove this substring after if statement:
import datefinder
f = ['this is text', 'this is a date', '* Model creation date: Sun Apr 25 08:52:06 2021']
for line in f:
if "creation date" in line:
date_line = line.replace('creattion date:', '')
rev_date = datefinder.find_dates(date_line)
Related
I have txt file like this;
name lastname 17 189cm
How do I get it to be like this?
name lastname, 17, 189cm
Using str.strip and str.split:
>>> my_string = 'name lastname 17 189cm'
>>> s = list(map(str.strip, my_string.split()))
>>> ', '.join([' '.join(s[:2]), *s[2:] ])
'name lastname, 17, 189cm'
You can use regex to replace multiple spaces (or tabs) with a comma:
import re
text = 'name lastname 17 189cm'
re.sub(r'\s\s+|\t', ', ', text)
text = 'name lastname 17 189cm'
out = ', '.join(text.rsplit(maxsplit=2)) # if sep is not provided then any consecutive whitespace is a separator
print(out) # name lastname, 17, 189cm
You could use re.sub:
import re
s = "name lastname 17 189cm"
re.sub("[ ]{2,}",", ", s)
PS: for the first problem you proposed, I had the following solution:
s = "name lastname 17 189cm"
s[::-1].replace(" ",",", 2)[::-1]
I have a text file having multiple lines one of the line contains field description, and that field has multiple combination or notation of dates surrounded by other strings like colasas|04/18/2017|NXP , FTP Permanent|09|10|2012|FTP, and Project|16 July 2005|Design. from which I want to parse the dates only, One way I found is to use dateutil module which looks to be complicated and lot of manipulation for this purpose.
So, while going through the examples test , it works for certain combinations..
>>> from dateutil.parser import parse
>>> test_cases = ['04/30/2009', '06/20/95', '8/2/69', '1/25/2011', '9/3/2002', '4-13-82', 'Mar-02-2009', 'Jan 20, 1974',
... 'March 20, 1990', 'Dec. 21, 2001', 'May 25 2009', '01 Mar 2002', '2 April 2003', '20 Aug. 2004',
... '20 November, 1993', 'Aug 10th, 1994', 'Sept 1st, 2005', 'Feb. 22nd, 1988', 'Sept 2002', 'Sep 2002',
... 'December, 1998', 'Oct. 2000', '6/2008', '12/2001', '1998', '2002']
>>> for date_string in test_cases:
... print(date_string, parse(date_string).strftime("%Y%m%d"))
...
04/30/2009 20090430
06/20/95 19950620
8/2/69 19690802
----- etc --------
However, I have the below data combination which I need to parse but while opting for above solution it fails to get the results..
As description is optional as it may be missing at certain point so , I considered using (?:description:* (.*))? .
description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP
I have re-formated the Question Just allow to more visibility to get more inputs.
Below is the actula script which is having three diffrent matches dn , ftpuser and the last description which i'm looking for the solution.
Below script is working for all the matches but the last feild which is description having the mixed and raw data from which i need the dates only
and the dates are encapsulated between PIPES"|".
#!/usr/bin/python3
# ./dataparse.py
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import re
with open('test2', 'r') as f:
for line in f:
line = line.strip()
data = f.read()
regex = (r"dn:(.*?)\nftpuser: (.*)\ndescription:* (.*)")
matchObj = re.findall(regex, data)
for index in matchObj:
#print(index)
index_str = ' '.join(index)
new_str = re.sub(r'[=,]', ' ', index_str)
new_str = new_str.split()
print("{0:<30}{1:<20}{2:<50}".format(new_str[1],new_str[8],new_str[9]))
Resulted output:
$ ./dataparse.py
ab02 disabled_5Mar07 Remedy
mela Y ROYALS|none|customer
ab01 Y VGVzdGluZyA
tt#regg.com T REG-JP|7-31-05|REG-JP
The parse method you're using accepts a keyword argument to allow ignoring irrelevant parts of the string.
:param fuzzy:
Whether to allow fuzzy parsing, allowing for string like "Today is
January 1, 2047 at 8:21:00AM".
Demo:
>>> parse('colasas|04/18/2017|NXP', fuzzy=True)
datetime.datetime(2017, 4, 18, 0, 0)
There is another one to also return tuples including the parts of the string that were ignored:
>>> parse('colasas|04/18/2017|NXP', fuzzy_with_tokens=True)
(datetime.datetime(2017, 4, 18, 0, 0), ('colasas|', '|NXP'))
This method won't work perfectly with all of your input strings, but it should get you most of the way there. You may have to do some pre-processing for the stranger ones.
Using some string manipulation
Demo:
s = """description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP"""
from dateutil.parser import parse
for i in s.split("\n"):
val = i.split("|", 1) #Split by first "|"
if len(val) > 1: #Check if Date in string.
val = val[1].rpartition("|")[0] #Split by right "|"
print( parse(val, fuzzy=True) )
Output:
2017-04-18 00:00:00
2017-04-18 00:00:00
2012-07-03 00:00:00
2004-07-03 00:00:00
2010-12-12 00:00:00
2005-07-16 00:00:00
2010-10-10 00:00:00
2005-07-31 00:00:00
2010-07-03 00:00:00
Regarding your datetime error remove from datetime import datetime
Demo:
import re
import datetime
strh = "description: colasas|04/18/2017|NXP"
match = re.search(r'\d{2}/\d{2}/\d{4}', strh)
date = datetime.datetime.strptime(match.group(), '%m/%d/%Y').date()
print(date)
text="""
description: colasas|04/18/2017|NXP
description: colasas|04/18/2017|NXP
description: Remedy Tkt 01212152 Orcad move
description: FTP Permanent|09|10|2012|FTP
description: Remedy Tkt 01212152 Orcad move
description: TDA Drop12 Account|July 2004|TDA Drop12 Account
description: ftp|121210|ftp
description: Design Foundry Project|16 July 2005|Design Foundry Project
description: FTP Permanent|10/10/2010|FTP
description: WFS-JP|7-31-05|WFS-JP
description: FTP Permanent|10|11|2010|FTP
"""
import re
reg=re.compile(r"(?ms)\|(\d\d)(\d\d)(\d\d)\||\|(\d{1,2})[\|/\-](\d{1,2})[\|/\-](\d{2,4})\||\|(\d*)\s*(\w+)\s*(\d{4})\|")
dates= [ t[:3] if t[1] else t[3:6] if t[4] else t[6:] for t in reg.findall(text) ]
print(dates)
"""
regexp for |121210| ---> \|(\d\d)(\d\d)(\d\d)\|
for |16 July 2005| ---> \|(\d*)\s*(\w+)\s*(\d{4})\|
for the others ---> \|(\d{1,2})[\|/\-](\d{1,2})[\|/\-](\d{2,4})\|
"""
Output: [('04', '18', '2017'), ('04', '18', '2017'), ('09', '10', '2012'), ('', 'July', '2004'), ('12', '12', '10'), ('16', 'July', '2005'), ('10', '10', '2010'), ('7', '31', '05'), ('10', '11', '2010')]
Get the date as it is:
reg=re.compile(r"(?ms)\|(\d{6})\||\|(\d{1,2}[\|/\-]\d{1,2}[\|/\-]\d{2,4})\||\|(\d*\s*\w+\s+\d{4})\|")
dates= [ t[0] or t[1] or t[2] for t in reg.findall(text) ]
print(dates)
Output:
['04/18/2017', '04/18/2017', '09|10|2012', 'July 2004', '121210', '16 July 2005', '10/10/2010', '7-31-05', '10|11|2010']
I achieved it through regex considering the values between pipes as follows:
"(?:description:* .*\|([0-9]{1,2}[-/]+[0-9]{1,2}[-/]+[0-9]{2,4})\|.*)?"
I wanted to extract the date from the given string on the basis of tag.
My string is -
DATE: 7/25/2017 DATE OPENED: 7/25/2017 RETURN DATE: 7/26/2017
NUMBER: 201707250008754 RATE: 10.00
I want something like this -
If I give "DATE" it should return 7/25/2017 only
if I give "RETURN DATE" it should return 7/26/2017
if I give the "NUMBER" it should return 201707250008754
and so on.
How we can achieve this in Python 2.7 (Note: Dates and numbers are always random in string"
You can create a dictionary from the string's contents with re:
import re
s = 'DATE: 7/25/2017 DATE OPENED: 7/25/2017 RETURN DATE: 7/26/2017 NUMBER: 201707250008754 RATE: 10.00'
results = re.findall('[a-zA-Z\s]+(?=:)|[\d/\.]+', s)
d = dict([re.sub('^\s+', '', results[i]), results[i+1]] for i in range(0, len(results), 2))
for i in ['DATE', 'RETURN DATE', 'NUMBER']:
print(d[i])
Output:
7/25/2017
7/26/2017
201707250008754
Use dict to map key (eg: 'DATE' ) to its value.
import re
s = '''DATE: 7/25/2017 DATE OPENED: 7/25/2017 RETURN DATE: 7/26/2017 NUMBER: 201707250008754 RATE: 10.00'''
items = re.findall('\s*(.*?)\:\s*([0-9/.]*)',s)
#[('DATE', '7/25/2017'), ('DATE OPENED', '7/25/2017'), ('RETURN DATE', '7/26/2017'), ('NUMBER', '201707250008754'), ('RATE', '10.00')]
info = dict(items)
#{'DATE': '7/25/2017', 'DATE OPENED': '7/25/2017', 'RETURN DATE': '7/26/2017', 'NUMBER': '201707250008754', 'RATE': '10.00'}
for key in ['DATE', 'RETURN DATE', 'NUMBER']:
print(info[key])
So I have several log files, they are structured like this:
Sep 9 12:42:15 apollo sshd[25203]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=189.26.255.11
Sep 9 12:42:15 apollo sshd[25203]: pam_succeed_if(sshd:auth): error retrieving information about user ftpuser
Sep 9 12:42:17 apollo sshd[25203]: Failed password for invalid user ftpuser from 189.26.255.11 port 44061 ssh2
Sep 9 12:42:17 apollo sshd[25204]: Received disconnect from 189.26.255.11: 11: Bye Bye
Sep 9 19:12:46 apollo sshd[30349]: Did not receive identification string from 199.19.112.130
Sep 10 03:29:48 apollo unix_chkpwd[4549]: password check failed for user (root)
Sep 10 03:29:48 apollo sshd[4546]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=221.12.29.170 user=root
Sep 10 03:29:51 apollo sshd[4546]: Failed password for root from 221.12.29.170 port 56907 ssh2
There are more dates and times, But this is an example. I was wondering how I would calculate the total time that the file covers. I've tried a few things, and have had about 5 hours of no success.
I tried this first, and it was close, but it didn't work like I wanted it to, it kept repeating dates:
with open(filename, 'r') as file1:
lines = file1.readlines()
for line in lines:
linelist = line.split()
date2 = int(linelist[1])
time2 = linelist[2]
print linelist[0], linelist[1], linelist[2]
if date1 == 0:
date1 = date2
dates.append(linelist[0] + ' ' + str(linelist[1]))
if date1 < date2:
date1 = date2
ttimes.append(datetime.strptime(str(ltime1), FMT) - datetime.strptime(str(time1), FMT))
time1 = '23:59:59'
ltime1 = '00:00:00'
dates.append(linelist[0] + ' ' + str(linelist[1]))
if time2 < time1:
time1 = time2
if time2 > ltime1:
ltime1 = time2
If the entries are in a chronological order, you can just look at the first and at the last entry:
entries = lines.split("\n")
first_date = entries[0].split("apollo")[0]
last_date = entries[len(entries)-1].split("apollo")[0]
We don't have the year, so I took the current year. Read all the lines, convert the month to month index, and parse each date.
Then sort it (so works even if logs mixed) and take first & last item. Substract. Enjoy.
from datetime import datetime
months = ["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
current_year = datetime.now().year
dates = list()
with open(filename, 'r') as file1:
for line in file1:
linelist = line.split()
if linelist: # filter out possible empty lines
linelist[0] = str(months.index(linelist[0])) # convert 3-letter months to index
date2 = int(linelist[1])
z=datetime.strptime(" ".join(linelist[0:3])+" "+str(current_year),"%m %d %H:%M:%S %Y") # compose & parse the date
dates.append(z) # store in list
dates.sort() # sort the list
first_date = dates[0]
last_date = dates[-1]
# print report & compute time span
print("start {}, end {}, time span {}".format(first_date,last_date,last_date-first_date))
result:
start 2016-09-09 12:42:15, end 2016-09-10 03:29:51, time span 14:47:36
Note that it won't work properly between december 31st and january the 1st because of the missing year info. I suppose we could make a guess if we find January & December in the log then assume that it's january from the next year. Unsupported yet.
I have a CSV file with each line containing information pertaining to a particular tweet (i.e. each line contains Lat, Long, User_ID, tweet and so on). I need to read the file and organize the tweets by the User_ID. I am trying to end up with a given User_ID attached to all of the tweets with that specific ID.
Here is what I want:
user_id: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
user_id2: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
: 'lat', 'long', 'tweet'
and so on...
This is a snip of my code that reads in the CSV file and creates a list:
UID = []
myID = []
ID = []
f = None
with open(csv_in,'rU') as f:
myreader = csv.reader(f, delimiter=',')
for row in myreader:
# Assign columns in csv to variables.
latitude = row[0]
longitude = row[1]
user_id = row[2]
user_name = row[3]
date = row[4]
time = row[5]
tweet = row[6]
flag = row[7]
compound = row[8]
Vote = row[9]
# Read variables into separate lists.
UID.append(user_id + ', ' + latitude + ', ' + longitude + ', ' + user_name + ', ' + date + ', ' + time + ', ' + tweet + ', ' + flag + ', ' + compound)
myID = ', '.join(UID)
ID = myID.split(', ')
I'd suggest you use pandas for this. It will allow you not only to list your tweets by user_id, as in your question, but also to do many other manipulations quite easily.
As an example, take a look at this python notebook from NLTK. At the end of it, you see an operation very closed to yours, reading a csv file containing tweets,
In [25]:
import pandas as pd
tweets = pd.read_csv('tweets.20150430-223406.tweet.csv', index_col=2, header=0, encoding="utf8")
You can also find a simple operation: looking for the tweets of a certain user,
In [26]:
tweets.loc[tweets['user.id'] == 557422508]['text']
Out[26]:
id
593891099548094465 VIDEO: Sturgeon on post-election deals http://...
593891101766918144 SNP leader faces audience questions http://t.c...
Name: text, dtype: object
For listing the tweets by user_id, you would simply do something like the following (this is not in the original notebook),
In [9]:
tweets.set_index('user.id')[0:4]
Out[9]:
created_at favorite_count in_reply_to_status_id in_reply_to_user_id retweet_count retweeted text truncated
user.id
107794703 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #KirkKus: Indirect cost of the UK being in ... False
557422508 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False VIDEO: Sturgeon on post-election deals http://... False
3006692193 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #LabourEoin: The economy was growing 3 time... False
455154030 Thu Apr 30 21:34:06 +0000 2015 0 NaN NaN 0 False RT #GregLauder: the UKIP east lothian candidat... False
Hope it helps.