Dig out information with Python re

Dig out information with Python re - python

I want to dig out information from the log files and wrote the script below:
import re
file = '''Date,Time,Type,User,Message
Thu Jul 18, 2019 14:18:41.945,EFM,201202 ,Robot picked
Thu Jul 18, 2019 14:18:51.486,DS ,201202 ,Module 1
Thu Jul 18, 2019 14:19:07.747,DS ,201202 ,Door opened
Thu Jul 18, 2019 14:20:08.231,EFM,203204205206,Robot picked
Thu Jul 18, 2019 14:20:08.231,DS ,203204 ,Module 2
Thu Jul 18, 2019 14:20:10.282,DS ,203204 ,Door opened
...
'''
p1 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),EFM,(\d+?\s*?),Robot picked')
p2 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),DS ,(\d+?\s*?),Module 1')
p3 = re.compile(r'\w{3} \w{3} \d\d, \d{4} (\d\d:\d\d:\d\d.\d{3}),DS ,(\d+?\s*?),Door opened')
w_file = r'D:\sample.txt'
lines = file.readlines()
t_file =open(w_file,'w')
info = ['User','Time1','Time2','Time3' ]
t_file.write('{}\n'.format(','.join(item for item in info)))
for line in lines:
p1_line = re.findall(p1, line.strip())
p2_line = re.findall(p2, line.strip())
p3_line = re.findall(p3, line.strip())
if p1_line and p2_line and p3_line:
if p1_line[0][1][:3] == p2_line[0][1][:3] and p1_line[0][1][:3] == p5_line[0][1][:3]:
t_file.write('{},{},{},{}\n'.format(p1_line[0][1].strip(),p1_line[0][0],p2_line[0][0],p3_line[0][0])
t_file.close()
When I open the sample.txt file, there is only the 'User,Time1,Time2,Time3' row. Can any find what's wrong in my script?
What I want is like below:
User,Time1,Time2,Time3
201202,14:18:41.945,14:18:51.486,14:19:07.747
203204205206,14:20:08.231,14:20:08.231,14:20:10.282

The issue with your script is that you are trying to match all regular expressions to the same line, and then performing an and condition, which of course fails.
Each regular expression works but only for specific lines, hence 2 out of the 3 will return [] which evaluates to False.
For example, given:
line = 'Thu Jul 18, 2019 14:18:41.945,EFM,201202 ,Robot picked'
You will have:
p1_line = [('14:18:41.945', '201202 ')] # match
p2_line = [] # no match
p3_line = [] # no match
Once you and these three values, the condition will evaluate to False and for this reason nothing will be written to the file:
if p1_line and p2_line and p3_line: # this evaluates to False
So, depending on the exact logic you want to implement, you may have to store and remember past matches and build on that.

Related

How to parse log file by regex grouping

Trying to parse based on the grouping, below is the input file to parse.
Cannot able to aggregate multiple groups from my regex which produces expected output. Need some recommendations to print data in expected output. (Note Group2 can have different other (strings) in the actual log-file)
#Parse out the timedate stamp Jan 20 03:25:08 to capture two groups
Example groups
1.) Jan 20 03:25 2.) logrotate
1.) Jan 20 05:03 2.) ntpd
logfile= """Jan 20 03:25:08 fakehost logrotate: ALERT exited abnormally with [1]
Jan 20 03:25:08 fakehost run-parts(/etc/cron.daily)[20447]: finished logrotate
Jan 20 03:26:21 fakehost anacron[28969]: Job 'cron.daily' terminated
Jan 20 03:26:21 fakehost anacron[28969]: Normal exit (1 job run)
Jan 20 03:30:01 fakehost CROND[31462]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Jan 20 03:30:01 fakehost CROND[31461]: (root) CMD (/var/system/bin/sys-cmd -F
Jan 20 05:03:03 fakehost ntpd[3705]: synchronized to time.faux.biz, stratum 2
"""
Expected output:
minute,total_count,logrotate,CROND,ntpd,anacron,run-parts
Jan 20 03:25,2,1,0,0,0,1
Jan 20 03:26,2,0,2,0,1,1
Jan 20 03:30,2,0,2,0,0,0
Jan 20 05:03,1,0,0,1,0,0
This is my code:
import re
output = {}
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
print(match.groups())
# Struck here to arrange the data
output[match.group(1)]['total_count'] += 1
output[match.group(1)][match.group(2)] += 1
for k, v in output.items():
print('{0} {1}'.format(k, v))

import re
output = []
regex = re.compile(r'^(\w+ \d+ \d+:\d+):\d+ \w+ (\w+).*$')
with open("logfile.txt", "r+") as myfile:
for log_line in myfile:
match = regex.match(log_line)
if match:
if match.group(1) and match.group(2):
dataDict = {'minute': match.group(1), 'total_count': 1}
dataDict[match.group(2)] = 1
lastInsertedIndex = len(output) - 1
if (len(output) > 0): # data exist, check if same minute data exist or not
# same minute, update existing data
if (output[lastInsertedIndex]['minute'] == match.group(1)):
lastInsertedIndexDict = output[lastInsertedIndex]
if (match.group(2) in lastInsertedIndexDict):
lastInsertedIndexDict[match.group(2)] = lastInsertedIndexDict[match.group(2)] + 1 # updating group(2)
else:
lastInsertedIndexDict[match.group(2)] = 1
# updating total count
lastInsertedIndexDict['total_count'] = lastInsertedIndexDict['total_count'] + 1
output[lastInsertedIndex] = lastInsertedIndexDict
else: # new minute, simply append
output.append(dataDict)
else: # output list is empty
output.append(dataDict)
for data in output:
print(data)
Here the idea is after we have match.groups(), create a dictionary with minute as a key & for total_count value as 1. Then for match.group(1) set value as 1 with the new found key.
As the data would be in increasing order of time, so check if previously inserted data is of same minute or different minute.
If same minute then increase the dictionary's total_count & match.group(2) values by 1
If different minute then simply append the dictionary to output list
Currently output list would print keys & values. Incase if you want to print only values then instead of print(data) in the last line, you can change to print(data.values())
Just to mention, I have assumed that you are not facing any issue in regex & that whatever regex you have provided is fulfilling your requirement.
In case you face any issue in regex or need help in that do let me know in comment.

I am getting error with my codes to read twitter data sets from json using python 3.6.1

Here is the code:
import json
import re
emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:#[\w_]+)', # #-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
with open('mytweets.json', mode='r', encoding='utf-8') as f:
for line in f:
#line = f.readline()
tweet = json.loads(line)
print(preprocess(tweet['text']))
After running it showing the problems:
Getting the problem after running the codes
What is the solution of the problem?How i can successfully read data and tokenize the tweets from json format?
Here are some samples of mytweets.json
{"created_at":"Thu Jun 22 21:50:18 +0000 2017","id":878007261674602496,"id_str":"878007261674602496","text":"RT #wreckitroy: Well, I like dick, so I don't see this as a possibility, but thanks for trying to reach that far up my ass to try t\u2026 ","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":632645991,"id_str":"632645991","name":"meche","screen_name":"mercedessreyes","location":null,"url":null,"description":"I mean, really it's same me, it's old me \u2022 FSU '21 \u2022 https:\/\/vsco.co\/onlymeche","protected":false,"verified":false,"followers_count":1039,"friends_count":352,"listed_count":6,"favourites_count":21860,"statuses_count":21676,"created_at":"Wed Jul 11 04:06:28 +0000 2012","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FCEBB6","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/762423763\/6c7d56ca20260816f75c10759208b283.png","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/762423763\/6c7d56ca20260816f75c10759208b283.png","profile_background_tile":true,"profile_link_color":"CE7834","profile_sidebar_border_color":"F0A830","profile_sidebar_fill_color":"78C0A8","profile_text_color":"5E412F","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/876886584087502848\/9WSQDm8F_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/876886584087502848\/9WSQDm8F_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/632645991\/1497147929","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Wed Jun 21 02:57:42 +0000 2017","id":877359845074018304,"id_str":"877359845074018304","text":"Well, I like dick, so I don't see this as a possibility, but thanks for trying to reach that far up my ass to try t\u2026 https:\/\/t.co\/lUJzY60Sn8","display_text_range":[0,140],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2341390003,"id_str":"2341390003","name":"roy","screen_name":"wreckitroy","location":"Fresno, CA","url":null,"description":"She said I'm looking like a bad man, smooth criminal. \ud83c\udf43 \/ snapchat\/instagram: thericharrow","protected":false,"verified":false,"followers_count":4831,"friends_count":1103,"listed_count":23,"favourites_count":79829,"statuses_count":1012,"created_at":"Thu Feb 13 04:30:59 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/876941549874978816\/eTGFmh8u_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/876941549874978816\/eTGFmh8u_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2341390003\/1498157548","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"quoted_status_id":877359034621468672,"quoted_status_id_str":"877359034621468672","quoted_status":{"created_at":"Wed Jun 21 02:54:29 +0000 2017","id":877359034621468672,"id_str":"877359034621468672","text":"When you trying so hard to getvout the friend zone\ud83d\ude02\ud83d\ude02 https:\/\/t.co\/i8yFNbGDNn","display_text_range":[0,52],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":844510650,"id_str":"844510650","name":"\u3164","screen_name":"DaddyGunPlay","location":null,"url":null,"description":"One of the best Contoller players dont #. Bo2 is surperior #JellyFam\ud83c\udf47","protected":false,"verified":false,"followers_count":325,"friends_count":276,"listed_count":3,"favourites_count":1795,"statuses_count":5009,"created_at":"Mon Sep 24 23:51:03 +0000 2012","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"3B94D9","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/874005327045414913\/NUPA2rvD_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/874005327045414913\/NUPA2rvD_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/844510650\/1496174936","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"quoted_status_id":877210813462740992,"quoted_status_id_str":"877210813462740992","is_quote_status":true,"retweet_count":45,"favorite_count":138,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/i8yFNbGDNn","expanded_url":"https:\/\/twitter.com\/wreckitroy\/status\/877210813462740992","display_url":"twitter.com\/wreckitroy\/sta\u2026","indices":[53,76]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":true,"extended_tweet":{"full_text":"Well, I like dick, so I don't see this as a possibility, but thanks for trying to reach that far up my ass to try to find the truth. \ud83d\ude09 https:\/\/t.co\/fv4Kqvv2sb","display_text_range":[0,134],"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/fv4Kqvv2sb","expanded_url":"https:\/\/twitter.com\/daddygunplay\/status\/877359034621468672","display_url":"twitter.com\/daddygunplay\/s\u2026","indices":[135,158]}],"user_mentions":[],"symbols":[]}},"retweet_count":2496,"favorite_count":12594,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/lUJzY60Sn8","expanded_url":"https:\/\/twitter.com\/i\/web\/status\/877359845074018304","display_url":"twitter.com\/i\/web\/status\/8\u2026","indices":[117,140]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"quoted_status_id":877359034621468672,"quoted_status_id_str":"877359034621468672","quoted_status":{"created_at":"Wed Jun 21 02:54:29 +0000 2017","id":877359034621468672,"id_str":"877359034621468672","text":"When you trying so hard to getvout the friend zone\ud83d\ude02\ud83d\ude02 https:\/\/t.co\/i8yFNbGDNn","display_text_range":[0,52],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":844510650,"id_str":"844510650","name":"\u3164","screen_name":"DaddyGunPlay","location":null,"url":null,"description":"One of the best Contoller players dont #. Bo2 is surperior #JellyFam\ud83c\udf47","protected":false,"verified":false,"followers_count":325,"friends_count":276,"listed_count":3,"favourites_count":1795,"statuses_count":5009,"created_at":"Mon Sep 24 23:51:03 +0000 2012","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"3B94D9","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/874005327045414913\/NUPA2rvD_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/874005327045414913\/NUPA2rvD_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/844510650\/1496174936","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"quoted_status_id":877210813462740992,"quoted_status_id_str":"877210813462740992","is_quote_status":true,"retweet_count":45,"favorite_count":138,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/i8yFNbGDNn","expanded_url":"https:\/\/twitter.com\/wreckitroy\/status\/877210813462740992","display_url":"twitter.com\/wreckitroy\/sta\u2026","indices":[53,76]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"is_quote_status":true,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"","expanded_url":null,"indices":[133,133]}],"user_mentions":[{"screen_name":"wreckitroy","name":"roy","id":2341390003,"id_str":"2341390003","indices":[3,14]}],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1498168218426"}
{"created_at":"Thu Jun 22 21:50:18 +0000 2017","id":878007262320754692,"id_str":"878007262320754692","text":"It makes me feel some type of way now bree got another lil boy friend","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":47587983,"id_str":"47587983","name":"Kee Gotti","screen_name":"_BadGalKee","location":"Columbus, OH","url":null,"description":"\u2022 Instagram|_badgalkee \u2022 SnapChat| kbabiy","protected":false,"verified":false,"followers_count":1107,"friends_count":639,"listed_count":12,"favourites_count":1160,"statuses_count":28359,"created_at":"Tue Jun 16 09:46:12 +0000 2009","utc_offset":-18000,"time_zone":"Quito","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"131516","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme14\/bg.gif","profile_background_tile":true,"profile_link_color":"009999","profile_sidebar_border_color":"EEEEEE","profile_sidebar_fill_color":"EFEFEF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/850590447261167616\/MuywFrn8_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/850590447261167616\/MuywFrn8_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/47587983\/1487216863","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1498168218580"}
{"created_at":"Thu Jun 22 21:50:18 +0000 2017","id":878007263310393344,"id_str":"878007263310393344","text":"I liked a #YouTube video https:\/\/t.co\/Znu4govqDi My Friend is in LOVE ...","source":"\u003ca href=\"http:\/\/www.google.com\/\" rel=\"nofollow\"\u003eGoogle\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":42287518,"id_str":"42287518","name":"David","screen_name":"iceman120","location":"FT LAUDERDALE, FL","url":"http:\/\/www.youtube.com\/iceman120dl","description":"\ue10e\ue10eOH YOU WANT SOME OF THIS\ue12f\ue12f\ue12f\ue12f\ue10e\ue10e","protected":false,"verified":false,"followers_count":4667,"friends_count":361,"listed_count":69,"favourites_count":134,"statuses_count":69716,"created_at":"Sun May 24 21:43:04 +0000 2009","utc_offset":-14400,"time_zone":"Eastern Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/53704022\/ahamericanflag72.br.jpg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/53704022\/ahamericanflag72.br.jpg","profile_background_tile":false,"profile_link_color":"D60000","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"1C1939","profile_text_color":"777777","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/511261204120363008\/DuNoXOXB_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/511261204120363008\/DuNoXOXB_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/42287518\/1375147278","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/Znu4govqDi","expanded_url":"http:\/\/youtu.be\/up6u1hzWHHc?a","display_url":"youtu.be\/up6u1hzWHHc?a","indices":[25,48]}],"user_mentions":[{"screen_name":"YouTube","name":"YouTube","id":10228272,"id_str":"10228272","indices":[10,18]}],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1498168218816"}

You have posted samples, and as far as I see you just need to skip empty lines.
OLD ANSWER BELOW
You should parse json this way:
...
with open('mytweets.json', mode='r', encoding='utf-8') as f:
tweet = json.load(f)
...
json.load() accepts a file-like object as first argument.
What you're currently trying to do is reading file line by line and trying to parse each line as separate JSON string, and the file seems to be formatted, so you don't have complete json in any line.
You might want to iterate over tweets list in your file (if my guess is correct), not text lines and call print(preprocess()) in the loop.

how to ignore a field data while comparing 2 files in python

Input files are as below with fields schema asMode|Date|Count|timestamp|status|insertTimeStamp
test1.txt:
HR|06/08/2016|3000|Thu Jun 09 2016|Complete|20160627020300
HR|06/08/2016|2000|Thu Jun 09 2016|Complete|20160627020400
HR|06/08/2016|1000|Thu Jun 09 2016|Complete|20160627020500
test2.txt:
HR|06/08/2016|3010|Thu Jun 09 2016|Complete|20160627070300
HR|06/08/2016|2000|Fri Jun 09 2016|Complete|20160627080300
HR|06/08/2016|1500|Thu Jun 09 2016|Complete|20160627090300
Now my requirement is to compare the difference lines between both the files, but it should ignore insertTimeStamp field (last column data) while comparing.
I tried below code. It's working fine, but its comparing line by line. Could someone please suggest me how can my code skip the insertTimeStamp field while comparison?
Thanks in advance for helping me.
import difflib
import sys
with open('/tmp/test1.txt', 'r') as hosts0:
with open('/tmp/test2.txt', 'r') as hosts1:
diff = difflib.unified_diff(
hosts0.readlines(),
hosts1.readlines(),
fromfile='hosts0',
tofile='hosts1',
n=0,
)
for line in diff:
for prefix in ('---', '+++', '##'):
if line.startswith(prefix):
break
else:
sys.stdout.write(line[1:])

You could potentially just slice off the last element in each line before passing them into the diff function
diff = difflib.unified_diff(
['|'.join(x.split('|')[:-1]) for x in hosts0.readlines()],
['|'.join(x.split('|')[:-1]) for x in hosts1.readlines()],
fromfile='hosts0',
tofile='hosts1',
n=0,
)
Line-by-line comparison w/o using difflib:
with open('/tmp/test1.txt', 'r') as fh:
hosts1 = fh.readlines()
with open('/tmp/test2.txt', 'r') as fh:
hosts2 = fh.readlines()
for h1, h2 in zip(hosts1, hosts2):
if h1.split('|')[:-1] != h2.split('|')[:-1]:
print 'Lines are not the same!'

Read and select specific rows from text file regex Python

I have a large number of text files to read from in Python. Each file is structured as the following sample:
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is a multiline abstract of the paper)
blablabla
blablabla
\\
I would like to automatically extract and store (e.g., as a list) the Title, Authors, and abstract (the text between the second and third \\ - note that it starts with an indent) from each text file. Also note that the white line between Date (revised) and Title is really there (it is not a typo that I introduced).
My attempts so far have involved (I am showing the steps for a single text file, say the first file in the list):
filename = os.listdir(path)[0]
test = pd.read_csv(filename, header=None, delimiter="\t")
Which gives me:
0
0 ----------------------------------------------...
1 \\
2 Paper: some_integer
3 From: <some_email_address>
4 Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
5 Date (revised v2): Tue, 8 May 2001 10:39:33 G...
6 Title: some_title...
7 Authors: name_1, name_2
8 Comments: 28 pages, JHEP latex
9 Report-no: DUKE-CGTP-00-01
10 \\
11 blabla...
12 blabla...
13 blabla...
14 \\
I can then select a given row (e.g., the one featuring the title) with:
test[test[0].str.contains("Title")].to_string()
But it is truncated, it is not a clean string (some attributes show up) and I find this entire pandas-based approach quite tedious actually... There must be an easier way to directly select the rows of interest from the text file using regex. At least I hope so...

you could process line by line.
import re
data = {}
temp_s = match = ''
with open('myfile.txt', 'r') as infile:
for line in infile:
if ":" in line:
line = line.split(':')
data[line[0]] = line[1]
elif re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
match = match.group(1)
temp_s += match
while 1:
line = infile.next()
if re.search(r'.*\w+', line):
match = re.search(r'(\w.*)', line)
temp_s += match.group(1)
else:
break
data['abstract'] = temp_s

How about iterating over each line in the file and split by the first : if it is present in line, collecting the result of the split in a dictionary:
with open("input.txt") as f:
data = dict(line.strip().split(": ", 1) for line in f if ": " in line)
As a result, the data would contain:
{
'Comments': '28 pages, JHEP latex',
'Paper': 'some_integer',
'From': '<some_email_address>',
'Date (revised v2)': 'Tue, 8 May 2001 10:39:33 GMT (27kb)',
'Title': 'some_title',
'Date': 'Wed, 4 Apr 2001 12:08:13 GMT (27kb)',
'Authors': 'name_1, name_2'
}

If your files really always have the same structure, you could come up with:
# -*- coding: utf-8> -*-
import re
string = """
------------------------------------------------------------------------------
\\
Paper: some_integer
From: <some_email_address>
Date: Wed, 4 Apr 2001 12:08:13 GMT (27kb)
Date (revised v2): Tue, 8 May 2001 10:39:33 GMT (27kb)
Title: some_title
Authors: name_1, name_2
Comments: 28 pages, JHEP latex
\\
blablabla (this is the abstract of the paper)
\\
"""
rx = re.compile(r"""
^Title:\s(?P<title>.+)[\n\r] # Title at the beginning of a line
Authors:\s(?P<authors>.+)[\n\r] # Authors: ...
Comments:\s(?P<comments>.+)[\n\r] # ... and so on ...
.*[\n\r]
(?P<abstract>.+)""",
re.MULTILINE|re.VERBOSE) # so that the caret matches any line
# + verbose for this explanation
for match in rx.finditer(string):
print match.group('title'), match.group('authors'), match.group('abstract')
# some_title name_1, name_2 blablabla (this is the abstract of the paper)
This approach takes Title as the anchor (beginning of a line) and skims the text afterwards. The named groups may not really be necessary but make the code easier to understand. The pattern [\n\r] looks for newline characters.
See a demo on regex101.com.

This pattern will get you started:
\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\
Assume 'txtfile.txt' is of the format as shown on the top. If using python 2.7x:
import re
with open('txtfile.txt', 'r') as f:
input_string = f.read()
p = r'\\[^\\].*[^\\]+Title:\s+(\S+)\s+Authors:\s+(.*)[^\\]+\\+\s+([^\\]*)\n\\'
print re.findall(p, input_string)
Output:
[('some_title', 'name_1, name_2', 'blablabla (this is a multiline abstract of the paper)\n blablabla\n blablabla')]

Python: how to parse the output of a linux command using compile and split

I need to write a script using re.compile and split to take in the cmd and print out the ip address(last col) and the date and time and convert it to epoch time.
I was using just re.compile but was mentioned to me to use the split command to make it easier.. just looking for some guidance?
this is what the output looks like
host:~ # last -a -F | egrep -v "boot|wtmp|tty"
root pts/2 Fri Jun 19 10:32:13 2015 still logged in xx.x.xx.xx
root pts/0 Fri Jun 19 08:22:29 2015 still logged in xx.xx.xx.xx
root pts/5 Thu Jun 18 10:09:30 2015 - Thu Jun 18 17:20:52 2015 (07:11) xx.xx.xx.xx
root pts/4 Thu Jun 18 09:53:33 2015 - Thu Jun 18 17:04:53 2015 (07:11) xx.xx.xx.xx
last_re = re.compile(r'(?P<user>\S+)\s+(?P<pts>\/.+)\s(?P<day>\S+)\s+(?P<month>)\s+(?P<date>\d+)\s+(?P<stime>(\d\:\d)\s+(?P<hyphen>(\s|-)\s+(?P<endtime>(\d\:\d)\s+(?P<user>)\s+(?P<duration>(\(\d\:\d\))\s+(?P<ipaddress>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)')
cmd = 'last -a -F | egrep -v "boot|wtmp|tty"'
try:
status, output = commands.getstatusoutput(cmd)
print last_re;
if not status:
output_lines = output.split('\n')
m = last_re.search(output_lines[1])
if m:
print "<day='%s' month='%s' time='%s' external_ip='%s'/>" % (m.group('day'), m.group('month'), m.group('stime'), m.group('ipaddress'))

Try this. No need of python.
last -a -F | egrep -v "boot|wtmp|tty" | awk '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/{print $0}'

split() might be a bit difficult with the spacing, so here's an example with regex. It looks behind for a '-' followed by whitespace, captures everything non-greedy after that up to and including four digits (year), skips everything up to a ')' and then more whitespace until it hits what is the first two octets of an IP separated by a '.', which is captured along with the rest of the IP before the end of line.
import re
import time
str = "root pts/4 Thu Jun 18 09:53:33 2015 - Thu Jun 18 17:04:53 2015 (07:11) 192.168.0.10"
rx = re.compile(r'(?<=-)\s+(.*?\d{4}).*?(?<=\))\s+(\d{1,3}\.\d{1,3}.*)$')
date, ip = rx.search(str).group(1,2)
epoch = int(time.mktime(time.strptime(date.strip(), "%a %b %d %H:%M:%S %Y")))
print(ip, epoch)
Output:
192.168.0.10 1434668693

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dig out information with Python re - python

Related

How to parse log file by regex grouping

I am getting error with my codes to read twitter data sets from json using python 3.6.1

how to ignore a field data while comparing 2 files in python

Read and select specific rows from text file regex Python

Python: how to parse the output of a linux command using compile and split

Categories

Resources