Parse Output for Python - python

My software outputs these two types of output:
-rwx------ Administrators/Domain Users 456220672 0% 2018-04-16 16:04:40 E:\\_WiE10-18.0.100-77.iso
-rwxrwx--- Administrators/unknown 6677 0% 2018-04-17 01:33:23 E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log
I would like to get the file names from both outputs:
E:\\_WiE10-18.0.100-77.iso, for the first one
E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log, for the second one
If i use something like the code below, it won't work if the second parameter has spaces in it. It works if there aren't any spaces in the Domain Username.
for item in outputs:
outputs.extend(item.split())
for item2 in [' '.join(outputs[6:])]:
new_list.append(item2)
How can I get all the parameters individually, including the filenames?

If regex is an option:
text = """-rwx------ Administrators/Domain Users 456220672 0% 2018-04-16 16:04:40 E:\\_WiE10-18.0.100-77.iso
-rwxrwx--- Administrators/unknown 6677 0% 2018-04-17 01:33:23 E:\\program files\\cluster groups\\sql server (mssqlserver)\\logs\\progress-MOD-1523883344023-3001-Windows.log"""
import re
for h in re.findall(r"^.*?\d\d:\d\d:\d\d (.*)",text,flags=re.MULTILINE):
print(h)
Output:
E:\_WiE10-18.0.100-77.iso
E:\program files\cluster groups\sql server (mssqlserver)\logs\progress-MOD-1523883344023-3001-Windows.log
Pattern explained:
The pattern r"^.*?\d\d:\d\d:\d\d (.*)" looks for linestart '^' + as less anythings as possible '.*?' + the time-stamp '\d\d:\d\d:\d\d ' followed by a space and captures all behind it till end of line into a group.
It uses the re.MULTILINE flag for that.
Edit:
Capturing the individual things needs some more capturing groups:
import re
for h in re.findall(r"^([rwexXst-]+) ([^0-9]+) +\d+.+? +(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (.*)",text,flags=re.MULTILINE):
# ^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^
# flags grpName datetime filename
for k in h:
print(k)
print("")
Output:
-rwx------
Administrators/Domain Users
2018-04-16 16:04:40
E:\_WiE10-18.0.100-77.iso
-rwxrwx---
Administrators/unknown
2018-04-17 01:33:23
E:\program files\cluster groups\sql server (mssqlserver)\logs\progress-MOD-1523883344023-3001-Windows.log

You could use a regular expression like
\b[A-Z]:\\\\.+

Aside from using regex, you can try something similar to this.
output = '-rwx------ ... 2018-04-16 16:04:40 E:\\\\_WiE10-18.0.100-77.iso'
drive_letter_start = output.find(':\\\\')
filename = output[drive_letter_start - 1:]
It looks for the first occurrence of ':\\'and gets the drive letter before the substring (i.e. ':\\') and the full file path after the substring.
EDIT
Patrick Artner's answer is better and completely answers OP's question compared to this answer. This only encompasses capturing the file path. I am leaving this answer here should anyone find it useful.

Related

How do I grab specific text in between other text?

I need help grabbing just K334-76A9 from this string:
b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n
Please help, I have tried so many things but none have worked.
Sorry if my question is bad :/
If you want to find the format xxxx-xxxx, no matter what string you have you can do it like this:
import re
b = '\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
splitString = []
splitString = b.split()
r = re.compile('.{4}-.{4}')
for string in splitString:
if r.match(string):
print(string)
Output:
K334-76A9
Here's code that grabs everything after "Serial Number is " up to the next whitespace character.
import re
data = b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
pat = re.compile(r"Serial Number is ([^\s]+)")
match = pat.search(data.decode("ASCII"))
if match:
print(match.group(1))
Result:
K334-76A9
You can adjust the regular expression per your needs. Regular expressions are Da Bomb! This one's really simple, but you can do amazingly complex things with them.

How can I capture all sentences in a file with the format of (name): (sentence)\n(name):

I have files of transcripts where the format is
(name): (sentence)\n (<-- There can be multiples of this pattern)
(name): (sentence)\n
(sentence)\n
and so on. I need all of the sentences. So far I have gotten it to work by hard-coding the names in the file, but I need it to be generic.
utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)
Python 3.6 using re. Or if anyone knows how to do this using spacy, that would be a great help, thanks.
I want to just grab the \n after an empty statement, and put it in its own string. And I suppose I will just have to grab the tape information given at the end of this, for example, since I can't think of a way to distinguish if the line is part of someone's speech or not.
Also sometimes, there's more than one word between start of line and colon.
Mock data:
CRO: How far are you from the World Trade Center, how many blocks, about? Three or
four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
You can use a lookahead expression that looks for the same pattern of a name at the beginning of a line and is followed by a colon:
s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)
This outputs:
[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
('CALLER', ''),
('CRO', "You're welcome. Thank you.\n"),
('OPERATOR', 'Bye.\n'),
('CRO', 'Bye.\n'),
('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
('OPERATOR NEWELL', 'blah blah.\n'),
('GUY IN DESK', 'I speak words!')]
You never gave us mock data, so I used the following for testing purposes:
name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.
We can try matching using the following pattern:
^\S+:\s+((?:(?!^\S+:).)+)
This can be explained as:
^\S+:\s+ match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+) then match and capture everything up until the next name
Note that this handles the edge case of the final sentence, because the negative lookahead used above just would not be true, and hence all remaining content would be captured.
Code sample:
import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)
['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']
Demo

Extracting #mentions from tweets using findall python (Giving incorrect results)

I have a csv file something like this
text
RT #CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in #CellCellPress htp://.co/HrjDwbm7NN
RT #gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT #sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT #MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via #nucAmbiguous htp://…
I want to extract all the mentions (starting with '#') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^#\w])#(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:]) it gives me ['CritCareMed'] which is not ok as it should give me ['CellCellPress'], and at str(X.iloc[:2,:]) it again gives me ['CritCareMed'] which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row.
What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
You can use str.findall method to avoid the for loop, use negative look behind to replace (^|[^#\w]) which forms another capture group you don't need in your regex:
df['mention'] = df.text.str.findall(r'(?<![#\w])#(\w{1,25})').apply(','.join)
df
# text mention
#0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
#1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
#2 RT #gvwilson: Where's the theory for software ... gvwilson
#3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
#4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
Also X.iloc[:i,:] gives back a data frame, so str(X.iloc[:i,:]) gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from the text column, you can use X.text.iloc[0], or a better way to iterate through a column, use iteritems:
import re
for index, s in df.text.iteritems():
result = re.findall("(?<![#\w])#(\w{1,25})", s)
print(','.join(result))
#CritCareMed
#CellCellPress
#gvwilson
#sciencemagazine
#MHendr1cks,nucAmbiguous
While you already have your answer, you could even try to optimize the whole import process like so:
import re, pandas as pd
rx = re.compile(r'#([^:\s]+)')
with open("test.txt") as fp:
dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines())
df = pd.DataFrame(dft, columns = ['text', 'mention'])
print(df)
Which yields:
text mention
0 RT #CritCareMed: New Article: Male-Predominant... CritCareMed
1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress
2 RT #gvwilson: Where's the theory for software ... gvwilson
3 RT #sciencemagazine: What’s killing off the se... sciencemagazine
4 RT #MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous
This might be a bit faster as you don't need to change the df once it's already constructed.
mydata['text'].str.findall(r'(?:(?<=\s)|(?<=^))#.*?(?=\s|$)')
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
#.*? carries out a non-greedy match for a word starting
with a hashtag
(?=\s|$) look-ahead for the end of the word or end of the sentence
(?:(?<=\s)|(?<=^)) look-behind to ensure there are no false positives if a # is used in the middle of a word
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Find all IPs on an HTML Page

I want to get an HTML page with python and then print out all the IPs from it.
I will define an IP as the following:
x.x.x.x:y
Where:
x = a number between 0 and 256.
y = a number with < 7 digits.
Thanks.
Right. The only part I cant do is the regular expression one. – das 9 mins ago If someone shows me that, I will be fine. – das 8 mins ago
import re
ip = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?):\d{1,6}\b")
junk = " 1.1.1.1:123 2.2.2.2:321 312.123.1.12:123 "
print ip.findall(junk)
# outputs ['1.1.1.1:123', '2.2.2.2:321']
Here is a complete example:
import re, urllib2
f = urllib2.urlopen("http://www.samair.ru/proxy/ip-address-01.htm")
junk = f.read()
ip = re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?):\d{1,6}\b")
print ip.findall(junk)
# ['114.30.47.10:80', '118.228.148.83:80', '119.70.40.101:8080', '12.47.164.114:8888', '121.
# 17.161.114:3128', '122.152.183.103:80', '122.224.171.91:3128', '123.234.32.27:8080', '124.
# 107.85.115:80', '124.247.222.66:6588', '125.76.228.201:808', '128.112.139.75:3128', '128.2
# 08.004.197:3128', '128.233.252.11:3124', '128.233.252.12:3124']
The basic approach would be:
Use urllib2 to download the contents of the page
Use a regular expression to extract IPv4-like addresses
Validate each match according to the numeric constraints on each octet
Print out the list of matches
Please provide a clearer indication of what specific part you are having trouble with, along with evidence to show what it is you've tried thus far.
Not to turn this into a who's-a-better-regex-author-war but...
(\d{1,3}\.){3}\d{1,3}\:\d{1,6}
Try:
re.compile("\d?\d?\d.\d?\d?\d.\d?\d?\d.\d?\d?\d:\d+").findall(urllib2.urlopen(url).read())
In action:
\b(?: # A.B.C in A.B.C.D:port
(?:
25[0-5]
| 2[0-4][0-9]
| 1[0-9][0-9]
| [1-9]?[0-9]
)\.
){3}
(?: # D in A.B.C.D:port
25[0-5]
| 2[0-4][0-9]
| 1[0-9][0-9]
| [1-9]?[0-9]
)
:[1-9]\d{0,5} # port number any number in (0,999999]
\b

Categories