Suggestion for python regex and selecting columns [duplicate] - python

This question already has answers here:
Split string on whitespace in Python [duplicate]
(4 answers)
Closed 8 years ago.
How can I select, in a file with 3, 4 or X columns separated by space (not constant space, but multiple spaces on each line) select the first 2 columns of each row with a regex?
My files consist of : IP [SPACES] Subnet_Mask [SPACES] NEXT_HOP_IP [NEW LINE]
All rows use that format. How can I extract only the first 2 columns? (IP & Subnet mask)
Here is an example on which to try your regex:
10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
Don't look to the specific IPs. I know the second column is not formed of valid address masks. It's just an example.
I already tried:
(?P<IP_ADD>\s*[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})(?P<space>\s*)(?P<MASK>[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\s+|\D*))
But it doesn't quite work...

With a regular expression:
If you want to get the 2 first columns, whatever they contain, and whatever amount of space separates them, you can use \S (matches anything but whitespaces) and \s (matches whitespaces only) to achieve that:
import re
lines = """
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
"""
regex = re.compile(r'(\S+)\s+(\S+)')
regex.findall(lines)
Result:
[('10.97.96.0', '10.97.97.128'),
('47.73.1.0', '47.73.4.128'),
('47.73.7.6', '47.73.8.0'),
('47.73.15.0', '47.73.40.0'),
('47.73.41.0', '85.205.9.164'),
('85.205.14.44', '172.17.103.0'),
('172.17.103.8', '172.17.103.48'),
('172.17.103.56', '172.17.103.96'),
('172.17.103.100', '172.17.103.136'),
('172.17.103.140', '172.17.104.44'),
('172.17.105.28', '172.17.105.32'),
('172.17.105.220', '172.17.105.224')]
Without a regular expression
If you didn't want to use a regex, and still be able to handle multiple spaces, you could also do:
while ' ' in lines: # notice the two-spaces-string
lines = lines.replace(' ', ' ')
columns = [line.split(' ')[:2] for line in lines.split('\n') if line]
Pros and cons:
The advantage of using a regex is that it would also parse the data properly if separators include tabulations, which wouldn't be the case with the 2nd solution.
On the other hand, regular expressions require more computing than a simple string splitting, which could make a difference on very large data sets.

One liner it is:
[s.split()[:2] for s in string.split('\n')]
Example
string = """10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224"""
print [s.split()[:2] for s in string.split('\n')]
Outputs
[['10.97.96.0', '10.97.97.128']
['47.73.4.128', '47.73.7.6']
['47.73.15.0', '47.73.40.0']
['85.205.9.164', '85.205.14.44']
['172.17.103.8', '172.17.103.48']
['172.17.103.96', '172.17.103.100']
['172.17.103.140', '172.17.104.44']
['172.17.105.32', '172.17.105.220']]

Since you need "some sort of one-liner", there are many ways that does not involve python.
Maybe:
| awk '{print $1,$2}'
with anything that produces your input on stdout.

Edited to perform space match with any number of spaces.
You can accomplish this with python regular expressions like this as an option if you know it's going to be the first 2 space separated values.
A nice regex cheat sheet will also help you find out some shortcuts. Specific tokens classes like words, spaces, and numbers have these little shortcuts.
import re
line = "10.97.96.0 10.97.97.128 47.73.1.0"
result = re.split("\s+", line)[0:2]
result
['10.97.96.0', '10.97.97.128']

Related

How to split a string with multiple delimiters without deleting delimiters in Python?

I currently have a list of filenames in a txt file and I am trying to sort them. The first this I am trying to do is split them into a list since they are all in a single line. There are 3 types of file types in the list. I am able to split the list but I would like to keep the delimiters in the end result and I have not been able to find a way to do this. The way that I am splitting the files is as follows:
import re
def breakLines():
unsorted_list = []
file_obj = open("index.txt", "rt")
file_str = file_obj.read()
unsorted_list.append(re.split('.txt|.mpd|.mp4', file_str))
print(unsorted_list)
breakLines()
I found DeepSpace's answer to be very helpful here Split a string with "(" and ")" and keep the delimiters (Python), but that only seems to work with single characters.
EDIT:
Sample input:
file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4
Expected output:
file_name1234.mp4
file_name1235.mp4
file_name1236.mp4
file_name1237.mp4
In re.split, the key is to parenthesise the split pattern so it's kept in the result of re.split. Your attempt is:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.split('.txt|.mpd|.mp4', s)
['file_name1234', 'file_name1235', 'file_name1236', 'file_name1237', '']
okay that doesn't work (and the dots would need escaping to be really compliant with what an extension is), so let's try:
>>> re.split('(\.txt|\.mpd|\.mp4)', s)
['file_name1234',
'.mp4',
'file_name1235',
'.mp4',
'file_name1236',
'.mp4',
'file_name1237',
'.mp4',
'']
works but this is splitting the extensions from the filenames and leaving a blank in the end, not what you want (unless you want an ugly post-processing). Plus this is a duplicate question: In Python, how do I split a string and keep the separators?
But you don't want re.split you want re.findall:
>>> s = "file_name1234.mp4file_name1235.mp4file_name1236.mp4file_name1237.mp4"
>>> re.findall('(\w*?(?:\.txt|\.mpd|\.mp4))',s)
['file_name1234.mp4',
'file_name1235.mp4',
'file_name1236.mp4',
'file_name1237.mp4']
the expression matches word characters (basically digits, letters & underscores), followed by the extension. To be able to create a OR, I created a non-capturing group inside the main group.
If you have more exotic file names, you can't use \w anymore but it still reasonably works (you may need some str.strip post-processing to remove leading/trailing blanks which are likely not part of the filenames):
>>> s = " file name1234.mp4file-name1235.mp4 file_name1236.mp4file_name1237.mp4"
>>> re.findall('(.*?(?:\.txt|\.mpd|\.mp4))',s)
[' file name1234.mp4',
'file-name1235.mp4',
' file_name1236.mp4',
'file_name1237.mp4']
So sometimes you think re.split when you need re.findall, and the reverse is also true.

Parsing String by regular expression in python

How can I parse this string in python?
Input String:
someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data
to this
Output array:
['someplace','2018:6:18:0','25.0114','95.2818','2.71164','66.8962','Entire grid contents are set to missing data']
I have already tried with split(' ') but as it is not clear how many spaces are between the sub-strings and inside the last sub-string there may be spaces so this doesn't work.
I need the regular expression.
If you do not provide a sep-character, pythons split(sep=None, maxsplit=-1) (doku) will treat consecutive whitespaces as one whitespace and split by those. You can limit the amount of splits to be done by providing a maxsplit value:
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
spl = data.split(None,6) # dont give a split-char, use 6 splits at most
print(spl)
Output:
['someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164',
'66.8962', 'Entire grid contents are set to missing data']
This will work as long as the first text does not contain any whitespaces.
If the fist text may contain whitespaces, you can use/refine this regex solution:
import re
reg = re.findall(r"([^\d]+?) +?([\d:]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +?([\d.]+) +(.*)$",data)[0]
print(reg)
Output:
('someplace', '2018:6:18:0', '25.0114', '95.2818', '2.71164', '66.8962', 'Entire grid contents are set to missing data')
Use f.e.https://regex101.com to check/proof the regex against your other data (follow the link, it uses above regex on sample data)
[A-Z]{1}[a-zA-Z ]{15,45}|[\w|:|.]+
You can test it here https://pythex.org/
Modify 15,45 according to your needs.
Maxsplit works with re.split(), too:
import re
re.split(r"\s+",text,maxsplit=6)
Out:
['someplace',
'2018:6:18:0',
'25.0114',
'95.2818',
'2.71164',
'66.8962',
'Entire grid contents are set to missing data']
EDIT:
If the first and last text parts don't contain digits, we don't need maxsplit and do not have to rely on number of parts with consecutive spaces:
re.split("\s+(?=\d)|(?<=\d)\s+",s)
We cut the string where a space is followed by a digit or vice versa using lookahead and lookbehind.
It is hard to answer your question as the requirements are not very precise. I think I would split the line with the split() function and then join the items when their contents has no numbers. Here is a snippet that works with your lonely sample:
def containsNumbers(s):
return any(c.isdigit() for c in s)
data = "someplace 2018:6:18:0 25.0114 95.2818 2.71164 66.8962 Entire grid contents are set to missing data"
lst = data.split()
lst2 = []
i = 0
agg = ''
while i < len(lst):
if containsNumbers(lst[i]):
if agg != '':
lst2.append(agg)
agg = ''
lst2.append(lst[i])
else:
agg += ' '+lst[i]
agg = agg.strip()
if i == len(lst) - 1:
lst2.append(agg)
i += 1
print(lst2)

Regular Expression in Python 3

I am new here and just start using regular expressions in my python codes. I have a string which has 6 commas inside. One of the commas is fallen between two quotation marks. I want to get rid of the quotation marks and the last comma.
The input:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
I want this output:
string = 'Fruits,Pear,Cherry,Apple,Orange,Cherry'
The output of my code:
string = 'Fruits,Pear,**CherryApple**,Orange,Cherry'
here is my code in python:
if (re.search('"', string)):
matches = re.findall(r'\"(.+?)\"',string);
matches1 = re.sub(",", "", matches[0]);
string = re.sub(matches[0],matches1,string);
string = re.sub('"','',string);
My problem is, I want to give a condition that the code only works for the last bit ("Cherry,") but unfortunately it affects other words in the middle (Cherry,Apple), which has the same text as the one between the quotation marks! That results in reducing the number of commas (from 6 to 4) as it merges two fields (Cherry,Apple) and I want to be left with 5 commas.
fullString = '2000-04-24 12:32:00.000,22186CBD0FDEAB049C60513341BA721B,0DDEB5,COMP,Ch‌​erry Corp.,DE,100,0.57,100,31213C678CC483768E1282A9D8CB524C,365.0‌​0000,business,acquis‌​itions-mergers,acqui‌​sition-bid,interest,‌​acquiree,fact,,,,,,,‌​,,,,,,acquisition-in‌​terest-acquiree,Cher‌​ry Corp. Gets Buyout Offer From Chairman President,FULL-ARTICLE,B5569E,Dow Jones Newswires,0.04,-0.18,0,0,1,0,0,0,0,1,1,5,RPA,DJ,DN2000042400‌​0597,"Cherry Corp. Gets Buyout Offer From Chairman President,"\n'
Many Thanks in advance
For your task you don't need regular expressions, just use replace:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
new_string = string.replace('"').strip(',')
The best way would be to use the newer regex module where (*SKIP)(*FAIL) is supported:
import regex as re
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
# parts
rx = re.compile(r'"[^"]+"(*SKIP)(*FAIL)|,')
def cleanse(match):
rxi = re.compile(r'[",]+')
return rxi.sub('', match)
parts = [cleanse(match) for match in rx.split(string)]
print(parts)
# ['Fruits', 'Pear', 'Cherry', 'Apple', 'Orange', 'Cherry']
Here you match anything between double quotes and throw it away afterwards, thus only commas outside quotes are used for the split operation. The rest is a list comprehension with a cleaning function.
See a demo on regex101.com.
Why not simply use this:
>>>ans_string=string.replace('"','')[0:-1]
Output
>>>ans_string
'Fruits,Pear,Cherry,Apple,Orange,Cherry'
For the sake of simplicity and algorithmic complexity.
You might consider using the csv module to do this.
Example:
import csv
s='Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
>>> ','.join([e.replace(',','') for row in csv.reader([s]) for e in row])
Fruits,Pear,Cherry,Apple,Orange,Cherry
The csv module will strip the quotes but keep the commas on each quoted field. Then you can just remove that comma that was kept.
This will take care of any modifications desired (remove , for example) on a field by field basis. The fields with quotes and commas could be any field in the string.
If your content is in a csv file, you would do something like this (in pseudo code)
with open(file, 'rb') as csv_fo:
# modify(string) stands for what you want to do to each field...
for row in csv.reader(csv_fo):
new_row=[modify(field) for field in row]
# now do what you need with that row

How to convert a multiline string into a list of lines?

In sikuli I've get a multiline string from clipboard like this...
Names = App.getClipboard();
So Name =
#corazona
#Pebleo00
#cofriasd
«paflio
and I have use this regex to delete the first character if it is not in x00-x7f hex range or is not a word, or is a digit
import re
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", Names)
So now Names =
corazona
Pebleo00
cofriasd
paflio
But, I am having trouble with the second regex that converts "Names" into the items of a sequence. I would like to convert "Names" into...
'corazona', 'Pebleo00', 'cofriasd', 'paflio'
or
'corazona', 'Pebleo00', 'cofriasd', 'paflio',
So sikuli can then recognize it as a List (I've found that Sikuli is able to recognize it even with those last "comma" and "space" in the end) by using...
NamesAsList = eval(Names)
How could I do this in python? is it necessary to use regex, or there is other way to do this in python?
I have already done this but using .Net regex, I just don't know how to do it in python, I have googled it with no result.
This is how I did it using .Net regex
Text to find:
(.*[^$])(\r\n|\z)
Replace with:
'$1',%" "%
Thanks Advanced.
A couple of one liners. Your question isn't completely clear - but I am assuming - you want to split a given string delimited by 'newline' and then generate a list of strings by removing the first character if it's not alpha numeric. Here's how I'd go about it
import re
r = re.compile(r'^[a-zA-Z0-9]') # match # beginning anything that's not alpha numeric
s = '#abc\ndef\nghi'
l = [r.sub('', x) for x in s.split()]
# join this list with comma (if that's required else you got the list already)
','.join(l)
Hope that's what you want.
If Names is a string before you "convert" it, in which each name is separated by a new line ('\n'), then this will work:
NamesAsList = '\n'.split(Names)
See this question for other options.
You could use splitlines()
import re
clipBoard = App.getClipboard();
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", clipBoard)
# Replace the end of a line with a comma.
singleNames = ', '.join(Names.splitlines())
print(singleNames)

Python Regular Expression Extract Chunk of Data From Binary File

I've a binary file. From that file I need to extract few chunk of data using python regular expression.
I need to extract non null characters-set present in-between null characters sets.
For example this is the main character set:
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56
The regex should extract below character sets from above master set:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32 and
\x56\x65\x00\x35\x56
One thing is important, If it gets more than 5 null bytes continuously then only it should treat these null characters set as separator..otherwise it should include this null bytes into no-null character. As you can see in given example few null characters are also present in extracted character set.
If its not making any sense please let me know I will try to explain it in a better manner.
Thanks in Advance,
You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.
In Perl, its something like this
Perl test case
$strLangs = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";
# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;
# Split on 5 or more 0's
#Alllangs = split /\x00{5,}/, $strLangs;
# Print each language characters
foreach $lang (#Alllangs)
{
print "<";
for ( split //, $lang ) {
printf( "%x,", ord($_));
}
print ">\n";
}
Output >>
<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>
You can use split and lstrip with list comprehension as:
s='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
sp=s.split('\x00\x00\x00\x00\x00')
print [i.lstrip('\x00\\') for i in sp if i != ""]
Output:
['\xff\xfe\xfe\x00\x00#A', 'AIW\x00\x00\x00\x002AIW\x00\x00\x00\x002', 'Ve\x005V']
split entire data based on 5 nul values.
in the list, find if any element is starting with nul and if it's starting remove them (this works for variable number of nul replacement at start).
Here's how to do it in Python. I had to str.strip() off and leading and trailing nulls to get the regex pattern to prevent the inclusion of an extra empty string at the beginning of the list of results returned from re.split().
import re
data = ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41'
'\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41'
'\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
chunks = re.split(r'\000{6,}', data.strip('\x00'))
# display results
print ',\n'.join(''.join('\\x'+ch.encode('hex_codec') for ch in chunk)
for chunk in chunks),
Output:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32,
\x56\x65\x00\x35\x56

Categories