Python String stripping and splitting - python

I'm working with image metadata and able to extract a string that looks like this
Cube1[visible:true, mode:Normal]{r:Cube1.R, g:Cube1.G, b:Cube1.B, a:Cube1.A},
Ground[visible:true, mode:Normal]{r:Ground.R, g:Ground.G, b:Ground.B, a:Ground.A},
Cube3[visible:true, mode:Normal]{r:Cube3.R, g:Cube3.G, b:Cube3.B, a:Cube3.A},
Cube4[visible:true, mode:Normal]{r:Cube4.R, g:Cube4.G, b:Cube4.B, a:Cube4.A},
Sphere[visible:true, mode:Normal]{r:Sphere.R, g:Sphere.G, b:Sphere.B, a:Sphere.A},
OilTank[visible:true, mode:Normal]{r:OilTank.R, g:OilTank.G, b:OilTank.B, a:OilTank.A},
Cube2[visible:true, mode:Normal]{r:Cube2.R, g:Cube2.G, b:Cube2.B, a:Cube2.A}
I what convert that large mess to only the layer names. I also need for the order to stay the same. So, in this case it would be:
Cube1
Ground
Cube3
Cube4
Sphere
OilTank
Cube2
I've tried using "split" and "slice". I'm assuming there is a hierarchy here but I'm not sure where to go next.

If the data is indeed formated like that:
import re
i = [the listed string]
names = [j.strip('[') for j in re.findall("\w+\[\.*", i)]
Output:
['Cube1', 'Ground', 'Cube3', 'Cube4', 'Sphere', 'OilTank', 'Cube2']

If you just need the left-most portion, I would use:
name, _ = line.split("[", 1)
If you need something more complex, I'd look into using regular expressions with the re module… Let me know and I can suggest something.

>>> mess = 'Cube1[visible:true, mode:Normal]{r:Cube1.R, g:Cube1.G, b:Cube1.B, a:Cube1.A},\nGround[visible:true, mode:Normal]{r:Ground.R, g:Ground.G, b:Ground.B, a:Ground.A},\nCube3[visible:true, mode:Normal]{r:Cube3.R, g:Cube3.G, b:Cube3.B, a:Cube3.A},\nCube4[visible:true, mode:Normal]{r:Cube4.R, g:Cube4.G, b:Cube4.B, a:Cube4.A},\nSphere[visible:true, mode:Normal]{r:Sphere.R, g:Sphere.G, b:Sphere.B, a:Sphere.A},\nOilTank[visible:true, mode:Normal]{r:OilTank.R, g:OilTank.G, b:OilTank.B, a:OilTank.A},\nCube2[visible:true, mode:Normal]{r:Cube2.R, g:Cube2.G, b:Cube2.B, a:Cube2.A}'
>>> names = "\n".join(line.split("[", 1)[0] for line in mess.split("\n"))
>>> print names
Cube1
Ground
Cube3
Cube4
Sphere
OilTank
Cube2

I don't know a lot about python, but my thoughts in terms of logic would be this:
Split on the comma character
Loop on the resulting array and cut off everything after the first '[' using substring(indexOf) or similar python manipulation.
Then loop though the array again to concatenate the strings back together.
Sorry I don't know the specific commands for doing this. Hope it helps!

Regexes are unecessary, assuming that really is the exact format of your data.
[i.split('[', 1)[0] for i in lst]

With string split:
names = [ x.split('[')[0] for x in your_text.split('\n') ]
With regular expressions:
import re
names = re.findall(r'^\w+', your_text, re.MULTILINE)

Related

Is there a way I can split this list?

Here's my string:
"1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473|
3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
The end-goal is to use the function .split("|") and have it all nicely arranged. The problem is that some of the movies don't have a "Gross:".
I want my string to look like this:
"1.Movie1|Votes:2,147,833| Gross:$28.34M|2.Movie2|Votes:42,473|3.Movie3|
Votes:23,439| Gross:$0,27M|4.Movie4|Votes:20,940|"
I used .replace to add more "|" to format it easier. I also tried using .split("M "), but since some movies don't have a gross, it would put 2 movies in one line.
I hope this would do it
string = "1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473| 3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
string = string.replace(' ','|').replace('||G','| G').replace('||','|')
this is considering you need an extra space between '|' and 'G' of Gross, if this extra space is not required in the output then you may remove the replace('||G','| G') part from the code to get the desired result
I would strongly recommend you to use dictionaries than lists especially in this case.But if you are so interested in parsing the data your way, then I found a workaround for your data, Maynot work for all the data but it works perfectly well in the given sample data.
a="1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473| 3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
c="|".join(a.split(" "))+"|"
print(c)
Hope this helps
you can do it using split and list comprehension like below
s = "1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473| 3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
"|".join(["|".join([sp2 for sp2 in sp1.split("|") if sp2!=''])
for sp1 in s.split(" ") if sp1!='']) + "|"
Result
'1.Movie1|Votes:2,147,833|Gross:$28.34M|2.Movie2|Votes:42,473|3.Movie3|Votes:23,439|Gross:$0,27M|4.Movie4|Votes:20,940|'

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

How to split a string and keeping the pattern

This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}

How to convert a multiline string into a list of lines?

In sikuli I've get a multiline string from clipboard like this...
Names = App.getClipboard();
So Name =
#corazona
#Pebleo00
#cofriasd
«paflio
and I have use this regex to delete the first character if it is not in x00-x7f hex range or is not a word, or is a digit
import re
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", Names)
So now Names =
corazona
Pebleo00
cofriasd
paflio
But, I am having trouble with the second regex that converts "Names" into the items of a sequence. I would like to convert "Names" into...
'corazona', 'Pebleo00', 'cofriasd', 'paflio'
or
'corazona', 'Pebleo00', 'cofriasd', 'paflio',
So sikuli can then recognize it as a List (I've found that Sikuli is able to recognize it even with those last "comma" and "space" in the end) by using...
NamesAsList = eval(Names)
How could I do this in python? is it necessary to use regex, or there is other way to do this in python?
I have already done this but using .Net regex, I just don't know how to do it in python, I have googled it with no result.
This is how I did it using .Net regex
Text to find:
(.*[^$])(\r\n|\z)
Replace with:
'$1',%" "%
Thanks Advanced.
A couple of one liners. Your question isn't completely clear - but I am assuming - you want to split a given string delimited by 'newline' and then generate a list of strings by removing the first character if it's not alpha numeric. Here's how I'd go about it
import re
r = re.compile(r'^[a-zA-Z0-9]') # match # beginning anything that's not alpha numeric
s = '#abc\ndef\nghi'
l = [r.sub('', x) for x in s.split()]
# join this list with comma (if that's required else you got the list already)
','.join(l)
Hope that's what you want.
If Names is a string before you "convert" it, in which each name is separated by a new line ('\n'), then this will work:
NamesAsList = '\n'.split(Names)
See this question for other options.
You could use splitlines()
import re
clipBoard = App.getClipboard();
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", clipBoard)
# Replace the end of a line with a comma.
singleNames = ', '.join(Names.splitlines())
print(singleNames)

Regex issue in python

I have a regex "value=4020a345-f646-4984-a848-3f7f5cb51f21"
if re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x ):
x = re.search( "value=\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*", x )
m = x.group(1)
m only gives me 4020a345, not sure why it does not give me the entire "4020a345-f646-4984-a848-3f7f5cb51f21"
Can anyone tell me what i am doing wrong?
try out this regex, looks like you are trying to match a GUID
value=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
This should match what you want, if all the strings are of the form you've shown:
value=((\w*\d*\-?)*)
You can also use this website to validate your regular expressions:
http://regex101.com/
The below regex works as you expect.
value=([\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*\-\w*|\d*]+)
You are trying to match on some hex numbers, that is why this regex is more correct than using [\w\d]
pattern = "value=([0-9a-fA-F]{8}-([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12})"
data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
res = re.search(pattern, data)
print(res.group(1))
If you dont care about the regex safety, aka checking that it is correct hex, there is no reason not to use simple string manipulation like shown below.
>>> data = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
>>> print(data[7:])
020a345-f646-4984-a848-3f7f5cb51f21
>>> # or maybe
...
>>> print(data[7:].replace('-',''))
020a345f6464984a8483f7f5cb51f21
You can get the subparts of the value as a list
txt = "value=4020a345-f646-4984-a848-3f7f5cb51f21"
parts = re.findall('\w+', txt)[1:]
parts is ['4020a345', 'f646', '4984', 'a848', '3f7f5cb51f21']
if you really want the entire string
full = "-".join(parts)
A simple way
full = re.findall("[\w-]+", txt)[-1]
full is 4020a345-f646-4984-a848-3f7f5cb51f21
value=([\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*\-[\w\d]*)
Try this.Grab the capture.Your regex was not giving the whole as you had used | operator.So if regex on left side of | get satisfied it will not try the latter part.
See demo.
http://regex101.com/r/hQ1rP0/45

Categories