Here's my string:
"1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473|
3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
The end-goal is to use the function .split("|") and have it all nicely arranged. The problem is that some of the movies don't have a "Gross:".
I want my string to look like this:
"1.Movie1|Votes:2,147,833| Gross:$28.34M|2.Movie2|Votes:42,473|3.Movie3|
Votes:23,439| Gross:$0,27M|4.Movie4|Votes:20,940|"
I used .replace to add more "|" to format it easier. I also tried using .split("M "), but since some movies don't have a gross, it would put 2 movies in one line.
I hope this would do it
string = "1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473| 3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
string = string.replace(' ','|').replace('||G','| G').replace('||','|')
this is considering you need an extra space between '|' and 'G' of Gross, if this extra space is not required in the output then you may remove the replace('||G','| G') part from the code to get the desired result
I would strongly recommend you to use dictionaries than lists especially in this case.But if you are so interested in parsing the data your way, then I found a workaround for your data, Maynot work for all the data but it works perfectly well in the given sample data.
a="1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473| 3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
c="|".join(a.split(" "))+"|"
print(c)
Hope this helps
you can do it using split and list comprehension like below
s = "1.Movie1|Votes:2,147,833| Gross:$28.34M 2.Movie2|Votes:42,473| 3.Movie3|Votes:23,439| Gross:$0,27M 4.Movie4|Votes:20,940"
"|".join(["|".join([sp2 for sp2 in sp1.split("|") if sp2!=''])
for sp1 in s.split(" ") if sp1!='']) + "|"
Result
'1.Movie1|Votes:2,147,833|Gross:$28.34M|2.Movie2|Votes:42,473|3.Movie3|Votes:23,439|Gross:$0,27M|4.Movie4|Votes:20,940|'
Related
I want to test if certain characters are in a line of text. The condition is simple but characters to be tested are many.
Currently I am using \ for easy viewing, but it feels clumsy. What's the way to make the lines look nicer?
text = "Tel+971-2526-821 Fax:+971-2526-821"
if "971" in text or \
"(84)" in text or \
"+66" in text or \
"(452)" in text or \
"19 " in text:
print "foreign"
Why don't extract the phone numbers from the string and do your tests
text = "Tel:+971-2526-821 Fax:+971-2526-821"
tel, fax = text.split()
tel_prefix, *_ = tel.split(':')[-1].split('-')
fax_prefix, *_ = fax.split(':')[-1].split('-')
if tel_prefix in ("971", "(84)"):
print("Foreigner")
for python 2.x
tel_prefix = tel.split(':')[-1].split('-')[0]
fax_prefix = fax.split(':')[-1].split('-')[0]
Enlightened by #Patrick Haugh in the comment. We can do:
text = "Tel+971-2526-821 Fax:+971-2526-821"
if any(x in text for x in ("971", "(84)", "+66", "(452)", "19 ")):
print "foreign"
You can use any builtin function to check if any one of the token exists in the text. If you would like to check if all the token exists in the string you can replace the below any with all function. Cheers!
text = 'Hello your number is 19 '
tokens = ('971', '(84)', '+66', '(452)', '19 ')
if any(token in text for token in tokens):
print('Foriegn')
Output:
Foriegn
Existing comments mention that you can't really have multiple or statements like you intend, but using generators/comprehensions and the any() function you are able to come up with a serviceable option, such as the snippet if any(x in text for x in ('971', '(84)', '+66', '(452)', '19 ')): that #Patrick Haugh recommended.
I would recommend using regular expressions instead as a more versatile and efficient way of solving the problem. You could either generate the pattern dynamically, or for the purpose of this problem, the following snippet would work (don't forget to escape parentheses):
import re
text = 'Tel:+971-2526-821 Fax:+971-2526-821'
pattern = u'(971|\(84\)|66|\(452\)|19)'
prog = re.compile(pattern)
if prog.search(text):
print 'foreign'
If you are searching many lines of text or large bodies of text for multiple possible substrings, this approach will be faster and more reusable. You only have to compile prog once, and then you can use it as often as you'd like.
As far as dynamic generation of a pattern is concerned, a naive implementation might do something like this:
match_list = ['971', '(84)', '66', '(452)', '19']
pattern = '|'.join(map(lambda s: s.replace('(', '\(').replace(')', '\)'), match_list)).join(['(', ')'])
The variable match_list could then be updated and modified as needed. There is a slight inefficiency in running two passes of replace(), and #Andrew Clark has a good trick for fixing that here, but I don't want this answer to be too long and cumbersome.
You can construct a lambda function that checks if a value is in the text, and then map this function to all of the values:
text = "Tel:+971-2526-821 Fax:+971-2526-821"
print any(map((lambda x: x in text), ["971", "(84)", "+66", "(452)", "19 "]))
The result is True, which means at least one of the values is in text.
I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']
This is how the string splitting works for me right now:
output = string.encode('UTF8').split('}/n}')[0]
output += '}\n}'
But I am wondering if there is a more pythonic way to do it.
The goal is to get everything before this '}/n}' including '}/n}'.
This might be a good use of str.partition.
string = '012za}/n}ddfsdfk'
parts = string.partition('}/n}')
# ('012za', '}/n}', 'ddfsdfk')
''.join(parts[:-1])
# 012za}/n}
Or, you can find it explicitly with str.index.
repl = '}/n}'
string[:string.index(repl) + len(repl)]
# 012za}/n}
This is probably better than using str.find since an exception will be raised if the substring isn't found, rather than producing nonsensical results.
It seems like anything "more elegant" would require regular expressions.
import re
re.search('(.*?}/n})', string).group(0)
# 012za}/n}
It can be done with with re.split() -- the key is putting parens around the split pattern to preserve what you split on:
import re
output = "".join(re.split(r'(}/n})', string.encode('UTF8'))[:2])
However, I doubt that this is either the most efficient nor most Pythonic way to achieve what you want. I.e. I don't think this is naturally a split sort of problem. For example:
tag = '}/n}'
encoded = string.encode('UTF8')
output = encoded[:encoded.index(tag)] + tag
or if you insist on a one-liner:
output = (lambda string, tag: string[:string.index(tag)] + tag)(string.encode('UTF8'), '}/n}')
or returning to regex:
output = re.match(r".*}/n}", string.encode('UTF8')).group(0)
>>> string_to_split = 'first item{\n{second item'
>>> sep = '{\n{'
>>> output = [item + sep for item in string_to_split.split(sep)]
NOTE: output = ['first item{\n{', 'second item{\n{']
then you can use the result:
for item_with_delimiter in output:
...
It might be useful to look up os.linesep if you're not sure what the line ending will be. os.linesep is whatever the line ending is under your current OS, so '\r\n' under Windows or '\n' under Linux or Mac. It depends where input data is from, and how flexible your code needs to be across environments.
Adapted from Slice a string after a certain phrase?, you can combine find and slice to get the first part of the string and retain }/n}.
str = "012za}/n}ddfsdfk"
str[:str.find("}/n}")+4]
Will result in 012za}/n}
This question already has answers here:
Split string on whitespace in Python [duplicate]
(4 answers)
Closed 8 years ago.
How can I select, in a file with 3, 4 or X columns separated by space (not constant space, but multiple spaces on each line) select the first 2 columns of each row with a regex?
My files consist of : IP [SPACES] Subnet_Mask [SPACES] NEXT_HOP_IP [NEW LINE]
All rows use that format. How can I extract only the first 2 columns? (IP & Subnet mask)
Here is an example on which to try your regex:
10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
Don't look to the specific IPs. I know the second column is not formed of valid address masks. It's just an example.
I already tried:
(?P<IP_ADD>\s*[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})(?P<space>\s*)(?P<MASK>[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\s+|\D*))
But it doesn't quite work...
With a regular expression:
If you want to get the 2 first columns, whatever they contain, and whatever amount of space separates them, you can use \S (matches anything but whitespaces) and \s (matches whitespaces only) to achieve that:
import re
lines = """
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
"""
regex = re.compile(r'(\S+)\s+(\S+)')
regex.findall(lines)
Result:
[('10.97.96.0', '10.97.97.128'),
('47.73.1.0', '47.73.4.128'),
('47.73.7.6', '47.73.8.0'),
('47.73.15.0', '47.73.40.0'),
('47.73.41.0', '85.205.9.164'),
('85.205.14.44', '172.17.103.0'),
('172.17.103.8', '172.17.103.48'),
('172.17.103.56', '172.17.103.96'),
('172.17.103.100', '172.17.103.136'),
('172.17.103.140', '172.17.104.44'),
('172.17.105.28', '172.17.105.32'),
('172.17.105.220', '172.17.105.224')]
Without a regular expression
If you didn't want to use a regex, and still be able to handle multiple spaces, you could also do:
while ' ' in lines: # notice the two-spaces-string
lines = lines.replace(' ', ' ')
columns = [line.split(' ')[:2] for line in lines.split('\n') if line]
Pros and cons:
The advantage of using a regex is that it would also parse the data properly if separators include tabulations, which wouldn't be the case with the 2nd solution.
On the other hand, regular expressions require more computing than a simple string splitting, which could make a difference on very large data sets.
One liner it is:
[s.split()[:2] for s in string.split('\n')]
Example
string = """10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224"""
print [s.split()[:2] for s in string.split('\n')]
Outputs
[['10.97.96.0', '10.97.97.128']
['47.73.4.128', '47.73.7.6']
['47.73.15.0', '47.73.40.0']
['85.205.9.164', '85.205.14.44']
['172.17.103.8', '172.17.103.48']
['172.17.103.96', '172.17.103.100']
['172.17.103.140', '172.17.104.44']
['172.17.105.32', '172.17.105.220']]
Since you need "some sort of one-liner", there are many ways that does not involve python.
Maybe:
| awk '{print $1,$2}'
with anything that produces your input on stdout.
Edited to perform space match with any number of spaces.
You can accomplish this with python regular expressions like this as an option if you know it's going to be the first 2 space separated values.
A nice regex cheat sheet will also help you find out some shortcuts. Specific tokens classes like words, spaces, and numbers have these little shortcuts.
import re
line = "10.97.96.0 10.97.97.128 47.73.1.0"
result = re.split("\s+", line)[0:2]
result
['10.97.96.0', '10.97.97.128']
I'm working with image metadata and able to extract a string that looks like this
Cube1[visible:true, mode:Normal]{r:Cube1.R, g:Cube1.G, b:Cube1.B, a:Cube1.A},
Ground[visible:true, mode:Normal]{r:Ground.R, g:Ground.G, b:Ground.B, a:Ground.A},
Cube3[visible:true, mode:Normal]{r:Cube3.R, g:Cube3.G, b:Cube3.B, a:Cube3.A},
Cube4[visible:true, mode:Normal]{r:Cube4.R, g:Cube4.G, b:Cube4.B, a:Cube4.A},
Sphere[visible:true, mode:Normal]{r:Sphere.R, g:Sphere.G, b:Sphere.B, a:Sphere.A},
OilTank[visible:true, mode:Normal]{r:OilTank.R, g:OilTank.G, b:OilTank.B, a:OilTank.A},
Cube2[visible:true, mode:Normal]{r:Cube2.R, g:Cube2.G, b:Cube2.B, a:Cube2.A}
I what convert that large mess to only the layer names. I also need for the order to stay the same. So, in this case it would be:
Cube1
Ground
Cube3
Cube4
Sphere
OilTank
Cube2
I've tried using "split" and "slice". I'm assuming there is a hierarchy here but I'm not sure where to go next.
If the data is indeed formated like that:
import re
i = [the listed string]
names = [j.strip('[') for j in re.findall("\w+\[\.*", i)]
Output:
['Cube1', 'Ground', 'Cube3', 'Cube4', 'Sphere', 'OilTank', 'Cube2']
If you just need the left-most portion, I would use:
name, _ = line.split("[", 1)
If you need something more complex, I'd look into using regular expressions with the re module… Let me know and I can suggest something.
>>> mess = 'Cube1[visible:true, mode:Normal]{r:Cube1.R, g:Cube1.G, b:Cube1.B, a:Cube1.A},\nGround[visible:true, mode:Normal]{r:Ground.R, g:Ground.G, b:Ground.B, a:Ground.A},\nCube3[visible:true, mode:Normal]{r:Cube3.R, g:Cube3.G, b:Cube3.B, a:Cube3.A},\nCube4[visible:true, mode:Normal]{r:Cube4.R, g:Cube4.G, b:Cube4.B, a:Cube4.A},\nSphere[visible:true, mode:Normal]{r:Sphere.R, g:Sphere.G, b:Sphere.B, a:Sphere.A},\nOilTank[visible:true, mode:Normal]{r:OilTank.R, g:OilTank.G, b:OilTank.B, a:OilTank.A},\nCube2[visible:true, mode:Normal]{r:Cube2.R, g:Cube2.G, b:Cube2.B, a:Cube2.A}'
>>> names = "\n".join(line.split("[", 1)[0] for line in mess.split("\n"))
>>> print names
Cube1
Ground
Cube3
Cube4
Sphere
OilTank
Cube2
I don't know a lot about python, but my thoughts in terms of logic would be this:
Split on the comma character
Loop on the resulting array and cut off everything after the first '[' using substring(indexOf) or similar python manipulation.
Then loop though the array again to concatenate the strings back together.
Sorry I don't know the specific commands for doing this. Hope it helps!
Regexes are unecessary, assuming that really is the exact format of your data.
[i.split('[', 1)[0] for i in lst]
With string split:
names = [ x.split('[')[0] for x in your_text.split('\n') ]
With regular expressions:
import re
names = re.findall(r'^\w+', your_text, re.MULTILINE)