Extract strings between brackets and nested brackets - python

So I have a file of text and titles, (titles indicated with the starting ";")
;star/stellar_(class(ification))_(chart)
Hertz-sprussels classification of stars is shows us . . .
What I want to do is have it where it's split by "_" into
['star/stellar','(class(ification))','(chart)'], interating through them and extracting whats in the brackets, e.g. '(class(ification))' to {'class':'ification'} and (chart) to just ['chart'].
All i've done so far is the splitting part
for ln in open(file,"r").read().split("\n"):
if ln.startswith(";"):
keys=ln[1:].split("_")
I have ways to extract bits in brackets, but I have had trouble finding a way that supports nested brackets in order.
I've tried things like re.findall('\(([^)]+)',ln) but that returns ['star/stellar', '(class', 'chart']. Any ideas?

You can do this with splits. If you separate the string using '_(' instead of only '_', the second part onward will be an enclosed keyword. you can strip the closing parentheses and split those parts on the '(' to get either one component (if there was no nested parentesis) or two components. You then form either a one-element list or dictionary depending on the number of components.
line = ";star/stellar_(class(ification))_(chart)"
if line.startswith(";"):
parts = [ part.rstrip(")") for part in line.split("_(")[1:]]
parts = [ part.split("(",1) for part in parts ]
parts = [ part if len(part)==1 else dict([part]) for part in parts ]
print(parts)
[{'class': 'ification'}, ['chart']]
Note that I assumed that the first part of the string is never included in the process and that there can only be one nested group at the end of the parts. If that is not the case, please update your question with relevant examples and expected output.

You can split (again) on the parentheses then do some cleaning:
x = ['star/stellar','(class(ification))','(chart)']
for v in x:
y = v.split('(')
y = [a.replace(')','') for a in y if a != '']
if len(y) > 1:
print(dict([y]))
else:
print(y)
Gives:
['star/stellar']
{'class': 'ification'}
['chart']

If all of the title lines have the same format, that is they all have these three parts ;some/title_(some(thing))_(something), then you can catch the different parts to separate variables:
first, second, third = ln.split("_")
From there, you know that:
for the first item you need to drop the ;:
first = first[1:]
for the second item, you want to extract the stuff in the parentheses and then merge it into a dict:
k, v = filter(bool, re.split('[()]', second))
second = {k:v}
for the third item, you want to drop the surrounding parentheses
third = third[1:-1]
Then you just need to put them all together again:
[first, second, third]

Related

Operate on part of sequence while returning whole sequence

I want to shorten a python class name by truncating all but the last part ie: module.path.to.Class => mo.pa.to.Class.
This could be accomplished by splittin the string and storing the list in a variable and then operating on all but the last part and joining them back.
I would like to know if there is a way to do this in one step ie:
split to parts
create two copies of sequence (tee ?)
apply truncation to one sequence and not the other
join selected parts of sequence
Something like:
'.'.join( [chain(map(lambda x: x[:2], foo[:-1]), bar[-1]) for foo, bar in tee(name.split('.'))] )
But I'm unable to figure out working with ...foo, bar in tee(...
If you want to do it by splitting, you can split once on the last dot first, and then process only the first part by splitting it again to get the package indices, then shorten each to its first two characters, and finally join everything back together in the end. If you insist on doing it inline:
name = "module.path.to.Class"
short = ".".join([[x[:2] for x in p.split(".")] + [n] for p, n in [name.rsplit(".", 1)]][0])
print(short) # mo.pa.to.Class
This creates unnecessary lists just so it can traverse the list comprehension waters safely, in reality it probably ends up being slower than just doing it in a normal, procedural fashion:
def shorten_path(source):
indices = source.split(".")
return ".".join(x[:2] for x in indices[:-1]) + "." + indices[-1]
name = "module.path.to.Class"
print(shorten_path(name)) # mo.pa.to.Class
You could do this in one line with a regular expression:
>>> re.sub(r'(\b\w{2})\w*(\.)', r'\1\2', 'module.path.to.Class')
'mo.pa.to.Class'
The pattern r'(\b\w{2})\w*(\.)' captures two matches: the first two letters of a word, and the dot at the end of the word.
The substitution pattern r'\1\2' concatenates the two captured groups - the first two letters of the word and the dot.
No count parameter is passed to re.sub so all occurrences of the pattern are substituted.
The final word - the class name - is not truncated because it isn't follwed by a dot, so it doesn't match the pattern.

Dot notation string manipulation

Is there a way to manipulate a string in Python using the following ways?
For any string that is stored in dot notation, for example:
s = "classes.students.grades"
Is there a way to change the string to the following:
"classes.students"
Basically, remove everything up to and including the last period. So "restaurants.spanish.food.salty" would become "restaurants.spanish.food".
Additionally, is there any way to identify what comes after the last period? The reason I want to do this is I want to use isDigit().
So, if it was classes.students.grades.0 could I grab the 0 somehow, so I could use an if statement with isdigit, and say if the part of the string after the last period (so 0 in this case) is a digit, remove it, otherwise, leave it.
you can use split and join together:
s = "classes.students.grades"
print '.'.join(s.split('.')[:-1])
You are splitting the string on . - it'll give you a list of strings, after that you are joining the list elements back to string separating them by .
[:-1] will pick all the elements from the list but the last one
To check what comes after the last .:
s.split('.')[-1]
Another way is to use rsplit. It works the same way as split but if you provide maxsplit parameter it'll split the string starting from the end:
rest, last = s.rsplit('.', 1)
'classes.students'
'grades'
You can also use re.sub to substitute the part after the last . with an empty string:
re.sub('\.[^.]+$', '', s)
And the last part of your question to wrap words in [] i would recommend to use format and list comprehension:
''.join("[{}]".format(e) for e in s.split('.'))
It'll give you the desired output:
[classes][students][grades]
The best way to do this is using the rsplit method and pass in the maxsplit argument.
>>> s = "classes.students.grades"
>>> before, after = s.rsplit('.', maxsplit=1) # rsplit('.', 1) in Python 2.x onwards
>>> before
'classes.students'
>>> after
'grades'
You can also use the rfind() method with normal slice operation.
To get everything before last .:
>>> s = "classes.students.grades"
>>> last_index = s.rfind('.')
>>> s[:last_index]
'classes.students'
Then everything after last .
>>> s[last_index + 1:]
'grades'
if '.' in s, s.rpartition('.') finds last dot in s,
and returns (before_last_dot, dot, after_last_dot):
s = "classes.students.grades"
s.rpartition('.')[0]
If your goal is to get rid of a final component that's just a single digit, start and end with re.sub():
s = re.sub(r"\.\d$", "", s)
This will do the job, and leave other strings alone. No need to mess with anything else.
If you do want to know about the general case (separate out the last component, no matter what it is), then use rsplit to split your string once:
>>> "hel.lo.there".rsplit(".", 1)
['hel.lo', 'there']
If there's no dot in the string you'll just get one element in your array, the entire string.
You can do it very simply with rsplit (str.rsplit([sep[, maxsplit]]) , which will return a list by breaking each element along the given separator.
You can also specify how many splits should be performed:
>>> s = "res.spa.f.sal.786423"
>>> s.rsplit('.',1)
['res.spa.f.sal', '786423']
So the final function that you describe is:
def dimimak_cool_function(s):
if '.' not in s: return s
start, end = s.rsplit('.', 1)
return start if end.isdigit() else s
>>> dimimak_cool_function("res.spa.f.sal.786423")
'res.spa.f.sal'
>>> dimimak_cool_function("res.spa.f.sal")
'res.spa.f.sal'

how to split records with non-standard delimiters

in my csv file I have the following records separated by a , between brackets:
(a1,a2,a3),(b1,b2,b3),(c1,c2,c3),(d1,d2,d3)
How do I split the data into a list so that I get something more like this:
a1,a2,a3
b1,b2,b3
c1,c2,c3
d1,d2,d3
Currently my python code looks like this:
dump = open('sample_dump.csv','r').read()
splitdump = dump.split('\n')
print splitdump
You could do something along the lines of:
Remove first and last brackets
Split by ),( character sequence
To split by a custom string, just add it as a parameter to the split method, e.g.:
line.split("),(")
It's a bit hacky, so you'll have to generalize based on any expected variations in your input data format (e.g. will your first/last chars always be brackets?).
Try this, split first by ")," then, join and split again by ( to left tuples without brackets
_line = dump.split("),")
_line = ''.join(_line).split("(")
print _line
>> ['', 'a1,a2,a3,', 'b1,b2,b3,', 'c1,c2,c3,', 'd1,d2,d3']
#drop first empty element
print _line.pop(0)
>> ['a1,a2,a3,', 'b1,b2,b3,', 'c1,c2,c3,', 'd1,d2,d3']
First you need to the steps you need to perform in order to get your result, here's a hacky solution:
remove first and last brackets
use the ),( as the group separator, split
split each group by ,
line = '(a1,a2,a3),(b1,b2,b3),(c1,c2,c3),(d1,d2,d3)'
[group.split(',') for group in line[1:-1].split('),(')]

How to remove brackets from python string?

I know from the title you might think that this is a duplicate but it's not.
for id,row in enumerate(rows):
columns = row.findall("td")
teamName = columns[0].find("a").text, # Lag
playedGames = columns[1].text, # S
wins = columns[2].text,
draw = columns[3].text,
lost = columns[4].text,
dif = columns[6].text, # GM-IM
points = columns[7].text, # P - last column
dict[divisionName].update({id :{"teamName":teamName, "playedGames":playedGames, "wins":wins, "draw":draw, "lost":lost, "dif":dif, "points":points }})
This is how my Python code looks like. Most of the code is removed but essentially i am extracting some information from a website. And i am saving the information as a dictionary. When i print the dictionary every value has a bracket around them ["blbal"] which causes trouble in my Iphone application. I know that i can convert the variables to strings but i want to know if there is a way to get the information DIRECTLY as a string.
That looks like you have a string inside a list:
["blbal"]
To get the string just index l = ["blbal"] print(l[0]) -> "blbal".
If it is a string use str.strip '["blbal"]'.strip("[]") or slicing '["blbal"]'[1:-1] if they are always present.
you can also you replace to just replace the text/symbol that you don't want with the empty string.
text = ["blbal","test"]
strippedText = str(text).replace('[','').replace(']','').replace('\'','').replace('\"','')
print(strippedText)
import re
text = "some (string) [another string] in brackets"
re.sub("\(.*?\)", "", text)
# some in brackets
# works for () and will work for [] if you replace () with [].
The \(.*?\) format matches brackets with some text in them with an unspecified length. And the \[.*?\] format matches also but a square brackets with some text inside the brackets.
The output will not contain brackets and texts inside of them.
If you want to match only square brackets replace square brackets with the bracket of choice and vise versa.
To match () and [] bracket in one go, use this format (\(.*?\)|\[.*?\]:) joining two pattern with the | character.

How to differentiate lines with one dot and two dot?

I want to extract a specific part of a sentence. My problem is that I have a list of sentences that each have different formats. For instance:
X.y.com
x.no
x.com
y.com
z.co.uk
s.com
b.t.com
how can I split these lines based on the number of dots they have? If I want the second part of the sentence with two dots and the first part of the sentences with one dot
You want the part directly preceding the last dot; just split on the dots and take the one-but last part:
for line in data:
if not '.' in line: continue
elem = line.strip().split('.')[-2]
For your input, that gives:
>>> for line in data:
... print line.strip().split('.')[-2]
...
y
x
x
y
co
s
t
To anwser your question you could use count to count the number of times the '.' appears and then do
whatever you need.
>>> 't.com'.count('.')
1
>>> 'x.t.com'.count('.')
2
You could use that in a loop:
for s in string_list:
dots = s.count('.')
if dots == 1:
# do something here
elif dots == 2:
# do something else
else:
# another piece of code
More pythonic way to solve your problem:
def test_function(s):
"""
>>> test_function('b.t.com')
't'
>>> test_function('x.no')
'x'
>>> test_function('z')
'z'
"""
actions = {0: lambda x: x
1: lambda x: x.split('.')[0],
2: lambda x: x.split('.')[1]}
return actions[s.count('.')](s)
I would follow this logic:
For each line:
remove any spaces at beginning and end
split the line by dots
take the part before last of the splitted list
This should give you the part of the sentence you're looking for.
Simply use the split function.
a = 'x.com'
b = a.split('.')
This will make a list of 2 items in b. If you have two dots, the list will contain 3 items. The function actually splits the string based on the given character.

Categories