I want to extract a specific part of a sentence. My problem is that I have a list of sentences that each have different formats. For instance:
X.y.com
x.no
x.com
y.com
z.co.uk
s.com
b.t.com
how can I split these lines based on the number of dots they have? If I want the second part of the sentence with two dots and the first part of the sentences with one dot
You want the part directly preceding the last dot; just split on the dots and take the one-but last part:
for line in data:
if not '.' in line: continue
elem = line.strip().split('.')[-2]
For your input, that gives:
>>> for line in data:
... print line.strip().split('.')[-2]
...
y
x
x
y
co
s
t
To anwser your question you could use count to count the number of times the '.' appears and then do
whatever you need.
>>> 't.com'.count('.')
1
>>> 'x.t.com'.count('.')
2
You could use that in a loop:
for s in string_list:
dots = s.count('.')
if dots == 1:
# do something here
elif dots == 2:
# do something else
else:
# another piece of code
More pythonic way to solve your problem:
def test_function(s):
"""
>>> test_function('b.t.com')
't'
>>> test_function('x.no')
'x'
>>> test_function('z')
'z'
"""
actions = {0: lambda x: x
1: lambda x: x.split('.')[0],
2: lambda x: x.split('.')[1]}
return actions[s.count('.')](s)
I would follow this logic:
For each line:
remove any spaces at beginning and end
split the line by dots
take the part before last of the splitted list
This should give you the part of the sentence you're looking for.
Simply use the split function.
a = 'x.com'
b = a.split('.')
This will make a list of 2 items in b. If you have two dots, the list will contain 3 items. The function actually splits the string based on the given character.
Related
So I have a file of text and titles, (titles indicated with the starting ";")
;star/stellar_(class(ification))_(chart)
Hertz-sprussels classification of stars is shows us . . .
What I want to do is have it where it's split by "_" into
['star/stellar','(class(ification))','(chart)'], interating through them and extracting whats in the brackets, e.g. '(class(ification))' to {'class':'ification'} and (chart) to just ['chart'].
All i've done so far is the splitting part
for ln in open(file,"r").read().split("\n"):
if ln.startswith(";"):
keys=ln[1:].split("_")
I have ways to extract bits in brackets, but I have had trouble finding a way that supports nested brackets in order.
I've tried things like re.findall('\(([^)]+)',ln) but that returns ['star/stellar', '(class', 'chart']. Any ideas?
You can do this with splits. If you separate the string using '_(' instead of only '_', the second part onward will be an enclosed keyword. you can strip the closing parentheses and split those parts on the '(' to get either one component (if there was no nested parentesis) or two components. You then form either a one-element list or dictionary depending on the number of components.
line = ";star/stellar_(class(ification))_(chart)"
if line.startswith(";"):
parts = [ part.rstrip(")") for part in line.split("_(")[1:]]
parts = [ part.split("(",1) for part in parts ]
parts = [ part if len(part)==1 else dict([part]) for part in parts ]
print(parts)
[{'class': 'ification'}, ['chart']]
Note that I assumed that the first part of the string is never included in the process and that there can only be one nested group at the end of the parts. If that is not the case, please update your question with relevant examples and expected output.
You can split (again) on the parentheses then do some cleaning:
x = ['star/stellar','(class(ification))','(chart)']
for v in x:
y = v.split('(')
y = [a.replace(')','') for a in y if a != '']
if len(y) > 1:
print(dict([y]))
else:
print(y)
Gives:
['star/stellar']
{'class': 'ification'}
['chart']
If all of the title lines have the same format, that is they all have these three parts ;some/title_(some(thing))_(something), then you can catch the different parts to separate variables:
first, second, third = ln.split("_")
From there, you know that:
for the first item you need to drop the ;:
first = first[1:]
for the second item, you want to extract the stuff in the parentheses and then merge it into a dict:
k, v = filter(bool, re.split('[()]', second))
second = {k:v}
for the third item, you want to drop the surrounding parentheses
third = third[1:-1]
Then you just need to put them all together again:
[first, second, third]
This is my first SO post, so go easy! I have a script that counts how many matches occur in a string named postIdent for the substring ff. Based on this it then iterates over postIdent and extracts all of the data following it, like so:
substring = 'ff'
global occurences
occurences = postIdent.count(substring)
x = 0
while x <= occurences:
for i in postIdent.split("ff"):
rawData = i
required_Id = rawData[-8:]
x += 1
To explain further, if we take the string "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff", it is clear there are 3 instances of ff. I need to get the 8 preceding characters at every instance of the substring ff, so for the first instance this would be 909a9090.
With the rawData, I essentially need to offset the variable required_Id by -1 when I get the data out of the split() method, as I am currently getting the last 8 characters of the current string, not the string I have just split. Another way of doing it could be to pass the current required_Id to the next iteration, but I've not been able to do this.
The split method gets everything after the matching string ff.
Using the partition method can get me the data I need, but does not allow me to iterate over the string in the same way.
Get the last 8 digits of each split using a slice operation in a list-comprehension:
s = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
print([x[-8:] for x in s.split('ff') if x])
# ['909a9090', '90434390', 'sdfs9000']
Not a difficult problem, but tricky for a beginner.
If you split the string on 'ff' then you appear to want the eight characters at the end of every substring but the last. The last eight characters of string s can be obtained using s[-8:]. All but the last element of a sequence x can similarly be obtained with the expression x[:-1].
Putting both those together, we get
subject = '090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff'
for x in subject.split('ff')[:-1]:
print(x[-8:])
This should print
909a9090
90434390
sdfs9000
I wouldn't do this with split myself, I'd use str.find. This code isn't fancy but it's pretty easy to understand:
fullstr = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
search = "ff"
found = None # our next offset of
last = 0
l = 8
print(fullstr)
while True:
found = fullstr.find(search, last)
if found == -1:
break
preceeding = fullstr[found-l:found]
print("At position {} found preceeding characters '{}' ".format(found,preceeding))
last = found + len(search)
Overall I like Austin's answer more; it's a lot more elegant.
I want to find the index of two substrings in a string of characters given like this:
find_start = '1L'
find_end = 'L'
>>> blah = 'A1LELST5W'
>>> blah.index('1L')
1
>>> blah.index('L')
2 # i want it to give me 4
If I use the index method, it gives me the "L" that's the third character in the string. But I want it to treat "1L" and "L" as separate strings and give me the fifth character instead.
Is there a simple way of doing this? Or would I have to store everything except find_start in a new string and then try to index through that? (But that would mess with the position of everything inside the string).
The str.index method has start and end arguments that allow you to constrain the search. So you just need to start the second search where the first one ends:
>>> find_start = '1L'
>>> find_end = 'L'
>>> blah = 'A1LELST5W'
>>> first = blah.index('1L')
>>> first
1
>>> blah.index('L', first + len(find_start))
4
The input is exactly as below:
"dinem-5554\tlee"
I need to perform regex match to get the value before \tlee), that is, dinem-5554. This is what I've tried:
m = re.findall(r'(\tlee)',a)[0]
if m:
print m
else:
print "none"
You don't need to use a regex. Use the builtin split method of str.
my_string = "dinem-5554\tlee"
groups = my_string.split('\tlee', 1)
if len(groups) > 0:
print groups[0]
else:
print 'none'
Or if you mean to split at the tab character:
groups = my_string.split('\t', 1)
Note that the second argument determines the number of times to split. If my_string contained multiple tab characters, it would only be split at the first one.
I have a string 'A1T1730'
From this I need to extract the second letter and the last four letters. For example, from 'A1T1730' I need to extract '1' and '1730'. I'm not sure how to do this in Python.
I have the following right now which extracts every character from the string separately so can someone please help me update it as per the above need.
list = ['A1T1730']
for letter in list[0]:
print letter
Which gives me the result of A, 1, T, 1, 7, 3, 0
my_string = "A1T1730"
my_string = my_string[1] + my_string[-4:]
print my_string
Output
11730
If you want to extract them to different variables, you can just do
first, last = my_string[1], my_string[-4:]
print first, last
Output
1 1730
Using filter with str.isdigit (as unbound method form):
>>> filter(str.isdigit, 'A1T1730')
'11730'
>>> ''.join(filter(str.isdigit, 'A1T1730')) # In Python 3.x
'11730'
If you want to get numbers separated, use regular expression (See re.findall):
>>> import re
>>> re.findall(r'\d+', 'A1T1730')
['1', '1730']
Use thefourtheye's solution if the positions of digits are fixed.
BTW, don't use list as a variable name. It shadows builtin list function.
Well you could do like this
_2nd = lsit[0][1]
# last 4 characters
numbers = list[0][-4:]
You can use the function isdigit(). If that character is a digit it returns true and otherwise returns false:
list = ['A1T1730']
for letter in list[0]:
if letter.isdigit() == True:
print letter, #The coma is used for print in the same line
I hope this useful.