Need to find element just before regex match element via python - python

The input is exactly as below:
"dinem-5554\tlee"
I need to perform regex match to get the value before \tlee), that is, dinem-5554. This is what I've tried:
m = re.findall(r'(\tlee)',a)[0]
if m:
print m
else:
print "none"

You don't need to use a regex. Use the builtin split method of str.
my_string = "dinem-5554\tlee"
groups = my_string.split('\tlee', 1)
if len(groups) > 0:
print groups[0]
else:
print 'none'
Or if you mean to split at the tab character:
groups = my_string.split('\t', 1)
Note that the second argument determines the number of times to split. If my_string contained multiple tab characters, it would only be split at the first one.

Related

How to split a list item based on digits in item

I am currently parsing this huge rpt file. In each item there is a value in parentheses. For example, "item_number_one(3.14)". How could I extract that 3.14 using the split function in python? Or is there another way to do that?
#Splits all items by comma
items = line.split(',')
#splits items within comma, just gives name
name_only = [i.split('_')[0] for i in items]
# print(name_only)
#splits items within comma, just gives full name
full_name= [i.split('(')[0] for i in items]
# print(full_Name)
#splits items within comma, just gives value in parentheses
parenth_value = [i.split('0-9')[0] for i in items]
# parenth_value = [int(s) for s in items.split() if s.isdigit()]
print(parenth_value)
parenth_value = [i.split('0-9')[0] for i in items]
for a more general way of extracting numbers from strings, you should read about Regular Expressions.
for this very specific case, you can split by ( and then by ) to get the value in between them.
like this:
line = "item_number_one(3.14)"
num = line.split('(')[1].split(')')[0]
print(num)
You could simply find starting index of parentheses and ending parentheses, and get the area between them:
start_paren = line.index('(')
end_paren = line.index(')')
item = line[start_paren + 1:end_paren]
# item = '3.14'
Alternatively, you could use regex, which offers an arguably more elegant solution:
import re
...
# anything can come before the parentheses, anything can come afterwards.
# We have to escape the parentheses and put a group inside them
# (this is notated with its own parentheses inside the pair that is escaped)
item = re.match(r'.*\(([0-9.-]*)\).*', line).group(1)
# item = '3.14'
can use regex and do something like below;
import re
sentence = "item_number_one(3.14)"
re.findall(r'\d.+', sentence)
You could get the integer value by using the following regular expression:
import re
text = 'item_number_one(3.14)'
re.findall(r'\d.\d+', text)
o/p: ['3.14']
Explanation:
"\d" - Matches any decimal digit; this is equivalent to the class [0-9].
"+" - one or more integers
In the same way you can parse the rpt file and split the lines and fetch the value present in the parentheses .

Python: How to move the position of an output variable using the split() method

This is my first SO post, so go easy! I have a script that counts how many matches occur in a string named postIdent for the substring ff. Based on this it then iterates over postIdent and extracts all of the data following it, like so:
substring = 'ff'
global occurences
occurences = postIdent.count(substring)
x = 0
while x <= occurences:
for i in postIdent.split("ff"):
rawData = i
required_Id = rawData[-8:]
x += 1
To explain further, if we take the string "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff", it is clear there are 3 instances of ff. I need to get the 8 preceding characters at every instance of the substring ff, so for the first instance this would be 909a9090.
With the rawData, I essentially need to offset the variable required_Id by -1 when I get the data out of the split() method, as I am currently getting the last 8 characters of the current string, not the string I have just split. Another way of doing it could be to pass the current required_Id to the next iteration, but I've not been able to do this.
The split method gets everything after the matching string ff.
Using the partition method can get me the data I need, but does not allow me to iterate over the string in the same way.
Get the last 8 digits of each split using a slice operation in a list-comprehension:
s = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
print([x[-8:] for x in s.split('ff') if x])
# ['909a9090', '90434390', 'sdfs9000']
Not a difficult problem, but tricky for a beginner.
If you split the string on 'ff' then you appear to want the eight characters at the end of every substring but the last. The last eight characters of string s can be obtained using s[-8:]. All but the last element of a sequence x can similarly be obtained with the expression x[:-1].
Putting both those together, we get
subject = '090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff'
for x in subject.split('ff')[:-1]:
print(x[-8:])
This should print
909a9090
90434390
sdfs9000
I wouldn't do this with split myself, I'd use str.find. This code isn't fancy but it's pretty easy to understand:
fullstr = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
search = "ff"
found = None # our next offset of
last = 0
l = 8
print(fullstr)
while True:
found = fullstr.find(search, last)
if found == -1:
break
preceeding = fullstr[found-l:found]
print("At position {} found preceeding characters '{}' ".format(found,preceeding))
last = found + len(search)
Overall I like Austin's answer more; it's a lot more elegant.

How to write regular expression to find combination of characters, but each can only appear once in python

I would like to find whether "xy" in a string, "xy" is optional, for each character it can only appear once. For example:
def findpat(texts, pat):
for text in texts:
if re.search(pat, t):
print re.search(pat, t).group()
else:
print None
pat = re.compile(r'[xy]*?b')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# it prints
# xyb
# xb
# yb
# yxb
# b
# xyxb
For the last one, my desired output is "yxb".
How should I modify my regex? Many thanks
You may use the following approach: match and capture the two groups, ([xy]*)(b). Then, once a match is found, check if the length of the value in Group 1 is the same as the number of unique chars in this value. If not, remove the chars from the start of the group value until you get a string with the length of the number of unique chars.
Something like:
def findpat(texts, pat):
for t in texts:
m = re.search(pat, t) # Find a match
if m:
tmp = set([x for x in m.group(1)]) # Get the unqiue chars
if len(tmp) == len(m.group(1)): # If Group 1 length is the same
print re.search(pat, t).group() # Report a whole match value
else:
res = m.group(1)
while len(tmp) < len(res): # While the length of the string is not
res = res[1:] # equal to the number of unique chars, truncate from the left
print "{}{}".format(res, m.group(2)) # Print the result
else:
print None # Else, no match
pat = re.compile(r'([xy]*)(b)')
text = ['xyb', 'xb', 'yb', 'yxb','b', 'xyxb']
findpat(text, pat)
# => [xyb, xb, yb, yxb, b, yxb]
See the Python demo
You can use this pattern
r'(x?y?|yx)b'
To break down, the interesting part x?y?|yx will match:
empty string
only x
only y
xy
and on the alternative branch, yx
As an advice, when you aren't very comfortable with regex and your number of scenarios are small, you could simply brute force the pattern. It's ugly, but it makes clear what your cases are:
r'b|xb|yb|xyb|yxb'
Part 2.
For a generic solution, that will do the same, but for any number of characters instead of just {x, y}, the following regex style can be used:
r'(?=[^x]*x?[^x]*b)(?=[^y]*y?[^y]*b)(?=[^z]*z?[^z]*b)[xyz]*b'
I'll explain it a bit:
By using lookaheads you advance the regex cursor and for each position, you just "look ahead" and see if what follows respects a certain condition. By using this technique, you may combine several conditions into a single regex.
For a cursor position, we test each character from our set to appear at most once from the position, until we match our target b character. We do this with this pattern [^x]*x?[^x]*, which means match not-x if there are any, match at most one x, then match any number of not x
Once the test conditions are met, we start advancing the cursor and matching all the characters from our needed set, until we find a b. At this point we are guaranteed that we won't match any duplicates, because we performed our lookahead tests.
Note: I strongly suspect that this has poor performance, because it does backtracking. You should only use it for small test strings.
Test it.
Well, the regexp that literally passes your test cases is:
pat = re.compile(r'(x|y|xy|yx)?b$')
where the "$" anchors the string at the end and thereby ensures it's the last match found.
However it's a little more tricky to use the regexp mechanism(s) to ensure that only one matching character from the set is used ...
From Wiktor Stribiżew's comment & demo, I got my answer.
pat = re.compile(r'([xy]?)(?:(?!\1)[xy])?b')
Thanks you all!

How to extract certain letters from a string using Python

I have a string 'A1T1730'
From this I need to extract the second letter and the last four letters. For example, from 'A1T1730' I need to extract '1' and '1730'. I'm not sure how to do this in Python.
I have the following right now which extracts every character from the string separately so can someone please help me update it as per the above need.
list = ['A1T1730']
for letter in list[0]:
print letter
Which gives me the result of A, 1, T, 1, 7, 3, 0
my_string = "A1T1730"
my_string = my_string[1] + my_string[-4:]
print my_string
Output
11730
If you want to extract them to different variables, you can just do
first, last = my_string[1], my_string[-4:]
print first, last
Output
1 1730
Using filter with str.isdigit (as unbound method form):
>>> filter(str.isdigit, 'A1T1730')
'11730'
>>> ''.join(filter(str.isdigit, 'A1T1730')) # In Python 3.x
'11730'
If you want to get numbers separated, use regular expression (See re.findall):
>>> import re
>>> re.findall(r'\d+', 'A1T1730')
['1', '1730']
Use thefourtheye's solution if the positions of digits are fixed.
BTW, don't use list as a variable name. It shadows builtin list function.
Well you could do like this
_2nd = lsit[0][1]
# last 4 characters
numbers = list[0][-4:]
You can use the function isdigit(). If that character is a digit it returns true and otherwise returns false:
list = ['A1T1730']
for letter in list[0]:
if letter.isdigit() == True:
print letter, #The coma is used for print in the same line
I hope this useful.

How to differentiate lines with one dot and two dot?

I want to extract a specific part of a sentence. My problem is that I have a list of sentences that each have different formats. For instance:
X.y.com
x.no
x.com
y.com
z.co.uk
s.com
b.t.com
how can I split these lines based on the number of dots they have? If I want the second part of the sentence with two dots and the first part of the sentences with one dot
You want the part directly preceding the last dot; just split on the dots and take the one-but last part:
for line in data:
if not '.' in line: continue
elem = line.strip().split('.')[-2]
For your input, that gives:
>>> for line in data:
... print line.strip().split('.')[-2]
...
y
x
x
y
co
s
t
To anwser your question you could use count to count the number of times the '.' appears and then do
whatever you need.
>>> 't.com'.count('.')
1
>>> 'x.t.com'.count('.')
2
You could use that in a loop:
for s in string_list:
dots = s.count('.')
if dots == 1:
# do something here
elif dots == 2:
# do something else
else:
# another piece of code
More pythonic way to solve your problem:
def test_function(s):
"""
>>> test_function('b.t.com')
't'
>>> test_function('x.no')
'x'
>>> test_function('z')
'z'
"""
actions = {0: lambda x: x
1: lambda x: x.split('.')[0],
2: lambda x: x.split('.')[1]}
return actions[s.count('.')](s)
I would follow this logic:
For each line:
remove any spaces at beginning and end
split the line by dots
take the part before last of the splitted list
This should give you the part of the sentence you're looking for.
Simply use the split function.
a = 'x.com'
b = a.split('.')
This will make a list of 2 items in b. If you have two dots, the list will contain 3 items. The function actually splits the string based on the given character.

Categories