I have a list of strings, and I want to all the strings that end with _1234 where 1234 can be any 4-digit number. It's ideal to find all the elements, and what the digits actually are, or at least return the 1st matching element, and what the 4 digit is.
For example, I have
['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
I want to get
['1024', '0510']
Okay so far I got, _\d{4}$ will match _1234 and return a match object, and the match_object.group(0) is the actual matched string. But is there a better way to look for _\d{4}$ but only return \d{4} without the _?
Use re.search():
import re
lst = ['A', 'BB_1024', 'CQ_2', 'x_0510']
newlst = []
for item in lst:
match = re.search(r'_(\d{4})\Z', item)
if match:
newlst.append(match.group(1))
print(newlst) # ['1024', '0510']
As for the regex, the pattern matches an underscore and exactly 4 digits at the end of the string, capturing only the digits (note the parens). The captured group is then accessible via match.group(1) (remember that group(0) is the entire match).
import re
src = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765', 'AB2421', 'D3&1345']
res = []
p = re.compile('.*\D(\d{4})$')
for s in src:
m = p.match(s)
if m:
res.append(m.group(1))
print(res)
Works fine, \D means not a number, so it will match 'AB2421', 'D3&1345' and so on.
Please show some code next time you ask a question here, even if it doesn't work at all. It makes it easier for people to help you.
If you're interested in a solution without any regex, here's a way with list comprehensions:
>>> data = ['A', 'BB_1024', 'CQ_2', 'x_0510', 'y_98765']
>>> endings = [text.split('_')[-1] for text in data]
>>> endings
['A', '1024', '2', '0510', '98765']
>>> [x for x in endings if x.isdigit() and len(x)==4]
['1024', '0510']
Try this:
[s[-4:] for s in lst if s[-4:].isdigit() and len(s) > 4]
Just check the last four characters if it's a number or not.
added the len(s) > 4 to correct the mistake Joran pointed out.
Try this code:
r = re.compile(".*?([0-9]+)$")
newlist = filter(r.match, mylist)
print newlist
Related
I am pulling data from a table that changes often using Python - and the method I am using is not ideal. What I would like to have is a method to pull all strings that contain only one letter and leave out anything that is 2 or more.
An example of data I might get:
115
19A6
HYS8
568
In this example, I would like to pull 115, 19A6, and 568.
Currently I am using the isdigit() method to determine if it is a digit and this filters out all numbers with one letter, which works for some purposes, but is less than ideal.
Try this:
string_list = ["115", "19A6", "HYS8", "568"]
output_list = []
for item in string_list: # goes through the string list
letter_counter = 0
for letter in item: # goes through the letters of one string
if not letter.isdigit(): # checks if the letter is a digt
letter_counter += 1
if letter_counter < 2: # if the string has more then 1 letter it wont be in output list
output_list.append(item)
print(output_list)
Output:
['115', '19A6', '568']
Here is a one-liner with a regular expression:
import re
data = ["115", "19A6", "HYS8", "568"]
out = [string for string in data if len(re.sub("\d", "", string))<2]
print(out)
Output:
['115', '19A6', '568']
This is an excellent case for regular expressions (regex), which is available as the built-in re library.
The code below follows the logic:
Define the dataset. Two examples have been added to show that a string containing two alpha-characters is rejected.
Compile a character pattern to be matched. In this case, zero or more digits, followed by zero or one upper case letter, ending with zero of more digits.
Use the filter function to detect matches in the data list and output as a list.
For example:
import re
data = ['115', '19A6', 'HYS8', '568', 'H', 'HI']
rexp = re.compile('^\d*[A-Z]{0,1}\d*$')
result = list(filter(rexp.match, data))
print(result)
Output:
['115', '19A6', '568', 'H']
Another solution, without re using str.maketrans/str.translate:
lst = ["115", "19A6", "HYS8", "568"]
d = str.maketrans(dict.fromkeys(map(str, range(10)), ""))
out = [i for i in lst if len(i.translate(d)) < 2]
print(out)
Prints:
['115', '19A6', '568']
z=False
a = str(a)
for I in range(len(a)):
if a[I].isdigit():
z = True
break
else:
z="no digit"
print(z)```
I have a list of strings:
str_list = ['123_456_789_A1', '678_912_000_B1', '980_210_934_A1', '632_210_464_B1']
And I basically want another list:
output_list = ['789', '000', '934', '464']
It is always going to be the third group of numbers that will always have a _A of _B
so far I have:
import re
m = re.search('_(.+?)_A', text)
if m:
found = m.group(1)
But I keep getting somthing like: 456_789
Just use simple list comprehension for this
ans = [i.split("_")[-2] for i in lst]
If you only want to match digits followed by an underscore and an uppercase char, you can match the digits and assert the underscore and uppercase char directly to the right.
To match only A or B, use [AB] else use [A-Z] to match that range.
\d+(?=_[AB])
Regex demo
You can use re.search to find the first occurrence in the string.
import re
str_list = ['123_456_789_A1', '678_912_000_B1', '980_210_934_A1', '632_210_464_B1']
str_list = [re.search(r'\d+(?=_[AB])', s).group() for s in str_list]
print(str_list)
Output
['789', '000', '934', '464']
Or using a capturing group version, matching the _ before as well to be a bit more precise as in your pattern you also wanted to match the leading _
str_list = [re.search(r'_(\d+)_[AB]', s).group(1) for s in str_list]
I have an input list:
list_1 = ['29','560001','08067739333','560037002','29AAACC0462F1Z0','55XX1XXX19','07S23X09','98561XXX1X9']
I have tried:
output_list = [i for i in list_1 if 'X' in i or i.isnumeric()==True]
Giving out with extra element '07S23X09' which is wrong:
output_list = ['29','560001','08067739333','560037002','55XX1XXX19','07S23X09','98561XXX1X9']
Expected output is the list with numbers and the elements with numbers and specific character X, else other elements should be discarded:
output_list = ['29','560001','08067739333','560037002','55XX1XXX19','98561XXX1X9']
You may use
import re
list_1 = ['29','560001','08067739333','560037002','29AAACC0462F1Z0','55XX1XXX19','07S23X09','98561XXX1X9']
rx = re.compile('[0-9X]+')
print ( [x for x in list_1 if rx.fullmatch(x)] )
# => ['29', '560001', '08067739333', '560037002', '55XX1XXX19', '98561XXX1X9']
See the Python demo.
With re.fullmatch('[0-9X]+', x), you only keep the items that fully consist of digits or X chars.
See ^[0-9X]+$ the regex demo.
NOTE: If there must be at least one digit in the string, i.e. if you want to fail and thus discard all items that are just XXX, you may use
^X*[0-9][0-9X]*$
Or, ^(?=X*[0-9])[0-9X]+$. See the regex demo.
How about:
output_list = [i for i in list_1 if i.replace('X', '').isnumeric()==True]
You seem to want all the numeric ones, but are ok if it's an 'X' in there. So if you remove the X's and check for numeric, that would do the trick.
This could be easily done with python regex like the following
import re
list_1 = ['29','560001','08067739333','560037002','29AAACC0462F1Z0','55XX1XXX19','07S23X09','98561XXX1X9']
l2 = re.sub('[a-zA-VY-Z]', "", str(list_1)) # delete unwanted characters which are lowercase and uppercase letters from a to v and Y-Z since we only need to preserve uppercase letter X
print(l2)
output
['29', '560001', '08067739333', '560037002', '29046210', '55XX1XXX19', '0723X09', '98561XXX1X9']
Say I have some list with files of the form *.1243.*, and I wish to obtain everything before these 4 digits. How do I do this efficiently?
An ugly, inefficient example of working code is:
names = []
for file in file_list:
words = file.split('.')
for i, word in enumerate(words):
if word.isdigit():
if int(word)>999 and int(word)<10000:
names.append(' '.join(words[:i]))
break
print(names)
Obviously though, this is far from ideal and I was wondering about better ways to do this.
You may want to use regular expressions for this.
import re
name = []
for file in file_list:
m = re.match(r'^(.+?)\.\d{4}\.', file)
if m:
name.append(m.groups()[0])
Using a regular expression, this would become simpler
import re
names = ['hello.1235.sas','test.5678.hai']
for fn in names:
myreg = r'(.*)\.(?:\d{4})\..*'
output = re.findall(myreg,fn)
print(output)
output:
['hello']
['test']
If you know that all entries has the same format, here is list comprehension approach:
[item[0] for item in filter(lambda start, digit, end: len(digit) == 4, (item.split('.') for item in file_list))]
To be fair I also like solution, provided by #James. Note, that downside of this list comprehension is three loops:
1. On all items to split
2. Filtering all items, that match
3. Returning result.
With regular for loop it could be be more sufficient:
output = []
for item in file_list:
begging, digits, end = item.split('.')
if len(digits) == 4:
output.append(begging)
It does only one loop, which way better.
You can use Positive Lookahead (?=(\.\d{4}))
import re
pattern=r'(.*)(?=(\.\d{4}))'
text=['*hello.1243.*','*.1243.*','hello.1235.sas','test.5678.hai','a.9999']
print(list(map(lambda x:re.search(pattern,x).group(0),text)))
output:
['*hello', '*', 'hello', 'test', 'a']
Is there an option how to filter those strings from list of strings which contains for example 3 equal characters in a row? I created a method which can do that but I'm curious whether is there a more pythonic way or more efficient or more simple way to do that.
list_of_strings = []
def check_3_in_row(string):
for ch in set(string):
if ch*3 in string:
return True
return False
new_list = [x for x in list_of_strings if check_3_in_row(x)]
EDIT:
I've just found out one solution:
new_list = [x for x in set(keywords) if any(ch*3 in x for ch in x)]
But I'm not sure which way is faster - regexp or this.
You can use Regular Expression, like this
>>> list_of_strings = ["aaa", "dasdas", "aaafff", "afff", "abbbc"]
>>> [x for x in list_of_strings if re.search(r'(.)\1{2}', x)]
['aaa', 'aaafff', 'afff', 'abbbc']
Here, . matches any character and it is captured in a group ((.)). And we check if the same captured character (we use the backreference \1 refer the first captured group in the string) appears two more times ({2} means two times).