Parsing String with Python - python

How can I parse a string ['FED590498'] in python, so than I can get all numeric values 590498 and chars FED separately.
Some Samples:
['ICIC889150']
['FED889150']
['MFL541606']
and [ ] is not part of string...

If the number of letters is variable, it's easiest to use a regular expression:
import re
characters, numbers = re.search(r'([A-Z]+)(\d+)', inputstring).groups()
This assumes that:
The letters are uppercase ASCII
There is at least 1 character, and 1 digit in each input string.
You can lock the pattern down further by using {3, 4} instead of + to limit repetition to just 3 or 4 instead of at least 1, etc.
Demo:
>>> import re
>>> inputstring = 'FED590498'
>>> characters, numbers = re.search(r'([A-Z]+)(\d+)', inputstring).groups()
>>> characters
'FED'
>>> numbers
'590498'

Given the requirement that there are always 3 or 4 letters you can use:
import re
characters, numbers = re.findall(r'([A-Z]{3,4})(\d+)', 'FED590498')[0]
characters, numbers
#('FED', '590498')
Or even:
ids = ['ICIC889150', 'FED889150', 'MFL541606']
[re.search(r'([A-Z]{3,4})(\d+)', id).groups() for id in ids]
#[('ICIC', '889150'), ('FED', '889150'), ('MFL', '541606')]
As suggested by Martjin, search is the preferred way.

Related

How to format a number with comma every four digits in Python?

I have a number 12345 and I want the result '1,2345'. I tried the following code, but failed:
>>> n = 12345
>>> f"{n:,}"
'12,345'
Regex will work for you:
import re
def format_number(n):
return re.sub(r"(\d)(?=(\d{4})+(?!\d))", r"\1,", str(n))
>>> format_number(123)
'123'
>>> format_number(12345)
'1,2345'
>>> format_number(12345678)
'1234,5678'
>>> format_number(123456789)
'1,2345,6789'
Explanation:
Match:
(\d) Match a digit...
(?=(\d{4})+(?!\d)) ...that is followed by one or more groups of exactly 4 digits.
Replace:
\1, Replace the matched digit with itself and a ,
Sounds like a locale thing(*). This prints 12,3456,7890 (Try it online!):
import locale
n = 1234567890
locale._override_localeconv["thousands_sep"] = ","
locale._override_localeconv["grouping"] = [4, 0]
print(locale.format_string('%d', n, grouping=True))
That's an I guess hackish way based on this answer. The other answer there talks about using babel, maybe that's a clean way to achieve it.
(*) Quick googling found this talking about Chinese grouping four digits, and OP's name seems somewhat Chinese, so...
Using babel:
>>> from babel.numbers import format_decimal
>>> format_decimal(1234, format="#,####", locale="en")
'1234'
>>> format_decimal(12345, format="#,####", locale="en")
'1,2345'
>>> format_decimal(1234567890, format="#,####", locale="en")
'12,3456,7890'
This format syntax is specified in UNICODE LOCALE DATA MARKUP LANGUAGE (LDML). Some light bedtime reading there.
Using stdlib only (hackish):
>>> from textwrap import wrap
>>> n = 12345
>>> ",".join(wrap(str(n)[::-1], width=4))[::-1]
'1,2345'
You can break your number into chunks of 10000's using modulus and integer division, then str.join using ',' delimiters
def commas(n):
s = []
while n > 0:
n, chunk = divmod(s, n)
s.append(str(chunk))
return ','.join(reversed(s))
>>> commas(123456789)
'1,2345,6789'
>>> commas(123)
'123'

How to test a string that only contains alphabets and numbers?

I am trying to test either a string contains only alphabets or numbers. Following statement should return false but it doesn't return. What am I doing wrong?
bool(re.match('[A-Z\d]', '2ae12'))
Just use the string method isalnum(), it does exactly what you want.
While not regex, you can use the very concise str.isalnum():
s = "sdfsdfq34sd"
print(s.isalnum())
Output:
True
However, if you do want a pure regex solution:
import re
if re.findall('^[a-zA-Z0-9]+$', s):
pass #string just contains letters and digits
Using a dataframe solution, courtesy of #Wen:
df.col1.apply(lambda x : x.isalnum())
df=pd.DataFrame( {'col1':["sdfsdfq34sd","sdfsdfq###34sd","sdfsdf!q34sd","sdfs‌​dfq34s#d"]})
Pandas answer: Consider this df
col
0 2ae12
1 2912
2 da2ae12
3 %2ae12
4 #^%6f
5 &^$*
You can select the rows that contain only alphabets or numbers using
df[~df.col.str.contains('(\W+)')]
You get
col
0 2ae12
1 2912
2 da2ae12
If you just want a boolean column, use
~df.col.str.contains('(\W+)')
0 True
1 True
2 True
3 False
4 False
5 False
If you are looking to return True if the string is either all digits or all letters, you can do:
for case in ('abcdefg','12345','2ae12'):
print case, case.isalpha() or case.isdigit()
Prints:
abcdefg True
12345 True
2ae12 False
If you want the same logic with a regex, you would do:
import re
for case in ('abcdefg','12345','2ae12'):
print case, bool(re.search(r'^(?:[a-zA-Z]+|\d+)$', case))
You regex is only matching one character, and I think the \d is being treated as an escaped D instead of the set of all integer characters.
If you really want to use a regex here's how I would do it;
def isalphanum(test_str):
alphanum_re = re.compile(r"[0-9A-Z]+", re.I)
return bool(alphanum_re.match(test_str)
Let's focus on the alphanum regex. I compiled it with a raw literal, indicated by the string with an 'r' next to it. This type of string won't escape certain characters when a slash is present, meaning r"\n" is interpreted as a slash and an N instead of a newline. This is helpful when using regexs, and certain text editors will even change the syntax highlighting of an R string to highlight features in the regex to help you out. The re.I flag ignores the case of the test string, so [A-Z] will match A through Z in either upper or lower case.
The simpler, Zen of Python solution involves invoking the isalnum method of the string;
test_str = "abc123"
test_str.isalnum()
You need to check is the string is made up of either alphabets or digits!
import re
bool(re.match('^[A-Za-z]+|\d+$', df['some_column'].str))
As dawg has suggested you can also use isalpha and isdigit,
df['some_column'].str.isalpha() or df['some_column'].str.isdigit()

How to extract the first numbers in a string - Python

How do I remove all the numbers before the first letter in a string? For example,
myString = "32cl2"
I want it to become:
"cl2"
I need it to work for any length of number, so 2h2 should become h2, 4563nh3 becomes nh3 etc.
EDIT:
This has numbers without spaces between so it is not the same as the other question and it is specifically the first numbers, not all of the numbers.
If you were to solve it without regular expressions, you could have used itertools.dropwhile():
>>> from itertools import dropwhile
>>>
>>> ''.join(dropwhile(str.isdigit, "32cl2"))
'cl2'
>>> ''.join(dropwhile(str.isdigit, "4563nh3"))
'nh3'
Or, using re.sub(), replacing one or more digits at the beginning of a string:
>>> import re
>>> re.sub(r"^\d+", "", "32cl2")
'cl2'
>>> re.sub(r"^\d+", "", "4563nh3")
'nh3'
Use lstrip:
myString.lstrip('0123456789')
or
import string
myString.lstrip(string.digits)

re.match is returning true on two different strings

I am using re.match function of python to compare two strings by ignoring few characters like this:
import re
url = "/ChessBoard_x16_y16.bmp/xyz"
if re.match( '/ChessBoard_x.._y..\.bmp', url ):
print("true")
else:
print("false")
Problem#1: the output is true but I want false here because the url has something extra after .bmp Problem#2: I have used two dots here to ignore the value 16 (x16 & y16) but in fact this value can contain any number of digits like x8, x16, x256 etc. So what should I do to ignore this complete value consisting of any number of digits?
Try the regex
'/ChessBoard_x[\d]+_y[\d]+\.bmp$'
A small demo (Also try on Regex101)
>>> import re
>>> pat = re.compile('/ChessBoard_x[\d]+_y[\d]+\.bmp$')
>>> url = "/ChessBoard_x162_y162.bmp"
>>> pat.match(url).group()
'/ChessBoard_x162_y162.bmp'
>>> url = "/ChessBoard_x16_y16.bmp/xyz"
>>> pat.match(url).group()
>>> # Does not match
Problem 1: You need to specify that you want the string to terminate at the end of the regex. The $ operator does that:
re.match("/ChessBoard_x.._y..\.bmp$", url)
Problem 2: What you want is one or more digits. The \d character class matches digits, + will match one or more of them. I replace the two dots with \d+ therefore:
re.match("/ChessBoard_x\d+_y\d+\.bmp$", url)

Python : UTF-8 : How to count number of words in UTF-8 string?

I need to count number of words in UTF-8 string. ie I need to write a python function which takes "एक बार,एक कौआ, बहुत प्यासा, था" as input and returns 7 ( number of words ).
I tried regular expression "\b" as shown below. But result are inconsistent.
wordCntExp=re.compile(ur'\b',re.UNICODE);
sen='एक बार,एक कौआ, बहुत प्यासा, था';
print len(wordCntExp.findall(sen.decode('utf-8'))) >> 1;
12
Any interpretation of the above answer or any other approaches to solve the above problem are appreciated.
try to use:
import re
words = re.split(ur"[\s,]+",sen, flags=re.UNICODE)
count = len(words)
It will split words divided by whitespaces and commas. You can add other characters into first argument that are not considered as characters belonging to a word.
inspired by this
python re documentation
I don't know anything about your language's structure, but can't you simply count the spaces?
>>> len(sen.split()) + 1
7
note the + 1 because there are n - 1 spaces. [edited to split on arbitrary length spaces - thanks #Martijn Pieters]
Using regex:
>>> import regex
>>> sen = 'एक बार,एक कौआ, बहुत प्यासा, था'
>>> regex.findall(ur'\w+', sen.decode('utf-8'))
[u'\u090f\u0915', u'\u092c\u093e\u0930', u'\u090f\u0915', u'\u0915\u094c\u0906', u'\u092c\u0939\u0941\u0924', u'\u092a\u094d\u092f\u093e\u0938\u093e', u'\u0925\u093e']
>>> len(regex.findall(ur'\w+', sen.decode('utf-8')))
7

Categories