How can I strip unwanted characters from a string in python? - python

I have the following string:
text = 'adsfklaiin2007daf adf adflkajf;j 2008afadfkjkj'
I want to return:
2007 2008
Any way to do this in Python?

This is a classic case for regular expressions. Using the re python library you get:
re.findall('\d{4}', "yourStringHere")
This will return a list of all four digit items found in the string. Simply adjust your regex as needed.

import re
num = re.compile('[\d]*')
numbers = [number for number in num.findall(text) if number]
['2007', '2008']

>>> import re
>>> text = 'adsfklaiin2007daf adf adflkajf;j 2008afadfkjkj'
>>> re.sub("[^0-9]"," ",text)
' 2007 2008 '
I will leave it to you to format the output.

str.translate
text.translate(None, ''.join(chr(n) for n in range(0xFF) if chr(n) not in ' 01234567890')
You can probably construct a better table of characters to skip and make it prettier, but that's the general idea.

Related

Getting word from string

How can i get word example from such string:
str = "http://test-example:123/wd/hub"
I write something like that
print(str[10:str.rfind(':')])
but it doesn't work right, if string will be like
"http://tests-example:123/wd/hub"
You can use this regex to capture the value preceded by - and followed by : using lookarounds
(?<=-).+(?=:)
Regex Demo
Python code,
import re
str = "http://test-example:123/wd/hub"
print(re.search(r'(?<=-).+(?=:)', str).group())
Outputs,
example
Non-regex way to get the same is using these two splits,
str = "http://test-example:123/wd/hub"
print(str.split(':')[1].split('-')[1])
Prints,
example
You can use following non-regex because you know example is a 7 letter word:
s.split('-')[1][:7]
For any arbitrary word, that would change to:
s.split('-')[1].split(':')[0]
many ways
using splitting:
example_str = str.split('-')[-1].split(':')[0]
This is fragile, and could break if there are more hyphens or colons in the string.
using regex:
import re
pattern = re.compile(r'-(.*):')
example_str = pattern.search(str).group(1)
This still expects a particular format, but is more easily adaptable (if you know how to write regexes).
I am not sure why do you want to get a particular word from a string. I guess you wanted to see if this word is available in given string.
if that is the case, below code can be used.
import re
str1 = "http://tests-example:123/wd/hub"
matched = re.findall('example',str1)
Split on the -, and then on :
s = "http://test-example:123/wd/hub"
print(s.split('-')[1].split(':')[0])
#example
using re
import re
text = "http://test-example:123/wd/hub"
m = re.search('(?<=-).+(?=:)', text)
if m:
print(m.group())
Python strings has built-in function find:
a="http://test-example:123/wd/hub"
b="http://test-exaaaample:123/wd/hub"
print(a.find('example'))
print(b.find('example'))
will return:
12
-1
It is the index of found substring. If it equals to -1, the substring is not found in string. You can also use in keyword:
'example' in 'http://test-example:123/wd/hub'
True

python getting each from decimal(10,2)

i want to seperate and get each one from following word
decimal(10,2)
i want:
decimal
10
2
each of these.how can we do that using regex?
",2" is optional,like in the case of char(10).The Precision value may or may not be there.
Try Regex: (\w+)(?:\((\d+)(?:,(\d+))?\))?
Demo
Simple string substitution:
import re
string = "decimal(10,2)"
string = re.sub(r'(\w+)\((\d+),(\d+)\)', r'\1\n\2\n\3', string)
print string

python regex to replace all single word characters in string

I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance
Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?
Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'
re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)
EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.

Python : UTF-8 : How to count number of words in UTF-8 string?

I need to count number of words in UTF-8 string. ie I need to write a python function which takes "एक बार,एक कौआ, बहुत प्यासा, था" as input and returns 7 ( number of words ).
I tried regular expression "\b" as shown below. But result are inconsistent.
wordCntExp=re.compile(ur'\b',re.UNICODE);
sen='एक बार,एक कौआ, बहुत प्यासा, था';
print len(wordCntExp.findall(sen.decode('utf-8'))) >> 1;
12
Any interpretation of the above answer or any other approaches to solve the above problem are appreciated.
try to use:
import re
words = re.split(ur"[\s,]+",sen, flags=re.UNICODE)
count = len(words)
It will split words divided by whitespaces and commas. You can add other characters into first argument that are not considered as characters belonging to a word.
inspired by this
python re documentation
I don't know anything about your language's structure, but can't you simply count the spaces?
>>> len(sen.split()) + 1
7
note the + 1 because there are n - 1 spaces. [edited to split on arbitrary length spaces - thanks #Martijn Pieters]
Using regex:
>>> import regex
>>> sen = 'एक बार,एक कौआ, बहुत प्यासा, था'
>>> regex.findall(ur'\w+', sen.decode('utf-8'))
[u'\u090f\u0915', u'\u092c\u093e\u0930', u'\u090f\u0915', u'\u0915\u094c\u0906', u'\u092c\u0939\u0941\u0924', u'\u092a\u094d\u092f\u093e\u0938\u093e', u'\u0925\u093e']
>>> len(regex.findall(ur'\w+', sen.decode('utf-8')))
7

Breaking up substrings in Python based on characters

I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.

Categories