How to remove slash, letters and numbers from a string?

How to remove slash, letters and numbers from a string? - python

I was trying to get a nice and clean representation of a string. My desired version would be ['Course Number: CLASSIC 10A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4']
However, the current output is ['Course Number: CLASSIC\xa010A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4'].
Something (\xa) is getting in the way of the first element. I will attach the part of codes below. Thanks in advance for helping me out.
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
all_rows = []
num = list[index][0].get_text(strip=True)
name = str.isalnum, list[index][1].get_text(strip=True)
unit = list[index][2].get_text(strip=True)
all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(c)

As #melpomene commented the string '\xa0' is a character - a non-breaking space... What you really need to be doing to this string is reformatting it to so called 'raw text', through the use of regex:
import re
re.sub('[^A-Za-z0-9-|:]+', ' ', str)
This is generally my preferred way of removing special characters/formatting - but how does it work... If we look with the first set of quotation marks'[^A-Za-z0-9-|:]+'we see the first thing we state is A-Z which simply means from A to Z all in capital letters. We then get from a-z all in lower case. After that we have 0-9 which shows all values from 0 to 9 and finally we have |: which means any colons or pipes... Let's test this with a simple script:
import re
vals = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|:'
print(vals == re.sub('[^A-Za-z0-9-|:]+', ' ', vals))
I would recommend running this code yourself to try it out but the answer you get back is True.
Incorporating this into your script would be as simple as:
import re
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
all_rows = []
num = list[index][0].get_text(strip=True)
name = str.isalnum, list[index][1].get_text(strip=True)
unit = list[index][2].get_text(strip=True)
all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(re.sub('[^A-Za-z0-9-|:]+', ' ', c))
If you encounter any other values you wish to include within your string, simple add them to the end of ^A-Za-z0-9-|:. For example, if you wished to keep underscores as well you would simply use '[^A-Za-z0-9-|:_]+'
Hope this helped. To read more go to the regex how to section of the python3 docs.

Related

Is there a more efficient way of replacing certain regex matches with elements?

I am required to use regex module.
I have coded this little program to replace certain regex matches such as orange with the length of orange in # signs, for example, if orange is in the string then it will be replaced with ######.
If a string has been changed it will add " !! This string has been changed !!" to the end of the string.
If a string has not been changed but has a # in it then it will not add " !! This string has been changed !!".
I am wondering, is there a more efficient way of coding this up? using regex functions and better python code.
orange = re.compile(r'\borange\b', re.IGNORECASE)
frog = re.compile(r'\bfrog\b', re.IGNORECASE)
cat = re.compile(r'\bcat\b', re.IGNORECASE)
num = 0
if re.search(orange, s):
s = re.sub(orange, "!!!!!!", s)
num +=1
if re.search(frog, s):
s = re.sub(frog, "!!!!", s)
num +=1
if re.search(cat, s):
s = re.sub(cat, "!!!", s)
num +=1
if num > 0:
return s + " !! This string has been changed !!"
else:
return s```

Assuming your line input can contain 'orange' 'frog' 'cat' simultaneously ONE particular solution to this is, create a regex pattern which can match either of your solutions, then create an iterator for each match, re-place this found match with the 'x' according to the len of the matched string and printing the string modified (or not if that is the case)
Code is:
import re
string = "orange frog cat test"
#string = "one two tree testing stackoverflow"
regex_pattern = re.compile(r"\b(orange|frog|cat)\b", re.IGNORECASE)
total_matches = regex_pattern.finditer(string)
# We find either of the options? then changes will be made
changes_done = regex_pattern.search(string)
for match in total_matches:
element_find = match.group(0)
string = regex_pattern.sub("x" * len(element_find), string, 1)
if( changes_done ):
print(string + " | changes where made")
else:
print(string + " | no changes made")
What really shines in this particular solution is the third parameter of sub, where you can limit the count of matches done. As i said, one particular solution for your problem.
Output generated for the replacement will be xxxxxx xxxx xxx test | changes where made

I guess you're using this code inside a function, since you're returning some values.
Anyway, without the num counter:
import re
pattern = r"\b(orange|frog|cat)\b"
s = "an orange eaten by a frog and a cat"
rgx_matches = re.findall(pattern, s, flags=re.IGNORECASE)
for rgx_match in rgx_matches:
print(re.sub(rgx_match, "#"*len(rgx_match), s) +\
" !! This string has been changed !!")

Complex string filtering with python

I have a long string that is a phylogenetic tree and I want to do a very specific filtering.
(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x#y is a species#gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x#y.
(Esy, Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x#y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an # character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.

Give this a try:
import re:
pat = re.compile(r'(\w{3})#')
txt = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
Result:
['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
If you need the structure intact, we can try to remove the unnecessary parts instead:
pat = re.compile(r'(#|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
Result:
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
Regex demo

How about this kind of function:
def parse_string(string):
new_string = ''
skip = False
for char in string:
if char == '#':
skip = True
if char == ',':
skip = False
if not skip or char in ['(', ')']:
new_string += char
return new_string
Calling it on your string:
string = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'

you can use regex:
import re
s = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=#)|\(|\)"
result = re.findall(p, s)
and you have your result as a list, so you can make it string or do anything with it
for explaining what is happening :
p is regular expression pattern
so in this pattern:
. means matching any word
...?(?=#) means match any word until I get to a word ? wich ? is #, so this whole pattern means that you get any three words before #
| is or statement, I used it here to find another pattern
and the rest of them is to find ) and (

Try this regex if you need the brackets in the output:
import re
regex = r"#[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
print(re.sub(regex,"",phylogenetic_tree))
Output:
(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))

Because you are trying to parse a phylogenetic tree, I highly suggest to let BioPython do the heavy lifting for you.
You can easily parse and display a phylogenetic with Bio.Phylo. Then it is just iterating over all tree elements and splitting the names at the 'at'-sign.
Because Phylo expects the input to be in a file, we create an in-memory file-like object with io.StringIO. Getting the complete tree is then as easy as
Phylo.read(io.StringIO(s), 'newick')
In order to check if the parsed tree looks sane, I print it once with print(tree).
Now we want to change all node names that contain a '#'. With tree.find_elements we get access to all nodes. Some nodes don't have a name and some might not contain a '#'. So to be extra careful, we first check if n.name and '#' in n.name. Only then do we split each node's name at the '#' and take just the first part (index 0) of it:
n.name = n.name.split('#')[0]
In order to recreate the initial string representation, we use Phylo.write:
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Again, write wants to get a file argument - if we just want to get a string, we can use a StringIO object again.
Full code:
import io
from Bio import Phylo
if __name__ == '__main__':
s = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree = Phylo.read(io.StringIO(s), 'newick')
print(' before '.center(20, '='))
print(tree)
for n in tree.find_elements():
if n.name and '#' in n.name:
n.name = n.name.split('#')[0]
print(' result '.center(20, '='))
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Output:
====== before ======
Tree(rooted=False, weight=1.0)
Clade(branch_length=0.0129090235079)
Clade(branch_length=0.0726396855636, name='Esy#ESY15_g64743_DN3_SP7_c0')
Clade(branch_length=0.137507902808, name='Aar#AA_maker7399_1')
Clade(branch_length=0.0129090235079)
Clade(branch_length=9.05326020871e-05)
Clade(branch_length=0.0318934795022, name='Spa#Tp2g18720')
Clade(branch_length=0.0273465005242, name='Cpl#CP2_g48793_DN3_SP8_c')
Clade(branch_length=0.00328120860999)
Clade(branch_length=0.00859075940423)
Clade(branch_length=0.0340484449097)
Clade(branch_length=0.0332592496158, name='Bst#Bostr_13083s0053_1')
Clade(branch_length=0.0150356382287)
Clade(branch_length=0.0205924636564)
Clade(branch_length=0.0328569260951, name='Aly#AL8G21130_t1')
Clade(branch_length=0.0391706378372, name='Ath#AT5G48370_1')
Clade(branch_length=0.00998579652059)
Clade(branch_length=0.0954469923893, name='Chi#CARHR183840_1')
Clade(branch_length=0.0570981548016, name='Cru#Carubv10026342m')
Clade(branch_length=0.0372829371381)
Clade(branch_length=0.0206478928557)
Clade(branch_length=0.0144626717872)
Clade(branch_length=0.00823215335663, name='Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
Clade(branch_length=0.0085462978729, name='Hlo#DN13684_c0_g1_i1_p1')
Clade(branch_length=0.0225079453622, name='Hla#DN22821_c0_g1_i1_p1')
Clade(branch_length=0.048590776459, name='Hse#DN23412_c0_g1_i3_p1')
Clade(branch_length=1.00000050003e-06)
Clade(branch_length=0.0378509854703, name='Esa#Thhalv10004228m')
Clade(branch_length=0.0712272454125, name='Aal#Aa_G102140_t1')
==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;
The default format of Phylo uses less digits than in your original tree. In order to keep the numbers unchanged, just override the branch length format string with a '%s':
Phylo.write(tree, out, "newick", format_branch_length="%s")

Parsing code can be hard to follow. Tatsu lets you write readable parsing code by combining grammars and python:
text = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
import sys
import tatsu
grammar = """
start = things ';'
;
things = thing [ ',' things ]
;
thing = x '#' y ':' number
| '(' things ')' ':' number
;
x = /\w+/
;
y = /\w+/
;
number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
;
"""
class Semantics:
def x(self, ast):
# the method name matches the rule name
print('X =', ast)
parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)

Python Regex matcher until two characters like a OR condition

My question is quite simple
I'm trying to come up with a RE to select any set of words or statement in between two characters.
For example is the strings are something like this :
') as whatever '
and it can also look like
') as whatever\r\n'
So i need to extract 'whatever' from this string.
The Regex I came up with is this :
\)\sas\s(.*?)\s
It works fine and extracts 'whatever' but this will only work for the first example not the second. What should i do in case of the second statement
I'm basically looking for an OR condition kind of thing!
Any help would be appreciated
Thanks in advance

The question is not very clear but maybe the regular expression syntax you are looking for might be something like this:
\)\sas\s(.*?)[\s | \r | \n]
basically telling after the string you are interested you can find a space or other characters.
EDIT
As example take the following code in Python2. The OR operator is '|' and I used it in the square brackets to catch the strings which have as subsequent character a space, '\r' a . or 'd'.
import re
a = ') as whatever '
b = ') as whatever\r\n'
c = ') as whatever.'
d = ') as whateverd'
a_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', a)[0] #ending with space, \r or new line char
b_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \n]', b)[0]
c_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \.]', c)[0] #ending with space, \r new line char or .
d_res = re.findall(r'\)\sas\s(.*?)[\s | \r | \on | \. | d]', d)[0] #ending with space, \r, new line char, . or d
print(a_res, len(a_res))
print(b_res, len(b_res))
print(c_res, len(c_res))
print(d_res, len(d_res))

It is working as you intended. Please check it
import re
a =') as whatever '
b=') as whatever\r\n'
print re.findall(r'\)\sas\s(.*?)\s', a)[0]
print re.findall(r'\)\sas\s(.*?)\s', b)[0]
This will output as
'whatever'
'whatever'

Right justify not working [Python]

I'm working on code to take a user input business name and print out reviews for it. When I run my final loop, I tell python to right justify the reviews by four spaces, yet nothing happens. I've tried multiple solutions and am honestly at a loss.
(Problem area is the very last line)
import json
import textwrap
import sys
f = open('reviews.json')
f1= open('businesses.json')
line1= f1.readline()
business= json.loads(line1)
line = f.readline()
review = json.loads(line)
idlist=[]
reviewlist=[]
bizname= raw_input('Enter a business name => ')
print bizname
for line in f1:
business= json.loads(line)
if bizname == business['name']:
idlist.append(business['business_id'])
if len(idlist)==0:
print 'This business is not found'
sys.exit()
for line in f:
review = json.loads(line)
for item in idlist:
if item == review['business_id']:
reviewlist.append(review['text'])
if len(reviewlist)==0:
print 'No reviews for this business are found'
sys.exit()
for i in range(len(reviewlist)):
w = textwrap.TextWrapper(replace_whitespace=False)
print 'Review',str(i+1)+':'
print w.fill(reviewlist[i] , ).rjust(4,' ')

I suggest you to verify the output print w.fill(reviewlist[i] , ).
the lenght may less than 4. so it looks like not working. e.g. 'abcdef'.rjust(4, ' ')
>>> 'abcdef'.rjust(4, ' ')
'abcdef'
>>> 'abcdef'.rjust(20, ' ')
' abcdef'
https://docs.python.org/2/library/string.html#string.rjust

"Right justify by 4 spaces" doesn't makes sense, so it's unclear what you really want. The first argument to .rjust() is the total width of the field, and if the string is already at least that long nothing at all is done. Some examples:
>>> "abcde".rjust(4, " ") # nothing done: 5 > 4
'abcde'
>>> "abcd".rjust(4, " ") # nothing done: 4 == 4
'abcd'
>>> "abc".rjust(4, " ") # extends to 4 with 1 blank on left
' abc'
>>> "ab".rjust(4, " ") # extends to 4 with 2 blanks on left
' ab'
>>> "a".rjust(4, " ") # extends to 4 with 3 blanks on left
' a'
>>> "".rjust(4, " ") # # extends to 4 with 4 blanks
' '

Assuming that you actually want to indent the text, you can do it with the TextWrapper object:
indent = ' ' * 4
w = textwrap.TextWrapper(replace_whitespace=False, initial_indent=indent, subsequent_indent=indent)
Demo
>>> indent = ' ' * 4
>>> w = textwrap.TextWrapper(width=20, replace_whitespace=False, initial_indent=indent, subsequent_indent=indent)
>>> print(w.fill('A longish paragraph to demonstrate indentation with TextWrapper objects.'))
A longish
paragraph to
demonstrate
indentation with
TextWrapper
objects.
Note that the indent is included in the line width, so you might want to adjust the width accordingly:
>>> w = textwrap.TextWrapper(width=20+len(indent), replace_whitespace=False, initial_indent=indent, subsequent_indent=indent)
>>> print(w.fill('A longish paragraph to demonstrate indentation with TextWrapper objects.'))
A longish paragraph
to demonstrate
indentation with
TextWrapper objects.

Most likely it doesn't work because wrap() returns a single string that is much longer than 4 characters. Example:
'hello'.rjust(3, '*')
output:
'hello'
While, if you do:
'hello'.rjust(10, '*')
Output:
'*****hello'
So, if I understand what you are trying to do, you may need to split the wrapped string and then apply the right justification to each string in the list, while you print it:
wrapped = w.fill(reviewlist[i], )
for line in wrapped.split('\n'):
print line.rjust(4, ' ')
Although I am not sure that justifying on a width of only four characters is really what you need.

There's a couple of problems you're facing here:
.rjust(4,' ') says you want the result to be 4 characters wide, not that you want to indent the line by 4 spaces.
.rjust() just looks at the length of the string, and after you've run it through textwrap it has a bunch of newlines in it that make the length of the string different than the width it prints out to.
You don't want to right justify, really, you want to indent.
The solution given above about indents is correct.
Formatting text through fixed space fonts is very old school, but also very fragile. Perhaps you could think about generating HTML output in a subsequent revision of your application. HTML tables work well for this and are appropriate for tabular data. Alternatively, consider doing a CSV file, and then you can import the result into Excel.

import textwrap
import re
# A sample input string.
inputStr = 'This is a long string which I want to right justify by 70 chars with four spaces on left'
# Wrap by 70 (default) chars. This would result in a multi-line string
w = textwrap.fill(inputStr, )
# Using RegEx read the lines and right justify for 75 chars.
m = re.sub("^(\w+.*)$", lambda g : g.group(0).rjust(75), w, flags = re.MULTILINE)
# Print the result
print(m)

How can I use regex to search inside sentence -not a case sensitive

I'm a newbie to Regular expression in Python :
I have a list that i want to search if it's contain a employee name.
The employee name can be :
it can be at the beginning followed by space.
followed by Â®
OR followed by space
OR Can be at the end and space before it
not a case sensitive
ListSentence = ["SteveÂ®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
The output from the ListSentence is:
["SteveÂ®", "Rob spring", "Car Daniel", "Done daniel"]

First take all your employee names and join them with a | character and wrap the string so it looks like:
(?:^|\s)((?:Steve|Rob|Daniel)(?:Â®)?)(?=\s|$)
By first joining all the names together you avoid the performance overhead of using a nested set of for next loops.
I don't know python well enough to offer a python example, however in powershell I'd write it like this
[array]$names = #("Steve", "Rob", "daniel")
[array]$ListSentence = #("SteveÂ®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel")
# build the regex, and insert the names as a "|" delimited string
$Regex = "(?:^|\s)((?:" + $($names -join "|") + ")(?:Â®)?)(?=\s|$)"
# use case insensitive match to find any matching array values
$ListSentence -imatch $Regex
Yields
SteveÂ®
Rob spring
Car Daniel
Done daniel

Why do you want to use regular expressions? I'd generally recommend avoiding them in Python - you can use string methods instead.
For example:
def string_has_employee_name_in_it(test_string):
test_string = test_string.lower() # case insensitive
for name in ListEmployee:
name = name.lower()
if name == test_string:
return True
elif name + 'Â®' == test_string:
return True
elif test_string.endswith(' ' + name):
return True
elif test_string.startswith(name + ' '):
return True
elif (' ' + name + ' ') in test_string:
return True
return False
final_list = []
for string in ListSentence:
if string_has_employee_name_in_it(string):
final_list.append(string)
final_list is the list you want. This is longer than a regex, but it's also a lot easier to parse and maintain. You can make it a lot shorter in various ways (e.g. combining the tests in the function, and using a list comprehension instead of a loop), but as you're starting out with Python it's a good idea to be clear with what's going on.

I don't think you need to check for all of those scenarios. I think all you need to do is check for word breaks.
You can join the ListEmployee list with | to make an either or regex (also, lowercase it for case-insensitivity), surrounded by \b for word breaks, and that should work:
regex = '|'.join(ListEmployee).lower()
import re
[l for l in ListSentence if re.search(r'\b(%s)\b' % regex, l.lower())]
Should output:
['Steve\xb6\xa9', 'Rob spring', 'Car Daniel', 'Done daniel']

If you're just looking for strings containing a space, as your example indicates, it should be something like this:
[i for i in ListSentence if i.endswith('Â®') or (' ' in i)]

A possible solution:
import re
ListSentence = ["SteveÂ®", "steveHotel", "Rob spring", "Car Daniel", "CarDaniel","Done daniel"]
ListEmployee = ["Steve", "Rob", "daniel"]
def findEmployees(employees, sentence):
retval = []
for employee in employees:
expr = re.compile(r'(^%(employee)s(Â®)?(\s|$))|((^|\s)%(employee)s(Â®)?(\s|$))|((^|\s)%(employee)s(Â®)?$)'
% {'employee': employee},
re.IGNORECASE)
for part in sentence:
if expr.search(part):
retval.append(part)
return retval
findEmployees(ListEmployee, ListSentence)
>> Returns ['Steve\xc3\x82\xc2\xae', 'Rob spring', 'Car Daniel', 'Done daniel']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove slash, letters and numbers from a string? - python

Related

Is there a more efficient way of replacing certain regex matches with elements?

Complex string filtering with python

Python Regex matcher until two characters like a OR condition

Right justify not working [Python]

How can I use regex to search inside sentence -not a case sensitive

Categories

Resources