This question already has answers here:
In Python, how do I split a string and keep the separators?
(19 answers)
Closed 9 years ago.
This code almost does what I need it to..
for line in all_lines:
s = line.split('>')
Except it removes all the '>' delimiters.
So,
<html><head>
Turns into
['<html','<head']
Is there a way to use the split() method but keep the delimiter, instead of removing it?
With these results..
['<html>','<head>']
d = ">"
for line in all_lines:
s = [e+d for e in line.split(d) if e]
If you are parsing HTML with splits, you are most likely doing it wrong, except if you are writing a one-shot script aimed at a fixed and secure content file. If it is supposed to work on any HTML input, how will you handle something like <a title='growth > 8%' href='#something'>?
Anyway, the following works for me:
>>> import re
>>> re.split('(<[^>]*>)', '<body><table><tr><td>')[1::2]
['<body>', '<table>', '<tr>', '<td>']
How about this:
import re
s = '<html><head>'
re.findall('[^>]+>', s)
Just split it, then for each element in the array/list (apart from the last one) add a trailing ">" to it.
Related
This question already has answers here:
How to input a regex in string.replace?
(7 answers)
Closed 2 years ago.
I'm extremely new to coding and python so bear with me.
I want to remove all text that is in parenthesis from a text file. There are multiple sets of parenthesis with varying lengths of characters inside. From another similar post on here, I found
re.sub(r'\([^()]*\)', '', "sample.txt")
which is supposed to remove characters between () but does absolutely nothing. It runs but I get no error code.
I've also tried
intext = 'C:\\Users\\S--\\PycharmProjects\\pythonProject1\\sample.txt'
outtext = 'C:\\Users\\S--\\PycharmProjects\\pythonProject1\\EDITEDsample.txt'
with open("sample.txt", 'r') as f, open(outtext, 'w') as fo:
for line in f:
fo.write(line.replace('\(.*?\)', '').replace('(', " ").replace(')', " "))
which successfully removes the parenthesis but nothing inbetween them.
How do I get the characters between the parenthesis out?
EDIT: I was asked for a sample of sample.txt, these are it's contents:
Example sentence (first), end of sentence. Example Line (second), end
of sentence (end).
As you can see here, the function sub does not receive a filename as parameter, but actually it receives the text on which to work.
>>> re.sub(r'\([^()]*\)', '', "123(456)789")
'123789'
As for your second attempt, notice that string.replace does not take in REGEX expressions, only literal strings.
This question already has answers here:
Why are empty strings returned in split() results?
(9 answers)
Closed 2 years ago.
I'm using split to parse http requests and came across something that I do not like but don't know a better way.
Imagine I have this GET : /url/hi
I'm splitting the url simply like so:
fields = request['url'].split('/')
It's simple, it works but it also makes the contents of the list have the first position as an empty string. I know this is expected behavior.
The question is: Can I change the calling of split to contemplate such thing or do I just live with it?
If you just always want to remove the first entry to the list you could just do this:
fields = request['url'].split('/')[1:]
If you just want to remove any empty strings from the list you can use instead follow your initial call with this:
fields.remove('')
Hope it helps!
Ok, If you sure your string start with '/'
you can ignore first character like this:
url = request['url']
fields = url[1:].split('/') #[1: to end]
If your not sure, simple check first:
url = request['url']
if url.startswith('/'):
url = url[1:]
fields = url.split('/')
Happy coding 😎
This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 3 years ago.
I have a string like this:
string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title" width="600"...
title="one more title"...> '''
I am trying to get anything that appears as title (title="Anything here")
I have already tried this but it does not work correctly.
re.findall(r'title=\"(.*)\"',string)
I think your Regex is too Greedy. You can try something like this
re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:
https://docs.python.org/3/library/html.parser.html
https://www.simplifiedpython.net/parsing-html-in-python/
https://github.com/psf/requests-html / Get html using Python requests?
If you would like to read more on performance testing of different python HTML parsers you can learn more here
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help
c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)
for i in c:
print(i.group(1))
The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.
This question already has answers here:
Replacing instances of a character in a string
(17 answers)
Closed 5 years ago.
I tried to replace vowels and add il to them using this code but it didnt work, help!
line=input("What do you want to say?\n")
line = line.replace('e', 'ile')
line = line.replace('o', 'ilo')
line = line.replace('a', 'ila')
line = line.replace('i', 'ili')
line = line.replace('u', 'ilu')
line = line.replace('y', 'ily')
print (line)
But if you type a long sentence it stop working correctly.
could someone please help me?
Want to print "Hello world"
it prints:
Hililellililo wililorld
when should print Hilellilo Wilorld
Try replacing any occurrence of the letters you want with regex. Like this i.e:
import re
re.sub(r'[eE]', 'i$0', "Hello World")
You can replace any letter you want putting them inside the square brackets.
Additionally, that 'i$0' is the literal character 'i' and $0 the letter that was matched.
"Hello world".replace('e', 'ie')
But your question is not very clear, may be you mean something different.
Whenever you do multiple replacements after each other, you always need to be careful with the order in which you do them.
In your case put this replacement first:
line = line.replace('i', 'ili')
Otherwise it replaces the i's in the replacements that have been done before.
When you need to do many replacements it is often better to use an approach that avoids these problems.
One of them can be using regular expressions, as already proposed. Another is scanning the text from start to end for items to replace and replace each item when you find it during the scan and continue scanning after the replacement.
This question already has answers here:
How can I remove a trailing newline?
(27 answers)
Closed 8 years ago.
I had a list that read from a text file into an array but now they all have "\n" on the end. Obviously you dont see it when you print it because it just takes a new line. I want to remove it because it is causing me some hassle.
database = open("database.txt", "r")
databaselist = database.readlines()
thats the code i used to read from the file. I am a total noob so please dont use crazy technical talk otherwise it will go straight over my head
"string with or without newline\n".rstrip('\n')
Using rstrip with \n avoids any unwanted side-effect except that it will remove multiple \n at the end, if present.
Otherwise, you need to use this less elegant function:
def rstrip1(s, c):
return s[:-1] if s[-1]==c else s
Use str.rstrip to remove the newline character at the end of each line:
databaselist = [line.rstrip("\n") for line in database.readlines()]
However, I recommend that you make three more changes to your code to improve efficiency:
Remove the call to readlines. Iterating over a file object yields its lines one at a time.
Remove the "r" argument to open since the function defaults to read-mode. This will not improve the speed of your code, but it will make it less redundant.
Most importantly, use a with-statement to open the file. This will ensure that it is closed automatically when you are done.
In all, the new code will look like this:
with open("database.txt") as database:
databaselist = [line.rstrip("\n") for line in database]