I'm using the module Win32Com to edit automatically Word Documents with Python. But I'm facing an annoying problem that you've probably seen before : I use the Find and Replace function of the module to insert paragraphs into a template that I have, but sometimes I'd like to insert several paragraphs at the same time, which are separated with a line-break. The python string of these paragraphs is something like that : text = "First paragraph.\nSecond paragraph."
But the problem is that when I use the Find and Replace function with that kind of strings, it doesn't make a line-break but something like First paragraph Second paragraph, which is obviously not what I want.
Does someone have an idea on how to deal with that ?
Thanks guys for help !
Related
I am writing a piece of code to get lyrics from genius.com.
I have managed to extract the code from the website but it comes out in a format where all the text is on one line.
I have used regex to add a space but cannot figure out how to add a new line. Here is my code so far:
text_container = re.sub(r"(\w)([A-Z])", r"\1 \2", text_container.text)
This adds a space before the capital letter, but I cannot figure out how to add a new line.
It is returning [Verse 1]Leaves are fallin' down on the beautiful ground I heard a story from the man in red He said, "The leaves are fallin' down
I would like to add a new line before "He" in the command line.
Any help would be greatly appreciated.
Thanks :)
If genius.com doesn't somehow provide a separator, it will be very hard to find a way to know what to look for.
In your example, I made a regex searching for " [A-Z]", which will find " He...". But it will also find all places where a sentence starts with " I...". Sometimes new sentences will start with "I...", but it might make new lines where there actually shouldn't be one.
TL;DR - genius.com needs to provide some sort of separator so we know when there should be a new line.
Disclaimer: Unless I missed something in your description/example
A quick skim of the view-source for a genius lyrics page suggests that you're stripping all the HTML markup which would otherwise contain the info about linebreaks etc.
You're probably better off posting that code (likely as a separate question) and asking how to correctly extract not just the text nodes, but also enough of the <span> structure to format it as necessary.
Looking around I found an API that python has to pull lyrics from Genius.com, here's the link to the PyPI:
https://lyricsgenius.readthedocs.io/en/master/
Just follow the instructions and it should have what you need, with more info on the problem I could provide a more detailed response
I'm not sure about using regex. Try this method:
text = lyrics
new_text = ''
for i, letter in enumerate(text):
if i and letter.isupper():
new_text += '\n'
new_text += letter
print(new_text)
However, as oscillate123 has explained, it will create a new line for every capital letter regardless of the context.
I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline
I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.
This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com
Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}
now pleased don't get me wrong on this, but im just curious whether I can get a text file and then find out how many lines within that text file have been written on, and thus use that number to print selective data from every few lines. Also could I use python to find specific words within the text file that are evenly apart for example within the text file if everything was written like this
name:> Ben
Score:> 2
name:> Ethan
Score:> 8
name:> James
Score:> 0
would it be possible for me to search the text file, for the string 'name:>' (and then save whatever comes infront of it, if possible to a variable) or seeing as they're all equally spaced could I save the specific score of one person to a variable with their name (as everything in front would be equally spaced), without having to open the txt file at all.
If all of this sounds completely impossible or if any of you have received any vague ideas as to what im talking about (in which case im in awe of your abilities of comprehension from this badly worded example), please give me any thoughts or ideas on how to format text files to create variables.
if all the above seems too complex could someone please just tell me wether its possible to analyse how many lines within a text file have been written on, from there ive got a vague idea on how to create my program.
You can use regular expression (RE) to search the text file as a string, then find out where the existing value is you want to change in the text file and write it.
https://docs.python.org/2/library/re.html
To do what you are asking, I would personally use the built-in re module, as follows:
import re
with open("foo.txt", "r") as foo:
contents = foo.read()
results = re.search("foo-bar", contents).group()
print(results)
That should do what you are looking for.
First of all I'd like to say this is my first post on SO, which has been of great help for years to me, so thank you all!
Now onto my question:
I have a string of characters containing unicode text, html tags and bbcode tags (which is obviously extracted from a forum).
Sample:
This is my sample text.
It may contain HTML tags,
[b]BBCode[b],
or even [b][u]both[/u] intricated[/b]!
I have also a list of keywords which may appear in the text described above, and for each of these words I have an associated URL.
Sample:
kw = {'sample': 'http://www.sample.fr', 'BBCode': 'http://www.bbcode.sp'}
As you can see I'm currently using Python because I'm used to the language, but I can be flexible.
My goal is to detect which word(s) in my keyword list is present in the sample text, and to "decorate" the matching word(s) with a link (preferably in bbcode) to the corresponding URL, without altering the rest of the string (just like for Wikis).
Taking further the examples above I'd like to retrieve:
This is my [url=http://www.sample.fr]sample[/url] text.
It may contain HTML tags,
[b][url=http://www.bbcode.sp]BBCode[/url][b],
or even [b][u]both[/u] intricated[/b]!
The main problem here is that sometimes, one of the keywords in my list appears inside a tag, which I do not want to "decorate" with a link for obvious reasons.
In other words, the text I'd like to replace can be located only outside the anchor tags:
**HERE** <not here>[not here] **HERE** [/not here]</not here> **HERE**
Also, I've already tried using BeautifulSoup (along with PostMarkup to convert BBCode to HTML before parsing with BeautifulSoup) but it doesn't allow me to keep the initial string...
Remark: "real" text actually can never be placed between brackets (angle nor squared) due to the general usage of my forum, so this simplifies the problem quite a bit.
I'm sorry for my very long question, I hope everything is clear!
Any help appreciated, thanks to everyone by advance!
Update: Casimir's solution in Python (see below) works just great. Thank you Casimir et Hippolyte!
To do that, the way is always the same: you must match first what you want to avoid.
Example:
(?s) # dotall mode
( # capture with all what you want to avoid
<!--.*?--> # html comment
|
<[^>]+> # html tag
|
\[[^\]]+\] # bbcode
)
| # OR
kw1|kw2|kw3|...
Then you must use a function as replacement, inside the function when the capture group 1 is defined, you return the match, otherwise you return the corresponding string for the keyword.