How to add a new line before a capital letter? - python

I am writing a piece of code to get lyrics from genius.com.
I have managed to extract the code from the website but it comes out in a format where all the text is on one line.
I have used regex to add a space but cannot figure out how to add a new line. Here is my code so far:
text_container = re.sub(r"(\w)([A-Z])", r"\1 \2", text_container.text)
This adds a space before the capital letter, but I cannot figure out how to add a new line.
It is returning [Verse 1]Leaves are fallin' down on the beautiful ground I heard a story from the man in red He said, "The leaves are fallin' down
I would like to add a new line before "He" in the command line.
Any help would be greatly appreciated.
Thanks :)

If genius.com doesn't somehow provide a separator, it will be very hard to find a way to know what to look for.
In your example, I made a regex searching for " [A-Z]", which will find " He...". But it will also find all places where a sentence starts with " I...". Sometimes new sentences will start with "I...", but it might make new lines where there actually shouldn't be one.
TL;DR - genius.com needs to provide some sort of separator so we know when there should be a new line.
Disclaimer: Unless I missed something in your description/example

A quick skim of the view-source for a genius lyrics page suggests that you're stripping all the HTML markup which would otherwise contain the info about linebreaks etc.
You're probably better off posting that code (likely as a separate question) and asking how to correctly extract not just the text nodes, but also enough of the <span> structure to format it as necessary.

Looking around I found an API that python has to pull lyrics from Genius.com, here's the link to the PyPI:
https://lyricsgenius.readthedocs.io/en/master/
Just follow the instructions and it should have what you need, with more info on the problem I could provide a more detailed response

I'm not sure about using regex. Try this method:
text = lyrics
new_text = ''
for i, letter in enumerate(text):
if i and letter.isupper():
new_text += '\n'
new_text += letter
print(new_text)
However, as oscillate123 has explained, it will create a new line for every capital letter regardless of the context.

Related

How to find exactly "\n" in Python IDLE replace dialog?

I'm a begginer in Python and one of the first codes I've made it's an RPG, so there's a lot of texts in strings being printed. Before I learned how to "word wrap", I used to test every string and put an "\n" in the right places, so it could be better to read the history in the console.
But now I don't need those "\n" anymore, and it's been really laborious to replace each one of them using the Replace Dialog of Python IDLE. One of the problems is that I want to ignore double new lines ("\n\n"), because they do make the texts more presentable.
So if I just search "\n" he finds it, but I want to ignore all the "\n\n".
I tried using the "Regular expression" option and did a research with regex but with no success, since I'm completly new in this area. Tried some things like "^\n$" because, if I understood it right, the ^ and the $ delimit the search to what's between them.
I think it's clear what I need, but will write an example anyways:
print("Here's the narrator telling some things to the player. Of course I could do some things but\nnow it's time to ask for help!\n\nProbably it's a simple thing, but it's been lots of time in research and no\nsuccess...")
I want to find and replace those two "\n" with one empty space (" ") and totally ignore the "\n\n".
Can you guys help? Thanks in advance.
You need
re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
See the regex demo.
Details
(?<!\n) - no newline allowed immediately on the left
\n - a newline
(?!\n) - no newline allowed immediately on the right
See Python demo:
import re
text = "Here's the narrator telling some things to the player. Of course I could do some things but\nnow it's time to ask for help!\n\nProbably it's a simple thing, but it's been lots of time in research and no\nsuccess..."
print(re.sub(r'(?<!\n)\n(?!\n)', ' ', text))
Output:
Here's the narrator telling some things to the player. Of course I could do some things but now it's time to ask for help!
Probably it's a simple thing, but it's been lots of time in research and no success...

Pyparsing for Paragraphs

I have run into a slight problem with pyparsing that I can't seem to solve. I'd like to write a rule that will parse a multiline paragraph for me. The end goal is to end up with a recursive grammar that will parse something like:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
Into something like HTML: so maybe (of course with a parse tree, I can transform this to whatever format I like).
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
Progress
I have managed to get to the stage where I can parse the heading row, and an indented block using pyparsing. But I can't:
Define a paragraph as a multiple lines that should be joined
Allow a paragraph to be indented
An Example
Following from here, I can get the paragraphs to output to a single line, but there doesn't seem to be a way to turn this into a parse tree without removing the line break characters.
I believe a paragraph should be:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
But this doesn't seem to work for me. Any ideas would be awesome :)
So I managed to solve this, for anybody who stumbles upon this in the future. You can define the paragraph like this. Although it is certainly not ideal, and doesn't exactly match the grammar that I described. The relevant code is:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
Where join_lines is defined as:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
That should point you in the right direction if this matches your needs :) I hope that helps!
A Better Empty Line
The definition of empty line given above is definitely not ideal, and it can be improved dramatically. The best way I've found is the following:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
This allows you to have empty lines that are filled with spaces, without breaking the match.

find substrings and replace them but get their information [python]

I want to do something like this to a text (This is just an example to show the problem):
new_text = re.sub(r'\[(?P<index>[0-9]+)\]',
'(Found pattern the ' + index + ' time', text)
Where text is my original text. I want to find any substring like this: [3] or [454]. But this isn't the hard part. The hard part is to get the number in there. I want to use the number to use a method called add_link(number) which expects a number(instead of the string I'm building with "Found pattern..." - that's just an example). (In a database it has stored links matched to IDs where it finds the links.)
Python tells me it doesn't know the local variable index. How can I make it knowing?
Edit: I have been told I didn't ask clearly. (I already have an answer but maybe someone is going to read this in future.) The question was how to get the pattern known as [0-9]+ get as a local variable. I guessed it would be something like this: (?P<index>[0-9]+), and it was.
Thanx in advanced, Asqiir
You can reference a named group in the replacement string with the syntax \g<field name>. So your code should be written as:
new_text = re.sub(r'\[(?P<index>[0-9]+)\]', '(Found pattern the \g<index> time', text)

PyEnchant 'correcting' words in dictionary to words not in dictionary

I'm attempting to take large amounts of natural language from a web forum and correct the spelling with PyEnchant. The text is often informal, and about medical issues, so I have created a text file "test.pwl" containing relevant medical words, chat abbreviations, and so on. In some cases, little bits of html, urls, etc do unfortunately remain in it.
My script is designed to use both the en_US dictionary and the PWL to find all misspelled words and correct them to the first suggestion of d.suggest totally automatically. It prints a list of misspelled words, then a list of words that had no suggestions, and writes the corrected text to 'spellfixed.txt':
import enchant
import codecs
def spellcheckfile(filepath):
d = enchant.DictWithPWL("en_US","test.pwl")
try:
f = codecs.open(filepath, "r", "utf-8")
except IOError:
print "Error reading the file, right filepath?"
return
textdata = f.read()
mispelled = []
words = textdata.split()
for word in words:
# if spell check failed and the word is also not in
# mis-spelled list already, then add the word
if d.check(word) == False and word not in mispelled:
mispelled.append(word)
print mispelled
for mspellword in mispelled:
#get suggestions
suggestions=d.suggest(mspellword)
#make sure we actually got some
if len(suggestions) > 0:
# pick the first one
picksuggestion=suggestions[0]
else: print mspellword
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
textdata = textdata.replace(mspellword,picksuggestion)
try:
fo=open("spellfixed.txt","w")
except IOError:
print "Error writing spellfixed.txt to current directory. Who knows why."
return
fo.write(textdata.encode("UTF-8"))
fo.close()
return
The issue is that the output often contains 'corrections' for words that were in either the dictionary or the pwl. For instance, when the first portion of the input was:
My NEW doctor feels that I am now bi-polar . This , after 9 years of being considered majorly depressed by everyone else
I got this:
My NEW dotor feels that I am now bipolar . This , aftER 9 years of being considERed majorly depressed by evERyone else
I could handle the case changes, but doctor --> dotor is no good at all. When the input is much shorter (for example, the above quotation is the entire imput), the result is desirable:
My NEW doctor feels that I am now bipolar . This , after 9 years of being considered majorly depressed by everyone else
Could anybody explain to me why? In very simple terms, please, as I'm very new to programming and newer to Python. A step-by-step solution would be greatly appreciated.
I think your problem is that you're replacing letter sequences inside words. "ER" might be a valid spelling correction for "er", but that doesn't mean that you should change "considered" to "considERed".
You can use regexes instead of simple text replacement to ensure that you replace only full words. "\b" in a regex means "word boundary":
>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
You were right, that is a bad idea. This is what's causing "considered" to be replaced by "considERed". Also, you're doing a replacement even when you don't find a suggestion. Move the replacement to the if len(suggestions) > 0 block.
As for replacing every instance of the word, what you want to do instead is save the positions of the misspelled words along with the text of the misspelled words (or maybe just the positions and you can look the words up in the text later when you're looking for suggestions), allow duplicate misspelled words, and only replace the individual word with its suggestion.
I'll leave the implementation details and optimizations up to you, though. A step-by-step solution won't help you learn as much.

Spell check program in python

Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?
There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right
I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.
It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.

Categories