Python splitting to the newline character - python

I have an html file that i am retrieving just the body of text
i would like to print one single line
right now i am print
for line in newName.body(text=True):
print line
this gives me everything in the body
what i would like is to print like
for line in newName.body(text=True):
print line[257:_____] # this is where i need help
instead of __ or choosing another number to end i want it to go to the newline character
so it looks like
for line in newName.body(text=True):
print line[257:'\n']
however that dosent work
how can i make that work?
the text which i am working in is located in
body
pre
the text i want
/pre
/body

You could use .partition() method to get the first line:
first_line = newName.body.getText().partition("\n")[0]
assuming newName is a BeautifulSoup object. It is usually named soup.
To get text from the first <pre> tag in the html:
text = soup.pre.string
To get a list of lines in the text:
list_of_lines = text.splitlines()
If you want to keep end of line markers in the text:
list_of_lines = text.splitlines(True)
To get i-th line from the list:
ith_line = list_of_lines[i]
note: zero-based indexing e.g., i = 2 corresponds to the 3rd line.

There is no guarantee that your HTML file has more than one line. The web page may be laid out in lines, but the structure of the page doesn't have to match the structure of the markup and vice versa.
Just to be sure, try this:
print len(newName.body(text=True).split('\n'))
If the value is >1, then you should be able to get the line you need like:
newName.body(text=True).split('\n')[257]
Maybe not the most graceful way, but it works, if there are in fact multiple lines.

Is it that you want line[127:line.find('\n')] as you are sure it's from 127 then equally you must be sure there's a \n.

Related

parsing file and appending string python

Let's say I have a file:
This is the first line
Ages = ["young*","old*"] //This was the second line, I put a * on purpose
This is the third line
The scenario is as:
I know there is the "Ages" array inside the file, but I don't have any idea about its elements.
I now want to append a specific string, say "test*" after each element, the file would become:
This is the first line
Ages = ["young\*test\*","old\*test\*"] //This was the second line, I put a * on purpose
This is the third line
Any help?
First you need to open the file with read and write mode and read it all to a string, then I think your best bet would be to use regular expressions to group what you want to replace and replace it. Then write it back to the file. It would be something similar to below:
pattern = re.compile(r'') # this would be your pattern
with open("filename.txt", "r+") as file:
content = file.read() #all the content as string
replaced = pattern.sub('the match group with your addition', content)
file.seek(0) #the seek is necessary to return back to the start of the file
file.write(replaced) # write the modified content to the file
file.truncate() # truncate if there are any trailing parts from before
I think this is a homework so I won't complete the regular expression pattern there but I will give a hint:
Ages with trailing 0 or more spaces = with trailing 0 or more spaces array open character with a match group for your inner array elements, and another inner group that signifies the end of each element, then you can replace it accordingly.

How to split one very long line in Pycharm into multiple lines?

After parsing, I've got a lot of urls that have unfortunately joined together in one line. It will take a long time to re-parse, so I ask if there is a method as one long line with Url to turn into a multiple lines - 1 Url per line?
What i have:
'https:// url1.com/bla1','https:// url1.com/bla2',..thousands of urls..,'https:// url999.com/blaN'
What i need:
'https:// url1.com/bla-1',
'https:// url1.com/bla-2',
etc
'https:// url999.com/bla-N'
I've already tried to uncheck Line breaks in Python - Wrapping and Braces and check Ensure right margin is not exceeded - doesn't work
So how can i fix it?
Yes.
First set Code->Style->Wrapping and Braces->Method parameters/Method call arguments to wrap if long or chop down if long.
After that simply call reformat code on the line (Command+Alt+L).
Let's try a simple method, if I understand your query correctly. Read the first file, replace commas with newline character, and write the result to the same file.
urlsfile = open('test1.txt', 'r+') # in case you are getting the data from file itself
urls = urlsfile.readline()
urlsfile.close()
newlines = urls.replace(",", "\n") # otherwise replace newlines with the variable name that you are trying to write to the file
newfile = open('test1.txt','w+')
newfile.write(newlines)
newfile.close()

Removing an imported text file (Python)

I'm trying to remove a couple of lines from a text file that I imported from my Kindle. The text looks like:
Shall I come to you?
Nicholls David, One Day, loc. 876-876
Dexter looked up at the window of the flat where Emma used to live.
Nicholls David, One Day, loc. 883-884
I want to grab the bin bag and do a forensics
Sophie Kinsella, I've Got Your Number, loc. 64-64
The complete file is longer, this is just a piece of document. The aim with my code is to remove all lines where "loc. " is written so that just the extracts remain. My target can be also seen as removing the line which is just before the blank line.
My code so far look like this:
f = open('clippings_export.txt','r', encoding='utf-8')
message = f.read()
line=message[0:400]
f.close()
key=["l","o","c","."," "]
for i in range(0,len(line)-5):
if line[i]==key[0]:
if line[i+1]==key[1]:
if line[i + 2]==key[2]:
if line[i + 3]==key[3]:
if line[i + 4]==key[4]:
The last if finds exactly the position (indices) where each "loc. " is located in file. Nevertheless, after this stage I do not know how to go back in the line so that the code catches where the line starts, and it can be completely remove. What could I do next? Do you recommend me another way to remove this line?
Thanks in advance!
I think that the question might be a bit misleading!
Anyway, if you simply want to remove those lines, you need to check whether they contain the "loc." substring. Probably the easiest way is to use the in operator.
Instead of getting whole file from read() function, read the file line by line (using the readlines() function for example). You can then check if it contains your key and omit it if it does.
Since the result is now list of strings, you might want to merge it: str.join().
Here I used another list to store desired lines, you can also use "more pythonic" filter() or list comprehension (example in similar question I mentioned below).
f = open('clippings_export.txt','r', encoding='utf-8')
lines = f.readlines()
f.close()
filtered_lines = []
for line in lines:
if "loc." in line:
continue
else:
filtered_lines.append(line)
result = ""
result = result.join(filtered_lines)
By the way, I thought it might be a duplicate - Here's question about the opposite (that is wanting lines which contain the key).

Search for every blank line in document, and pop first line after that?

What I am trying to do is go through a document line by line, find each blank line, keep traversing until I hit the next line of text, and pop that line.
So for example, what I want to do is this:
Paragraph 1
This is a line.
This is another line.
Here is a line after a space, which I want to pop!
Here is the next line, which I want to keep.
Here is another line I want to pop.
So it will go through each number of blank lines until it hits the next sentence, and pops that sentence only, then continues on. I am thinking I should use re.split('\n') , but I am not sure.
I am sorry I have no code to post but I really don't know where to start
any help would be much appreciated, thank you!
this is part of a larger code, which i've worked days and days on and have figured out up to this point, so I have done the bulk of the word.
If you do for line in filehandle: it will iterate over each line. If you have a flag that is true when the previous line is blank you can skip the next line then reset the flag.
The easiest novice solution by far is probably the way Steve suggested: Just iterate the lines, and use a flag to keep track of whether the last line was a blank line.
But if you want a higher-level solution, you need to rethink the problem at a higher level. What you're actually trying to specify is the first line of every paragraph but the first, where "paragraphs" are things divided by empty lines. Right?
So, how could you do that? Well, you can split on '\n\n' just as easily as on \n. So:
paragraphs = document.split('\n\n')
first_lines = [paragraph.partition('\n')[0] for paragraph in paragraphs]
popped_lines = first_lines[1:]
(I used partition instead of split here both because it splits only at the first '\n', leaving the rest alone, and because it handles one-line paragraphs right—which paragraph.split('\n', 1) would not.)
But you don't want a list of the popped lines, you want a list of everything but the popped lines, right?
paragraphs = document.split('\n\n')
first, rest = paragraphs[0], paragraphs[1:]
rest_edited = [paragraph.partition('\n')[1] for paragraph in rest]
And if you want to turn that back into a document:
all_edited = [first] + rest_edited
document_edited = '\n\n'.join(all_edited)
You can shorten that a bit by using slice assignment, although I'm not sure it's quite as readable:
paragraphs = document.split('\n\n')
paragraphs[1:] = [paragraph.partition('\n')[1] for paragraph in paragraphs[1:]]
document_edited = '\n\n'.join(paragraphs)
As J.F. Sebastian points out, the question is a little ambiguous. Does "blank lines" mean "empty lines", or "lines with nothing but whitespace in them"? If it's the latter, things are a bit more complicated, and the easiest solution probably is a simple regex (r'\n\s*\n') for the splitting into paragraphs.
Meanwhile, if what you have is a sequence of lines (and note that a file is a sequence of lines!) rather than one big string, you can do this without split at all, in a few different ways.
For example, paragraphs are groups of non-blank lines, right? So you can use the groupby function to get them:
groups = itertools.groupby(lines, bool)
Or, if "blank" doesn't mean "empty":
groups = itertools.groupby(lines, lambda line: not line.strip())
Note that this gives you (False, <sequence of lines>) for each paragraph, and (True, <sequence of blank lines>) for each blank line. If you want to preserve blank lines as-is, you can—but if you're happy just replacing each run of blank lines with a single empty line (which you obviously are if "blank" does mean "empty"), it's probably easier to throw away the blank paragraphs:
paragraphs = (group for (key, group) in paragraphs if not key)
Then you can remove the first element from all but the first group, and finally chain the groups back together into one big sequence:
first = next(paragraphs)
edited_paragraphs = (itertools.islice(paragraph, 1) for paragraph in paragraphs)
edited_document = itertools.chain(first, *edited_paragraphs)
Finally, what if you have runs of multiple blank lines in a row? Well, first you have to decide how to deal with them. If you have two blank lines, do you remove the second? If so, do you remove the first line of the next paragraph (because it was originally after a blank line), or not (because the blank line it was after was already removed)? What if you have three in a row? Splitting on '\n\n' will do one thing, splitting on '\n\s*\n' a different thing, and groupby yet another… but until you know what you want, it's impossible to say which is "right" or how to "fix" the others, of course.
I assume the original poster (OP) wants to remove those lines in-place, meaning removing those lines from the file. Here is a revised solution (my previous solution was off the mark. Thank you J.F Sebastian for telling me.
import fileinput
def remove_line_after_blank(filename, in_place_edit=False):
previous_line = ''
for line in fileinput.input(filename, inplace=in_place_edit):
if not (previous_line == '\n' and line != '\n'):
print line.rstrip()
previous_line = line
if __name__ == '__main__':
remove_line_after_blank('data.txt', in_place_edit=True)
Discussion
If you do not want to modify the original data file, remove , in_place_edit=True.
use re.findall to match all occurrence in a string:
>>> text = """Paragraph 1
This is a line.
This is another line.
Here is a line after a space, which I want to pop!
Here is the next line, which I want to keep.
Here is another line I want to pop."""
>>> re.findall("\n\n+(.+)", text)
['Here is a line after a space, which I want to pop!', 'Here is another line I want to pop.']
>>> re.findall("\n\n+(.+)$", text, re.MULTILINE)
['Here is a line after a space, which I want to pop!', 'Here is another line I want to pop.']
The easiest way would be to split the text on newlines:
lines = your_string.split("\n")
That would break it up into an array (stored in lines), where each element of the array is a separate line of text. (As noted in the comments, if you have a file object already, you can just loop through that.)
Then you could go through each line of lines, checking for a newline. If you find one, you could "pop" out the next one. (I don't know what you mean by pop, so I just have the code printing out the lines you want.)
for line in lines:
if print_next_line:
print(line)
print_next_line = False
if line == "":
print_next_line = True

Removing selected characters from text file

I have long a text file where each line looks something like /MM0001 (Table(12,)) or /MM0015 (Table(11,)). I want to keep only the four-digit number next to /MM. If it weren't for the "table(12,)" part I could just strip all the non-numeric characters, but I don't know how to extract the four-digit numbers only. Any advice on getting started?
If it's exactly that format, you could just print out line[3:7]
You could parse text line by line and then use 4th to 7th char of every line.
ln[3:7]
import re
R=re.compile(r'/MM(\d+)')
for line in file:
L=R.match(line)
if L:
print L.group(1)
or, more succinctly...
lines=[R.match(line).group(1) for line in file] #works if the lines are guaranteed to start with \MM
This should give you only the integers following a /MM and should work no matter how long the strings of integers are. If they're guaranteed to be a certain length, then you're better off with one of the other examples (which don't use regex).
if each line starts with /MM then just go through the file and print out line[3:7] e.g.
for line in file:
print line[3:7]

Categories