Python - how to separate paragraphs from text? - python

I need to separate texts into paragraphs and be able to work with each of them. How can I do that? Between every 2 paragraphs can be at least 1 empty line. Like this:
Hello world,
this is an example.
Let´s program something.
Creating new program.
Thanks in advance.

This sould work:
text.split('\n\n')

Try
result = list(filter(lambda x : x != '', text.split('\n\n')))

Not an entirely trivial problem, and the standard library doesn't seem to have any ready solutions.
Paragraphs in your example are split by at least two newlines, which unfortunately makes text.split("\n\n") invalid. I think that instead, splitting by regular expressions is a workable strategy:
import fileinput
import re
NEWLINES_RE = re.compile(r"\n{2,}") # two or more "\n" characters
def split_paragraphs(input_text=""):
no_newlines = input_text.strip("\n") # remove leading and trailing "\n"
split_text = NEWLINES_RE.split(no_newlines) # regex splitting
paragraphs = [p + "\n" for p in split_text if p.strip()]
# p + "\n" ensures that all lines in the paragraph end with a newline
# p.strip() == True if paragraph has other characters than whitespace
return paragraphs
# sample code, to split all script input files into paragraphs
text = "".join(fileinput.input())
for paragraph in split_paragraphs(text):
print(f"<<{paragraph}>>\n")
Edited to add:
It is probably cleaner to use a state machine approach. Here's a fairly simple example using a generator function, which has the added benefit of streaming through the input one line at a time, and not storing complete copies of the input in memory:
import fileinput
def split_paragraph2(input_lines):
paragraph = [] # store current paragraph as a list
for line in input_lines:
if line.strip(): # True if line is non-empty (apart from whitespace)
paragraph.append(line)
elif paragraph: # If we see an empty line, return paragraph (if any)
yield "".join(paragraph)
paragraph = []
if paragraph: # After end of input, return final paragraph (if any)
yield "".join(paragraph)
# sample code, to split all script input files into paragraphs
for paragraph in split_paragraph2(fileinput.input()):
print(f"<<{paragraph}>>\n")

I usually split then filter out the '' and strip. ;)
a =\
'''
Hello world,
this is an example.
Let´s program something.
Creating new program.
'''
data = [content.strip() for content in a.splitlines() if content]
print(data)

this is worked for me:
text = "".join(text.splitlines())
text.split('something that is almost always used to separate sentences (i.e. a period, question mark, etc.)')

Easier. I had the same problem.
Just replace the double \n\n entry by a term that you seldom see in the text (here ¾):
a ='''
Hello world,
this is an example.
Let´s program something.
Creating new program.'''
a = a.replace("\n\n" , "¾")
splitted_text = a.split('¾')
print(splitted_text)

Related

Want to fetch the spacing between the words line by line from a PDF using python

I want to implement a code that can perform one simple task: Fetch the spacing between the words (line by line). The user input should be a PDF from which the lines should be recognized by the code. The PDF can contain different kinds of spacing and patterns.
There is the usage of isspace() in Python, but I don't think that would work in this scenario. Any kind of help would be very much appreciated.
Generally it will not be easy as there is not one answer, look at this page saved as PDF the gap between letters is not a fixed value, this is called kerning.
Each font letter is in effect standalone, so the last letter of one letter word can be any spacing from start of next letter word, usually the font metrics are needed so non-proportional letters one inch wide would be at one inch interval but void needs to be a small bit more than one inch apart for word space. But then again, may be kerned to a different value. Using kerning / justification / obliques the spacing needs much more complex values, such that, often you will see unsuitable spaces.
Basically every word space can be different on every page & every line in a page unlike here in HTML.
So, after a week me & my friend tried to solve the problem which gets the job done but not the perfect way. If anyone find this problem interesting, I'm sharing the code. Open to any suggestions. Thank you.
import re
import pdftotext
from glob import glob
st = glob('Tampered.pdf')
for i in st:
with open(1, "rb") as f:
pdf = pdftotext.PDF(f)
ls = []; text = ""
for j in range(len(pdf)):
ls.append(pdf[j])
text = text.join(ls)
text = re.sub('Page [0-9]*', '', text)
text = re.sub('/(\r\n)+|\r+|\n+|\t+/', '', text)
text = re.sub('TAMPERED.*', '', text)
# text = re.sub(' +', '', text)
text = text.strip()
def Spaces (input_list):
s = [i for i in input_list if i != '']
s_1 = s[:]
s = [input_list.index(s[j]) for j in range(len(s))]
print('Spaces between :- ')
for i in range(len(s)):
if i+1 < len(s):
print("\t\'{s_1[i]}\' and \'{s_1[i+1]}\' : {s[i+1] s[i]}")
input_list = text.split(" ")
Spaces (input_list)

Using a keyword to print a sentence in Python

Hello I am writing a Python program that reads through a given .txt file and looks for keywords. In this program once I have found my keyword (for example 'data') I would like to print out the entire sentence the word is associated with.
I have read in my input file and used the split() method to rid of spaces, tabs and newlines and put all the words into an array.
Here is the code I have thus far.
text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'
for token in lines:
if token == keyword:
//I have found my keyword, what methods can I use to
//print out the words before and after the keyword
//I have a feeling I want to use '.' as a marker for sentences
print(sentence) //prints the entire sentence
file.txt Reads as follows
Welcome to SOF! This website securely stores data for the user.
desired output:
This website securely stores data for the user.
We can just split text on characters that represent line endings and then loop trough those lines and print those who contain our keyword.
To split text on multiple characters , for example line ending can be marked with ! ? . we can use regex:
import re
keyword = "data"
line_end_chars = "!", "?", "."
example = "Welcome to SOF! This website securely stores data for the user?"
regexPattern = '|'.join(map(re.escape, line_end_chars))
line_list = re.split(regexPattern, example)
# line_list looks like this:
# ['Welcome to SOF', ' This website securely stores data for the user', '']
# Now we just need to see which lines have our keyword
for line in line_list:
if keyword in line:
print(line)
But keep in mind that: if keyword in line: matches a sequence of
characters, not necessarily a whole word - for example, 'data' in
'datamine' is True. If you only want to match whole words, you ought
to use regular expressions:
source explanation with example
Source for regex delimiters
My approach is similar to Alberto Poljak but a little more explicit.
The motivation is to realise that splitting on words is unnecessary - Python's in operator will happily find a word in a sentence. What is necessary is the splitting of sentences. Unfortunately, sentences can end with ., ? or ! and Python's split function does not allow multiple separators. So we have to get a little complicated and use re.
re requires us to put a | between each delimiter and escape some of them, because both . and ? have special meanings by default. Alberto's solution used re itself to do all this, which is definitely the way to go. But if you're new to re, my hard-coded version might be clearer.
The other addition I made was to put each sentence's trailing delimiter back on the sentence it belongs to. To do this I wrapped the delimiters in (), which captures them in the output. I then used zip to put them back on the sentence they came from. The 0::2 and 1::2 slices will take every even index (the sentences) and concatenate them with every odd index (the delimiters). Uncomment the print statement to see what's happening.
import re
lines = "Welcome to SOF! This website securely stores data for the user. Another sentence."
keyword = "data"
sentences = re.split('(\.|!|\?)', lines)
sentences_terminated = [a + b for a,b in zip(sentences[0::2], sentences[1::2])]
# print(sentences_terminated)
for sentence in sentences_terminated:
if keyword in sentence:
print(sentence)
break
Output:
This website securely stores data for the user.
This solution uses a fairly simple regex in order to find your keyword in a sentence, with words that may or may not be before and after it, and a final period character. It works well with spaces and it's only one execution of re.search().
import re
text_file = open("file.txt", "r")
text = text_file.read()
keyword = 'data'
match = re.search("\s?(\w+\s)*" + keyword + "\s?(\w+\s?)*.", text)
print(match.group().strip())
Another Solution:
def check_for_stop_punctuation(token):
stop_punctuation = ['.', '?', '!']
for i in range(len(stop_punctuation)):
if token.find(stop_punctuation[i]) > -1:
return True
return False
text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'
sentence = []
stop_punctuation = ['.', '?', '!']
i = 0
while i < len(lines):
token = lines[i]
sentence.append(token)
if token == keyword:
found_stop_punctuation = check_for_stop_punctuation(token)
while not found_stop_punctuation:
i += 1
token = lines[i]
sentence.append(token)
found_stop_punctuation = check_for_stop_punctuation(token)
print(sentence)
sentence = []
elif check_for_stop_punctuation(token):
sentence = []
i += 1

Python string split and do not use middle part

I am reading a file in my Python script which looks like this:
#im a useless comment
this is important
I wrote a script to read and split the "this is important" part and ignore the comment lines that start with #.
I only need the first and the last word (In my case "this" and "important").
Is there a way to tell Python that I don't need certain parts of a split?
In my example I have what I want and it works.
However if the string is longer and I have like 10 unused variables, I gues it is not like programmers would do it.
Here is my code:
#!/usr/bin/python3
import re
filehandle = open("file")
for line in file:
if re.search("#",line):
break;
else:
a,b,c = line.split(" ")
print(a)
print(b)
filehandle.close()
Another possibility would be:
a, *_, b = line.split()
print(a, b)
# <a> <b>
If I recall correctly, *_ is not backwards compatible, meaning you require Python 3.5/6 or above (would really have to look into the changelogs here).
On line 8, use the following instead of
a,b,c = line.split(" ")
use:
splitLines = line.split(" ")
a, b, c = splitLines[0], splitLines[1:-1], splitLines[-1]
Negative indexing in python, parses from the last. More info
I think python negative indexing can solve your problem
import re
filehandle = open("file")
for line in file:
if re.search("#",line):
break;
else:
split_word = line.split()
print(split_word[0]) #First Word
print(split_word[-1]) #Last Word
filehandle.close()
Read more about Python Negative Index
You can save the result to a list, and get the first and last elements:
res = line.split(" ")
# res[0] and res[-1]
If you want to print each 3rd element, you can use:
res[::3]
Otherwise, if you don't have a specific pattern, you'll need to manually extract elements by their index.
See the split documentation for more details.
If I've understood your question, you can try this:
s = "this is a very very very veeeery foo bar bazzed looong string"
splitted = s.split() # splitted is a list
splitted[0] # first element
splitted[-1] # last element
str.split() returns a list of the words in the string, using sep as the delimiter string. ... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
In that way you can get the first and the last words of your string.
For multiline text (with re.search() function):
import re
with open('yourfile.txt', 'r') as f:
result = re.search(r'^(\w+).+?(\w+)$', f.read(), re.M)
a,b = result.group(1), result.group(2)
print(a,b)
The output:
this important

Parsing a huge dictionary file with python. Simple task I cant get my head around

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:
<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>
and I want to extract every word of it to a list of words without duplicates. Lets start by my code.
f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
#<ar><k> lines
#<kref> lines
#ending to ";" - lines
for word in listofwordsfromaline:
p.write(word,"\n")
f.close()
p.close()
Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.
For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "
# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]
In the last case you can use the split method:
word_lst = line.split(";")
ans = []
for word in word_list:
start = word.find("-")
ans.append(word[start:])
ans = set(ans)
First find what defines a word for you.
Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters).
https://docs.python.org/2/howto/regex.html
If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.
Match groups in Python

Comparing multiple strings

Hey I am new I need some help with comparing strings
My Assignment is to make a chatbot, one that reads from a text file, that has possible things to input, and what the resulting output will be.
My problem is that it asks to choose the most suited one from the text file, easy yeh? but you also must save variables at the same time
Ok an example is one of the lines of the rules is:
you <w1> <w2> <w3> <w4> me | What makes you think I <w1> <w2> <w3> <w4> you?
You must save the <w1> and so on to a variable.
AND the input can be like, "did you know that you are really nice to me" so you have to adjust the code for that as well.
And also we cant make the code just for this text file, it is supposed to adjust to anything that is put into the text file.
Can someone help me ?
This is what I'm up to:
import string
import sys
import difflib
#File path:
rules = open("rules.txt", "rU")
#Set some var's:
currentField = 0
fieldEnd = 0
questions = []
responses = []
Input = ""
run = True
#Check if we are not at the end of the file:
for line in rules:
linem = line.split(" | ")
question = linem[0]
response = linem[1]
questions.append(question.replace("\n", ""))
responses.append(response.replace("\n", ""))
print questions
print responses
for q in questions:
qwords.appendq.split()
while run = True:
Input = raw_input('> ').capitalize()
for char in Input:
for quest in questions:
if char in quest:
n += 1
else:
if "<" in i:
n += 1
closestQuestion = questions.index(q)
print response
I would prefer pyparsing over any regex-based approach to tackle this task. It's easier to construct a readable parser even for more involved and complex grammars.
As a quick-and-stupid solution, parse input file and store entries in list. Each entry should contain dynamically-compiled "matching regex" (e.g. r'you (\w+) (\w+) (\w+) (\w+) me(?i)') and "replacement string" (e.g. r'What makes you think I \1 \2 \3 \4 you?'). For each incoming request, chat bot should match text agains regex list, find appropriate entry and then call regex.sub() for "replacement string".
But first of all, read some beginner's tutorial on Python. Your code is un-pythonic and just wrong in many ways.

Categories