Python: print/get first sentence of each paragraph

Python: print/get first sentence of each paragraph - python

This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?
from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
print(soup.find_all('p')[0].get_text())
This code prints:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial. The brain is the only kind of object
capable of understanding that the cosmos is even there, or why there
are infinitely many prime numbers, or that apples fall because of the
curvature of space-time, or that obeying its own inborn instincts can
be morally wrong, or that it itself exists. Nor are its unique
abilities confined to such cerebral matters. The cold, physical fact
is that it is the only kind of object that can propel itself into
space and back without harm, or predict and prevent a meteor strike on
itself, or cool objects to a billionth of a degree above absolute
zero, or detect others of its kind across galactic distances.
BUT I ONLY want it to print:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial.
Thanks for help

Split the text on that dot; for a single split, using str.partition() is faster than str.split() with a limit:
text = soup.find_all('p')[0].get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
If you only need to process the first <p> element, use soup.find() instead:
text = soup.find('p').get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
For your given URL, however, the sample text is found as the second paragraph:
>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'

def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
paragraph = soup.find_all('p')[0].get_text()
phrase_list = paragraph.split('.')
print(phrase_list[0])

split the paragraph at the first period. Argument 1 species the MAXSPLIT and saves your time from unneccessary extra splitting.
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
my_paragraph = soup.find_all('p')[0].get_text()
my_list = my_paragraph.split('.', 1)
print(my_list[0])

you can use find('.'), it return the index of the first occurence of what you're looking for.
So if the paragraph is stored in a variable called paragraph
sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])
Obviously here is missing the control part like check if the string contained in paragraph variable has '.' etc.. anyway find() return -1 if it does not find the substring you're looking for.

Related

How to separate words to single letters from text file python

How do I separate words from a text file into single letters?
I'm given a text where I have to calculate the frequency of the letters in a text. However, I can't seem to figure out how I separate the words into single letters so I can count the unique elements and from there determine their frequency.
I apologize for not having the text in a text file, but the following text I'm given:
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book,' thought alice without pictures or conversation?'
so she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy- chain would be worth the trouble of getting up and picking the daisies, when suddenly a white rabbit with pink eyes ran close by her.
there was nothing so very remarkable in that; nor did alice think it so very much out of the way to hear the rabbit say to itself, `oh dear! oh dear! i shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the rabbit actually took a watch out of its waistcoat- pocket, and looked at it, and then hurried on, alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
in another moment down went alice after it, never once considering how in the world she was to get out again.
the rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
I'm supposed to separate into getting 26 variables a-z, and then determine their frequency which is given as the following:
I tried making the following code so far:
# Check where the current file you are working in, is saved.
import os
os.getcwd()
#print(os.getcwd())
# 1. Change the current working directory to the place where you have saved the file.
os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles')
os.getcwd()
#print(os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles'))
# 2. Listing the content of current working directory type
os.listdir(os.getcwd())
#print(os.listdir(os.getcwd()))
#importing the file
filein = open("small_text.txt", "r") #opens the file for reading
lines = filein.readlines() #reads all lines into an array
smalltxt = "".join(lines) #Joins the lines into one big string.
import numpy as np
def letterFrequency(filename):
#counts the frequency of letters in a text
unique_elems, counts = np.unique(separate_words, return_counts=True)
return unique_elems
I just don't know how to separate the letters in the text, so I can count the unique elements.

You can use collections.Counter to get your frequencies directly from the text.
Then just select the 26 keys you are interested, because it will also include whitespaces and other signs.
from collections import Counter
[...]
with open("small_text.txt", "r") as file:
text = file.read()
keys = "abcdefghijklmnopqrstuvwxyz"
c = Counter(text.lower())
# initialize occurrence with zeros to have all keys present.
occurrence = dict.fromkeys(keys, 0)
occurrence.update({k:v for k,v in c.items() if k in keys})
total = sum(occurrence.values())
frequency = {k:v/total for k,v in occurrence.items()}
[...]
To handle upper case str.lower might be useful as well.

"how I separate the words into single letters" since you want to calculate the count of the characters you can implement python counter in collections.
For example
import collections
import pprint
...
...
file_input = input('File_Name: ')
with open(file_input, 'r') as info:
count = collections.Counter(info.read().upper()) # reading file
value = pprint.pformat(count)
print(value)
...
...
This read your file will output the count of characters present.

How to add punctuation to data in python

I'm writing a code for Tacotron 2 where it would get transcripts from youtube & format it in a file. Unfortunately the data it recieves from YT doesn't specify where sentences end. So, I tried adding full stop in the end but most of the sentences isn't a full sentence. So, how can I make it only add full stops at the finish of a sentence. The only other data it recieves are timestamps.
# Batch file for Tacotron 2
from youtube_transcript_api import YouTubeTranscriptApi
transcript_txt = YouTubeTranscriptApi.get_transcript('DY0ekRZKtm4')
def write_transcript():
with open('transcript.txt', 'a+') as transcript_object:
transcript_object.seek(0)
subtitles = transcript_object.read(100)
if len(subtitles) > 0:
transcript_object.write('\n')
for i in transcript_txt:
ii = i['text']
if ii[-1] != '.':
iii = ii + '.'
else:
iii = ii
print(iii)
transcript_object.write(iii + '\n')
transcript_object.close()
write_transcript()
Here's an example:
What it saves:
sometimes it was possible to completely.
fall.
out of the world if the lag was bad.
enough.
What I want:
sometimes it was possible to completely
fall
out of the world if the lag was bad
enough.

There is no easy solution. The least effort way I can think of is to set up spaCy, nlp the whole transcript and hope for the best. It's not trained on data without punctuation though, so don't expect perfect results, but it will detect some sentence boundaries (based on syntax for the most part).
import spacy
nlp = spacy.load('en_core_web_trf')
text = """sometimes it was possible to completely
fall
out of the world if the lag was bad
enough
we solved that by
adding more test data"""
doc = nlp(text)
for s in doc.sents:
print(f"'{s}'")
Output:
'sometimes it was possible to completely
fall
out of the world if the lag was bad
enough
'
'we solved that by
adding more test data'
So in this case, it worked. Once you have that, you could do some additional processing, add punctuation manually, etc.

Substring with multiple instances of the same character

So I am using a Magtek USB reader that will read card information,
As of right now I can swipe a card and I get a long string of information that goes into a Tkinter Entry textbox that looks like this
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
All of the data has been randomized, but that's the format
I've got a tkinter button (it gets the text from the entry box in the format I included above and runs this)
def printCD(self):
print(self.carddata.get())
self.card_data_get = self.carddata.get()
self.creditnumber =
self.card_data_get[self.card_data_get.find("B")+1:
self.card_data_get.find("^")]
print(self.creditnumber)
print(self.card_data_get.count("^"))
This outputs:
%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?
8954756016548963
This yields no issues, but if I wanted to get the next two variables firstname, and lastname
I would need to reuse self.variable.find("^") because in the format it's used before LAST and after INITIAL
So far when I've tried to do this it hasn't been able to reuse "^"
Any takers on how I can split that string of text up into individual variables:
Card Number
First Name
Last Name
Expiration Date

Regex will work for this. I didn't capture everything because you didn't detail what's what but here's an example of capturing the name:
import re
data = "%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?"
matches = re.search(r"\^(?P<name>.+)\^", data)
print(matches.group('name'))
# LAST/FIRST INITIAL
If you aren't familiar with regex, here's a way of testing pattern matching: https://regex101.com/r/lAARCP/1 and an intro tutorial: https://regexone.com/
But basically, I'm searching for (one or more of anything with .+ between two carrots, ^).
Actually, since you mentioned having first and last separate, you'd use this regex:
\^(?P<last>.+)/(?P<first>.+)\^
This question may also interest you regarding finding something twice: Finding multiple occurrences of a string within a string in Python

If you find regex difficult you can divide the problem into smaller pieces and attack one at a time:
data = '%B8954756016548963^LAST/FIRST INITIAL^180912345678912345678901234?;8954756016548963=180912345678912345678901234?'
pieces = data.split('^') # Divide in pieces, one of which contains name
for piece in pieces:
if '/' in piece:
last, the_rest = piece.split('/')
first, initial = the_rest.split()
print('Name:', first, initial, last)
elif piece.startswith('%B'):
print('Card no:', piece[2:])

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list

The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]

open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.

This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.

The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.