i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +
Related
I have a question about input
description = input('add description: ')
I'm adding a text using Ctrl+C and Ctrl+V.
For example:
"The short story is a crafted form in its own right. Short stories
make use of plot, resonance, and other dynamic components as in a
novel, but typically to a lesser degree. While the short story is
largely distinct from the novel or novella/short novel, authors
generally draw from a common pool of literary techniques.
Determining what exactly separates a short story from longer fictional
formats is problematic. A classic definition of a short story is that
one should be able to read it in one sitting, a point most notably
made in Edgar Allan Poe's essay "The Philosophy of Composition"
(1846)"
Result is:
description = "The short story is a crafted form in its own right. Short stories make use of plot, resonance, and other dynamic components as in a novel, but typically to a lesser degree. While the short story is largely distinct from the novel or novella/short novel, authors generally draw from a common pool of literary techniques."
Whilst I want description to hold the entire text chain I copied.
Normally the input() function terminates on an End Of Line or \n. I would suggest using a setup like this:
line = []
while True:
line = input()
if line == "EOF":
break
else:
lines.append(line)
text = ' '.join(lines)
What this does is read input and add it to a array until you type in "EOF" on its own line and hit enter. Thsis should solve the multi line problem.
The problem you're facing here is that an input ends as soon as enter is hit or (in this case) the next line is started. The only way to use enter (I'm just going to call It that, hope you know what I mean) is to instead of actually writing a new paragraph just to write \n, since that is the representation of enter in a string. If you want to go around this issue though I highly recommend you learn how to use the TKinter model, since if you want to create any kind of app for frontend It is one of the best modules. Here a link to get you started https://www.tutorialspoint.com/python/python_gui_programming.htm
I´m trying to clean the following data:
from sklearn import datasets
data = datasets.fetch_20newsgroups(categories=['rec.autos', 'rec.sport.baseball', 'soc.religion.christian'])
texts, targets = data['data'], data['target']
Where texts is a list of articles and targets is a vector containing the index of the category to which each article belongs to.
I need to clean all articles. The cleaning task means:
Remove headers
Remove punctuation
Remove parenthesis
Consecutive blank spaces
Tokens emails with length 1
Line breaks
I'm quite new at Python but I've tried to remove all punctuation and everything using replace(). However, I think that an easy way to do this task must exist.
def clean_articles (article):
return ' '.join([x for x in article[article.find('\n\n'):].replace('.','').replace('[','')
clean_articles(data['data'][1])
For the following article:
print(data['data'][1])
Uncleaned Article:
'From: aas7#po.CWRU.Edu (Andrew A. Spencer)\nSubject: Re: Too fast\nOrganization: Case Western Reserve University, Cleveland, OH (USA)\nLines: 25\nReply-To: aas7#po.CWRU.Edu (Andrew A. Spencer)\nNNTP-Posting-Host: slc5.ins.cwru.edu\n\n\nIn a previous article, wrat#unisql.UUCP (wharfie) says:\n\n>In article <1qkon8$3re#armory.centerline.com> jimf#centerline.com (Jim Frost) writes:\n>>larger engine. That\'s what the SHO is -- a slightly modified family\n>>sedan with a powerful engine. They didn\'t even bother improving the\n>>brakes.\n>\n>\tThat shows how much you know about anything. The brakes on the\n>SHO are very different - 9 inch (or 9.5? I forget) discs all around,\n>vented in front. The normal Taurus setup is (smaller) discs front, \n>drums rear.\n\none i saw had vented rears too...it was on a lot.\nof course, the sales man was a fool..."titanium wheels"..yeah, right..\nthen later told me they were "magnesium"..more believable, but still\ncrap, since Al is so m uch cheaper, and just as good....\n\n\ni tend to agree, tho that this still doesn\'t take the SHO up to "standard"\nfor running 130 on a regular basis. The brakes should be bigger, like\n11" or so...take a look at the ones on the Corrados.(where they have\nbraking regulations).\n\nDREW\n'
Cleaned Article:
In previous article UUCP wharfie says In article centerline com com Jim Frost writes larger engine That's what the SHO is slightly modified family sedan with powerful engine They didn't even bother improving the *brakes That shows how much you know about anything The brakes on the SHO are very different inch or forget discs all around vented in front The normal Taurus setup is smaller discs front drums rear one saw had vented rears too it was on lot of course the sales man was fool titanium wheels yeah right then later told me they were magnesium more believable but still crap since Al is so uch cheaper and just as good tend to agree tho that this still doesn't take the SHO up to standard for running 130 on regular basis The brakes should be bigger like 11 or so take look at the ones on the Corrados where they have braking regulations DREW
note: this is not a complete answer, but the following will at least get you half way to:
remove punctuation
remove line breaks
remove consecutive white space
remove parentheses
import re
s = ';\n(a b.,'
print('before:', s)
s = re.sub('[.,;\n(){}\[\]]', '', s)
s = re.sub('\s+', ' ', s)
print('after:', s)
this will print:
before: ;
(a b.,
after: a b
Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""
Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq
The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.
This is the code I have, but it prints the whole paragraph. How to print the first sentence only, up to the first dot?
from bs4 import BeautifulSoup
import urllib.request,time
article = 'https://www.theguardian.com/science/2012/\
oct/03/philosophy-artificial-intelligence'
req = urllib.request.Request(article, headers={'User-agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html,'lxml')
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
print(soup.find_all('p')[0].get_text())
This code prints:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial. The brain is the only kind of object
capable of understanding that the cosmos is even there, or why there
are infinitely many prime numbers, or that apples fall because of the
curvature of space-time, or that obeying its own inborn instincts can
be morally wrong, or that it itself exists. Nor are its unique
abilities confined to such cerebral matters. The cold, physical fact
is that it is the only kind of object that can propel itself into
space and back without harm, or predict and prevent a meteor strike on
itself, or cool objects to a billionth of a degree above absolute
zero, or detect others of its kind across galactic distances.
BUT I ONLY want it to print:
To state that the human brain has capabilities that are, in some
respects, far superior to those of all other known objects in the
cosmos would be uncontroversial.
Thanks for help
Split the text on that dot; for a single split, using str.partition() is faster than str.split() with a limit:
text = soup.find_all('p')[0].get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
If you only need to process the first <p> element, use soup.find() instead:
text = soup.find('p').get_text()
if len(text) > 100:
text = text.partition('.')[0] + '.'
print(text)
For your given URL, however, the sample text is found as the second paragraph:
>>> soup.find_all('p')[1]
<p><span class="drop-cap"><span class="drop-cap__inner">T</span></span>o state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial. The brain is the only kind of object capable of understanding that the cosmos is even there, or why there are infinitely many prime numbers, or that apples fall because of the curvature of space-time, or that obeying its own inborn instincts can be morally wrong, or that it itself exists. Nor are its unique abilities confined to such cerebral matters. The cold, physical fact is that it is the only kind of object that can propel itself into space and back without harm, or predict and prevent a meteor strike on itself, or cool objects to a billionth of a degree above absolute zero, or detect others of its kind across galactic distances.</p>
>>> text = soup.find_all('p')[1].get_text()
>>> text.partition('.')[0] + '.'
'To state that the human brain has capabilities that are, in some respects, far superior to those of all other known objects in the cosmos would be uncontroversial.'
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
paragraph = soup.find_all('p')[0].get_text()
phrase_list = paragraph.split('.')
print(phrase_list[0])
split the paragraph at the first period. Argument 1 species the MAXSPLIT and saves your time from unneccessary extra splitting.
def print_intro():
if len(soup.find_all('p')[0].get_text()) > 100:
my_paragraph = soup.find_all('p')[0].get_text()
my_list = my_paragraph.split('.', 1)
print(my_list[0])
you can use find('.'), it return the index of the first occurence of what you're looking for.
So if the paragraph is stored in a variable called paragraph
sentence_index = paragraph.find('.')
# add the '.'
sentence += 1
print(paragraph[0: sentence_index])
Obviously here is missing the control part like check if the string contained in paragraph variable has '.' etc.. anyway find() return -1 if it does not find the substring you're looking for.
I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id, title, scientific_title and synonyms, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.