Hello guys I'm trying to create a program to count the total words and the total unique words in a file but when I run the 2 parts of the codes together only the unique words counter part will work and when I delete the unique words counter part the normal words counter will work normally
here is the code full code
f = open('icuhistory.txt','r')
wordCount = 0
text = f.read()
for line in f:
lin = line.rstrip()
wds = line.split()
wordCount += len(wds) #this section alone works fine
text = text.lower() #when I start writing this one the first one will stop working
words = text.split()
words = [word.strip('.,!;()[]') for word in words]
words = [word.replace("'s", '') for word in words]
unique = []
for word in words:
if word not in unique:
unique.append(word)
unique.sort()
print("number of words: ",wordCount)
print("number of unique words: ",len(unique))
Here is the inside of the file
in the fall of 1945 just weeks after the end
of world war ii a group of japanese christian educators
initiated a move to establish a university based on christian
principles the foreign missions conference of north america and the
us education mission both visiting japan at the time
gave their wholehearted support conveying this plan to people in
the us amidst the post-war yearning for reconciliation and
world peace americans supported this project with great enthusiasm in
1948 the japan international christian university foundation jicuf was
established in new york to coordinate fund-raising efforts in the
us people in japan also found hope in a
cause dedicated toworld peace organizations firms and individuals made donations
to this ambitious undertaking regardless of their religious orientation anddespite
the often destitute circumstances in the immediate post-war years bank
of japan governor hisato ichimada headed the supporting organization to
lead the national fund raising drive icu has been unique from
its inception with its endowment procured through good will transcending
national borders
on june 15 1949 japanese and north american christian leaders
convened at the gotemba ymca camp to establish international christian
university with the inauguration of the board of trustees and
the board of councillors the founding principles and a fundamental
educational plan were laid down establishing an interdenominational christian university
had been a dream of japanese and american christians for
half a century the gotemba conference had finally realized their
aspirations
in 1950 icu purchased a spacious site in mitaka city
on the outskirts of tokyo with the donations it received
within japan the campus was dedicated on april 29 1952
with the language institute set up in the first year
in march 1953 the japanese ministry of education authorized icu
as an incorporated educational institution the college of liberal arts
opening on april 1 as the first four-year liberal arts
college in japan
the university celebrated its 50th anniversary in 1999 with diverse
events and projects during the commemorative five year period leading to
march 2004 in 2003 the ministry of education culture sports
science and technology selected icu s research and education
for peace security and conviviality for the 21st century center
of excellence program and its liberal arts to nurture
responsible global citizens for the distinctive university education support program
good practice
in 2008 an academic reform was enforced in the college
of liberal arts which replaced the system of six divisions
with a new organization of the division of arts
and sciences and a system of academic majors as of
april 2008 all new students simply start as college of
liberal arts students making their choice of major from 31
areas by the end of their sophomore year students now
have more time to make a decision while they study
diverse subjects through general education and foundation courses mext chose
icu for its fiscal year 2007 distinctive university education support
program educational support for liberal arts to nurture international
learning from academic advising to academic planning in acknowledgement of
the university s efforts for educational improvement in 2010 the
graduate school also conducted a reform and integrated the four
divisions into a new school of arts and sciences
icu is continually working to reconfirm its responsibilities and fulfill
its mission for the changing times
The entire file content appears to be lowercase so it's as easy as this:
result = {}
with open('icuhistory.txt') as icu:
for word in icu.read().split():
word = word.strip('.,!;()[]').replace("'s", "")
result[word] = result.get(word, 0) + 1
print(f'Number of words = {sum(result.values())}')
print(f'Number of unique words = {len(result)}')
Output:
Number of words = 547
Number of unique words = 273
Take a look at the text = f.read() line. Is it at the right place?
Also, the Python script you pasted does not have consistent indenting. Are you able to clean it up so that it looks just like the original?
Also curious if you have explored the set type in Python? It is a little like a list, but you may find it applicable in your scenario.
Explenation:
Behind files and open stands a concept of streaming or if you are more familiar with iterators think of f = open('icuhistory.txt','r') as an iterator.
You can go through it only once (if you don't tell it to reset)
text = f.read()
Will go through it once, then f is at the end of the file.
for line in f:
Now tries to continue where f currently is... at the end of the file.
So this loop will try to loop over the 0 lines left at the end.
As there is nothing left to iterate over it will not enter the for loop.
Solutions:
You could reset it with f.seek(0) this will tell the object to go back to the start of the file.
But more efficient would be if you either combine both your actions in the loop (more memory friendly) or work with the text text = f.read()
There's no need to read by line as you are counting words, also avoid sorting unless it's needed, as this can be expensive. Converting a list to a set will remove duplicates, and you can chain string methods.
with open('icuhistory.txt','r') as f:
text = f.read().lower()
words = [word.strip('.,!;()[]').replace("'s", '') for word in text.split()]
unique_words = set(words)
print("number of words: ", len(words))
print("number of unique words: ", len(unique_words))
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 months ago.
Improve this question
This is my first time posting so I apologize if I omit necessary information!
I am trying to extract a paragraph of text in Python that always follows a line starting with "Item 5.02". There is a line space between "Item 5.02" and the paragraph that I am trying to extract. I need the text between the "Item 5.02" line and the next section (in this case the next section starts at "Item 9.01"). Please let me know if I need to clarify anything. I have been tinkering with regular expressions but haven't had much luck. I'm pretty new to them. Thanks for the help!
I would like to extract the following:
On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company. Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP. As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001). A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.
From the below text:
Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangement of Certain Officers.
On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company.
Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP.
As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001).
A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.
Item 9.01 Financial Statements and Exhibits.
You could split it by double newlines, find the piece which contains Item 5.02, then take the next one:
def extractPassage(text):
lines = text.split("\n\n")
for i,line in enumerate(lines):
if line.startswith("Item 5.02"):
return lines[i+1]
raise Exception("No line found starting with Item 5.02")
I can't tell from the post formatting if there are any tabs or spaces before Item 5.02 on that line. If so, include them in the startswith call.
To get all text between 5.02 and 9.01, we can append lines to a string, starting after the one starting with 5.02, and ending when we see 9.01:
def extractPassage(text):
lines = text.split("\n\n")
output = ""
for i,line in enumerate(lines):
if line.startswith("Item 5.02"):
j = i+1
take_line = lines[j]
while not take_line.startswith("Item 9.01"):
output += take_line
j += 1
take_line = lines[j]
return output
raise Exception("No line found starting with Item 5.02")
The following regex will match the word Item followed by a space, one number, one period, and then two more numbers.
import re
re.split('Item \d\.\d\d', text)
To explain the regex: \d will match any number, and then to match a period we have to escape the period using \..
If you would rather accept either 1 or 2 digits after the period, you would use the regex 'Item \d\.\d{1,2}'
I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?
urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")
Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.
Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision
I'm new to NLP and to Python.
I'm trying to use object standardization to replace abbreviations with their full meaning. I found code online and altered it to test it out on a wikipedia exert. but all the code does is print out the original text. Can any one help out a newbie in need?
heres the code:
import nltk
lookup_dict = {'EC': 'European Commission', 'EU': 'European Union', "ECSC": "European Coal and Steel Commuinty",
"EEC": "European Economic Community"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
print(new_text)
return new_text
_lookup_words(
"The High Authority was the supranational administrative executive of the new European Coal and Steel Community ECSC. It took office first on 10 August 1952 in Luxembourg. In 1958, the Treaties of Rome had established two new communities alongside the ECSC: the eec and the European Atomic Energy Community (Euratom). However their executives were called Commissions rather than High Authorities")
Thanks in advance, any help is appreciated!
In your case, the lookup dict has the abbreviations for EC and ECSC amongs the words found in your input sentence. Calling split splits the input based on whitespace. But your sentence has the words ECSC. and ECSC: ,ie these are the tokens obtained post splitting as opposed to ECSC thus you are not able to map the input. I would suggest to do some depunctuation and run it again.
What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
Use a list comprehension if you have your strings already in memory:
new = [line for line in the_list if not any(item in line for item in set_of_words)]
If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:
new = (line for line in the_list if not any(item in line for item in set_of_words))
I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."
EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond
You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html