For my upcoming project, I am supposed to take a text file that has been scrambled and unscramble it into a specific format.
Each line in the scrambled file contains the line in the text file, a line number, and a three-letter code that identifies the work. Each of these items is separated by the |character. For example,
it ran away when it saw mine coming!"|164|ALC
cried to the man who trundled the barrow; "bring up alongside and help|27|TRI
"Of course he's stuffed," replied Dorothy, who was still angry.|46|WOO
My task is to write a program that reads each line in the text file, separates and unscrambles the lines, and collects the basic data you’d first set out to collect. For each work, I have to determine
its longest line (and the corresponding line number),
its shortest line (and corresponding line number), and
the average length of the lines in the entire work.
The summaries should be sorted by three-letter code and should be formatted as follows:
ALC
Longest Line (107): "No, I didn’t," said Alice: "I don’t think it’s at all a pity. I said Shortest Line (148): to." Average Length: 59
WOO
Longest Line (66): of my way. Whenever I’ve met a man I’ve been awfully scared; but I just Shortest Line (71): go." Average Length: 58
Then I have to make another file and this file should contain the three-letter code for a work followed by its text. The lines must all be included and should be ordered and should not include line numbers or three-letter codes. The lines should be separated by a separator with five dashes. The result should look like the following:
ALC
A large rose-tree stood near the entrance of the garden: the roses growing on it were white, but there were three gardeners at it, busily painting them red. Alice thought this a very curious thing, and she went nearer to watch them, and just as she came up to them she heard one of them say, "Look out now, Five! Don’t go splashing paint over me like that!" "I couldn’t help it," said Five, in a sulky tone; "Seven jogged my elbow." On which Seven looked up and said, "That’s right, Five! Always lay the blame on others!"
-----
TRI SQUIRE TRELAWNEY, Dr. Livesey, and the rest of these gentlemen having asked me to write down the whole particulars about Treasure Island, from the beginning to the end, keeping nothing back but the bearings of the island, and that only because there is still treasure not yet lifted, I take up my pen in the year of grace 17__ and go back to the time when my father kept the Admiral Benbow inn and the brown old seaman with the sabre cut first took up his lodging under our roof. I remember him as if it were yesterday, as he came plodding to the inn door, his sea-chest following behind him in a hand-barrow--a tall, strong, heavy, nut-brown man, his tarry pigtail falling over the
-----
WOO All this time Dorothy and her companions had been walking through the thick woods. The road was still paved with yellow brick, but these were much covered by dried branches and dead leaves from the trees, and the walking was not at all good. There were few birds in this part of the forest, for birds love the open country where there is plenty of sunshine. But now and then there came a deep growl from some wild animal hidden among the trees. These sounds made the little girl’s heart beat fast, for she did not know what made them; but Toto knew, and he walked close to Dorothy’s side, and did not even bark in return.
My question is, what sort of tools or methods for lists or any other data structures that python has would be the best to use for this project where I have to move lines of texts around and unscramble the order of the words themselves? I would greatly appreciate some advice or help with the code.
#Gesucther
The code you posted works, except when the program tries to find the data for the summaries file it only brings back:
TTL
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0
WOO
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0
ALG
Longest Line (59): subscribe to our email newsletter to hear about new eBooks. Shortest Line (59): subscribe to our email newsletter to hear about new eBooks. Average Length: 0
Is there something you can see that's causing the average, and the shortest lines to not print out correctly? Or even the longest line.
Here is a download link to the starting text file. https://drive.google.com/file/d/1Dwnk0ziqovEEuaC7r7YzZdkI5_bh7wvG/view?usp=sharing
EDIT*****
It is working properly now but is there a way to change the code so it outputs the line number where the longest and shortest lines are found? Instead of the character count?
TTL
Longest Line (82): *** END OF THE PROJECT GUTENBERG EBOOK TWENTY THOUSAND LEAGUES UNDER THE SEAS ***
Shortest Line (1): N
Average Length: 58
WOO
Longest Line (74): Section 5. General Information About Project Gutenberg-tm electronic works
Shortest Line (3): it.
Average Length: 58
ALG
Longest Line (76): 2. Alice through Q.’s 3d (_by railwayv) to 4th (_Tweedledum and Tweedledee_)
Shortest Line (1): 1
Average Length: 54
Above, next to longest line it has (76) because it's the character length in the sentence, but is there a way to have it be the line number instead?
EDIT****
It looks like my summary and unscrambled are coming out unalphabetilally? Is there a way to make them come out alphabetical instead?
I suggest using pandas for this. You can load your data as a dataframe with read.csv once you've added a newline character at the right positions, which can be done with regex:
import pandas as pd
import io
import re
data = '''it ran away when it saw mine coming!"|164|ALC cried to the man who trundled the barrow; "bring up alongside and help|27|TRI "Of course he's stuffed," replied Dorothy, who was still angry.|46|WOO'''
data = re.sub('(?<=[A-Z]{3})\s', '\n', data) # replace space after a word of three captial letters with newline character
df = pd.read_csv(io.StringIO(data), sep='|', names=['text', 'line', 'book'])
This will output the following dataframe:
text
line
book
0
it ran away when it saw mine coming!"
164
ALC
1
cried to the man who trundled the barrow; "bring up alongside and help
27
TRI
2
Of course he's stuffed, replied Dorothy, who was still angry.
46
WOO
Now you can process the data as you like. For example by getting the number of characters in the lines and printing the desired statistics:
df['length'] = df['text'].str.len()
print('longest string:', df[df['length']==df['length'].max()])
print('shortest string:', df[df['length']==df['length'].min()])
print('average string length:', df['length'].mean())
Or get the full texts of the books by sorting by line number, grouping the data by book and joining the lines per book:
full_texts = df.sort_values(['line']).groupby('book', as_index = False).agg({'text': ' '.join})
print('\n\n-----\n\n'.join(full_texts['book']+' '+full_texts['text']))
Result:
ALC it ran away when it saw mine coming!"
-----
TRI cried to the man who trundled the barrow; "bring up alongside and help
-----
WOO Of course he's stuffed, replied Dorothy, who was still angry.
if you aren't allowed to use any (3rd party) imports, this approach might help you:
First of all, we need to parse your scrambled file, where
Each line [...] contains the line in the text file, a line number, and a three-letter code that identifies the work. Each of these items is separated by the |character.
INPUT_FILE = "text.txt"
SUMMARIES_FILE = "summaries.txt"
UNSCRAMBLED_FILE = "unscrambled.txt"
books = {}
with open(INPUT_FILE, "r") as f:
for l in f:
l = l.strip().split("|")
text, line, book = l
texts = books.get(book, [])
texts.append((line, text))
books[book] = texts
The dictionary authors will now look like this:
{
'ALC': [('164', 'it ran away when it saw mine coming!"')],
'TRI': [('27', 'cried to the man who trundled the barrow; "bring up alongside and help')],
'WOO': [('46', '"Of course he\'s stuffed," replied Dorothy, who was still angry.')]
}
Now we can proceed to the processing of each line (please notice the comments in the code):
with open(SUMMARIES_FILE, "w") as summaries_file, open(UNSCRAMBLED_FILE, "w") as unscrambled_file:
summary = ""
unscrambled = ""
# iterate over all books
for book, texts in books.items():
# sort the lines by line number
texts = sorted(texts, key=lambda k: int(k[0]))
unscrambled += f"{book}\n"
total_len = 0
longest = shortest = None
# iterate over all (sorted) lines of the book
for text in texts:
line, text = text
unscrambled += text
length = len(text)
# keep track of the total length of each line (we need that to calculate the average)
total_len += length
# check whether the current sentence is the longest one yet
longest = longest if longest is not None and len(longest[1]) > length else (line, text)
# check whether the current sentence is the smallest one yet
shortest = shortest if shortest is not None and len(shortest[1]) < length else (line, text)
unscrambled += "\n-----\n"
# calculate the average length of lines
average_len = total_len // len(texts)
summary += f"{book}\n" \
f"Longest Line ({longest[0]}): {longest[1]}\n" \
f"Shortest Line ({shortest[0]}): {shortest[1]}\n" \
f"Average Lenght: {average_len}\n\n"
# write results to the appropriate files
summaries_file.write(summary)
unscrambled_file.write(unscrambled)
summaries.txt will contain:
ALC
Longest Line (37): it ran away when it saw mine coming!" Shortest Line (37): it ran away when it saw mine coming!" Average Lenght: 37
TRI
Longest Line (70): cried to the man who trundled the barrow; "bring up alongside and help Shortest Line (70): cried to the man who trundled the barrow; "bring up alongside and help Average Lenght: 70
WOO
Longest Line (63): "Of course he's stuffed," replied Dorothy, who was still angry. Shortest Line (63): "Of course he's stuffed," replied Dorothy, who was still angry. Average Lenght: 63
unscrambled.txt will contain:
ALC
it ran away when it saw mine coming!"
-----
TRI
cried to the man who trundled the barrow; "bring up alongside and help
-----
WOO
"Of course he's stuffed," replied Dorothy, who was still angry.
-----
However, this solution might not be as efficient as using pandas.
Related
I have a set of text files with blurbs of text and I need to search these for a particular keyword such that a set of words before and/or after the keyword (i.e. phrases) are returned along with a count of the phrases across the files. For example, contents of a few of files are:
File 1: This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!
File 2: Having a beautiful green park close to your house is great.
File 3: I visited a green park today. My friend also visited a green park today.
So if I search for the keyword park, I'm looking for the output to be a set of phrases (let's say one word before & after park), ranked based on how many times the phrase occurs across files. So in this example, the output should be:
green park today: 2
green park close: 1
Is there a way I can achieve this in Python, maybe using some NLP libraries or even without them. I have some code in my post here but that doesn't solve the purpose (I'll perhaps delete that post once I get a response to this one).
Thank you
Based on your expected output above, it looks like you only want to add one to the count for a single phrase per file (even if it appears several times in the same file). Below is an example of how you can do this without any special NLP libraries, just defining "words" as chains of non-space characters delimited by spaces (I'm assuming you know how to read text from a file so leaving that part out).
from collections import Counter
str1 = "This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!"
str2 = "Having a beautiful green park close to your house is great."
str3 = "I visited a green park today. My friend also visited a green park today."
str1_words = ["START"] + str1.split(" ") + ["END"]
str2_words = ["START"] + str2.split(" ") + ["END"]
str3_words = ["START"] + str3.split(" ") + ["END"]
print(str1_words)
all_phrases = []
SEARCH_WORD = "park"
for words in [str1_words, str2_words, str3_words]:
phrases = []
for i in range(1, len(words) - 1):
if words[i] == SEARCH_WORD:
phrases.append(" ".join(words[i-1:i+2]))
# Only count each phrase once for this text
phrases = set(phrases)
all_phrases.extend(phrases)
phrase_count = Counter(all_phrases)
print(phrase_count.most_common())
The output is:
[('green park today', 1), ('green park close', 1), ('green park today.', 1)]
This perfectly demonstrates the problem with the definition of a "word" above - punctuation is treated as part of the word. For a better way to do it, look into the NLTK library, specifically methods for "word tokenization".
Hopefully the above gives you an idea of how to get started on this.
This question already has answers here:
Python - how to separate paragraphs from text?
(6 answers)
Closed 1 year ago.
I have a text file that contains paragraphs separated by an empty line, such as below. I am trying to create a list, where each element is a paragraph in the text file.
The Yakutia region, or Sakha Republic, where the Siberian wildfires are mainly
taking place is one of the most remote parts of Russia.
The capital city, Yakutsk, recorded one of the coldest temperatures on Earth in
February 1891, of minus 64.4 degrees Celsius (minus 83.9 degrees Fahrenheit); but
the region saw record high temperatures this winter.
The Siberian Times reported in mid-July that residents were breathing smoke from more than
300 separate wildfires, but that only around half of the forest blazes were being tackled
by firefighters — including paratroopers flown in by the Russian military — because
the rest were thought to be too dangerous.
The wildfires have grown in size since then and have engulfed an estimated 62,300 square
miles (161,300 square km) since the start of the year.
So in the above example there would be 4 elements in the list, one for each paragraph.
I can easily combine the paragraphs into a single string using the following code,
mystr = " ".join([line.strip() for line in lines])
but I have no idea how to use the empty line between the paragraphs as a delimiter to make a list out of the text file. I have tried,
with open('texr.txt', encoding='utf8') as f:
lines = [line for line in f]
hoping that I could covert every line into a list element, and then combining everything between an empty space into one string. But that doesn't seem to work. I must be missing something very fundamental here..
Thanks
Try:
with open('textr.txt') as fp:
lst = [p.strip() for p in fp.read().split('\n\n')]
>>> lst
['The Yakutia region, or Sakha Republic, where the Siberian wildfires are mainly \ntaking place is one of the most remote parts of Russia.',
'The capital city, Yakutsk, recorded one of the coldest temperatures on Earth in \nFebruary 1891, of minus 64.4 degrees Celsius (minus 83.9 degrees Fahrenheit); but \nthe region saw record high temperatures this winter.',
'The Siberian Times reported in mid-July that residents were breathing smoke from more than \n300 separate wildfires, but that only around half of the forest blazes were being tackled \nby firefighters — including paratroopers flown in by the Russian military — because \nthe rest were thought to be too dangerous.',
'The wildfires have grown in size since then and have engulfed an estimated 62,300 square \nmiles (161,300 square km) since the start of the year.']
I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')
I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.
There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?
Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person
you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person