Error: match word in file - python

There are two sentences in "test_tweet1.txt"
#francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
#mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
In "Personal.txt"
The Game (rapper)
The Notorious B.I.G.
The Undertaker
Thor
Tiësto
Timbaland
T.I.
Tom Cruise
Tony Romo
Trajan
Triple H
My codes:
import re
popular_person = open('C:/Users/Personal.txt')
rpopular_person = popular_person.read()
file1 = open("C:/Users/test_tweet1.txt").readlines()
array = []
count1 = 0
for line in file1:
array.append(line)
count1 = count1 + 1
print "\n",count1, line
ltext1 = line.split(" ")
for i,text in enumerate(ltext1):
if text in rpopular_person:
print text
text2 = ' '.join(ltext1)
Results from the codes showed:
1 #francesco_con40 2nd worst QB. DEFINITELY Tony Romo. The man who likes to share the ball with everyone. Including the other team.
Tony
The
man
to
the
the
2 #mariakaykay aga tayo tomorrow ah. :) Good night, Ces. Love you! >:D<
aga
I tried to match word from "test_tweet1.txt" with "Personal.txt".
Expected result:
Tony
Romo
Any suggestion?

Your problem seems to be that rpopular_person is just a single string. Therefore, when you ask things like 'to' in rpopular_person, you get a value of True, because the characters 't', 'o' appear in sequence. I am assuming that the same goes for 'the' elsewhere in Personal.txt.
What you want to do is split up Personal.txt into individual words, the way you're splitting your tweets. You can also make the resulting list of words into a set, since that'll make your lookup much faster. Something like this:
people = set(popular_person.read().split())
It's also worth noticing that I'm calling split(), with no arguments. This splits on all whitespace--newlines, tabs, and so on. This way you get everything separately like you intend. Or, if you don't actually want this (since this will give you results of "The" all the time based on your edited contents of Personal.txt), make it:
people = set(popular_person.read().split('\n'))
This way you split on newlines, so you only look for full name matches.
You're not getting "Romo" because that's not a word in your tweet. The word in your tweet is "Romo." with a period. This is very likely to remain a problem for you, so what I would do is rearrange your logic (assuming speed isn't an issue). Rather than looking at each word in your tweet, look at each name in your Personal.txt file, and see if it's in your full tweet. This way you don't have to deal with punctuation and so on. Here's how I'd rewrite your functionality:
rpopular_person = set(personal.split())
with open("Personal.txt") as p:
people = p.read().split('\n') # Get full names rather than partial names
with open("test_tweet1.txt") as tweets:
for tweet in tweets:
for person in people:
if person in tweet:
print person

you need to split rpopular_person to get it to match words instead of substrings
rpopular_person = open('C:/Users/Personal.txt').read().split()
this gives:
Tony
The
the reason Romo isn't showing up is that on your line split you have "Romo." Maybe you should look for rpopular_person in the lines, instead of the other way around. Maybe something like this
popular_person = open('C:/Users/Personal.txt').read().split("\n")
file1 = open("C:/Users/test_tweet1.txt")
array = []
for count1, line in enumerate(file1):
print "\n", count1, line
for person in popular_person:
if person in line:
print person

Related

How to extract only sentences from some texts in Python?

I have a text that is in the following form:
document = "Hobby: I like going to the mountains.\n Something (where): To Everest mountain.\n\n
The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
I want to extract only the sentences from this text, without the "Hobby:", "Something (where):", "The reason". Only the sentences. For example, "To Everest mountain" would not be a sentence, since it is not like a full sentence.
The idea is that I need to get rid of those words followed by ":" (Hobby:, The reason:) (it doesn't matter what's written before the ":" part, the idea is to get rid of that if it is at the beginning of the "sentence") and extract only the sentences from what it remained.
I'd appreciate any idea.
You can just use the split() method. First, you should split by \n and then check if there is ":" in the sentence and append to the final list the second part of this sentence split by ": ". Here is the code:
document = "Hobby: I like going to the mountains.\n Something (where): To Everest mountain.\n\n The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
sentences = []
for element in document.split("\n"):
if ":" in element:
sentences.append(element.split(": ")[1])
print(*sentences, sep="\n")
And the output will be:
I like going to the mountains.
To Everest mountain.
I want to go because I like nature.
I'd like to go hiking and admiring the beauty of the nature.
But if the sentence can also contain ": " you should use the following code:
document = "Hobby: I like go: ing to the mountain: s.\n Something (where): To Everest mountain.\n\n The reason: I want to go because I like nature.\n Activities: I'd like to go hiking and admiring the beauty of the nature. "
sentences = []
for element in document.split("\n"):
if ":" in element:
sentences.append(element.split(": ")[1:])
for line in sentences:
print(": ".join(line))
Output:
I like go: ing to the mountain: s.
To Everest mountain.
I want to go because I like nature.
I'd like to go hiking and admiring the beauty of the nature.
Hope that helped!
If the text file is structured in the way, that each sentence is separated by newline character parsing it with regex might be feasible.
As other answer mentioned, use `split() function to separate sentences.
lines = document.split("\n")
With that you can apply regex to each line:
import re
sentences = []
for line in lines:
result = re.search(r"^[a-zA-z0-9\s]*:(.*)", line)
if not result:
continue
sentences.extend(result.groups())
print(sentences)
To check out what the regex does, visit a website such as https://regex101.com/
In short: It checks for alphanumerical characters until first : symbol, then grab everything after that. ^ symbol is crucial here as it's asserting position at the start of the line, this way you won't grab another :

Extract phrase count from text files based on a keyword

I have a set of text files with blurbs of text and I need to search these for a particular keyword such that a set of words before and/or after the keyword (i.e. phrases) are returned along with a count of the phrases across the files. For example, contents of a few of files are:
File 1: This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!
File 2: Having a beautiful green park close to your house is great.
File 3: I visited a green park today. My friend also visited a green park today.
So if I search for the keyword park, I'm looking for the output to be a set of phrases (let's say one word before & after park), ranked based on how many times the phrase occurs across files. So in this example, the output should be:
green park today: 2
green park close: 1
Is there a way I can achieve this in Python, maybe using some NLP libraries or even without them. I have some code in my post here but that doesn't solve the purpose (I'll perhaps delete that post once I get a response to this one).
Thank you
Based on your expected output above, it looks like you only want to add one to the count for a single phrase per file (even if it appears several times in the same file). Below is an example of how you can do this without any special NLP libraries, just defining "words" as chains of non-space characters delimited by spaces (I'm assuming you know how to read text from a file so leaving that part out).
from collections import Counter
str1 = "This is a great day. I wish I could go to a beautiful green park today but unfortunately, we are in a lockdown!"
str2 = "Having a beautiful green park close to your house is great."
str3 = "I visited a green park today. My friend also visited a green park today."
str1_words = ["START"] + str1.split(" ") + ["END"]
str2_words = ["START"] + str2.split(" ") + ["END"]
str3_words = ["START"] + str3.split(" ") + ["END"]
print(str1_words)
all_phrases = []
SEARCH_WORD = "park"
for words in [str1_words, str2_words, str3_words]:
phrases = []
for i in range(1, len(words) - 1):
if words[i] == SEARCH_WORD:
phrases.append(" ".join(words[i-1:i+2]))
# Only count each phrase once for this text
phrases = set(phrases)
all_phrases.extend(phrases)
phrase_count = Counter(all_phrases)
print(phrase_count.most_common())
The output is:
[('green park today', 1), ('green park close', 1), ('green park today.', 1)]
This perfectly demonstrates the problem with the definition of a "word" above - punctuation is treated as part of the word. For a better way to do it, look into the NLTK library, specifically methods for "word tokenization".
Hopefully the above gives you an idea of how to get started on this.

Check if there are numbers around a keyword in a text file

I am having a text file 'Filter.txt' which contains a specific keyword 'D&O insurance'. I would check if there are numbers in the sentence which contains that keyword, as well as the 2 sentences before and after that.
For example, I have a long paragraphe like this:
"International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered? "
The target word is "D&O insurance." If I wanted to extract the target sentence (D&O insurance grants cover on a claims-made basis.) as well as the preceding and following sentences (Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. and How much is enough?), what would be a good approach?
This is what I'm trying to do so far. However I don't really know how to apply to find ways to check in the whole sentence and the ones around it.
for line in open('Filter.txt'):
match = re.search('D&O insurance(\d+)',line)
if match:
print match.group(1)
I'm new to programming, so I'm looking for the possible solutions for that purpose.
Thank you for your help!
Okay I'm going to take a stab at this. Assume string is the entire contents of your .txt file (you may need to clean the '/n's out).
You're going to want to make a list of potential sentence endings, use that list to find the index positions of the sentence endings, and then use THAT list to make a list of the sentences in the file.
string = "International insurance programs necessary for companies with global subsidiaries and offices. Coverage is usually for current, future and past directors and officers of a company and its subsidiaries. D&O insurance grants cover on a claims-made basis. How much is enough? What and who is covered – and not covered?"
endings = ['! ', '? ','. ']
def pos_find(string):
lst = []
for ending in endings:
i = string.find(ending)
if i != -1:
lst.append(string.find(ending))
return min(lst)
def sort_sentences(string):
sentences = []
while True:
try:
i = pos_find(string)
sentences.append(string[0:i+1])
string = string[i+2:]
except ValueError:
sentences.append(string)
break
return sentences
sentences = sort_sentences(string)
Once you have the list of sentences (I got a little weary here, so forgive the spaghetti code - the functionality is there), you will need to comb through that list to find characters that could be integers (this is how I'm checking for numbers...but you COULD do it different).
for i in range(len(sentences)):
sentence = sentences[i]
match = sentence.find('D&O insurance')
print(match)
if match >= 0:
lst = [sentences[i-1],sentence, sentences[i+2]]
for j in range(len(lst)):
sen = lst[j]
for char in sen:
try:
int(char)
print(f'Found {char} in "{sen}", index {j}')
except ValueError:
pass
Note that you will have to make some modifications to capture multi-number numbers. This will just print something for each integer in the full number (i.e. it will print a statement for 1, 0, and 0 if it finds 100 in the file). You will also need to catch the two edge cases where the D&O insurance substring is found in the first or last sentences. In the code above, you would throw an error because there would be no i-1 (if it's the first) index location.

How to extract string that contains specific characters in Python

I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')

How to create a list of strings using a special char to understand where to split

I have a text file that consist of the songs from all the albums of pink floyd, which looks like that:
#The Piper At The Gates Of Dawn::1967
*Lucifer Sam::Syd Barrett::03:07::Lucifer Sam, Siam cat
Always sitting by your side
Always by your side
... ( The lyrics of the song )
*Matilda mother::Syd Barrett::03:07::There was a king who ruled the land
His majesty was in command
With silver eyes the scarlet eagle
... ( The lyrics of the song )
#Another album
*another song
song's lyrics
I would like to create a list of strings out of it, using the albums ( indicated using # ) as one string, and all the songs in it as another string after that and so on, like this:
["album\n", "*song's name\nlyrics\n*song's name\nlyrics ..."]
Thanks a lot! :D
Edit: so I noticed that my explanation is a bit clumsy, So i will rephrase it.
What i want to do is to convert the given text into a list, which has each album and it's data in separated variables, so i would have something like that:
["album's name, "(Everything between the album's name and the next one)", "album's name", ...]
and so on.
The albums have # before them, and i need to use it somehow to separate it from it's songs.
I tried to do a for which finds each # and the first \n after that to create the list, but it went into ashes :(
IMPORTANT! CLEAR EXPLANATION: consider you have a string that looks like that:
#Hello
Whatever
#Hello
More Whatever
I want to separate each #Hello with it's Whatever. so i would have something like that:
["hello", "Whatever", "Hello", "Whatever]
I'm really sorry for my bad explanation abilities. this is the easiest way i can think of to explain it to you :D
Not super efficient, but works:
f = "filepath"
txt = "".join([line + "#" if line.startswith("#") else line for line in open(f)])
data = [x for x in txt.split("#")][1:]
data
['The Piper At The Gates Of Dawn::1967\n',
'*Lucifer Sam::Syd Barrett::03:07::Lucifer Sam, Siam cat\nAlways sitting by your side\nAlways by your side\n... ( The lyrics of the song )\n*Matilda mother::Syd Barrett::03:07::There was a king who ruled the land\nHis majesty was in command\nWith silver eyes the scarlet eagle\n... ( The lyrics of the song )\n',
'Another album\n',
"*another song\nsong's lyrics\n"]
You could do it using regular expressions (re module), consider following example, lets say that you have file songs.txt as follows:
#Song 1
First line
Second line
#Song 2
First line of second
Last line
You could do:
import re
with open('songs.txt','r') as f:
data = f.read()
songs = re.findall(r'(#.+?\n)([^#]+)',data)
#now songs is list of 2-tuples with song name and "song body"
songs = list(sum(songs,())) #here I am doing so called flattening
print(songs) #['#Song 1\n', 'First line\nSecond line\n', '#Song 2\n', 'First line of second\nLast line\n']
pattern (1st argument of re.findall) contains two groups denoted by brackets (()), first is for title and second for lyrics. First group need to be in form of: # followed by 1 or more not newlines (\n) and ending in newline (\n). Second group mean simply 1 or more characters which are not #.

Categories