Grab a keyword and the text between keywords in Python

Grab a keyword and the text between keywords in Python - python

Firt thing I'd like to say is this place has helped me more than I could ever repay. I'd like to say thanks to all that have helped me in the past :).
I am trying to devide up some text from a specific style message. It is formated like this:
DATA|1|TEXT1|STUFF: some random text|||||
DATA|2|TEXT1|THINGS: some random text and|||||
DATA|3|TEXT1|some more random text and stuff|||||
DATA|4|TEXT1|JUNK: crazy randomness|||||
DATA|5|TEXT1|CRAP: such random stuff I cant believe how random|||||
I have code shown below that combines the text adding a space between words and adds it to a string named "TEXT" so it looks like this:
STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random
I need it formated like this:
DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|CRAP: |||||
DATA|9|NEWTEXT|such random stuff I cant believe how random|||||
The line numbers are easy, I have that done as well as the carraige returns. What I need is to grab "CRAP" and change the part that says "TEXT1" to "NEWTEXT".
My code scans the string looking for keywords then adds them to their own line then adds text below them followed by the next keyword on its own line etc. Here is my code I have so far:
#this combines all text to one line and adds to a string
while current_segment.move_next('DATA')
TEXT = TEXT + " " + current_segment.field(4).value
KEYWORD_LIST = [STUFF:', THINGS:', JUNK:']
KEYWORD_LIST1 = [CRAP:']
#this splits the words up to search through
TEXT_list = TEXT.split(' ')
#this searches for the first few keywords then stops at the unwanted one
for word in TEXT_list:
if word in KEYWORD_LIST:
my_output = my_output + word
elif word in KEYWORD_LIST1:
break
else:
my_output = my_output + ' ' + word
#this searches for the unwanted keywords leaving the output blank until it reaches the wanted keyword
for word1 in TEXT_list:
if word1 in KEYWORD_LIST:
my_output1 = ''
elif word1 in KEYWORD_LIST1:
my_output1 = my_output1 + word1 + '\n'
else:
my_output1 = my_output1 + ' ' + word1
#my_output is formatted back the way I want deviding up the text into 65 or less character lines
MAX_LENGTH = 65
my_wrapped_output = wrap(my_output,MAX_LENGTH)
my_wrapped_output1 = wrap(my_output1,MAX_LENGTH)
my_output_list = my_wrapped_output.split('\n')
my_output_list1 = my_wrapped_output1.split('\n')
for phrase in my_output_list:
if phrase == "":
SetID +=1
output = output + "DATA|" + str(SetID) + "|TEXT| |||||"
else:
SetID +=1
output = output + "DATA|" + str(SetID) + "|TEXT|" + phrase + "|||||"
for phrase2 in my_output_list1:
if phrase2 == "":
SetID +=1
output = output + "DATA|" + str(SetID) + "|NEWTEXT| |||||"
else:
SetID +=1
output = output + "DATA|" + str(SetID) + "|NEWTEXT|" + phrase + "|||||"
#this populates the fields I need
value = output
Then I format the "my_output" and "my_output1" adding the word "NEWTEXT" where it goes. This code runs through each line looking for the keyword then puts that keyword and a carraige return in. Once it gets the other "KEYWORD_LIST1" it stops and drops the rest of the text then starts the next loop. My problem is the above code gives my this:
DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|crazy randomness|||||
DATA|9|NEWTEXT|CRAP: |||||
DATA|10|NEWTEXT|such random stuff I cant believe how random|||||
It is grabbing the text from before "KEYWORD_LIST1" and adding it into the NEWTEXT section. I know there is a way to make groups from the keyword and text after it but I am unclear on how to impliment it. Any help would be much appreciated.
Thanks.
This is what I had to do to get it to work for me:
KEYWORD_LIST = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']
def text_to_message(text):
result=[]
for word in text.split():
if word in KEYWORD_LIST or word in KEYWORD_LIST1:
if result:
yield ' '.join(result)
result=[]
yield word
else:
result.append(word)
if result:
yield ' '.join(result)
def format_messages(messages):
title='TEXT1'
for message in messages:
if message in KEYWORD_LIST:
title='TEXT1'
elif message in KEYWORD_LIST1:
title='NEWTEXT'
my_wrapped_output = wrap(message,MAX_LENGTH)
my_output_list = my_wrapped_output.split('\n')
for line in my_output_list:
if line = '':
yield title + '|'
else:
yield title + '|' + line
for line in format_messages(text_to_message(TEXT)):
if line = '':
SetID +=1
output = "DATA|" + str(SetID) + "|"
else:
SetID +=1
output = "DATA|" + str(SetID) + "|" + line
#this is needed instead of print(line)
value = output

General tip: Don't try to build up strings accretively like this:
my_output = my_output + ' ' + word
instead, make my_output a list, append word to the list, and
then, at the very end, do a single join: my_output = '
'.join(my_output). (See text_to_message code below for an example.)
Using join is the right way to build strings. Delaying the creation of the string is useful because processing lists of substrings is more pleasant than splitting and unsplitting strings, and having to add spaces and carriage returns here and there.
Study generators. They are easy to understand, and can help you a lot when processing text like this.
import textwrap
KEYWORD_LIST = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']
def text_to_message(text):
result=[]
for word in text.split():
if word in KEYWORD_LIST or word in KEYWORD_LIST1:
if result:
yield ' '.join(result)
result=[]
yield word
else:
result.append(word)
if result:
yield ' '.join(result)
def format_messages(messages):
title='TEXT1'
num=1
for message in messages:
if message in KEYWORD_LIST:
title='TEXT1'
elif message in KEYWORD_LIST1:
title='NEWTEXT'
for line in textwrap.wrap(message,width=65):
yield 'DATA|{n}|{t}|{l}'.format(n=num,t=title,l=line)
num+=1
TEXT='''STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random'''
for line in format_messages(text_to_message(TEXT)):
print(line)

Related

Preventing removal of linebreaks

I have a function that replaces offensive words with a star, but in running text through this, it strips out linebreaks. Any thoughts on how to prevent this?
def replace_words(text, exclude_list):
words = text.split()
for i in range(len(words)):
if words[i].lower() in exclude_list:
words[i] = "*"
return ' '.join(words)

Don't use .split() with no argument on the entire input string, it removes line breaks and you lose the information where you have to put them in the result string.
You could first split the input into lines and then process each line separately in the same way as you now process the whole input.

credit to mkrieger1
def replace_words(text, exclude_list):
paragraphs = text.split('\n')
new_paragraph = ""
for p in paragraphs:
words = p.split()
for i in range(len(words)):
if words[i].lower() in exclude_list:
words[i] = "*"
new_p = ' '.join(words)
new_paragraph = new_paragraph + "\n" + new_p #add line break
return new_paragraph

You can use \n to create a new line or .split()

How to reverse the words of a string considering the punctuation?

Here is what I have so far:
def reversestring(thestring):
words = thestring.split(' ')
rev = ' '.join(reversed(words))
return rev
stringing = input('enter string: ')
print(reversestring(stringing))
I know I'm missing something because I need the punctuation to also follow the logic.
So let's say the user puts in Do or do not, there is no try.. The result should be coming out as .try no is there , not do or Do, but I only get try. no is there not, do or Do. I use a straightforward implementation which reverse all the characters in the string, then do something where it checks all the words and reverses the characters again but only to the ones with ASCII values of letters.

Try this (explanation in comments of code):
s = "Do or do not, there is no try."
o = []
for w in s.split(" "):
puncts = [".", ",", "!"] # change according to needs
for c in puncts:
# if a punctuation mark is in the word, take the punctuation and add it to the rest of the word, in the beginning
if c in w:
w = c + w[:-1] # w[:-1] gets everthing before the last char
o.append(w)
o = reversed(o) # reversing list to reverse sentence
print(" ".join(o)) # printing it as sentence
#output: .try no is there ,not do or Do

Your code does exactly what it should, splitting on space doesn't separator a dot ro comma from a word.
I'd suggest you use re.findall to get all words, and all punctation that interest you
import re
def reversestring(thestring):
words = re.findall(r"\w+|[.,]", thestring)
rev = ' '.join(reversed(words))
return rev
reversestring("Do or do not, there is no try.") # ". try no is there , not do or Do"

You can use regular expressions to parse the sentence into a list of words and a list of separators, then reverse the word list and combine them together to form the desired string. A solution to your problem would look something like this:
import re
def reverse_it(s):
t = "" # result, empty string
words = re.findall(r'(\w+)', s) # just the words
not_s = re.findall(r'(\W+)', s) # everything else
j = len(words)
k = len(not_s)
words.reverse() # reverse the order of word list
if re.match(r'(\w+)', s): # begins with a word
for i in range(k):
t += words[i] + not_s[i]
if j > k: # and ends with a word
t += words[k]
else: # begins with punctuation
for i in range(j):
t += not_s[i] + words[i]
if k > j: # ends with punctuation
t += not_s[j]
return t #result
def check_reverse(p):
q = reverse_it(p)
print("\"%s\", \"%s\"" % (p, q) )
check_reverse('Do or do not, there is no try.')
Output
"Do or do not, there is no try.", "try no is there, not do or Do."
It is not a very elegant solution but sure does work!

Why does str.capitalize() not work as I expect?

Please, let me know if I'm not providing enough information. The goal of the program is to capitalize the first letter of every sentence.
usr_str = input()
def fix_capitalization(usr_str):
list_of_sentences = usr_str.split(".")
list_of_sentences.pop() #remove last element: ""
new_str = ''
for sentence in list_of_sentences:
new_str += sentence.capitalize() + "."
return new_str
print(fix_capitalization(usr_str))
For instance, if I input "hi. hello. hey." I expect it to output "Hi. Hello. Hey." but instead, it outputs "Hi. hello. hey."

An alternative would be to build a list of strings then concatenate them:
def fix_capitalization(usr_str):
list_of_sentences = usr_str.split(".")
output = []
for sentence in list_of_sentences:
new_sentence = sentence.strip().capitalize()
# If empty, don't bother
if new_sentence:
output.append(new_sentence)
# Finally, join everything
return ". ".join(output) +"."

You've entered the sentences with spaces between them. Now when you split the list the list at the '.' character the spaces are still remaining. I checked what the elements in the list were when you split it and the result was this.
'''
['hi', ' hello', ' hey', '']
'''

Faster de-merge of all hashtags

I would like to de-merge hastags from a Twitter dataset. For instance: "#sunnyday" would be "sunny day".
I have found the following code:
The code finds the hastags and looks into a file called "wordlist.txt", which is a huge txt file with a lot of words for some matching words.
The txt. file can be downloaded here:
http://www-personal.umich.edu/~jlawler/wordlist
Source: Term split by hashtag of multiple words
I modified it a bit to make sure that it works if a sentence is empty: " "
# Returns a list of common english terms (words)
def initialize_words():
content = None
with open('wordlist.txt') as f: # A file containing common english words
content = f.readlines()
return [word.rstrip('\n') for word in content]
def parse_sentence(sentence, wordlist):
new_sentence = "" # output
# MODIFICATION: If the sentence is not empty
if sentence != '':
terms = sentence.split(' ')
for term in terms:
# MODIFICATION: If the term is not empty
if term != '':
if term[0] == '#': # this is a hashtag, parse it
new_sentence += parse_tag(term, wordlist)
else: # Just append the word
new_sentence += term
new_sentence += " "
return new_sentence
def parse_tag(term, wordlist):
words = []
# Remove hashtag, split by dash
tags = term[1:].split('-')
for tag in tags:
word = find_word(tag, wordlist)
while word != None and len(tag) > 0:
words.append(word)
if len(tag) == len(word): # Special case for when eating rest of word
break
tag = tag[len(word):]
word = find_word(tag, wordlist)
return " ".join(words)
def find_word(token, wordlist):
i = len(token) + 1
while i > 1:
i -= 1
if token[:i] in wordlist:
return token[:i]
return None
The problem is that it takes for ever to run!
How can I make it faster ?

Use a set instead of a list for your wordlist variable.
This will be a massive performance improvement because with list you need to (potentially) scan the entire word list, so it's O(n). With a set, it's O(1) because membership is checked by calculating a hash of the item and using that as an index into the backing storage.

Separate words into list, except for symbols

I'm creating a project where I'll receive a list of tweets (Twitter), and then check if there words inside of a dictionary, which has words that certain values. I've gotten my code to take the words, but I don't know how to eliminate the symbols like: , . ":
Here's the code:
def getTweet(tweet, dictionary):
score = 0
seperate = tweet.split(' ')
print seperate
print "------"
if(len(tweet) > 0):
for item in seperate:
if item in dictionary:
print item
score = score + int(dictionary[item])
print "here's the score: " + str(score)
return score
else:
print "you haven't tweeted a tweet"
return 0
Here's the parameter/tweet that will be checked:
getTweet("you are the best loyal friendly happy cool nice", scoresDict)
Any ideas?

If you want to get rid of all the non alphanumerical values you can try:
import re
re.sub(r'[^\w]', ' ', string)
the flag [^\w] will do the trick for you!

Before doing the split, replace the characters with spaces, and then split on the spaces.
import re
line = ' a.,b"c'
line = re.sub('[,."]', ' ', line)
print line # ' a b c'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grab a keyword and the text between keywords in Python - python

Related

Preventing removal of linebreaks

How to reverse the words of a string considering the punctuation?

Why does str.capitalize() not work as I expect?

Faster de-merge of all hashtags

Separate words into list, except for symbols

Categories

Resources