Constant part of string

Constant part of string - python

I've got a problem and don't know how to solve it.
E.x. I have a dynamically expanding file which contains lines splited by '\n'
Each line - a message (string) which is built by some pattern and value part which is specific only for this line.
E.x.:
line 1: The temperature is 10 above zero
line 2: The temperature is 16 above zero
line 3: The temperature is 5 degree zero
So, as you see, the constant part (pattern) is
The temperature is zero
Value part:
For line 1 will be: 10 above
For line 2 will be: 16 above
For line 3 will be: 5 degree
Of course it's very simple example.
In fact there're too many lines and about ~50 pattern in one file.
The value part may be anything - it can be number, word, punctuation, etc!
And my question is - how can I find all possible patterns from data?

This sounds like a log message clustering problem.
Trivial solution: replace all numbers with the string NUMBER using a regex. You might need to exclude dates or IP addresses or something. That might be enough to give you a list of all patterns in your logs.
Alternately, you might be able to count the number of words (whitespace-delimited fields) in each message and group the messages that way. For example, maybe all messages with 7 words are in the same format. If two different messages have the same format you can also match on the first word or something.
If neither of the above work then things get much more complicated; clustering arbitrary log messages is a research problem. If you search for "event log clustering" on Google Scholar you should see a lot of approaches you can learn from.

If the no of words in a line is fixed, like in your eg str, then you can use str.split()
str='''
The temperature is 10 above zero
The temperature is 16 above zero
The temperature is 5 degree zero
'''
for line in str.split('\n'):
if len(line.split()) >= 5:
a,b = line.split()[3], line.split()[4]
print(a,b)
Output:
10 above
16 above
5 degree

First, we would read the file line by line and add all sentences to a List.
In the example below, I am adding few lines to a list.
This list has all sentences..
lstSentences = ['The temperature is 10 above zero', 'The temperature is 16 above zero', 'The temperature is 5 degree above zero','Weather is ten degree below normal', 'Weather is five degree below normal' , 'Weather is two degree below normal']
Create a list to store all patterns
lstPatterns=[]
Initialize
intJ = len(lstSentences)-1
Compare one sentence against the one that follows it. If there are more than 2 matching words between two setences, perhaps this is a pattern.
for inti, sentence in enumerate(lstSentences):
if intJ!=inti:
lstMatch = [ matching for matching in sentence.split() if matching in
lstSentences[inti+1].split()]
if len(lstMatch)>2: #We need min 2 words matching between sentences
if not ' '.join(lstMatch) in lstPatterns: #if not in list, add
lstPatterns.append(' '.join(lstMatch))
lstMatch=[]
print(lstPatterns)
I am assuming patterns come one after the other (i.e., 10 rows with one pattern and then, 10 rows with another pattern). If not, the above code needs to change

Related

How to search for string that spans a line using regex in python

I am stymied trying to search for a string using regex when the string spans two lines in my text file. I have searched Stackoverflow to no end. I have tried the regex \s and DOTALL and other things for some time now.
The problem seems to be that I am iterating line after line, which seems to me to be the right thing to do for large files. And I get it that looking line after line shouldn't pick up something on the next line, but I thought there would be a flag in regex to use. I can't seem to find the relevant flag.
I have also tried various if then to deal with looking around the corner so to speak. But first that doesn't seem pythonic and second I keep getting errors like cannot concatenate strings and lists. And doing it that way just keeps getting more and more complicated.
Here is my script:
captured_text = []
captured_multi_nums = []
with open('text.txt', mode='r') as ptnt_txt:
for line in ptnt_txt:
my_txt_pull = re.findall("[a-zA-Z]+ [a-zA-Z]+ [0-9][0-9]*", line, re.M)
if my_txt_pull: #captures nonempty list
for item in my_txt_pull:
captured_text.append(item)
make_text_unique = (set(captured_text))
with open('patent_fig_number_output.txt', 'w') as f:
for item in make_text_unique:
f.write(item)
f.write('\n')
Here is the the text.txt file I use and what does NOT get captured is "line chaz \n56":
"chaz help for chaz new line chaz
56 you see.
To keep the telescoping arms 40 in their final extended state after deployment in a vessel, a one-way latch may be used to lock adjacent segments 42. FIG. 5 shows one possible latch 44, in a first position, for locking the telescoping arms 40. The latch 44 may consist of a one or more grooves 46 associated with a first segment 48 and a tooth 50 associated with a second, adjacent segment 52. As the telescoping arm 40 is expanded, the second segment 52 moves in a first direction A relative to the first segment 48. The tooth 50 and the grooves 46 are aligned so as to engage when the telescoping arm 40 is extended. Once the tooth 50 engages a groove 46, as shown in FIG. 6, the second segment 52 may not move in a second direction B relative to the first segment 48. Accordingly, the telescoping arm 40 is free to extend but may not collapse once extended. Of course other one-way latches may be used to lock the segments 42 of the telescoping arms 40. FIG. 7 illustrates one possible cross-section of a segment 42 of the telescoping arm 40. This "rail" design permits room for sliding and positioning of a one-way latch, like the one shown in FIG. 5, between segments 42 shown in FIG. 4."

The statement for line in ptnt_txt: splits the input on newlines and
processes line by line. Then you can't perform the regex across lines.
Use ptnt_txt.read() instead to slurp whole text into a variable.
In the regex "[a-zA-Z]+ [a-zA-Z]+ [0-9][0-9]* atoms are separated by
whitespaces which never matches newlines. Try something like:
"[a-zA-Z]+\s*[a-zA-Z]+\s*[0-9][0-9]*`
Then the lines between #4 and #8 will look like:
s = ptnt_txt.read()
my_txt_pull = re.findall("[a-zA-Z]+\s+[a-zA-Z]+\s+[0-9][0-9]*", s, re.M)
if my_txt_pull:
for item in my_txt_pull:
captured_text.append(item.replace('\n', ' '))

Here's what I ended up with based on #tshiono and the first comment that was eye opening:
patent_file = 'test2.txt'
with open(patent_file, mode='r') as ptnt_txt:
patent_txt = ptnt_txt.read().replace('\n', ' ')
my_txt_pull = re.findall("[a-zA-Z]+ [a-zA-Z]+\s+[0-9][0-9]*", patent_txt, re.M)

How to understand byte pair encoding?

I read a lot of tutorial about BPE but I am still confuse how it works.
for example.
In a tutorial online, they said the folowing :
Algorithm
Prepare a large enough training data (i.e. corpus)
Define a desired subword vocabulary size
Split word to sequence of characters and appending suffix “” to end of
word with word frequency. So the basic unit is character in this stage. For example, the frequency of “low” is 5, then we rephrase it to “l o w ”: 5
Generating a new subword according to the high frequency occurrence.
Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.
Taking “low: 5”, “lower: 2”, “newest: 6” and “widest: 3” as an example, the highest frequency subword pair is e and s. It is because we get 6 count from newest and 3 count from widest. Then new subword (es) is formed and it will become a candidate in next iteration.
In the second iteration, the next high frequency subword pair is es (generated from previous iteration )and t. It is because we get 6count
from newest and 3 count from widest.
I do not understand why low is 5 and lower is 2:
does this meand l , o, w , lo, ow + = 6 and then lower equal two but why is not e, r, er which gives three ?

The numbers you are asking about are the frequencies of the words in the corpus. The word "low" was seen in the corpus 5 times and the word "lower" 2 times (they just assume this for the example).
In the first iteration we see that the character pair "es" is the most frequent one because it appears 6 times in the 6 occurrences of "newest" and 3 times in the 3 occurrences of the word "widest".
In the second iteration we have "es" as a unit in our vocabulary the same way we have single characters. Then we see that "est" is the most common character combination ("newest" and "widest").

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?

Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]

If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

Removing chars/signs from string

I'm preparing text for a word cloud, but I get stuck.
I need to remove all digits, all signs like . , - ? = / ! # etc., but I don't know how. I don't want to replace again and again. Is there a method for that?
Here is my concept and what I have to do:
Concatenate texts in one string
Set chars to lowercase <--- I'm here
Now I want to delete specific signs and divide the text into words (list)
calculate freq of words
next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
Text example:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.

So to set upper case characters to lower case characters you could do the following:
so just store your text to a string variable, for example STRING and next use the command
STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
now your string will be free of capital letters.
To remove the special characters again module re can help you with the sub command :
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
with these command your string will be free of special characters
And to determine the word frequency you could use the module collections from where you have to import Counter.
Then use the following command to determine the frequency with which the words occur:
Counter(STRING.split()).most_common()

I'd probably try to use string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.