My YAML database:
left:
- title: Active Indicative
fill: "#cb202c"
groups:
- "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
My Python code:
import io
import yaml
with open("C:/Users/colin/Desktop/LBot/latin3_2.yaml", 'r', encoding="utf8") as f:
doc = yaml.safe_load(f)
txt = doc["left"][1]["groups"][1]
print(txt)
Currently my output is Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt] but I would like the output to be ō, is, it, or imus. Is this possible in PyYaml and if so how would I implement it? Thanks in advance.
I don't have a PyYaml solution, but if you already have the string from the YAML file, you can use Python's regex module to extract the text inside the [ ].
import re
txt = "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
parts = txt.split(" | ")
print(parts)
# ['Present', 'dūc[ō]', 'dūc[is]', 'dūc[it]', 'dūc[imus]', 'dūc[itis]', 'dūc[unt]']
pattern = re.compile("\\[(.*?)\\]")
output = []
for part in parts:
match = pattern.search(part)
if match:
# group(0) is the matched part, ex. [ō]
# group(1) is the text inside the (.*?), ex. ō
output.append(match.group(1))
else:
output.append(part)
print(" | ".join(output))
# Present | ō | is | it | imus | itis | unt
The code first splits the text into individual parts, then loops through each part search-ing for the pattern [x]. If it finds it, it extracts the text inside the brackets from the match object and stores it in a list. If the part does not match the pattern (ex. 'Present'), it just adds it as is.
At the end, all the extracted strings are join-ed together to re-build the string without the brackets.
EDIT based on comment:
If you just need one of the strings inside the [ ], you can use the same regex pattern but use the findall method instead on the entire txt, which will return a list of matching strings in the same order that they were found.
import re
txt = "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
pattern = re.compile("\\[(.*?)\\]")
matches = pattern.findall(txt)
print(matches)
# ['ō', 'is', 'it', 'imus', 'itis', 'unt']
Then it's just a matter of using some variable to select an item from the list:
selected_idx = 1 # 0-based indexing so this means the 2nd character
print(matches[selected_idx])
# is
Related
I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'
Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.
Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).
The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.
in my task I want to fetch only time and store in variable, in my string it may be possible that time occurs more than 1 time and it may be "AM" or "PM"
I only want to store this value from my string.
"4:19:27" and "7:00:05" the occurrence of time may be more than twice.
str = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
MY Code is
str = '''TEXT VIEW : 16908310=android.widget.TextView#405ee2f0=Troubles | 2131034163=android.widget.TextView#405ef6d0=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#40630608=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#40631068=OK | 2131034162=android.widget.TextView#40632078=Sep 12, 2017 4:19:27 AM | VIEW : -1=android.widget.LinearLayout#405ed390 | -1=android.widget.FrameLayout#405edd48 | 16908310=android.widget.TextView#405ee2f0 | 16908290=android.widget.FrameLayout#405eefa8 | -1=android.widget.LinearLayout#405ef468 | 2131034163=android.widget.TextView#405ef6d0 | -1=android.widget.ScrollView#405effc8 | 2131034164=android.widget.TableLayout#405f0cd0 | 2131034158=android.widget.TableRow#4062f7a8 | 2131034159=android.widget.ImageView#4062fcd0 | 2131034160=android.widget.TextView#40630608 | 2131034161=android.widget.RadioButton#40631068 | 2131034162=android.widget.TextView#40632078 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ecb98 | BUTTONS : 2131034161=android.widget.RadioButton#40631068 |'''
if " AM " or " PM " in str:
Time = str.split(" AM " or " PM ")[0].rsplit(None, 1)[-1]
print Time
Note that you shouldn't name a variable with a special word like str. You could use a regular expression, like this:
import re
my_string = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
pattern = '\d{1,2}:\d{2}:\d{2}\s[AP]M'
date_list = re.findall(pattern, my_string)
print(date_list)
# outputs ['4:19:27 AM', '7:00:05 PM']
Explanation of the pattern:
\d{1,2} matches one or two digits
: matches ":"
\d{2} matches exactly two digits
: matches ":"
\d{2} matches exactly two digits
\s matches a space
[AP] matches either an A or a P, only one
M, the last M
Use regex with this expression: ([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM). This pattern will give you two groups: one for the numbers of the time and one for the AM or PM information. This is much better than splitting the string manually. You can test it here, and get used to using regex.
All in all you can use it like this in python:
import re
p = re.compile('([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM)')
for (numbers, status) in p.match(theString):
#prints the numbers like 04:02:55
print(numbers)
#prints the AM or PM
print(status)
It's not a good idea to use str as a variable name because that's a builtin
so assuming your string is in s, here is an interactive demonstration of
what I think you want.
>>> import re
>>> re.findall('[=][^|=]+[AP]M [|]', s)
['=Sep 12, 2017 4:19:27 AM |', '=Sep 12, 2017 7:00:05 PM |']
>>> [r.split() for r in re.findall('[=][^|=]+[AP]M [|]', s)]
[['=Sep', '12,', '2017', '4:19:27', 'AM', '|'], ['=Sep', '12,', '2017', '7:00:05', 'PM', '|']]
>>> [r.split()[3] for r in re.findall('[=][^|=]+[AP]M [|]', s)]
['4:19:27', '7:00:05']
>>>
Regular expressions are your friend here. For example:
import re
inputstring = '''...'''
timematch = re.compile('\d{1,2}:\d{1,2}:\d{1,2} [AP]M')
print(timematch.findall(inputstring))
The regular expression in question matches any occurrence of XX:XX:XX AM and XX:XX:XX PM, and takes into account time noted as 4:00:00 AM as well as 04:00:00 AM.
It would be easy to use regex:
<script src="//repl.it/embed/Kyqe/0.js"></script>
You can use this regex:
\d+:\d+:\d+
or r'\d{1,2}:\d{1,2}:\d{1,2}'
Code: https://repl.it/Kyqe/0
Example
300 january 10 20
300 februari 120,30 10
300 march 20,30 10
300,10 april 20,30 10
300 may 420,10 10,46
I want to reorder columns.
The first thing I do is to separate the columns between the text using a separator. p.e.
(?<=\S)(\s{2,})(?=\S) or
(?<=\S)(\s{1,})(?=\S)
Then I want to to put the columns in a list like this:
|300 | |january | | 10 | |20 |
|300 | |februari| |120,30| |10 |
|300 | |march | |20,30 | |10 |
|300,10| | april | | 20,30| |10 |
|300 | |may | |420,10| |10,46|
expected output:
mylist = [['300 ','january ',' 10 ','20 ']
['300 ','februari','120,30','10 '],
['300 ','march ','20,30 ','10 '],
['300,10',' april ',' 20,30','10 '],
['300 ','may ','420,10','10,46']]
I have no idea how to capture the spaces.
I tried this to capture the spaces after use of the separator:
#find the max length of an element in a column
lengte_temp = [[len(x) for x in row] for row in mylist]
maxlengthcolumn = max(l[len(mylist[0])-1] for l in length_temp)
#add spaces to elements
for b in range(0,len(mylist)):
if length_temp[b][len(mylist[-1])-1] < maxlengthcolumn:
mylist[b][len(mylist[-1])-1] = mylist[b][len(mylist[-1])-1] + ' '*(maxlengthcolumn-length_temp[b][len(mylist[-1])-1])
but this removes the spaces before the elements in a column.
How can I capture the elements in a list as in my example above?
Assuming that you're working with strings, you can use `ord' to obtain the ascii values, and split your string where alphas and numerics begin and end.
To break it down:
Intake each line in text one at time (from what I've read it looks like your original text could be a .txt?) to import your can use file i/o methods (more about that here and here)
Pass each line as a string and convert to ascii values using ord(), store these values in a separate variable
Set up logic to see where words/numbers begin (you should be looking for a pattern of an alpha, or numeric, followed by 0 or more alpha/numeric(s) followed by spaces, and after those series of spaces, you should find another alpha or numeric. Store the locations of each beginning (beginning defined as the first in the string, or the first alpha/numeric to follow after a series of spaces)
Index the line of text your currently working with and pull out desired strings.
This might be unclear, so see the psuedo code below:
strings_start = [5, 12, 22] # this would be where the words/numbers begin in the string that holds a line of your text
# we'll assume you have some variable, line, which holds the current line of the text you're parsing in a loop
for i in range(len(strings_start)):
if i < len(strings_start) - 1 # subtract 1 because indexes start at 0
string_list[i] = line[i: i + 1]
else:
string_list[i] = line[i:]
I have this code below that compares a piece of text to a stop word set and returns a list of words in the text that are not in the stop word set. Then I change the list of words to a string, so I can use it in the textmining module to create a term document matrix.
I have checks in the code that show that hyphenated words are being maintained in the list and in the string, but once I pass them through the TDM part of the code, the hyphenated words are broken up. Is there a way to maintain hyphenated words in the textmining module and TDM?
import re
f= open ("words") #dictionary
stops = set()
for line in f:
stops.add(line.strip())
f = open ("azathoth") #Azathoth (1922)
azathoth = list()
for line in f:
azathoth.extend(re.findall("[A-z\-\']+", line.strip()))
azathothcount = list()
for w in azathoth:
if w in stops:
continue
else:
azathothcount.append(w)
print azathothcount[1:10]
raw_input('Press Enter...')
azathothstr = ' '.join(azathothcount)
print azathothstr
raw_input('Press Enter...')
import textmining
def termdocumentmatrix_example():
doc1 = azathothstr
tdm = textmining.TermDocumentMatrix()
tdm.add_doc(doc1)
tdm.write_csv('matrixhp.csv', cutoff=1)
for row in tdm.rows(cutoff=1):
print row
raw_input('Press Enter...')
termdocumentmatrix_example()
The textmining package defaults to its own ‘simple_tokenize’ function when initializing the TermDocumentMatrix class. add_doc() pushes your text through simple_tokenize() before adding it to the tdm.
help(textmining) yields, in part:
class TermDocumentMatrix(__builtin__.object)
| Class to efficiently create a term-document matrix.
|
| The only initialization parameter is a tokenizer function, which should
| take in a single string representing a document and return a list of
| strings representing the tokens in the document. If the tokenizer
| parameter is omitted it defaults to using textmining.simple_tokenize
|
| Use the add_doc method to add a document (document is a string). Use the
| write_csv method to output the current term-document matrix to a csv
| file. You can use the rows method to return the rows of the matrix if
| you wish to access the individual elements without writing directly to a
| file.
|
| Methods defined here:
|
| __init__(self, tokenizer=<function simple_tokenize>)
|
| ...
|
|simple_tokenize(document)
| Clean up a document and split into a list of words.
|
| Converts document (a string) to lowercase and strips out
| everything which is not a lowercase letter.
So you’ll have to roll your own tokenizer that does not split on the hyphen, and pass it through when you initialize the TermDocumentMatrix class.
In my mind, it would be best if this process maintained the rest of the functionality of the simple_tokenize() function - minus removing hyphenated words, so you might route the hyphenated words around the results of that function. Below, I've removed the hyphenated words from the document, pushed the remainder through simple_tokenize() and then merged the two lists (hyphenated words + simple_tokenize() results) before adding them to the tdm:
doc1 = 'blah "blah" blahbitty-blah, in-the bloopity blip bleep br-rump! '
import re
def toknzr(txt):
hyph_words = re.findall(r'\w+(?:-\w+)+',txt)
remove = '|'.join(hyph_words)
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
simple = regex.sub("", txt)
return(hyph_words + textmining.simple_tokenize(simple))
tdm = textmining.TermDocumentMatrix(tokenizer = toknzr)
tdm.add_doc(doc1)
This may not be the most pythonic way to make your own tokenizer (feedback appreciated!), but the main point here is that you'll have to initialize the class with a new tokenizer and not use the default simple_tokenize().
My regex is not working properly. I'm showing you before regex text and after regex text. I'm using this regex re.search(r'(?ms).*?{{(Infobox film.*?)}}', text). You will see my regex not displaying the result after | country = Assam, {{IND . My regex stuck at this point. Will you please help me ? thanks
Before regex:
{{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free
}}
After regex:
{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND
Why regex stuck at this point? country = Assam, {{IND
Edit : Expecting Result
Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free
Your regex is catching everything between the first {{ and the first }}, which is in the "country" entry of the infobox. If you want everything between the first {{ and the last }}, then you want to make the .* inside the braces greedy by removing the ?:
re.search(r'(?ms).*?{{(Infobox film.*)}}', text)
Note that this will find the last }} in the input (eg. if there's another template far below the end of the infobox, it will find the end of that), so this may not be what you want. When you have nesting things like this, regex is not always the best way to search.