How to capture element in columns with spaces around? - python

Example
300 january 10 20
300 februari 120,30 10
300 march 20,30 10
300,10 april 20,30 10
300 may 420,10 10,46
I want to reorder columns.
The first thing I do is to separate the columns between the text using a separator. p.e.
(?<=\S)(\s{2,})(?=\S) or
(?<=\S)(\s{1,})(?=\S)
Then I want to to put the columns in a list like this:
|300 | |january | | 10 | |20 |
|300 | |februari| |120,30| |10 |
|300 | |march | |20,30 | |10 |
|300,10| | april | | 20,30| |10 |
|300 | |may | |420,10| |10,46|
expected output:
mylist = [['300 ','january ',' 10 ','20 ']
['300 ','februari','120,30','10 '],
['300 ','march ','20,30 ','10 '],
['300,10',' april ',' 20,30','10 '],
['300 ','may ','420,10','10,46']]
I have no idea how to capture the spaces.
I tried this to capture the spaces after use of the separator:
#find the max length of an element in a column
lengte_temp = [[len(x) for x in row] for row in mylist]
maxlengthcolumn = max(l[len(mylist[0])-1] for l in length_temp)
#add spaces to elements
for b in range(0,len(mylist)):
if length_temp[b][len(mylist[-1])-1] < maxlengthcolumn:
mylist[b][len(mylist[-1])-1] = mylist[b][len(mylist[-1])-1] + ' '*(maxlengthcolumn-length_temp[b][len(mylist[-1])-1])
but this removes the spaces before the elements in a column.
How can I capture the elements in a list as in my example above?

Assuming that you're working with strings, you can use `ord' to obtain the ascii values, and split your string where alphas and numerics begin and end.
To break it down:
Intake each line in text one at time (from what I've read it looks like your original text could be a .txt?) to import your can use file i/o methods (more about that here and here)
Pass each line as a string and convert to ascii values using ord(), store these values in a separate variable
Set up logic to see where words/numbers begin (you should be looking for a pattern of an alpha, or numeric, followed by 0 or more alpha/numeric(s) followed by spaces, and after those series of spaces, you should find another alpha or numeric. Store the locations of each beginning (beginning defined as the first in the string, or the first alpha/numeric to follow after a series of spaces)
Index the line of text your currently working with and pull out desired strings.
This might be unclear, so see the psuedo code below:
strings_start = [5, 12, 22] # this would be where the words/numbers begin in the string that holds a line of your text
# we'll assume you have some variable, line, which holds the current line of the text you're parsing in a loop
for i in range(len(strings_start)):
if i < len(strings_start) - 1 # subtract 1 because indexes start at 0
string_list[i] = line[i: i + 1]
else:
string_list[i] = line[i:]

Related

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'
Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.
Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).
The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

How do I print a specific part of a YAML string

My YAML database:
left:
- title: Active Indicative
fill: "#cb202c"
groups:
- "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
My Python code:
import io
import yaml
with open("C:/Users/colin/Desktop/LBot/latin3_2.yaml", 'r', encoding="utf8") as f:
doc = yaml.safe_load(f)
txt = doc["left"][1]["groups"][1]
print(txt)
Currently my output is Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt] but I would like the output to be ō, is, it, or imus. Is this possible in PyYaml and if so how would I implement it? Thanks in advance.
I don't have a PyYaml solution, but if you already have the string from the YAML file, you can use Python's regex module to extract the text inside the [ ].
import re
txt = "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
parts = txt.split(" | ")
print(parts)
# ['Present', 'dūc[ō]', 'dūc[is]', 'dūc[it]', 'dūc[imus]', 'dūc[itis]', 'dūc[unt]']
pattern = re.compile("\\[(.*?)\\]")
output = []
for part in parts:
match = pattern.search(part)
if match:
# group(0) is the matched part, ex. [ō]
# group(1) is the text inside the (.*?), ex. ō
output.append(match.group(1))
else:
output.append(part)
print(" | ".join(output))
# Present | ō | is | it | imus | itis | unt
The code first splits the text into individual parts, then loops through each part search-ing for the pattern [x]. If it finds it, it extracts the text inside the brackets from the match object and stores it in a list. If the part does not match the pattern (ex. 'Present'), it just adds it as is.
At the end, all the extracted strings are join-ed together to re-build the string without the brackets.
EDIT based on comment:
If you just need one of the strings inside the [ ], you can use the same regex pattern but use the findall method instead on the entire txt, which will return a list of matching strings in the same order that they were found.
import re
txt = "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
pattern = re.compile("\\[(.*?)\\]")
matches = pattern.findall(txt)
print(matches)
# ['ō', 'is', 'it', 'imus', 'itis', 'unt']
Then it's just a matter of using some variable to select an item from the list:
selected_idx = 1 # 0-based indexing so this means the 2nd character
print(matches[selected_idx])
# is

I would like to slice the first part of a string before the '|'

i would like to slice different strings at a certain point . To be specific, i want to print the part of the sring before the first '|' .
data=' xbox 360 | 10000 | NEW '
length=len(data)
for i in range(length):
if (data[i]=='|'):
product=data[:i]
print(product)
However when i run the code the result is this:
xbox 360 | 10000
i want it to show only:
xbox 360
All you need is .split() as below:
the_stuff = data.split('|')[0]
This will split the line using | as the delimiter and give the results in a tuple, but the [0] only returns the first offset in the tuple which is everything before the first |.
If you want all 3 components, then you just need:
tuple_of_the_stuff = data.split('|')
And now you have a tuple of: (' xbox 360 ', ' 10000 ', ' NEW ')
Edit: and as suggested below, you can use .strip() to clean up the resulting values of your tuple at some point.

Regular expression Variant

I want to extract the length of a dress from a pandas dataframe .The row of that dataframe looks like this :
A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4
As you can see the size is contained between About and shoulder but in some cases shoulder is replaced by waist,hem etc.Below is my python script that finds the length but it fails when lets say there is a comma after About since i am slicing the list.
import re
def regexfinder(string_var):
res=''
x=re.search(r"(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])", string_var).group(0)
tohave=int(x[1:3])
if tohave >=16 and tohave<=36:
res="Mini"
return res
if tohave>36 and tohave<40:
res="Above the Knee"
return res
if tohave >=40 and tohave<=46:
res="Knee length"
return res
if tohave>46 and tohave<49:
res="Mid/Tea length"
return res
if tohave >=49 and tohave<=59:
res="Long/Maxi length"
return res
if tohave>59:
res="Floor Length"
return res
Your regex (?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips]) uses a character class for the words shoulder,waist,hem,bust,neck,bust,top,hips.
I think you want to put them in a non capturing group using an or |.
Try it like this using an optional comma ,?:
(?<=About),? (\d+)(?=.*?(?:shoulder|waist|hem|bust|neck|bust|top|hips]))
The size is in the first capturing group.
import re
s = """A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4"""
q = """'Velvet dress featuring mesh front, back and sleeves | Crewneck | Long bell sleeves | Self-tie closure at back cutout | About, 31" from shoulder to hem | Viscose/nylon | Hand wash | Imported | Model shown is 5\'10" (177cm) wearing a size Small.'1"""
def getSize(stringVal, strtoCheck):
for i in stringVal.split("|"): #Split string by "|"
if i.strip().startswith(strtoCheck): #Check if string startswith "About"
val = i.strip()
return re.findall("\d+", val)[0] #Extract int
print getSize(s, "About")
print getSize(q, "About")
Output:
23
31

Tabbing in Python?

I need to write data to a textfile as a table. Sort of like a database. The header has Drivers, Cars, Teams, Grids, Fastest Lap, Race Time and Points. When I try to write the data that goes under it the don't line up. As some drivers names are longer than others.
It looks a bit like this:
| Driver |
|Sebastian William|
|Tom Hamilton |
Only 2 of the names actually align with the header. I am only trying to solve the issue with Drivers for now once I figure that out I should be able to gets all the other headers lined up.
Using a for loop through the array of dictionaries I set x to equal the len of the drivers name and 22 is the length of the longest name(18) plus a few spaces.
TextFile.write((items['Driver']+'\t|').expandtabs(22-x))
Any way of making them line up?
You could use format string syntax:
>>> "|{:22}|".format("Niki Lauda")
'|Niki Lauda |'
You can also change the alignment:
>>> "|{:>22}|".format("Niki Lauda")
'| Niki Lauda|'
>>> "|{:^22}|".format("Niki Lauda")
'| Niki Lauda |'
and if you want more flexibility with your column size, you can parametrize that as well:
>>> "|{:^{}}|".format("Niki Lauda", 24)
'| Niki Lauda |'
On top of the answer provided by Tim, you could opt to use Tabulate which is very easy to use and customise.
table = [["spam",42],["eggs",451],["bacon",0]]
headers = ["item", "qty"]
print tabulate(table, headers, tablefmt="grid")
+--------+-------+
| item | qty |
+========+=======+
| spam | 42 |
+--------+-------+
| eggs | 451 |
+--------+-------+
| bacon | 0 |
+--------+-------+
This provides support for multiple different database styles too. I prefer this to simply using format because it allows me to completely change the output style by configuring the tablefmt argument.

Categories