Tabbing in Python? - python

I need to write data to a textfile as a table. Sort of like a database. The header has Drivers, Cars, Teams, Grids, Fastest Lap, Race Time and Points. When I try to write the data that goes under it the don't line up. As some drivers names are longer than others.
It looks a bit like this:
| Driver |
|Sebastian William|
|Tom Hamilton |
Only 2 of the names actually align with the header. I am only trying to solve the issue with Drivers for now once I figure that out I should be able to gets all the other headers lined up.
Using a for loop through the array of dictionaries I set x to equal the len of the drivers name and 22 is the length of the longest name(18) plus a few spaces.
TextFile.write((items['Driver']+'\t|').expandtabs(22-x))
Any way of making them line up?

You could use format string syntax:
>>> "|{:22}|".format("Niki Lauda")
'|Niki Lauda |'
You can also change the alignment:
>>> "|{:>22}|".format("Niki Lauda")
'| Niki Lauda|'
>>> "|{:^22}|".format("Niki Lauda")
'| Niki Lauda |'
and if you want more flexibility with your column size, you can parametrize that as well:
>>> "|{:^{}}|".format("Niki Lauda", 24)
'| Niki Lauda |'

On top of the answer provided by Tim, you could opt to use Tabulate which is very easy to use and customise.
table = [["spam",42],["eggs",451],["bacon",0]]
headers = ["item", "qty"]
print tabulate(table, headers, tablefmt="grid")
+--------+-------+
| item | qty |
+========+=======+
| spam | 42 |
+--------+-------+
| eggs | 451 |
+--------+-------+
| bacon | 0 |
+--------+-------+
This provides support for multiple different database styles too. I prefer this to simply using format because it allows me to completely change the output style by configuring the tablefmt argument.

Related

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'
Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.
Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).
The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

python check if a string contains a 'word' with a certain format

I'm trying to check if a string contains a word with a certain format- 3numbers+ x+ 3numbers. I'm working with a pandas dataframe and the data looks like this:
| ad name |
| puma sneaker ad banner 320x480 |
| puma mobile 320x240 video ad |
the 320x480 and 320x240 indicate the size of the ad banner and I want to create a new column that only contains the size
| ad name | banner size |
| puma sneaker ad banner 320x480 | 320x480 |
| puma mobile 320x240 video ad | 320x240 |
For Example in sentence 'puma sneaker ad banner 320x480', I want to be able to print out '320x480', and in sentence 'puma mobile 320x240 video ad', I want to be able to print '320x240'.
I am not familiar with Regex and don't even know if this is achievable with it.
To brute force it I can do an if-else statement:
if "320x240" in somestring:
print "320x240"
elif "320x480" in somestring:
print "320x480"
...
But I don't want to brute force it, I'd like to find another way around to make the code cleaner. Any advice?
for pandas:
df['banner size'] = df['ad name'].str.extract(r'(\d{3}x\d{3})')
if you have more than one banner size in row, use str.findall instead.
import re
if re.search(r'\d{3}x\d{3}', somestring):
output = re.findall(r'\d{3}x\d{3}', somestring)
print(', '.join(output))
else:
print('Nothing find')

NLTK CFG No spaces between non terminals

I want to define a CFG txt file to read into NLTK using nltk.CFG.fromstring(). Thing is, when I define rules, I want to make rules that don't output spaces between non terminals. For example, say I have this grammar:
X -> TENS ONES
TENS -> '二十' | '三十' | '四十' | '五十' | '六十' | '七十' | '八十' | '九十'
ONES -> '一' | '二' | '三' | '四' | '五' | '六' | '七' | '八' | '九'
If I want the word "二十一", I cannot generate it because TENS ONES will insert a space and make '二十 一". If I instead make the rule as X -> TENSONES, TENSONES is treated as one non-terminal, not two and thus there is no parse. Is there a way I can use two non terminals in a production without the need of a space between them?

Is there a way to add a new line after every ']' in Python?

I originally had a string containing BBCode in which I wanted to format it better so it can be readable.
I had something like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG][b]\\u25ba osu! Mapping Theory[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]Linear Momentum[\\/url] | [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]Linear Momentum 2[\\/url] | [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]Angular Momentum and Circular Flow[\\/url] | [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]Active and Passive Mapping[\\/url]\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]Slider Flow[\\/url] | [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]Stream Flow[\\/url] | [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]Slider Mechanics[\\/url] | [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]Aesthetics by Symmetry[\\/url] | [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]Aesthetics by Complexity[\\/url] | [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]Defining Flow[\\/url]\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]Flow and Aesthetics[\\/url] | [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]Angle Emphasis[\\/url] | [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]Strain[\\/url] | [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]Pressure[\\/url] | [url=https:\\/\\/youtu.be\\/jm43HilQhYk]Tension[\\/url] | [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]Song Choice[\\/url] | [url=https:\\/\\/youtu.be\\/BNjVu8xq4os]Song Length[\\/url]\\n\\n[url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG6t5MCwGnq87iYZnE5G7aZL][b]\\u25ba osu! Rambling[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/-Beeh7dKyTk]Storyboards[\\/url] | [url=https:\\/\\/youtu.be\\/i6zzHMzwIzU]Why[\\/url]\\n\\n[url=https:\\/\\/youtu.be\\/_sBP7ttRQog]0 BPM[\\/url] | [url=https:\\/\\/youtu.be\\/UgtR6WnuTT8]ppv2 Pt.1[\\/url] | [url=https:\\/\\/youtu.be\\/Bx14u5tltyE]ppv2 Pt.2[\\/url] | [url=https:\\/\\/youtu.be\\/-095yuSLE4Y]Super high star rating[\\/url][\\/notice][url=https:\\/\\/amo.s-ul.eu\\/oApvJHWA][b]Skin v3.4[\\/b][\\/url]\\n[size=85]Personal edit of [url=https:\\/\\/osu.ppy.sh\\/forum\\/t\\/481314]Re:m skin by Choilicious[\\/url][\\/size]\\n\\n[img]http:\\/\\/puu.sh\\/qqv6C\\/0aaca52f51.jpg[\\/img][url=https:\\/\\/osu.ppy.sh\\/u\\/Satellite][img]http:\\/\\/puu.sh\\/qqv6K\\/94681bed3f.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Sellenite][img]http:\\/\\/puu.sh\\/qqv6T\\/c943ed1703.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Morinaga][img]http:\\/\\/puu.sh\\/qqv70\\/cfbdb2a242.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/-Mo-][img]http:\\/\\/puu.sh\\/qqv77\\/ca489f2d00.jpg[\\/img][\\/url]\\n[notice]I don\'t really do nomination stuff often anymore. \\nHowever, please do show me your map if it\'s any of the following:[box=][b]Bounty[\\/b]\\n[size=50]High priority modding for these artists\\/songs (maybe a GD, just ask).\\nPreferably non-cut versions and songs that have no ranked maps yet.[\\/size]\\n\\nYuuhei Satellite\\nYuuhei Catharsis\\nShoujo Fractal\\nHoneyWorks (non-vocaloid)\\nTrySail\\nClariS\\n\\nClariS - CLICK (Asterisk DnB Remix), [size=85]either version.[\\/size]\\nfhana - Outside of Melancholy, [size=85]a version that isn\'t cut pls[\\/size]\\nAny cover of \\u7832\\u96f7\\u6483\\u6226\\u3001\\u59cb\\u3081![\\/box]I also do storyboard checks for any map.\\n\\nPMs are open for anything. Ask me anything. \\nAsk me what my favourite colour is if you really want even.[\\/notice][box=Guests][b]Ranked[\\/b]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1575100][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Terasareru kurai no Shiawase [Lunatic][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1794557][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Arehateta Chijou no Uta [Collab Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1592915][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] Tanaka Hirokazu - C-TYPE [TetriS-TYPE] [S-TYPE][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1490130][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] TrySail - adrenaline!!! [Insane][\\/url] [size=85]Slightly ruined version.[\\/size]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1401096][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Shunkan Everlasting [Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/795269][img]
Basically unreadable currently.
I tried making it look like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG]
[b]
\\u25ba osu! Mapping Theory[\\/b]
[\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]
Linear Momentum[\\/url]
| [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]
Linear Momentum 2[\\/url]
| [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]
Angular Momentum and Circular Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]
Active and Passive Mapping[\\/url]
\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]
Slider Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]
Stream Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]
Slider Mechanics[\\/url]
| [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]
Aesthetics by Symmetry[\\/url]
| [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]
Aesthetics by Complexity[\\/url]
| [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]
Defining Flow[\\/url]
\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]
Flow and Aesthetics[\\/url]
| [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]
Angle Emphasis[\\/url]
| [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]
Strain[\\/url]
| [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]
Pressure[\\/url]
| [url=https:\\/\\/youtu.be\\/jm43HilQhYk]
Tension[\\/url]
| [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]
Song Choice[\\/url]
|
Where there's a new line after every ']'
I've tried tweaking re.sub("[\(\[].*?[\)\]]", "", str(x)) to fit what I need but it just deletes everything inside of them. (I have no idea how regex works)
How can I go about this?
There's no need for a regular expression, just use the simple str.replace() function.
x = x.replace(']', ']\n')
It really depends on exactly what you want your output to look like.
I interpreted your output as wanting a newline after each url= tag, which would require the following regex:
output = re.sub(r"(\[url.*?\])", r"\1\n", input)
The brackets () form a capture group which is then used in the replace statement as \1 since its the first unnamed capture group.
You can change the regex to your will but just keep the stuff you want to keep within the capture group.
If you want to experiment with regex you can use https://regexr.com/ which is an amazing resource when fiddling around with regex.

Extracting specific information from a string list using regular expressions

I have a string list with several thousands of URL values in different structures and I am trying to use regex to extract specific information from the URL values. The following gives you an example URL from which you can get an idea about the structure of this specific URL (note that there are many other records in this format, only the numbers changes across the data):
url_id | url_text
15 | /course/123908/discussion_topics/394785/entries/980389/read
Using the re library in python I can find which URLs have this structure:
re.findall(r"/course/\d{6}/discussion_topics/\d{6}/entries/\d{6}/read", text)
However, I also need to extract the '394785' and '980389' values and create a new matrix that may look like this:
url_id | topic_394785 | entry_980389 | {other items will be added as new column}
15 | 1 | 1 | 0 | 0 | 1 | it goes like this
Can someone help me in extracting this specific info? I know that 'split' method of 'str' could be an option. But, I wonder if there is a better solution.
Thanks!
Do you mean something like this?
import re
text = '/course/123908/discussion_topics/394785/entries/980389/read'
pattern = r"/course/\d{6}/discussion_topics/(?P<topic>\d{6})/entries/(?P<entry>\d{6})/read"
for match in re.finditer(pattern, text):
topic, entry = match.group('topic'), match.group('entry')
print('Topic ID={}, entry ID={}'.format(topic, entry))
Output
Topic ID=394785, entry ID=980389

Categories