Split Text into paragraphs NLTK - usage of nltk.tokenize.texttiling? - python

I was looking at methods to split documents into paragraphs and I came across texttiling as one possible way to do this.
Here is my attempt to use it. However, I don't understand how to work with the output. I'd appreciate your help.
t = unidecode(doclist[0].decode('utf-8','ignore'))
nltk.tokenize.texttiling.TextTilingTokenizer(t)
output:
<nltk.tokenize.texttiling.TextTilingTokenizer at 0x11e9c6350>

I'm messing around with this one myself just now for the same reason you are and had the same question you did so don't be too upset if this is wrong. I figured best to pass on what little I know... :)
I'm not sure yet but I found in this bug report an example of using the TextTilingTokenizer:
alice=nltk.corpus.gutenberg.raw('carroll-alice.txt')
ttt = nltk.tokenize.TextTilingTokenizer()
tiles = ttt.tokenize(alice[140309 : ])
It appears that you want to feed your text to the tokenize method on the the TextTilingTokenizer.

Related

How to get partial match with soup.find()?

For some reason I wasn't able to find the answer to this somewhere.
So, I'm using this
soup.find(text="Dimensions").find_next().text
to grab the text after "Dimensions". My problem is on the website I'm scraping sometimes it is displayed as "Dimensions:" (with colon) or sometimes it has space "Dimensions " and my code throws an error. So that's why I'm looking for smth like (obviously, this is an invalid piece of code) to get a partial match:
soup.find(if "Dimensions" in text).find_next().text
How can I do that?
Ok, I've just found out looks like it's much simpler than I thought
soup.find(text=re.compile(r"Dimensions")).find_next().text
does what I need

How to find the root of a word from its present participle or other variations in Python?

I'm working on a NLP project, and right now, I'm stuck on detecting antonyms for certain phrases that aren't in their "standard" forms (like verbs, adjectives, nouns) instead of present-participles, past tense, or something to that effect. For instance, if I have the phrase "arriving" or "arrived", I need to convert it to "arrive". Similarly, "came" should be "come". Lastly, “dissatisfied” should be “dissatisfy”. Can anyone help me out with this? I have tried several stemmers and lemmanizers in NLTK with Python, to no avail; most of them don’t produce the correct root. I’ve also thought about the ConceptNet semantic network and other dictionary APIs, but it seems far too complicated for what I need. Any advice is helpful. Thanks!
If you know you'll be working with a limited set, you could create a dictionary.
Example :
look_up = {'arriving' : 'arrive',
'arrived' : 'arrive',
'came' : 'come',
'dissatisfied' : 'dissatisfy'}
test = 'arrived'
print (look_up [test])

REGEX to find all matches inside a given string

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item)) <class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item) but only got empty brackets [].
Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item)
print(matches), but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update:
thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of #FailSafe worked for me:screenshot of my jupyter notebook
here the result with list
But yes I totally agree with # Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)
this may not be your final answer but it creates a list of categories.
x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
y=x[2:-2]
z=y.split(',')
for item in z:
print(item)

Python Reportlab combine paragraph

I hope you can help me trying to combine a paragraph, my style is called "cursiva" and works perfectly also I have other's but it's the same if I change cursiva to other one. the issue is that If I use this coude o get this.
As you can see guys it shows with a line break and I need it shows togetter.
The problem is that i need to make it like this (one, one) togetter because I need to use two styles, the issue here is that I'm using arial narrrow so if I use italic or bold I need to use each one by separate because the typography does not alow me to use "< i >italic text< /i > ", so I need to use two different styles that actually works fine by separate.
how can I achive this?
cursiva = ParagraphStyle('cursiva')
cursiva.fontSize = 8
cursiva.fontName= "Arialni"
incertidumbre=[]
incertidumbre.extend([Paragraph("one", cursiva), Paragraph("one", cursiva)])
Thank you guys
The question you are asking is actually caused by a workaround for a different problem, namely that you don't know how to register font families in Reportlab. Because that is what is needed to make <i> and <b> work.
So you probably already managed to add a custom font, so the first part should look familiar, the final line is probably the missing link. It is registering the combination of these fonts a family.
from reportlab.pdfbase.pdfmetrics import registerFontFamily
pdfmetrics.registerFont(TTFont('Arialn', 'Arialn.ttf'))
pdfmetrics.registerFont(TTFont('Arialnb', 'Arialnb.ttf'))
pdfmetrics.registerFont(TTFont('Arialni', 'Arialni.ttf'))
pdfmetrics.registerFont(TTFont('Arialnbi', 'Arialnbi.ttf'))
registerFontFamily('Arialn',normal='Arialn',bold='Arialnb',italic='Arialni',boldItalic='Arialnbi')

How to transform hyperlink codes into normal URL strings?

I'm trying to build a blog system. So I need to do things like transforming '\n' into < br /> and transform http://example.com into < a href='http://example.com'>http://example.com< /a>
The former thing is easy - just using string replace() method
The latter thing is more difficult, but I found solution here: Find Hyperlinks in Text using Python (twitter related)
But now I need to implement "Edit Article" function, so I have to do the reverse action on this.
So, how can I transform < a href='http://example.com'>http://example.com< /a> into http://example.com?
Thanks! And I'm sorry for my poor English.
Sounds like the wrong approach. Making round-trips work correctly is always challenging. Instead, store the source text only, and only format it as HTML when you need to display it. That way, alternate output formats / views (RSS, summaries, etc) are easier to create, too.
Separately, we wonder whether this particular wheel needs to be reinvented again ...
Since you are using the answer from that other question your links will always be in the same format. So it should be pretty easy using regex. I don't know python, but going by the answer from the last question:
import re
myString = 'This is my tweet check it out http://tinyurl.com/blah'
r = re.compile(r'(http://[^ ]+)')
print r.sub(r'\1', myString)
Should work.

Categories