Check if text is sentences? [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
So I have a scraper that gets articles. However, it doesn't always work properly. I want to get better at checking when it doesn't work. For example, the following is something like I want it to scrape:
Hello. This is a sequence of sentences that are put together. They don't have to follow this exact format, but something very close to this would be nice! Just basically stuff like this put together with the occasional weird formatting, which depends on what is scraped.
But I might also get something that is obviously not text:
REGISTER | LOGIN | LOGOUT | Sign in to your account Forgot your password? {* #signInForm *}....
Is there any python library that checks the general format of strings? Basically, I am scraping articles and want to see if the text scraped is article-y. If there isn't a python library, would the best way to go be some sort of regex matching? Is this possible to do reasonably well?
Any help would be greatly appreciated, thanks!!
[edit] if you voted to close, do you mind leaving a comment as to why? Reason being: There is no stack exchange for NLP. Hence, where else can I ask this question? Thanks.

There are many ways to do this, and without seeing a lot more of your data predicting the correct way will be difficult.
That said, here's one simple strategy: split the text into words and check if it statistically looks like writing as opposed to boilerplate. For example, in English for any sufficiently long piece of writing roughly 5% of the tokens should be the word the. For short pieces of text this is less reliable but based on your examples above a very simple check along these lines ("do a|an|the make up more than 1% of the tokens?") may work.
For more sophisticated methods you can look at a list of boilerplate removal libraries here.

Related

Tips to practice matplotlib [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I've been studying python for data science for about 5 months now. But I get really stucked when it comes to matplotlib. There's always so many options to do anything, and I can't see a well defined path to do anything. Does anyone have this problem too and knows how to deal with it?
I had the same problem sometime back. I just picked the Boston Housing Prices dataset and kept practicing on that. If you work on it enough you will be able to create all types of plots for the EDA and get good practice. Of course after a certain point it can get boring , thats when you jump to a dataset in an area of your interests, in my case it was movie reviews.
Below is the link to the housing prices data.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
I think your question is stating that you are bored and do not have any projects to make. If that is correct, there are many datasets available on sites like Kaggle that have open-source datasets for practice programmers.
in programming in general " There's always so many options to do anything".
i recommend to you that read library and understand their functions and classes in a glance, then go and solve some problems from websites or give a real project if you can. if your code works do not worry and go ahead.
after these try and error you have a lot of real idea about various problems and you recognize difference between these options and pros and cons of them. like me three years ago.

Understanding the structure of python code [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Does anyone know of any useful resources to learn the structure of python code? When I say structure, I'm referring to the ingredients necessary to make the code correct, so a combination of syntax and the order of terms etc.
For example, to understand the type of a pandas dataframe, df, (I know, I've already identified that it's a dataframe, but just hypothetically), we type: type(df). For the shape, we write df.shape.
So, in the latter, why do we put df first, as in type(df)? Yet in the former, we put df last, and in parenthesis as in type(df). So, mixing these commands up, one may write df.type and expect to get an answer, but they wouldn't. So how do we learn these rules? I have searched high and low for some guidance, but I find very little; a lot of the guides assume you have an appreciation of these rules and don't explain them.
I'm not asking for help on this particular problem. Rather, I'm looking for a resource that would enable me to understand this, and the multitude of other rules better.
I don't have a Computer Science background, so my answer might be unorthodox or incomplete. There are things which come across as inconsistent when you use Python, and these things often trick me when I write new code. So you can easily find yourself looking up the same stuff time and time again.
However, I would say that the more you practice, the less likely it is to make mistakes or be confused. One good tip is to create a OneNote page with all the snippets that usually trip you up. Maybe by going back to that same snippet you will end up saving yourself time and fix those concepts in memory. It will cost you nothing, so give it a try.

Setting up every sentence (or string) in a text have its own URL [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
So I want to have a webpage that has a text - say a short story.
I want every sentence in the short story to have its own URL and page.
The idea would be to allow users to save and comment on each sentence.
It would be similar to what RapGenius does for lyrics. I.e. a song has its own page (https://genius.com/Eminem-venom-music-from-the-motion-picture-lyrics). But each line in the song can also have its own URL/page (https://genius.com/15303513 or https://genius.com/15306806)
What is the best way to approach this?
Should I be splitting the short story into sentences beforehand,
then import it into the database?
Or should I be looking to upload the short story as a whole text
(either onto the database or on the server) and then try to split it
after the fact?
Or is "splitting up" the story into sentences the wrong approach
completely? Should I be looking to have the URL generate based off
the sentence's location in the text?
I'm currently leaning towards option 1.
I would appreciate any help or guidance! Looking to build my first proper Django app (after doing a bunch of tutorials) and I would like to make sure I'm on the right path.
I'm not sure there's particularly a "right" answer here.
If you are always going to split per sentence, I'd just Keep it Simple and go with (1) and store each sentence separately in the database. It makes a simple one-to-one relationship between sentence and page, and decreases the amount of work your code has to do each time the story or sentence is viewed. It also means if you modify a sentence nothing will get messed up.
OTOH, if you want to have some sort of user-definable "highlight" like genius.com, that won't work, as of course you won't know what the user wants to highlight until they're there.
The difference between (1) and (2) seems like more of an implementation decision - when you split the sentences out. It really depends on how you're adding these stories to the database. In the long run you'll probably benefit from having an automated process on the server that does this, but that might not be the MVP (minimum viable product) solution for now.
(3) seems like a massive PITA and could get messed up if you edit the text anywhere above the split out area.

Is it possible to send emails in python? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
My goal is to send emails entirely from python. I want to do it all from scratch, maybe go as far as building an email server in python if someone hasn't done it already. I want to do this because I'm basically tired of using Postfix or the common email providers with the standard SMTP/POP/IMAP libraries. Also, another reason I want to try to do this is because I want to try and understand better how the email protocols work.
I'm not entirely sure where to start. Maybe I should take a look at the Postfix source code and try and make a python SMTP server. I know it would be much easier to just stick with the standard way of doing it instead of building from scratch, but like I said, this is more of an educational study for me to learn how it all works, I will very likely never use it in production.
So, give me ideas guys. Where should I start? If you know of an article that may enlighten me, please post it. --Thanks
It's been done before, but you can always make one if you want.
smtpd was a Python Module of the Week.
This is some good reading that was provided in a similar SO question here.
I've used this before when I was working on a project.

Is there a Ruby/Python HTML reflow/layout library? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a library in Ruby or Python that would take some HTML and CSS as the input and return data that contains the positions and sizes of the elements. If it helps, I don't need the info for all the elements but just the major divs of the page.
Scriptor, I think what you likely are looking for might be something in JavaScript more then Ruby or Python. I mean - the positions and sizes are essentially going to be determined by the rendering engine (the browser). You might consider using something like jQuery to loop through all of your desired objects - outputting the name of the object (like the DIV's ID) and the height and width of that item. So, for what it's worth I'd look at jQuery if I was in your position and the height() and width() methods. You never know - there may already be a jQuery plugin.
Both Ruby and Python have a Regex library. Why not search for things like /width=\"(\d+)px\"/ and /height:(\d+)px/. Use $1 to find the value in the group. I'm not a regex expert and I'm doing this from memory, so refer to any of the tutorials on the net for the correct syntax and variable usage, but that's where to start. Good luck,
bsperlinus

Categories