So, I've been doing some research for a while now and I could't find anything about detecting a URL in a string. The problem is that most results are about detecting whether a string IS a URL, and not if it contains a URL. The 2 results that look best to me are
Regex to find urls in string in Python
and
Detecting a (naughty or nice) URL or link in a text string
but the first requires http://, which is not something spammers would use (:P) and the second one isn't in regex - and my limited knowledge does not know how to translate any of these. Something I have considered doing is using something dull like
spamlist = [".com",".co.uk","etc"]
for word in string:
if word in spamlist:
Do().stuff()
But that would honestly do more bad than good, and I am 100% sure there is a better way using regex or anything!
So if anyone knows anything that could help me I'd be very grateful! I've only been doing python for 1-2 months and not very intensively during this period but I feel like I'm making great progress and this one thing is all that's in the way, really.
EDIT: Sorry for not specifying earlier, I am looking to use this locally, not website (apache) based or anything similar. More trying to clean out any links from files I've got hanging around.
As I said in the comments,
Detecting a (naughty or nice) URL or link in a text string 's solution is a regex and you should probably make it a raw string or escape backslashes in it when using it in Python
You really shouldn't reinvent the square wheel here, especially since spam filtering is an arms race domain (couldn't remember the exact English phrase for this)
Related
I am trying to correct badly written emails contained in a list, by searching differences in the most common domains. E.g: hotmal.com to hotmail.com.
The thing is, there are tons of variations to one single domain. It would be extremly helpful if someone knew of an algorithm in python that can work as an autocorrect for email domains. Or if this is too complex of a problem for a few lines of code.
Check Levenshtein distance starting at https://en.wikipedia.org/wiki/Levenshtein_distance
It is commonly used for auto-correct
What if...you search for keywords in the domain. Like for hotmail.com, you can search for hot, or something similar. Also, like the #user10817019 wrote, you can combine it with searching for the first and last letters of the domain.
Write a small script in your preferred language that takes domains that start with h and end with l, and replace the entire string with hotmail so it fixes everything in between. Search for mai if they forgot the L. I had to do this the other day in vb.net so check my lists twice and correct bad data.
I am using Python 3.4, Windows 10, and Visual Studio 2015. I am trying to make a program that scrapes phone numbers from websites formatted like this one.
I am using Beautiful Soup 4, and am trying to get the number of beds from the table. I have tried soup.select('.td') and it only returns an empty array, I am not sure what else to try.
Why not grab the entire page HTML as a string and then use a regular expression to parse it? Is that not where Python excels?
In case you are afraid of regex, here is a beginner-friendly tutorial:
https://regexone.com/
The syntax for Python might be slightly different:
https://docs.python.org/2/library/re.html
And I seriously hope you are not scraping phone numbers for nefarious purposes. I don't want a phone call from you :-).
Here is another Stack Overflow answer which gives a good starting regex:
https://stackoverflow.com/a/123666/5129424
Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
Just because you "might mess it up" doesn't mean you shouldn't try it and test it. Regardless of what you do, you are either at the mercy of the structure of the page, which may change, or the format of the phone numbers, which may also change. There is no perfect solution.
I'm making an email in plain text. Now I have a string with text in it which I insert into the email. However, I want this text to have a max character width.
So input text is for example:
This is the input text. It's very boring to read because it's only an example which is used to explain my problem better. I hope you can help me.
And I want it to become:
This is the input text. It's very
boring to read because it's only
an example which is used to explain
my problem better. I hope you can
help me.
Of course we need to take into account that you can't split in the middle of a word. It may get pretty tricky when you have symols like ' and -, so I was wondering if there are tools that can do this for you? I've heard about NLTK but I couldn't find a solution there yet, and maybe it's a little bit overkill?
There is a textwrap library for just this:
http://docs.python.org/2/library/textwrap.html
Examples can be found there too. You likely want to use:
textwrap.fill(text, width)
I now also found the preferred way to do this in a Django project (which I use). It's the built-in template tag wordwrap
{{ value|wordwrap:5 }}
https://docs.djangoproject.com/en/1.3/ref/templates/builtins/#wordwrap
I'm currently learning how to program plugins for SiriServer, in hope to create a bit of home automation using my phone. I'm trying to figure out how to program the text coverted speech to match and run the plugin.
I've learnt how to to short phrases, like this for example.:
#register("en-US", ".*Start.*XBMC.*")
Though if I'm understanding it's searching at random for the two words. If I were to say XBMC Start, it would probably work as well, but when I start working with wolframalpha, I need to be a bit more specific.
For example, speech to text saying "What's the weather like in Toronto?", somehow connects to this:
#register("en-US", "(what( is|'s) the )?weather( like)? in (?P<location>[\w ]+?)$")
What would all the extra symbols in that line mean that could connect these two together? I've tried messing around with a couple ideas but nothing seems to work the way I want it to. Any help is appreciated, thanks!
I will break down the example you provided so hopefully that is a good start, but searching for python regex would provide more thorough information.
The parentheses set the enclosed items to be seen as the result, not the individual items by the remaining expression. The pipes mean "or", the question marks mean this portion may or may not be present, and the group for location is a regex which sets the variable "location" as the input provided at this point in the input. The $ at the end means that this will complete the sentence. .* means anything at this place in the input is acceptable, but should also be ignored. Hopefully that helps.
Maybe this is a dumb question, but I don't get it so appologize :)
I have an RTF document, and I want to change it. E.g. there is a table, I want to duplicate a row and change the text in the second row in my code in an object-oriented way.
I think pyparsing should be the way to go, but I'm fiddling around for hours and don't get it. I'm providing no example code because it's all nonsense I think :/
Am I on the right path or is there a better approach?
Anyone did something like that before?
RTFs are text documents with special "symbols" to create the formatting. (see - http://search.cpan.org/~sburke/RTF-Writer/lib/RTF/Cookbook.pod#RTF_Document_Structure It seems that perl has a good RTF library though), so yes, PyParsing is a good way to go. You have to learn the structure and then parse (there are perl code examples in the page i mentioned. If you are lucky you can translate them in python with some effort)
There is a basic RTF module available for python. Check - http://pyrtf.sourceforge.net/
Hope that helps you a little.