I want to create a sane/safe app bundle name (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (mich might contain just anything).
(It doesn't matter for me wether the function is Cocoa, ObjC, Python, etc.)
(This is related to the filename question and the bundle name question but the bundle identifier is much more restrictive. I think it cannot even contain spaces and I also would want to strip out the dots and put my own prefix.)
I think Xcode also hase some function to do that automatically from the app name. Maybe there is some standard function in Cocoa to do that.
Bundle identifiers are meant to be in reverse URL form (guaranteeing global uniqueness):
com.apple.xcode, for example
So really you need a domain name, then you can invent whatever scheme you like below that.
Given this, and some knowledge of the characters in your input, you can either scan through your input composing a new string with only the bits you want, or use methods like stringByReplacingOccurrencesOfString: withString: and, if you like, lowercaseString.
The permitted characters in bundle identifiers are named in the Property List Documentation as:
The bundle ID string must be a uniform type identifier (UTI) that contains only alphanumeric (A-Z,a-z,0-9), hyphen (-), and period (.) characters. The string should also be in reverse-DNS format.
Related
I am looking for a way to force the groupSeparator symbol of a doublespinbox.
For context, one of my programs uses an numerical input (doublespinbox) + unit choices (radio buttons) to form a number. It looks roughly like this:
voltage [ 5 ] o V
o mV
o µV
I use a group separator to make reading easier. On a French machine I get a satisfying display, where for example 1 thousand and 1 look like so: 1 000 or 1,000. On an English machine, I get 1,000 and 1.000 which can be easily confused. How could I force the group separator to be always a space?
Alternatively, I believe that a solution could be to force the locale of the program as answered here but I'm always interested in seeing if custom solutions are possible. Otherwise, I'll stick to
self.setLocale(QtCore.QLocale(QtCore.QLocale.French))
Another possibility is to reimplement your own subclass for the spinbox and override the textFromValue() function:
class SpaceSeparatorSpin(QtWidgets.QDoubleSpinBox):
def textFromValue(self, value):
text = self.locale().toString(float(value), 'f', self.decimals())
return text.replace(self.locale().groupSeparator(), ' ')
In this way, we use the current (default) locale to transform the value to a string and then return the string with the separator replaced with the space.
There are some issues with both approaches, though.
Using a custom locale for a single widget class can result in unexpected behavior when using copy&paste functions: if the user lives in a country that uses the point for the decimals, a simple "50.2" value that might be taken from another source will not be pasted, as the validator will not recognize that string as valid (for the French locale, it should be "50,2").
Using the textFromValue override has the opposite problem if the user wants to copy from a subclassed spinbox to another, as the space separator will make the validator ignore the string when the spinbox calls valueFromText().
To avoid all that, you could override the validate() function: if the base implementation returns a valid or intermediate value, return it, otherwise validate it on your own being careful about the current locale and the possibility of the "double input possibilities" (with or without spaces, inverted points/commas for decimals and group separators); note that while pasting a "space-separated" value on a locale that uses them works, QAbstractSpinBox doesn't accept spaces when typing.
Besides all that, keep in mind that using "de-localized" decimal points and separator is not a good thing. While it might seem fine for you, user with other types of punctuations will probably find it very annoying, especially for people that are used to the numeric pad: usually, the decimal point key of the pad is configured with that of the system locale, so users that have a locale that uses the point for decimals won't be able to type decimals from the pad, forcing them to move their hand away from it to type the comma.
I have a route:
#app.route("/login/<user>/<timestamp>")
def user(user, timestamp):.
But, I need it in this form -
#app.route("/login/<user><timestamp>")
def user(user, timestamp):.
i.e without the slash('/').
Is there any way to do it ?
Short answer: It is possible given the two parameters have a non-overlapping pattern. By giving it a wildcard-pattern however (you did not specify the converter). It will result in the fact that all content is handled to the user. That being said, it is advisable to have a clear separator.
As is specified in the documentation, you can define variables by writing them like HTML tags, like <var>, you can also specify a converter, like <converter:var>. If you do not specify a converter, the parameter is assumed to be a string that can not contain slashes.
There are however other converters, like int, float, path and uuid.
If the patterns are written in such way that it is clear when the first pattern ends, and the second pattern begins, then it this can be handled. For example:
#app.route("/login/<int:day><user>")
can work, given user can not start with a digit, since here once the sequence of digits ends, Flask will parse the <user> parameter.
By writing #app.route("/login/<user><timestamp>") however, the two patterns are overlapping: if we do not have a parsing strategy any split could be a valid one. Since the engine is greedy if I recall correctly, in practice it will result in the fact that user takes all characters, and timestamp none.
Since the default string does not include a slash, we know that the slash acts as a clear separator, since it is not included in both variables in your example.
My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?
Does anyone know a regular expression to match Domain.CCTLD? I don't want subdomains, only the "atomic domain". For example, docs.google.com doesn't get matched, but google.com does. However, this gets complicated with stuff like .co.uk, CCTLDs. Does anyone know a solution? Thanks in advance.
EDIT: I've realized I also have to deal with multiple subdomains, like john.doe.google.co.uk. Need a solution now more than ever :P.
It sounds like you are looking for the information available through the Public Suffix List project.
A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are ".com", ".co.uk" and "pvt.k12.wy.us". The Public Suffix List is a list of all known public suffixes.
There is no single regular expression that will reasonably match the list of public suffixes. You will need to implement code to use the public suffix list, or find an existing library that already does so.
Based on your comment above, I'm going to reinterpret the question -- rather than making a regex that will match them, we'll create a function that will match them, and apply that function to filter a list of domain names to only include first class domains, e.g. google.com, amazon.co.uk.
First, we'll need a list of TLDs. As Greg mentioned, the public suffix list is a great place to start. Let's assume you've parsed the list into a python array called suffixes. If this isn't something your comfortable with, comment and I can add some code that will do it.
suffixes = parse_suffix_list("suffix_list.txt")
Now we'll need code that identifies whether a given domain name matches the pattern some-name.suffix:
def is_domain(d):
for suffix in suffixes:
if d.endswith(suffix):
# Get the base domain name without suffix
base_name = d[0:-(suffix.length + 1)]
# If it contains '.', it's a subdomain.
if not base_name.contains('.'):
return true
# If we get here, no matches were found
return false
I would probably solve this by getting a complete list of TLDs and using it to create the regex. For example (in Ruby, sorry, not a Pythonista yet):
tld_alternation = ['\.com','\.co\.uk','\.eu','\.org',...].join('|')
regex = /^[a-z0-9]([a-z0-9\-]*[a-z0-9])?(#{tld_alternation})$/i
I don't think it's possible to properly differentiate between a real two part TLD and a subdomain without knowing the actual list of TLDs (ie: you could always construct a subdomain that looks like a TLD if you knew how the regex worked.)
I am using python and would like a simple api or regex to check for a domain name's validity. By validity I am the syntactical validity and not whether the domain name actually exists on the Internet or not.
Any domain name is (syntactically) valid if it's a dot-separated list of identifiers, each no longer than 63 characters, and made up of letters, digits and dashes (no underscores).
So:
r'[a-zA-Z\d-]{,63}(\.[a-zA-Z\d-]{,63})*'
would be a start. Of course, these days some non-Ascii characters may be allowed (a very recent development) which changes the parameters a lot -- do you need to deal with that?
r'^(?=.{4,255}$)([a-zA-Z0-9][a-zA-Z0-9-]{,61}[a-zA-Z0-9]\.)+[a-zA-Z0-9]{2,5}$'
Lookahead makes sure that it has a minimum of 4 (a.in) and a maximum of 255 characters
One or more labels (separated by periods) of length between 1 to 63, starting and ending with alphanumeric characters, and containing alphanumeric chars and hyphens in the middle.
Followed by a top level domain name (whose max length is 5 for museum)
Note that while you can do something with regular expressions, the most reliable way to test for valid domain names is to actually try to resolve the name (with socket.getaddrinfo):
from socket import getaddrinfo
result = getaddrinfo("www.google.com", None)
print result[0][4]
Note that technically this can leave you open to DoS (if someone submits thousands of invalid domain names, it can take a while to resolve invalid names) but you could simply rate-limit someone who tries this.
The advantage of this is that it'll catch "hotmail.con" as invalid (instead of "hotmail.com", say) whereas a regex would say "hotmail.con" is valid.
I've been using this:
(r'(\.|\/)(([A-Za-z\d]+|[A-Za-z\d][-])+[A-Za-z\d]+){1,63}\.([A-Za-z]{2,3}\.[A-Za-z]{2}|[A-Za-z]{2,6})')
to ensure it follows either after dot (www.) or / (http://) and the dash occurs only inside the name and to match suffixes such as gov.uk too.
The answers are all pretty outdated with the spec at this point. I believe the below will match the current spec correctly:
r'^(?=.{1,253}$)(?!.*\.\..*)(?!\..*)([a-zA-Z0-9-]{,63}\.){,127}[a-zA-Z0-9-]{1,63}$'