Django query phone numbers excluding brackets - python

I am trying to build a django query that filters phone numbers (CharField) but excludes brackets
eg.
if I search for 0123456789 it would find (01) 234 567 89
Thanks

Well, you can either use regex, or you can reformat your search:
pn = '0123456789'
Model.objects.filter(phone='(%s) %s %s %s' % (pn[:2], pn[2:5], pn[5:8], pn[8:]))
Ideally you normalize all phone numbers and you'd search for them in that format. If you check out django.contrib.localflavors.us's PhoneNumberField, it forces all new phone numbers to be saved in XXX-XXX-XXXX format, for instance. If you aren't normalizing the phone numbers somehow, you should be. Dealing with multiple potential formats would not be fun.

You can also use regular expressions in your lookup. See: https://docs.djangoproject.com/en/dev/ref/models/querysets/#s-regex

I never like to answer my own questions, but this solution might be helpful to others doing a similar thing:
I defined a function on the model using the #property decorator.
class MyModel(models.model):
....
phoneNumber = CharField...
#property
def raw_phone_number(self):
# function strips those characters and returns just the number

Related

Is there an R or Python function for separating information in non-delimited strings, where the information varies?

I am currently cleaning up a messy data sheet in which information is given in one excel cell where the different characteristics are not delimited (no comma, spaces are random).
Thus, my problem is to separate the different information without a delimitation I could use in my code (can't use a split command)
I assume that I need to include some characteristics of each part of information, such that the corresponding characteristic is recognized. However, I don't have a clue how to do that since I am quite new to Python and I only worked with R in the framework of regression models and other statistical analysis.
Short data example:
INPUT:
"WMIN CBOND12/05/2022 23554132121"
or
"WalMaInCBND 12/05/2022-23554132121"
or
"WalmartI CorpBond12/05/2022|23554132121"
EXPECTED OUTPUT:
"Walmart Inc.", "Corporate Bond", "12/05/2022", "23554132121"
So each of the "x" should be classified in a new column with the corresponding header (Company, Security, Maturity, Account Number)
As you can see the input varies randomly but I want to have the same output for each of the three inputs given above (I have over 200k data points with different companies, securities etc.)
First Problem is how to separate the information effectively without being able to use a systematic pattern.
Second Problem (lower priority) is how to identify the company without setting up a dictionary with 50 different inputs for 50k companies.
Thanks for your help!
I recommend to first introduce useful seperators where possible and construct a dictionary of replacements for processing with regular expressions.
import re
s = 'WMIN CBOND12/05/2022 23554132121'
# CAREFUL this not a real date regex, this should just
# illustrate the principle of regex
# see https://stackoverflow.com/a/15504877/5665958 for
# a good US date regex
date_re = re.compile('([0-9]{2}/[0-9]{2}/[0-9]{4})')
# prepend a whitespace before the date
# this is achieved by searching the date within the string
# and replacing it with itself with a prepended whitespace
# /1 means "insert the first capture group", which in our
# case is the date
s = re.sub(date_re, r' \1', s)
# split by one or more whitespaces and insert
# a seperator (';') to make working with the string
# easier
s = ';'.join(s.split())
# build a dictionary of replacements
replacements = {
'WMIN': 'Walmart Inc.',
'CBOND': 'Corporate Bond',
}
# for each replacement apply subsitution
# a better, but more replicated solution for
# this is given here:
# https://stackoverflow.com/a/15175239/5665958
for pattern, r in replacements.items():
s = re.sub(pattern, r, s)
# use our custom separator to split the parts
out = s.split(';')
print(out)
Using python and regular expressions:
import re
def make_filter(pattern):
pattern = re.compile(pattern)
def filter(s):
filtered = pattern.match(s)
return filtered.group(1), filtered.group(2), filtered.group(3), filtered.group(4)
return filter
filter = make_filter("^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$")
filter("WMIN CBOND12/05/2022 23554132121")
The make_filter function is just an utility to allow you to modify the pattern. It returns a function that will filter the output according to that pattern. I use it with the "^([a-zA-Z]+)\s([a-zA-Z]+)(\d+/\d+/\d+)\s(\d+)$" pattern that considers some text, an space, some text, a date, an space, and a number. If you want to kodify this pattern provide more info about it. The output will be ("WMIN", "CBOND", "12/05/2022", "23554132121").
welcome! Yeah, we would definitely need to see more examples and regex seems to be the way to go... but since there seems to be no structure, I think it's better to think of this as seperate steps.
We KNOW there's a date which is (X)X/(X)X/XXXX (ie. one or two digit day, one or two digit month, four digit year, maybe with or without the slashes, right?) and after that there's numbers. So solve that part first, leaving only the first two categories. That's actually the easy part :) but don't lose heart!
if these two categories might not have ANY delimiter (for example WMINCBOND 12/05/202223554132121, or delimiters are not always delimiters for example IMAGINARY COMPANY X CBOND, then you're in deep trouble. :) BUT this is what we can do:
Gather a list of all the codes (hopefully you have that).
use str_detect() on each code and see if you can recognize the exact string in any of the dataset (if you do have the codes lemme know I'll write the code to do this part).
What's left after identifying the code will be the CBOND, whatever that is... so do that part last... what's left of the string will be that. Alternatively, you can use the same str_detect() if you have a list of whatever CBOND stuff is.
ONLY AFTER YOU'VE IDENTIFIED EVERYTHING, you can then replace the codes for what they stand for.
If you have the code-list let me know and I'll post the code.
edit
s = c("WMIN CBOND12/05/2022 23554132121",
"WalMaInCBND 12/05/2022-23554132121",
"WalmartI CorpBond12/05/2022|23554132121")
ID = gsub("([a-zA-Z]+).*","\\1",s)
ID2 = gsub(".* ([a-zA-Z]+).*","\\1",s)
date = gsub("[a-zA-Z ]+(\\d+\\/\\d+\\/\\d+).*","\\1",s)
num = gsub("^.*[^0-9](.*$)","\\1",s)
data.frame(ID=ID,ID2=ID2,date=date,num=num,stringsAsFactors=FALSE)
ID ID2 date num
1 WMIN CBOND 12/05/2022 23554132121
2 WalMaInCBND WalMaInCBND 12/05/2022-23554132121 12/05/2022 23554132121
3 WalmartI CorpBond 12/05/2022 23554132121
Works for cases 1 and 3 but I haven't figured out a logic for the second case, how can we know where to split the string containing the company and security if they are not separated?

Python random character string repeated 7/2000 records

I am using the below to generate a random set of characters and numbers:
tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
I thought that this was a decent method. 36 character length, with each character being one of 36 unique options. Should be a good amount of randomness, right?
Then, I was running a query off an instance with what I thought was a unique tag. Turns out, there were SEVEN (7) records with the same "random" tag. So, I opened the DB, and ran a query to see the repeatability of my tags.
Turns out that not only does mine show up 7 times, but there are a number of tags that repeatedly appear over and over again. With approximately 2000 rows, it clearly should not be happening.
Two questions:
(1) What is wrong with my approach, and why would it be repeating the same tag so often?
(2) What would be a better approach to get unique tags for each record?
Here is the code I am using to save this to the DB. While it is written in Django, clearly this is not a django related question.
class Note(models.Model):
...
def save(self, *args, **kwargs):
import random
import string
self.tag = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(36)])
super(Note, self).save(*args, **kwargs)
The problem with your approach:
true randomness/crypto is hard, you should try to use tested existing solutions instead of implementing your own.
Randomness isn't guaranteed - while 'unlikely', there's nothing preventing the same string to be generated more than once.
A better solution would be to not reinvent the wheel, and use the uuid module, a common solution to generating unique identifiers:
import uuid
tag = uuid.uuid1()
Use a cryptographically secure PRNG with random.SystemRandom(). It will use the PRNG of whatever system you are on.
tag = ''.join(random.SystemRandom().choice(string.ascii_letters + string.digits) for n in xrange(36))
Note that there is no need to pass this as a list comprehension to join().
There are 6236 possible combinations, a number with 65 digits, so duplicates should be extremely rare, even if you take the birthday paradox into consideration.

get string with parsing in python list

i have list like this
["<name:john student male age=23 subject=\computer\sience_{20092973}>",
"<name:Ahn professor female age=61 subject=\computer\math_{20092931}>"]
i want to get student using {20092973},{20092931}.
so i want to split to list like this
my expect result 1 is this (input is {20092973})
"student"
my expect result 2 is this (input is {20092931})
"professor"
i already searching... but i can't find.. sorry..
how can i this?
I don't think you should be doing this in the first place. Unlike your toy example, your real problem doesn't involve a string in some clunky format; it involves a Scapy NetworkInterface object. Which has attributes that you can just access directly. You only have to parse it because for some reason you stored its string representation. Just don't do that; store the attributes you actually want when you have them as attributes.
The NetworkInterface object isn't described in the documentation (because it's an implementation detail of the Windows-specific code), but you can interactively inspect it like any other class in Python (e.g., dir(ni) will show you all the attributes), or just look at the source. The values you want are name and win_name. So, instead of print ni, just do something like print '%s,%s' % (ni.name, ni.win_name). Then, parsing the results in some other program will be trivial, instead of a pain in the neck.
Or, better, if you're actually using this in Scapy itself, just make the dict directly out of {ni.win_name: ni.name for ni in nis}. (Or, if you're running Scapy against Python 2.5 or something, dict((ni.win_name, ni.name) for ni in nis).)
But to answer the question as you asked it (maybe you already captured all the data and it's too late to capture new data, so now we're stuck working around your earlier mistake…), there are three steps to this: (1) Figure out how to parse one of these strings into its component parts. (2) Do that in a loop to build a dict mapping the numbers to the names. (3) Just use the dict for your lookups.
For parsing, I'd use a regular expression. For example:
<name:\S+\s(\S+).*?\{(\d+)\}>
Debuggex Demo
Now, let's build the dict:
r = re.compile(r'<name:\S+\s(\S+).*?\{(\d+)\}>')
matches = (r.match(thing) for thing in things)
d = {match.group(2): match.group(1) for match in matches}
And now:
>>> d['20092973']
'student'
Code:
def grepRole(role, lines):
return [line.split()[1] for line in lines if role in line][0]
l = ["<name:john student male age=23 subject=\computer\sience_{20092973}>",
"<name:Ahn professor female age=61 subject=\compute\math_{20092931}>"]
print(grepRole("{20092973}", l))
print(grepRole("{20092931}", l))
Output:
student
professor
current_list = ["<name:john student male age=23 subject=\computer\sience_{20092973}>", "<name:Ahn professor female age=61 subject=\computer\math_{20092931}>"]
def get_identity(code):
print([row.split(' ')[1] for row in current_list if code in row][0])
get_identity("{20092973}")
regular expression is good ,but for me, a rookie, regular expression is another big problem...

Validate email local component

I'm writing a registration form that only needs to accept the local component of a desired email address. The domain component is fixed to the site. I am attempting to validate it by selectively copying from validators.validate_email which Django provides for EmailField:
email_re = re.compile(
r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*" # dot-atom
# quoted-string, see also http://tools.ietf.org/html/rfc2822#section-3.2.5
r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-\011\013\014\016-\177])*"'
r')#((?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$)' # domain
r'|\[(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\]$', re.IGNORECASE) # literal form, ipv4 address (SMTP 4.1.3)
validate_email = EmailValidator(email_re, _(u'Enter a valid e-mail address.'), 'invalid')
Following is my code. My main issue is that I'm unable to adapt the regex. At this point I'm only testing it in a regex tester at http://www.pythonregex.com/ however it's failing:
^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$
This seems to be passing undesirable characters such as ?
The entire code for my Field, which is not necessarily relevant at this stage but I wouldn't mind some comment on it would be:
class LocalEmailField(CharField):
email_local_re = re.compile(r"^([-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*)$", re.IGNORECASE)
validate_email_local = RegexValidator(email_re, (u'Enter a valid e-mail username.'), 'invalid')
default_validators = [validate_email_local]
EDIT: To clarify, the user is only entering the text BEFORE the #, hence why I have no need to validate the #domain.com in the validator.
EDIT 2: So the form field and label will look like this:
Desired Email Address: [---type-able area---] #domain.com
You say "undesirable characters such as ?", but I think you're mistaken about what characters are desirable. The original regex allows question marks.
Note that you can also define your own validator that doesn't use a massive regex, and have some chance of decoding the logic later.
Some people, when confronted with a problem, think, “I know, I’ll use
regular expressions.” Now they have two problems. - Jamie
Zawinski
Checking via regex is an exercise in wasting your time. The best way is to attempt delivery; this way not only can you verify the email address, but also if the mailbox is actually active and can receive emails.
Otherwise you'll end up in an every-expanding regular expression that can't possibly hope to match all the rules.
"Haha boo hoo woo woo!"#foo.com is a valid address, so is qwerterukeriouo#gmail.com
Instead, offer the almost-standard "Please click on the link in the email we sent to blahblah#goo.com to verify your address." approach.
If you want to create email addresses, then you can write your own rules on what can be a part of the email component; and they can be a subset of the official allowed chars in the RFC.
For example, a conservative rule (that doesn't use regular expressions):
allowed_chars = [string.digits+string.letters+'-']
if len([x in user_input if x not in allowed_chars]):
print 'Sorry, invalid characters'
else:
if user_input[0] in string.digits+'-':
print 'Cannot start with a number or `-`'
else:
if check_if_already_exists(user_input):
print 'Sorry, already taken'
else:
print 'Congratulations!'
I'm still new to Django and Python, but why reinvent the wheel and maintain your own regex? If, apart from wanting users to enter only the local portion of their email address, you're happy with Django's built-in EmailField, you can subclass it quite easily and tweak the validation logic a bit:
DOMAIN_NAME = u'foo.com'
class LocalEmailField(models.EmailField):
def clean(local_part):
whole_address = '%s#%s' % (local_part, DOMAIN_NAME)
clean_address = super(LocalEmailField, self).clean(whole_address)
# Can do more checking here if necessary
clean_local, at_sign, clean_domain = clean_address.rpartition('#')
return clean_local
Have you looked at the documentation for Form and Field Validation and the .clean() method?
If you want to do it 100% correctly with regex, you need to use an engine with some form of extended regex which allow matching nested parentheses.
Python's default engine does not allow this, so you're better off compromising with a very simple (permissive) regex.

Irregular String Parsing on Python

I'm new to python/django and I am trying to suss out more effective information from my scraper. Currently, the scraper takes a list of comic book titles and correctly divides them into a CSV list in three parts (Published Date, Original Date, and Title). I then pass the current date and title through to different parts of my databse, which I do in my Loader script (convert mm/dd/yy into yyyy-mm-dd, save to "pub_date" column, title goes to "title" column).
A common string can look like this:
10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)
I am successfully grabbing the date, but the title is trickier. In this instance, I'd ideally like to fill three different columns with the information after the second "|". The Title should go to "title", a charfield. the number 12 (after the '#') should go into the DecimalField "issue_num", and everything between the '()' 's should go into the "Special" charfield. I am not sure how to do this kind of rigorous parsing.
Sometimes, there are multiple #'s (one comic in particular is described as a bundle, "Containing issues #90-#95") and several have multiple '()' groups (such as, "Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover)
)
What would be a good road to start onto crack this problem? My knowledge of If/else statements quickly fell apart for the more complicated lines. How can I efficiently and (if possible) pythonic-ly parse through these lines and subdivide them so I can later slot them into the correct place in my database?
Use the regular expression module re. For example, if you have the third |-delimited field of your sample record in a variable s, then you can do
match = re.match(r"^(?P<title>[^#]*) #(?P<num>[0-9]+) \((?P<special>.*)\)$", s)
title = match.groups('title')
issue = match.groups('num')
special = match.groups('special')
You'll get an IndexError in the last three lines for a missing field. Adapt the RE until it parses everything your want.
Parsing the title is the hard part, it sounds like you can handle the dates etc yourself. The problem is that there is not one rule that can parse every title but there are many rules and you can only guess which one works on a particular title.
I usually handle this by creating a list of rules, from most specific to general and try them out one by one until one matches.
To write such rules you can use the re module or even pyparsing.
The general idea goes like this:
class CantParse(Exception):
pass
# one rule to parse one kind of title
import re
def title_with_special( title ):
""" accepts only a title of the form
<text> #<issue> (<special>) """
m = re.match(r"[^#]*#(\d+) \(([^)]+)\)", title)
if m:
return m.group(1), m.group(2)
else:
raise CantParse(title)
def parse_extra(title, rules):
""" tries to parse extra information from a title using the rules """
for rule in rules:
try:
return rule(title)
except CantParse:
pass
# nothing matched
raise CantParse(title)
# lets try this out
rules = [title_with_special] # list of rules to apply, add more functions here
titles = ["Stan Lee's Traveler #12 (10 Copy Incentive Cover)",
"Betrayal Of The Planet Of The Apes #1 (Of 4)(25 Copy Incentive Cover) )"]
for title in titles:
try:
issue, special = parse_extra(title, rules)
print "Parsed", title, "to issue=%s special='%s'" % (issue, special)
except CantParse:
print "No matching rule for", title
As you can see the first title is parsed correctly, but not the 2nd. You'll have to write a bunch of rules that account for every possible title format in your data.
Regular expression is the way to go. But if you fill uncomfortably writing them, you can try a small parser that I wrote (https://github.com/hgrecco/stringparser). It translates a string format (PEP 3101) to a regular expression. In your case, you will do the following:
>>> from stringparser import Parser
>>> p = Parser(r"{date:s}\|{date2:s}\|{title:s}#{issue:d} \({special:s}\)")
>>> x = p("10/12/11|10/12/11|Stan Lee's Traveler #12 (10 Copy Incentive Cover)")
OrderedDict([('date', '10/12/11'), ('date2', '10/12/11'), ('title', "Stan Lee's Traveler "), ('issue', 12), ('special', '10 Copy Incentive Cover')])
>>> x.issue
12
The output in this case is an (ordered) dictionary. This will work for any simple cases and you might tweak it to catch multiple issues or multiple ()
One more thing: notice that in the current version you need to manually escape regex characters (i.e. if you want to find |, you need to type \|). I am planning to change this soon.

Categories