Python valid values?

Python valid values? - python

I am trying to fix older code someone wrote years ago using python. I believe the "\d\d\d\d" refers to the number of text characters, and 0-9A-Z limits the type of input but I can't find any documentation on this.
idTypes = {"PFI":"\d\d\d\d",
"VA HOSPITAL ID":"V\d\d\d",
"CERTIFICATION NUMBER":"\d\d\d-[A-Z]-\d\d\d",
"MORTUARY FIRM ID":"[0-9]",
"HEALTH DEPARTMENT ID":"[0-9]",
"NYSDOH OFFICE ID":"[0-9]",
"ACF ID":"AF\d\d\d\d",
"GENERIC NUMBER ID":"[0-9]",
"GENERIC ID":"[A-Za-z0-9]",
"OASAS FAC":"[0-9]",
"OMH PSYCH CTR":"[0-9A-Z]"}
Like the PFI values seem to be limited to 4 numeric digits in a string field, so 12345 doesn't work later in the code but 1234 does. Adding another \d doesn't appear to be the answer.

These are, apparently, regular expressions used to validate inputs. See https://docs.python.org/2/library/re.html
Without seeing the code that uses these values it is impossible to say more.

Related

Get the language of an excel instance in python

I set a format using xlwings ws.range('A1').number_format = '#.0;[Red]-#.0' but one of the user has a french Excel and there is an error because of [Red]. I have to add a condition based on Excel language, for French Excel instances it must be [Rouge].
Here is my question, do you know how I can get the language of an Excel instance in python (pywin32 / xlwings) ?
In VBA, the following code will return 1 for English Excel and 33 for French Excel:
Application.International(xlApplicationInternational.xlCountryCode)
But I can't manage to get the python equivalent.
Thanks.

wb.app.api.International returns a tuple with index 0 being 1 or 33 depending on the language.

As asked, the Python equivalent to
Application.International(xlApplicationInternational.xlCountryCode)
would be something like (wb being an xlwings.Workbook instance):
from xlwings.constants import ApplicationInternational
int_constants = wb.app.api.International
country_code = int_constants[ApplicationInternational.xlCountryCode]
The class ApplicationInternational defines the indices you can use to retrieve the respective properties of the International class, as done above. From experience sometimes you need to subtract 1 from these values.
I don't have a lot of experience with specifying colours, but it's probably possible to specify the colour in RGB (it seems to be possible in VBA). This way you can skip the locale part altogether (and you could use something like xlwings.constants.RgbColor.rgbRed if you're feeling fancy)

Parsing technique to list all available tokens at any point during a parse

I'm wondering if I can represent CLI prompts using a CFG or PEG inspired grammar; for instance, to auto-generate a setup-wizard or survey. In order to achieve this, the parser has to prompt the user for each next input token given what they have already entered. Example:
customer_info -> "My name is " name_expression " and I'm " %age " years old."
name_expression -> %name %name
| %name
name_expression allows you to enter first and last name, or simply a single name. The string constants would be auto-filled by the prompt. This spec would compile the following example experience for a hypothetical user:
My name is (enter %name):
>> john
My name is john (1 for "%lastname", 2 for " and I'm "):
>> 2
My name is john and I'm (enter a number):
>> 39
My name is john and I'm [39] years old.
Prompt complete, exiting.
I've read a short amount about "inverse parsers", the idea being that during an interactive dialog, all your potential responses are laid out at each step of the conversation (think RPG-style video game conversation with an NPC). Information about this technique seems to be scarce online and I'm not sure it would do what I want entirely.
I've looked into Earley parsers, predictive LL parsers, and some others, but learning each and every one of these candidates just to find out if it is suitable for this case seems unreasonable. My question is, what kind of parsing technique would best allow me to prompt the user for a list of valid tokens given an incomplete input sentence?
Though I'm comfortable with recursive descent parsing and using various parser generators, I've only studied the material for about a year, so pardon my ignorance.
Thank you.

This technique already exists in LRSTAR and I think it's built into ANTLR and Bison/Yacc generated parsers. It gets activated when an error is encountered in the input. Then it lists all expected valid tokens.
Some people call it auto-complete or sentence-completion. It's rarely used for the purpose you are asking about. However, it's doable with a modified parser. The parser would have to generate the questions, "My first name is" then read the expected token, "<first_name>" from the user.
It's really a simplification of what a parser can do. Using an LR parser for something this simple is overkill.
The grammar might look like this:
Goal -> Questions <eof>
Questions -> FirstName LastName Street City State Zipcode Age
FirstName -> first name <first_name>
LastName -> last name <last_name>
Street -> street <street>
City -> city <city>
State -> state <state>
Zipcode -> zipcode <zipcode>
Age -> age <age>
It's a valid way to automate the creation of questionaires. The parser would generate the words that are not in angled brackets and ask the user to input the variable information. Or just put the angled bracket words in the grammar, to avoid the redundancy and ask the user for <first_name>, <last_name>, etc.
The best and most reliable method would be creating a Canonical LR(1) parser.
This kind of parser has all the expected token in every state. No need to look at other states via default reductions. As long as your grammar is not huge, you should try a CLR(1) parser.
LRSTAR can generate a CLR(1) parser which already has code builtin to list the expected tokens.

Any left-to-right parsing scheme which requires no more than one lookahead token would work fine, provided that the language can be parsed with that scheme. Table-driven implementations are probably easier to work with, provided that the parsing tables are accessible and documented (not the case for most parser generators, unfortunately), but you could use any blackbox parser with a "push" interface and a copyable state, by simply cyclng through all possible token types and recording which ones don't produce errors.
The prediction logic is easier with an LL(1) grammar than with an LR(1) grammar, because the LL parser state is always a unique item. LR parser states are often the union of several items, so it might not be totally obvious how to describe the current parsing context. On the other hand, LR parsers can handle a larger set of grammars.

REGEX to find all matches inside a given string

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item)) <class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re: I tried re.findall(r"'(\w+)'", item) but only got empty brackets [].
Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\"(.+?)\"',item)
print(matches), but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update:
thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of #FailSafe worked for me:screenshot of my jupyter notebook
here the result with list
But yes I totally agree with # Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)

this may not be your final answer but it creates a list of categories.
x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
y=x[2:-2]
z=y.split(',')
for item in z:
print(item)

sumy LexRankSummarizer() proper formatting of output text

I am trying to get the output as string using LexRankSummarizer in sumy library.
I am using the following code (pretty straightforward)
parser = PlaintextParser.from_string(text,Tokenizer('english'))
summarizer = LexRankSummarizer()
sum_1 = summarizer(parser.document,10)
sum_lex=[]
for sent in sum_1:
sum_lex.append(sent)
using the above code I am getting an output which is in the form of tuple. Consider a summary as given below from a text as input
The Mahājanapadas were sixteen kingdoms or oligarchic republics that existed in ancient India from the sixth to fourth centuries BCE.
Two of them were most probably ganatantras (republics) and others had forms of monarchy.
Using the above code I am getting an output as
sum_lex = [<Sentence: The Mahājanapadas were sixteen kingdoms or oligarchic republics that existed in ancient India from the sixth to fourth centuries BCE.>,
<Sentence: Two of them were most probably ganatantras (republics) and others had forms of monarchy.>]
However, if I use print(sent) I am getting proper output as given above.
How to tackle this issue?

Replacing sum_lex.append(sent) with sum_lex.append(str(sent)) should do.

US-only (hopefully) email regex (Trying to filter domains with out ''.jp or .co.uk")

I'm using the current regex to match email addresses in Python:
EMAIL_REGEX = re.compile(r"[^#]+#[^#]+\.[^#]+")
This works great, except for when I run it, I'm getting domains such as '.co.uk', etc. For my project, I am simply trying to get a count of international looking TLD's. I understand that this doesn't necessarily guarantee that my users are only US based, but it does give me a count of my users without internationally based TLD's (or what we would consider internationally based TLD's - .co.uk, .jp, etc).

What you want is very difficult.
If I make a mail server called this.is.my.email.my-domain.com, and an account called martin, my perfectly valid US email would be martin#this.is.my.email.my-domain.com. Emails with more than 1 domain part are not uncommon (.gov is a common example).
Disallowing emails from the .uk TLD is also problematic, since many US-based people might have a .uk address, for example they think it sounds nice, work for a UK based company, have a UK spouse, used to live in the UK and never changed emails, etc.
If you only want US-based registrations, your options are:
Ask your users if they are US-based, and tell them your service is only for US-based users if they answer with a non-US country.
Ask for a US address or phone number. Although this can be faked, it's not easy to get a matching address & ZIP code, for example.
Use GeoIP, and allow only US email addresses. This is not fool-proof, since people can use your service on holidays and such.
In the question's comments, you said:
Does it not make sense that if some one has a .jp TLD, or .co.uk, it stands to reason (with considerable accuracy) that they are internationally based?
Usually, yes. But far from always. My girlfriend has 4 .uk email addresses, and she doesn't live in the UK anymore :-) This is where you have to make a business choice, you can either:
Turn away potential customers
Take more effort in allowing customers with slightly "strange" email addresses
Your business, your choice ;-)
So, with that preamble, if you must do this, this is how you could do it:
import re
EMAIL_REGEX = re.compile(r'''
^ # Anchor to the start of the string
[^#]+ # Username
# # Literal #
([^#.]+){1} # One domain part
\. # Literal 1
([^#.]+){1} # One domain part (the TLD)
$ # Anchor to the end of the string
''', re.VERBOSE)
print(EMAIL_REGEX.search('test#example.com'))
print(EMAIL_REGEX.search('test#example.co.uk'))
Of course, this still allows you to register with a .nl address, for example. If you want to allow only a certain set of TLD's, then use:
allow_tlds = ['com', 'net'] # ... Probably more
result = EMAIL_REGEX.search('test#example.com')
if result is None or result.groups()[1] in allowed_tlds:
print('Not allowed')
However, if you're going to create a whilelist, then you don't need the regexp anymore, since not using it will allow US people with multi-domain addresses to sign up (such as #nlm.nih.gov).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python valid values? - python

These are, apparently, regular expressions used to validate inputs. See https://docs.python.org/2/library/re.html Without seeing the code that uses these values it is impossible to say more.

Related

Get the language of an excel instance in python

Parsing technique to list all available tokens at any point during a parse

REGEX to find all matches inside a given string

sumy LexRankSummarizer() proper formatting of output text

US-only (hopefully) email regex (Trying to filter domains with out ''.jp or .co.uk")

Categories

Resources