excluding delimiters from csv.Sniffer()

excluding delimiters from csv.Sniffer() - python

As a part of a script, I am reading a file without the user specifying the delimiter.
Theoretically, I want to be able to use this script for any type of delimiter besides some specific ones.
Is there a way to tell csv.Sniffer() "These characters are not delimiters:..." ?
From the docs I saw that there's an optional argument for specifying valid delimiters but that's not what I'm looking for.
As it seem, this option is not supported - let me know if I am wrong or that there is another module that provides the same functionality.

There isn't a way to specify that characters aren't delimiters in the existing Sniffer implementation..
Delimiters are identified in the Sniffer._guess_quote_and_delimiter and Sniffer._guess_delimiters methods.
To negatively specify delimiters you would need to subclass Sniffer and override these methods to take into account the set of "not allowed" delimiter characters.

Related

what are delims used for tab completer

I am reading some code about Readline Library I try to type readline.get_completer_delims() it gives me delimiters like ~!##$%^&*()-=+[{]}\|;:'",<>/?
My question is what the meaning of those delimiters for tab complete? can someone explain ?
readline.get_completer_delims()

Per the docs:
Set or get the word delimiters for completion. These determine the start of the word to be considered for completion (the completion scope). These functions access the rl_completer_word_break_characters variable in the underlying library.
These are characters after which the tab completer should consider the start of a "word", and ignore characters on the line prior to that one. For example, if you wanted to implement a tab completer for attributes and methods on Python objects you might add . to this list, so when hitting tab after a . it would indicate that you want completion for whatever comes after the dot.

How to open a file in its default program with python

I want to open a file in python 3.5 in its default application, specifically 'screen.txt' in Notepad.
I have searched the internet, and found os.startfile(path) on most of the answers. I tried that with the file's path os.startfile(C:\[directories n stuff]\screen.txt) but it returned an error saying 'unexpected character after line continuation character'. I tried it without the file's path, just the file's name but it still didn't work.
What does this error mean? I have never seen it before.
Please provide a solution for opening a .txt file that works.
EDIT: I am on Windows 7 on a restricted (school) computer.

It's hard to be certain from your question as it stands, but I bet your problem is backslashes.
[EDITED to add:] Or actually maybe it's something simpler. Did you put quotes around your pathname at all? If not, that will certainly not work -- but once you do, you will find that then you need the rest of what I've written below.
In a Windows filesystem, the backslash \ is the standard way to separate directories.
In a Python string literal, the backslash \ is used for putting things into the string that would otherwise be difficult to enter. For instance, if you are writing a single-quoted string and you want a single quote in it, you can do this: 'don\'t'. Or if you want a newline character, you can do this: 'First line.\nSecond line.'
So if you take a Windows pathname and plug it into Python like this:
os.startfile('C:\foo\bar\baz')
then the string actually passed to os.startfile will not contain those backslashes; it will contain a form-feed character (from the \f) and two backspace characters (from the \bs), which is not what you want at all.
You can deal with this in three ways.
You can use forward slashes instead of backslashes. Although Windows prefers backslashes in its user interface, forward slashes work too, and they don't have special meaning in Python string literals.
You can "escape" the backslashes: two backslashes in a row mean an actual backslash. os.startfile('C:\\foo\\bar\\baz')
You can use a "raw string literal". Put an r before the opening single or double quotes. This will make backslashes not get interpreted specially. os.startfile(r'C:\foo\bar\baz')
The last is maybe the nicest, except for one annoying quirk: backslash-quote is still special in a raw string literal so that you can still say 'don\'t', which means you can't end a raw string literal with a backslash.

The recommended way to open a file with the default program is os.startfile. You can do something a bit more manual using os.system or subprocess though:
os.system(r'start ' + path_to_file')
or
subprocess.Popen('{start} {path}'.format(
start='start', path=path_to_file), shell=True)
Of course, this won't work cross-platform, but it might be enough for your use case.

For example I created file "test file.txt" on my drive D: so file path is 'D:/test file.txt'
Now I can open it with associated program with that script:
import os
os.startfile('d:/test file.txt')

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.

When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.

Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.

Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

Create (sane/safe) app bundle identifier from any (unsafe) string

I want to create a sane/safe app bundle name (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (mich might contain just anything).
(It doesn't matter for me wether the function is Cocoa, ObjC, Python, etc.)
(This is related to the filename question and the bundle name question but the bundle identifier is much more restrictive. I think it cannot even contain spaces and I also would want to strip out the dots and put my own prefix.)
I think Xcode also hase some function to do that automatically from the app name. Maybe there is some standard function in Cocoa to do that.

Bundle identifiers are meant to be in reverse URL form (guaranteeing global uniqueness):
com.apple.xcode, for example
So really you need a domain name, then you can invent whatever scheme you like below that.
Given this, and some knowledge of the characters in your input, you can either scan through your input composing a new string with only the bits you want, or use methods like stringByReplacingOccurrencesOfString: withString: and, if you like, lowercaseString.
The permitted characters in bundle identifiers are named in the Property List Documentation as:
The bundle ID string must be a uniform type identifier (UTI) that contains only alphanumeric (A-Z,a-z,0-9), hyphen (-), and period (.) characters. The string should also be in reverse-DNS format.

easy way to determine if a string CAN'T be a valid regex

I have a config file that the user can specify sections, and then within those section they can specify regular expressions. I have to parse this config file and separate the regex's into the various sections.
Is there an easy way to delimitate a regex from a section header? I was thinking just the standard
[section]
regex1
regex2
But I just realized that [section] is a valid regex. So I'm wondering if there's a way I can format a section header so that it can ONLY be understood as a section header and not a regex.

There's an unlimited ways of making an invalid regexp, but the first thing that comes to mind would be
*section*
You can't have a quantifier (*) at the start of the regexp.
(The other * is there just to satisfy my obsession for symmetry.)

I don't know your problem domain, so I don't know what forms of regex you're expecting, but it seems to me you should keep your section formatting as it is. A regex that starts with [ and ends with ] and has no square brackets in between is quite unusual. It can only match a single character. So leave the section headers as they are. Strictly speaking, they are valid regexes, but they probably aren't interesting regexes.
Also, why not use ConfigParser from the standard library, and let it do the parsing for you?

There are easy ways, but they all require changing your format:
Use indentation, similar to how Python source is interpreted. Leading spaces would need special handling, e.g. "(?: )abc" instead of " abc".
Use an INI format, where each item in a section requires a name=value pair.
Use some sort of list syntax. ast.literal_eval will be helpful.
section1 = [
"regex 1",
"2",
"3",
]
section2 = ["..."]
Primarily, don't invent your own format, or make it as close to a known format as you can. The third is a subset of Python syntax, for example, and you could even use raw string literals naturally.
JSON or YAML may be useful for you.

As others have said, please don't invent yet another config format. Use the Python Standard Library's ConfigParser, which will be able to parse the [section] notation exactly as you have shown it.
EDIT: The allow_no_value option allows you to to just have a single entry, rather than a key/value pair. And the default dict type is OrderedDict, so it will maintain order.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.