Wildcards in a Search and Replace Dictionary - python

I am fairly new to Python so if my terminology is wrong I apologize. I am running Python 2.6.5, I am not sure about updating to 3.0 since Python was initially downloaded with my spatial analysis software.
I am writing a program to search and replace column headers in multiple comma delimited text files. Since there are over a hundred headers and they are the same in all the files I decided to create a dictionary and 'pickle' to save all the replacements (got the idea from reading other posts). My issue comes in when I noticed there are tabs and spaces within the text file column headings, for example:
..."Prev Roll #: ","Prev Prime/Sub","Frontage : ","Depth : ","Area : ","Unit of Measure : ",...
So I thought why not just stick in a wildcard at the end of my key term so the search will match it no matter how many spaces are dividing the name and the colon. I was trying the * wildcard, but it doesn't work, when I run it no matches/replacements are made. Am I using the wildcard correctly? Is what I'm trying to do even possible? Or should I do away with the dictionary pickle?
Below is a sample of what I'm trying to do
import cPickle as pickle
general_D = { ....
"Prev Prime/Sub" : "PrvPrimeSub",
"Frontage*" : "Frontage",
"Depth*" : "Depth",
"Area*" : "Area",
"Unit of Measure*" : "UnitMeasure",
Thanks for the input!

Use the csv module to parse and write your comma-separated data.
Use the string strip() method to remove unwanted spaces and tabs.
Do not include * in your dict key names. They will not glob as you
hope. They just represent literal *s there.
It is probably better to use json instead of pickle. JSON is
human-readable, independent of programming language. Pickle may have
problems even across different versions of Python.

Related

Hidden characters in integer-like string

I scraped data about fundraising from the web and put it into a table.
As I start to clean the data , I see that some elements, for instance "2 000000", are read "2\xa0000000" by the machine.
1/ What does that mean ?
2/ How can I remove it ? (as I want to transform the whole column to integers)
Best,
To fix a DataFrame column, use:
df['col'] = df['col'].str.replace('\D', '').astype(int)
The issue is that you have escape sequences read in as Unicode characters in the string. The easiest way to remove those characters without using replace on each specific showing is using the unicodedata package.
Specifically:
from unicodedata import normalize
string1 = "2\xa0000000"
new_string = normalize('NFKD', string1)
print(new_string)
Output:
2 000000
This package was already built into my machine, but you may need to install it if you used a different method to build your python package than I. I find this better because this normalization works across a lot of various formatting, so you do not need to use replace each time you see something else that is not formatted correctly. It's an escape sequence
Character of hex code A0 is non-breaking space. So to speak, you can just treat it as a space in most cases. According to my experience, it mostly come up when I process some data generated from Microsoft Office products, or from the web when people put the HTML code on it.
Unfortunately, python split() (for example, I don't know how you process your data) will not treat that as space. But as it is just a distinct character, you can solve the issue with:
longstring.replace('\xA0', ' ').split()
PS: Read again your question, seems it should be ignored to produce the number two million as an data entity. So you might want to replace '\xA0' with empty string.

Processing a CSV with colon separated pairs in fields

I have a CSV in the format of
Fruit:Apple,Seeds:Yes,Colour:Red or Green
Fruit:Orange,Seeds:No,Colour:Orange
Fruit:Pear,Seeds:Yes,Colour:Green,Shape:Odd
Fruit:Banana,Seeds:No,Colour:Yellow,Shape:Also Odd
and I want to be able to use create a JSON object for these values that looks something like
{"requestdata":{
"testdata":"example",
"testcategory":"category",
"fruits":{
"Fruit":{
"value":"Apple"
"type":"string"},
"Seeds":{
"value":"Yes"
"type":"bool"}
}
etc
I know I can load the CSV with a delimiter of my choosing, but how would I specify the second delimiter? Or should I try and build a dictionary instead for each cell of data and treat it as a string to split?
You should just split on the comma and use a string split to process the remaining elements, building a dictionary, then have the json module produce JSON from the dictionary. It is fairly easy to create malformed JSON when trying to be clever with text processing, such as
Forgetting to quote keys.
Quoting Values you didn't mean to
Not escaping JSON special characters
Building the dictionary and then having the moule do its thing will make your code much more maintainable and less error prone.

Python 3: how to parse a csv file where the text fields can contain embedded new line characters

When exporting excel/libreoffice sheets where the cells can contain new lines as CSV, the resulting file will have those new lines preserved as literal newline characters not something like the char string "\n".
The standard csv module in Python 3 apparently does not handle this as would be necessary. The documentation says "Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." . Well, duh.
Is there some other way to read in such csv files properly? What csv really should do is ignore any new lines withing quoted text fields and only recognise new line characters outside a field, but since it does not, is there a different way to solve this short of implementing my own CSV parser?
Try using pandas with something like df = pandas.read_csv('my_data.csv'). You'll have more granular control over how the data is read in. If you're worried about formatting, you can also set the delimiter for the csv from libreoffice to something that doesn't occur in nature like ;;

importing CSV file with values wrapped in " when some of them contains " as well as commas

I think I searched throughout but if I missed something - let me know please.
I am trying to import CSV file where all non numerical values are wrapped with ".
I have encountered a problem with:
df = pd.read_csv(file.csv)
Example of CSV:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company "MoscowMining" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company" Jankowski,A,B""
Because of multiple quotes and commas inside them, pandas is seeing more columns than 4 in this case (like 5 or 6).
I have already tried to play with
df = pd.read_csv(file.csv, quotechar='"', quoting=2)
But got
ParserError: Error tokenizing data (...)
What works is skipping bad lines by
error_bad_lines=False
but I'd rather have all data somehow taken into consideration than just ommit it.
Many thanks for any help!
This seems like badly formed CSV data as the '"' characters within the values should be escaped. I've often seen such values escaped by doubling them up or prefixing with a \. See https://en.wikipedia.org/wiki/Comma-separated_values#cite_ref-13
First thing I'd do is fix whatever is exporting those files. However if you cannot do that you may be able to work around the issue by escaping the " which are part of a value.
Your best bet might be to assume that a " is only followed (or preceeded) by a comma or newline if it is the end of a value. Then you could do a regex something like (working from memory so may not be 100% - but should give you the right idea. You'll have to adapt it for whatever regex library you have handy)
s/([^,\n])"([^,\n])/$1""$2/g
So if you were to run your example file though that it would be escaped something like this:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company ""MoscowMining"" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company"" Jankowski,A,B"""
or using the following
s/([^,\n])"([^,\n])/$1\"$2/g
the file would be escaped something like this:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski,A,B\""
Depending on your CSV parser, one of those should be accepted and work as expected.
If, as #exe suggests, your CSV parser also requires the commas within values to be escaped, you can apply a similar regex to replace the commas.
If i understand well what you need is cast the quotes and commas before panda read the csv.
Like these:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1\, Owner2\, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski\,A\,B\""

Save sentence as server filename

I'm saving the recording of a set of sentences to a corresponding set of audio files.
Sentences include:
Ich weiß es nicht!
¡No lo sé!
Ég veit ekki!
How would you recommend I convert the sentence to a human readable filename which will later be served on an online server. I'm not sure right now as to what languages I might be dealing with in the future.
UPDATE:
Please note that two sentences can't clash with each other. For example:
É bär icke dej.
E bår icke dej.
can't resolve to the same filename as these will overwrite each other. This is the problem with the slugify function mentioned here: Turn a string into a valid filename?
The best I have come up with is to use urllib.parse.quote. However I think the resulting output is harder to read than I would have hoped. Any suggestions?:
Ich%20wei%C3%9F%20es%20nicht%21
%C2%A1No%20lo%20s%C3%A9%21
%C3%89g%20veit%20ekki%21
What about unidecode?
import unidecode
a = [u'Ich weiß es nicht!', u'¡No lo sé!', u'Ég veit ekki!']
for s in a:
print(unidecode.unidecode(s).replace(' ', '_'))
This gives pure ASCII strings that can readily be processed if they still contain unwanted characters. Keeping spaces distinct in the form of underscores helps with readability.
Ich_weiss_es_nicht!
!No_lo_se!
Eg_veit_ekki!
If uniqueness is a problem, a hash or something like that might be added to the strings.
Edit:
Some clarification seems to be required with respect to the hashing. Many hash functions are explicitely designed for giving very different outputs for close inputs. For example, the built-in hash function of python gives:
In [1]: hash('¡No lo sé!')
Out[1]: 6428242682022633791
In [2]: hash('¡No lo se!')
Out[2]: 4215591310983444451
With that you can do something like
unidecode.unidecode(s).replace(' ', '_') + '_' + str(hash(s))[:10]
in order to get not too long strings. Even with such shortened hashes, clashes are pretty unlikely.
you should probably try to convert spaces into another symbol making your string look like É-bär-icke-dej.
if your using python I would do it like this.
Replace spaces with another symbol like (-) or (/)
mystring.replace(' ','-')
Detect your character encoding using chardet a python package that detects encoding.
Decode your string using pythons
mystring.decode(*the detected encoding*)
Check if file name is in your directory already using python's OS package. something like
files = os.listdir(*path to directory*)
//get how many times the file name has been repeated
redundance = 0
for name in files:
if mystring in name:
redundance+=1
append redundance to your string
if redundance !=0:
mystring = mystring+redundance
Use ur string as a file name!
Hope this helps!
The only disallowed characters in traditional Unix / Linux file names are slash (/ U+002F) and the null character (U+0000). There is no need to convert your example human-readable strings to anything else.
If you need to make the files available to systems which do not use the same file name encoding, such as for downloading over FTP or from a web server, perhaps you want to expose them as explicitly UTF-8. On most modern U*xes, this should be the default out of the box anyway. This would correspond to the results you get from urllib quoting, where the percent-encoding is a safe and reasonably standard way of producing a machine readable and unambigious representation of the encoding. If you embed these in a snippet of HTML or something, you can keep the display text human-readable, and just keep the link machine-readable.
Ég veit ekki!

Categories