I have found information on how this could be done, but nothing has worked for me. I am trying to replace the special character 'ð'. I imported my data from a csv file and I used encoding='latin1' or else I kept getting errors. However, a simple DF['Column'].str.replace('ð', '') will not do the trick. I also tried decoding and using the hex value for that character which was recommended on another post, but that still won't work for me. Help is very much appreciated, and I am willing to post code if necessary.
Call str.encode followed by str.decode:
df.YourCol.str.encode('utf-8').str.decode('ascii', 'ignore')
If you want to do this for multiple columns, you can slice and call df.applymap:
df[col_list].applymap(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
Remember that these operations are not in-place. So, you'll have to assign those columns back to their rightful place.
I have a python script which parses an xml file and then gives me the required information. My output looks like this, and is 100% correct:
output = ['77:275,77:424,77:425,77:426,77:427,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:431,77:432,77:433,77:435,77:467,77:470,77:471,77:484,77:485,77:475,77:476,77:437,77:438,77:439,77:440,77:442,77:443,77:444,77:445,77:446,77:447,77:449,77:450,77:451,77:454,77:455,77:456,77:305,77:309,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:317,77:321,77:346,77:349,77:350,77:351,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:362,77:367,77:369,77:374,77:370,77:372,77:373,77:387,77:388,77:389,77:392,77:393,77:394,77:328,77:283,77:284,77:285,77:288,77:289,77:290,77:292,']
It is all fine, but I want to remove the duplicate elements in an element, like in the case above. I tried using the OrderedDict package or just simple list(set(output)), but obvoiusly they both didn't work. Does anyone have a tip for me on how to solve this problem.
You have one element in a list. If you expected it to be treated as separate elements, you need to explicitly split it.
You could split the string on the ',' comma character into a list with str.split():
separate_elements = output[0].split(',')
after which you can use set() (unordered) or OrderedDict (maintaining order) and re-join the string if you still need just the one string object:
','.join(set(separate_elements))
You can put that back into a list with just one element, but there is little point if all you ever handle is that one string.
I've written an XML parser in Python and have just added functionality to read a further script from a different directory.
I've got two args, first is the path where I'm parsing XML. Second is a string in another XML file which I want to match with the first path;
arg1 = \work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator
path = calculators/2012/example/calculator
How can I compare the two strings to match identify that they're both referencing the same thing and also, how can I strip calculator from either string so I can store that & use it?
edit
Just had a thought. I have used a Regex to get the year out of the path already with year = re.findall(r"\.(\d{4})\.", path) following a problem Python has with numbers when converting the path to an import statement.
I could obviously split the strings and use a regex to match the path as a pattern in arg1 but this seems a long way round. Surely there's a better method?
Here I am assuming you are actually talking about strings, and not file paths - for which #mgilson's suggestion is better
How can I compare the two strings to match identify that they're both
referencing the same thing
Well first you need to identify what you mean by "the same thing"
At first glance it seems that if the the second string ends with the first string with the reversed slash, you have a match.
arg1 = r'\work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator'
arg2 = r'calculators/2012/example/calculator'
>>> arg1.endswith(arg2.replace('/','\\'))
True
and also, how can I strip calculator from
either string so I can store that & use it?
You also need to decide if you want to strip the first calculator, the last calculator or any occurance of calculator in the string.
If you just want to remove the last string after the separator, then its simply:
>>> arg2.split('/')[-1]
'calculator'
Now to get the orignal string back, without the last bit:
>>> '/'.join(arg2.split('/')[:-1])
'calculators/2012/example'
check out os.path.samefile:
http://docs.python.org/library/os.path.html#os.path.samefile
and os.path.dirname:
http://docs.python.org/library/os.path.html#os.path.dirname
or maybe os.path.basename (I'm not sure what part of the string you want to keep).
Here, try this:
arg1 = "\work\parser\main\tools\app\shared\xml\calculators\2012\example\calculator"
path = "calculators/2012/example/calculator"
arg1=arg1.replace("/","\\")
path=path.replace("/","\\")
if str(arg1).endswith(str(path)) or str(path).endswith(str(arg1)):
print "Match"
That should work for your needs. Cheers :)
I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'
I am converting some matlab code to C, currently I have some lines that have powers using the ^, which is rather easy to do with something along the lines \(?(\w*)\)?\^\(?(\w*)\)?
works fine for converting (glambda)^(galpha),using the sub routine in python pattern.sub(pow(\g<1>,\g<2>),'(glambda)^(galpha)')
My problem comes with nested parenthesis
So I have a string like:
glambdastar^(1-(1-gphi)*galpha)*(glambdaq)^(-(1-gphi)*galpha);
And I can not figure out how to convert that line to:
pow(glambdastar,(1-(1-gphi)*galpha))*pow(glambdaq,-(1-gphi)*galpha));
Unfortunately, regular expressions aren't the right tool for handling nested structures. There are some regular expressions engines (such as .NET) which have some support for recursion, but most — including the Python engine — do not, and can only handle as many levels of nesting as you build into the expression (which gets ugly fast).
What you really need for this is a simple parser. For example, iterate over the string counting parentheses and storing their locations in a list. When you find a ^ character, put the most recently closed parenthesis group into a "left" variable, then watch the group formed by the next opening parenthesis. When it closes, use it as the "right" value and print the pow(left, right) expression.
I think you can use recursion here.
Once you figure out the Left and Right parts, pass each of those to your function again.
The base case would be that no ^ operator is found, so you will not need to add the pow() function to your result string.
The function will return a string with all the correct pow()'s in place.
I'll come up with an example of this if you want.
Nested parenthesis cannot be described by a regexp and require a full parser (able to understand a grammar, which is something more powerful than a regexp). I do not think there is a solution.
See recent discussion function-parser-with-regex-in-python (one of many similar discussions). Then follow the suggestion to pyparsing.
An alternative would be to iterate until all ^ have been exhausted. no?.
Ruby code:
# assuming str contains the string of data with the expressions you wish to convert
while str.include?('^')
str!.gsub!(/(\w+)\^(\w+)/, 'pow(\1,\2)')
end