Unable to find a character in a string? - python

I am getting the data from different Wikipedia pages. First, this data is stored in a local location (in the form of a python pickle). Here is the code:
k = '9–10000000'
if '-' in k:
print('Found')
'-' character in the if statement is typed from the keyboard and print statement is not showing anything. But, if I copy '-' from the k value, it is showing the required output(which is printing 'Found'). I don't know what is different between these two '-' characters.
This is the simplest example, I can share here. There are multiple other characters, which are showing the same result?
Any idea, why?????

Possibly because the character you typed is the "Hyphen" and the one in your k is an "En dash" (different character codes, but both look the same to the naked eye)

Try to run your modified program, you will see that two characters are, contrary to what you think, not the same.
k = '9–10000000' # first dash
print(ord('–')) # printing first dash
print(ord('-')) # printing second dash
if '-' in k: # second dash
print('Found')
where ord() gives a numerical representation of a char in UNICODE encoding.
It prints
8211
45
8211 stands for EN-DASH
45 stands for HYPHEN-MINUS
Have a look at this if you'd like to know more:
http://www.fileformat.info/info/unicode/char/2013/index.htm
http://www.fileformat.info/info/unicode/char/2d/index.htm

Related

How to separate user's input with two separators? And controlling the users input

I want to separate the users input using two different separators which are ":" and ";"
Like the user should input 4 subject and it's amounts. The format should be:
(Subject:amount;Subject:amount;Subject:amount;Subject:amount)
If the input is wrong it should print "Invalid Input "
Here's my code but I can only used one separator and how can I control the users input?
B = input("Enter 4 subjects and amount separated by (;) like Math:90;Science:80:").split(";")
Please help. I can't figure it out.
If you are fine with using regular expressions in python you could use the following code:
import re
output_list = re.split("[;:]", input_string)
Where inside the square brackets you include all the characters (also known as delimiters) that you want to split by, just make sure to keep the quotes around the square brackets as that makes a regex string (what we are using to tell the computer what to split)
Further reading on regex can be found here if you feel like it: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
However, if you want to do it without importing anything you could do this, which is another possible solution (and I would recommend against, but it gets the job done well):
input_string = input_string.replace(";", ":")
output_list = input_string.split(":")
Which works by first replacing all of the semicolons in the input string with colons (it could also work the other way around) and then splitting by the remaining character (in this case the colons)
Hope this helped, as it is my first answer on Stack overflow.

Remove "." and "\" from a string

my project is to capture a log number from Google Sheet using gspread module. But now the problem is that the log number captured is in the form of string ".\1300". I only want the number in the string but I could not remove it using the below code.
Tried using .replace() function to replace "\" with "" but failed.
a='.\1362'
a.replace('\\',"")
Should obtain the string "1362" without the symbol.
But the result obtained is ".^2"
The problem is that \136 has special meaning (similar to \n for newline, \t for tab, etc). Seemingly it represents ^.
Check out the following example:
a = '.\1362'
a = a.replace('\\',"")
print(a)
b = r'.\1362'
b = b.replace('\\',"")
print(b)
Produces
.^2
.\1362
Now, if your Google Sheets module sends .\1362 instead of .\\1362, if is very likely because you are in fact supposed to receive .^2. Or, there's a problem with your character encoding somewhere along the way.
The r modifier I put on the b variable means raw string, meaning Python will not interpret backlashes and leave your string alone. This is only really useful when typing the strings in manually, but you could perhaps try:
a = r'{}'.format(yourStringFromGoogle)
Edit: As pointed out in the comments, the original code did in fact discard the result of the .replace() method. I've updated the code, but please note that the string interpolation issue remains the same.
When you do a='.\1362', a will only have three bytes:
a = '.\1362'`
print(len(a)) # => 3
That is because \132 represents a single character. If you want to create a six byte string with a dot, a slash, and the digits 1362, you either need to escape the backslash, or create a raw string:
a = r'.\1362'
print(len(a)) # => 6
In either case, calling replace on a string will not replace the characters in that string. a will still be what it was before calling replace. Instead, replace returns a new string:
a = r'.\1362'
b = a.replace('\\', '')
print(a) # => .\1362
print(b) # => .1362
So, if you want to replace characters, calling replace is the way to do it, but you've got to save the result in a new variable or overwrite the old.
See String and Bytes literals in the official python documentation for more information.
Your string should contains 2 backslashes like this .\\1362 or use r'.\1362' (which is declaring the string as raw and then it will be converted to normal during compile time). If there is only one backslash, Python will understand that \136 mean ^ as you can see (ref: link)
Whats happening here is that \1362 is being encoded as ^2 because of the backslash, so you need to make the string raw before you're able to use it, you can do this by doing
a = r'{}'.format(rawInputString)
or if you're on python3.6+ you can do
a = rf'{rawInputString}'

get escaped unicode code from string

I seem to be having the opposite issue as everyone else in the development world. I need to generate escaped characters from strings. For instance, say I have the word MESSAGE:, I need to generate:
\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A\\u0053\\u0069\\u006D
The closest thing I could get using Python was:
u'MESSAGE:'.encode('utf16')
# output = '\xff\xfeM\x00E\x00S\x00S\x00A\x00G\x00E\x00:\x00'
My first thought was that I could replace \x with \u00 (or something to that effect), but I quickly realized that wouldn't work. What can I do to output the escaped (unescaped?) string in Python (preferably)?
Before everyone starts "answering" and down voting, the escaped \u00... string is what my app is getting from another 3rd party app which I have no control over. I'm trying to generate my own test data so I don't have to rely on that 3rd party app.
Pierre's answer is nearly right, but the for x in u'MESSAGE:' bit would fail for characters above U+FFFF, except for ‘narrow builds’ (primarily Python 1.6–3.2 on Windows) which use UTF-16 for Unicode strings.
On ‘wide builds’ (and in 3.3+ where the distinction no longer exists), len(unichr(0x10000)) is 1 not 2. When this code point is UTF-16BE-encoded you get two surrogates taking up four bytes, so the output is '\\uD800DC00' instead of what you probably wanted, u'\\uD800\\uDC00'.
To cover it on both variants of Python you can do:
>>> h = u'MESSAGE:\U00010000'.encode('utf-16be').encode('hex')
# '004d004500530053004100470045003ad800dc00'
>>> ''.join(r'\u' + h[i:i+4] for i in range(0, len(h), 4))
'\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a\\ud800\\udc00'
I think this (quick & dirty) code does what you want:
''.join('\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a'
Or if you want more '\':
''.join('\\\\u' + x.encode('utf_16_be').encode('hex') for x in u'MESSAGE:')
# output: '\\\\u004d\\\\u0045\\\\u0053\\\\u0053\\\\u0041\\\\u0047\\\\u0045\\\\u003a'
print _
# output: \\u004d\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003a
If you absolutely need upper-case for hexadecimal codes:
''.join('\\u' + x.encode('utf_16_be').encode('hex').upper() for x in u'MESSAGE:')
# output: '\\u004D\\u0045\\u0053\\u0053\\u0041\\u0047\\u0045\\u003A'
There's no need to go through the .encode() step if you don't have characters outside the BMP (>0xFFFF):
>>> ''.join('\\u{:04x}'.format(ord(a)) for a in u'Message')
'\\u004d\\u0065\\u0073\\u0073\\u0061\\u0067\\u0065'

Python: Separating string into individual letters/words to be converted into ascii-hex or ascii-dec

As the title suggests, I want to get a string, split it into individual bits to input into something like ord('') and get a value for each individual character in that string. Still learning python so things like this get super confusing :P. Furthermore, the process for encryption for each of the codes will just be to shift the alphabet's dec number by a specified value and decrypt into the shifted value, plus state that value for each character. How would i go about doing this? any and all help would be greatly appreciated!
message=input("Enter message here: ", )
shift=int(input("Enter Shift....explained shift: ", )
for c in list(message):
a=ord(c)
print c
This is the very basic idea of what i was doing (was more code but similar), but obviously it didn't work :C, the indented--> just means that it was indented, just don't know how to do that in stack overflow.
UPDATE: IT WORKS (kinda) using the loop and tweaking it according to the comments i got a list of every single ascii dec value for each character in the string!, ill try and use #Hugh Bothwell's suggestion within the loop and hopefully get some work done.
mystring = "this is a test"
shift = 3
encoded = ''.join(chr(ord(ch) + shift) for ch in mystring)
You'll have to do a little more if you want your alphabet to wrap around, ie encode('y') == 'b', but this should give you the gist of it.

How to compare unicode strings with entity ref to non-unicode string

I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'

Categories