python split a unicode string by 3-bytes utf8 character - python

suppose that we have a unicode string in python,
s = u"abc你好def啊"
Now I want to split that by no-ascii characters, with result like
result = ["abc", "你好", "def", "啊"]
So, how to implement that?

With regex you could simply split between "has or has not" a-z chars.
>>> import re
>>> re.findall('([a-zA-Z0-9]+|[^a-zA-Z0-9]+)', u"abc你好def啊")
["abc", "你好", "def", "啊"]
Or, with all ASCIIs
>>> ascii = ''.join(chr(x) for x in range(33, 127))
>>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊")
['abc', '你好', 'def', '啊']
Or, even simpler as suggested by #Dolda2000
>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊")
['abc', '你好', 'def', '啊']

You can do something like this:
s = u"abc你好def啊"
status = ord(s[0]) < 128
word = ""
res =[]
for b, letter in zip([ ord(c) < 128 for c in s ], s):
if b != status:
res.append(word)
status = b
word = ""
word += letter
res.append(word)
print res
>> ["abc", "你好", "def", "啊"]

s = "abc你好def啊"
filter(None, re.split('(\w+|\W+)', s))
works in python 2.x versions

Just ... split.
[s[i:i+3] for i in xrange(0, len(s), 3)]
http://ideone.com/PeoGaF

Related

How do I let Python know that these two words are the same?

I have a .csv file. Here is an example:
Table format:
A
AA
BB
B
C
CC
D D
D DD
Text format:
A,AA
BB,B
C,CC
D D,D DD
I want Python to know that A is equal to AA, BB is equal to B, and C is equal to CC. Also, the fourth example has spaces.
It can also be reversed, such as checking whether AA is equal to A.
What I'm working on is a search engine. A word may be written in two ways, so I need to do this.
For example, I have a boolean variable that checks if the search result is A and it returns AA, then it is True. Of course, returning A is also True.
My code:
query = … // "AA"
result_list = … // ["A"]
sorted_list = [element for element in result_list if element.find(query) != -1]
Feel free to leave a comment if you need more information.
How do I let Python know that these two words are the same?
You can solve that with a simplified string at the left and right sides. For example, A will be simplified as A, AA be A, D D as D, etc. Try this :
import pandas as pd
def simplify(s):
return "".join(set(s))
# data.csv
# A,AA
# BB,B
# C,CC
# D D,D DD
df = pd.read_csv('data.csv', header=None)
df = df.values.tolist()
result = []
for i in df:
result.append(simplify(i[0]) == simplify(i[1]))
print(result)
The output must be like this :
[True, True, True, True]
After using csv module to read the rows as lists, remove any spaces, and compare the characters:
ex: all(set(x.replace(' ', '')) == set(A[0]) for x in A)
>>> A = ['AA']
>>> D = ['D','D DD']
>>> all(set(x.replace(' ', '')) == set(A[0]) for x in A)
True
>>> all(set(x.replace(' ', '')) == set(D[0]) for x in D)
True
Turn both strings into sets, add a space to one, and check if they match. Wrap that into function and use as you will:
def isequal_with_spaces(s1, s2):
set1 = set(s1 + " ")
return set1.issuperset(s2)
assert isequal_with_spaces('A', 'AA')
assert not isequal_with_spaces('A', 'BA')
assert isequal_with_spaces('A', 'A A')
assert isequal_with_spaces('AA', 'AAA A ')
assert not isequal_with_spaces('B', 'A A')
This doesn't take into account case and might work not quite as you need for strings like 'AB' == 'A A', but that wasn't specified in the question =)
Moreover, if a "word" could be defined as "capital letter followed by any number of spaces or the same letters", and that's guaranteed, you can simplify isequal_with_spaces even further:
def isequal_with_spaces(s1, s2):
return s1[0] == s2[0]
One way to handle this scenario is to use a dictionary that maps words to their equivalent words. For example:
word_map = {
"A": "AA",
"AA": "A",
"BB": "B",
"B": "BB",
"C": "CC",
"CC": "C",
"D D": "D DD",
"D DD": "D D",
}
Then, you can use this dictionary to check if the query is equivalent to any of the elements in the result_list:
equivalent_word = word_map.get(query, None)
if equivalent_word in result_list:
sorted_list.append(equivalent_word)
You can add all the equivalent words to the word_map dictionary, and then use this dictionary to check if a word is equivalent to another word.
Maybe it helps if you check for the hex code. Something like this:
input = "A AA"
hex_code=(":".join("{:02x}".format(ord(x)) for x in input))
if "41" in hex_code:
print("True")

how to split after each word and get the following string in an organized way?

Given the following string:
'hello0192239world0912903spam209394'
I would like to be able to split the above string into this
hello, 0192239, world, 0912903, spam, 209394
and ideally end with a list:
[hello, 0192239], [world, 0912903], [spam, 209394]
But I just don't know how to go about even the first step, splitting by word x number. I know there's the split method and something called regex but I don't know how to use it and even if it's the right thing to use
Try this:
>>> lst = re.split('(\d+)','hello0192239world0912903spam209394')
>>> list(zip(lst[::2],lst[1::2]))
[('hello', '0192239'), ('world', '0912903'), ('spam', '209394')]
>>> lst = re.split('(\d+)','09182hello2349283world892')
>>> list(zip(lst[::2],lst[1::2]))
[('', '09182'), ('hello', '2349283'), ('world', '892')]
# as a list
>>> list(map(list,zip(lst[::2],lst[1::2])))
[['', '09182'], ['hello', '2349283'], ['world', '892']]
See below. The idea is to maintain a 'mode' and flip mode every time you switch from digit to char or the other way around.
data = 'hello0192239world0912903spam209394'
A = 'A'
D = 'D'
mode = D if data[0].isdigit() else A
holder = []
tmp = []
for x in data:
if mode == A:
is_digit = x.isdigit()
if is_digit:
mode = D
holder.append(''.join(tmp))
tmp = [x]
continue
else:
is_char = not x.isdigit()
if is_char:
mode = A
holder.append(''.join(tmp))
tmp = [x]
continue
tmp.append(x)
holder.append(''.join(tmp))
print(holder)
output
['hello', '0192239', 'world', '0912903', 'spam', '209394']

python distinguish number and string solution

I'm new to python and trying to solve the distinguish between number and string
For example :
Input: 111aa111aa
Output : Number: 111111 , String : aaaa
Here is your answer
for numbers
import re
x = '111aa111aa'
num = ''.join(re.findall(r'[\d]+',x))
for alphabets
import re
x = '111aa111aa'
alphabets = ''.join(re.findall(r'[a-zA-Z]', x))
You can use in-built functions as isdigit() and isalpha()
>>> x = '111aa111aa'
>>> number = ''.join([i for i in x if i.isdigit()])
'111111'
>>> string = ''.join([i for i in x if i.isalpha()])
'aaaa'
Or You can use regex here :
>>> x = '111aa111aa'
>>> import re
>>> numbers = ''.join(re.findall(r'\d+', x))
'111111'
>>> string = ''.join(re.findall(r'[a-zA-Z]', x))
'aaaa'
>>> my_string = '111aa111aa'
>>> ''.join(filter(str.isdigit, my_string))
'111111'
>>> ''.join(filter(str.isalpha, my_string))
'aaaa'
Try with isalpha for strings and isdigit for numbers,
In [45]: a = '111aa111aa'
In [47]: ''.join([i for i in a if i.isalpha()])
Out[47]: 'aaaa'
In [48]: ''.join([i for i in a if i.isdigit()])
Out[48]: '111111'
OR
In [18]: strings,numbers = filter(str.isalpha,a),filter(str.isdigit,a)
In [19]: print strings,numbers
aaaa 111111
As you mentioned you are new to Python, most of the presented approaches using str.join with list comprehensions or functional styles are quite sufficient. Alternatively, I present some options using dictionaries that can help organize data, starting from basic to intermediate examples with arguably increasing intricacy.
Basic Alternative
# Dictionary
d = {"Number":"", "String":""}
for char in s:
if char.isdigit():
d["Number"] += char
elif char.isalpha():
d["String"] += char
d
# {'Number': '111111', 'String': 'aaaa'}
d["Number"] # access by key
# '111111'
import collections as ct
# Default Dictionary
dd = ct.defaultdict(str)
for char in s:
if char.isdigit():
dd["Number"] += char
elif char.isalpha():
dd["String"] += char
dd

Same Python code returns different results for same input string

Below code is supposed to return the most common letter in the TEXT string in the format:
always lowercase
ignoring punctuation and spaces
in the case of words such as "One" - where there is no 2 letters the same - return the first letter in the alphabet
Each time I run the code using the same string, e.g. "One" the result cycles through the letters...weirdly though, only from the third try (in this "One" example).
text=input('Insert String: ')
def mwl(text):
from string import punctuation
from collections import Counter
for l in punctuation:
if l in text:
text = text.replace(l,'')
text = text.lower()
text=''.join(text.split())
text= sorted(text)
collist=Counter(text).most_common(1)
print(collist[0][0])
mwl(text)
Counter uses a dictionary:
>>> Counter('one')
Counter({'e': 1, 'o': 1, 'n': 1})
Dictionaries are not ordered, hence the behavior.
You can get the desired output with OrderedDict replacing the below two lines:
text= sorted(text)
collist=Counter(text).most_common(1)
with:
collist = OrderedDict([(i,text.count(i)) for i in text])
collist = sorted(collist.items(), key=lambda x:x[1], reverse=True)
You also need to import OrderedDict for this.
Demo:
>>> from collections import Counter, OrderedDict
>>> text = 'One'
>>> collist = OrderedDict([(i,text.count(i)) for i in text])
>>> print(sorted(collist.items(), key=lambda x:x[1], reverse=True)[0][0])
O
>>> print(sorted(collist.items(), key=lambda x:x[1], reverse=True)[0][0])
O # it will always return O
>>> text = 'hello'
>>> collist = OrderedDict([(i,text.count(i)) for i in text])
>>> print(sorted(collist.items(), key=lambda x:x[1], reverse=True)[0][0])
l # l returned because it is most frequent
This can also be done without Counter or OrderedDict:
In [1]: s = 'Find the most common letter in THIS sentence!'
In [2]: letters = [letter.lower() for letter in s if letter.isalpha()]
In [3]: max(set(letters), key=letters.count)
Out[3]: 'e'

Single integer to multiple integer translation in Python

I'm trying to translate a single integer input to a multiple integer output, and am currently using the transtab function. For instance,
intab3 = "abcdefg"
outtab3 = "ABCDEFG"
trantab3 = maketrans(intab3, outtab3)
is the most basic version of what I'm doing. What I'd like to be able to do is have the input be a single letter and the output be multiple letters. So something like:
intab4 = "abc"
outtab = "yes,no,maybe"
but commas and quotation marks don't work.
It keeps saying :
ValueError: maketrans arguments must have same length
Is there a better function I should be using? Thanks,
You can use a dict here:
>>> dic = {"a":"yes", "b":"no", "c":"maybe"}
>>> strs = "abcd"
>>> "".join(dic.get(x,x) for x in strs)
'yesnomaybed'
In python3, the str.translate method was improved so this just works.
>>> intab4 = "abc"
>>> outtab = "yes,no,maybe"
>>> d = {ord(k): v for k, v in zip(intab4, outtab.split(','))}
>>> print(d)
{97: 'yes', 98: 'no', 99: 'maybe'}
>>> 'abcdefg'.translate(d)
'yesnomaybedefg'

Categories