python split a unicode string by 3-bytes utf8 character

python split a unicode string by 3-bytes utf8 character - python

suppose that we have a unicode string in python,
s = u"abc你好def啊"
Now I want to split that by no-ascii characters, with result like
result = ["abc", "你好", "def", "啊"]
So, how to implement that?

With regex you could simply split between "has or has not" a-z chars.
>>> import re
>>> re.findall('([a-zA-Z0-9]+|[^a-zA-Z0-9]+)', u"abc你好def啊")
["abc", "你好", "def", "啊"]
Or, with all ASCIIs
>>> ascii = ''.join(chr(x) for x in range(33, 127))
>>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊")
['abc', '你好', 'def', '啊']
Or, even simpler as suggested by #Dolda2000
>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊")
['abc', '你好', 'def', '啊']

You can do something like this:
s = u"abc你好def啊"
status = ord(s[0]) < 128
word = ""
res =[]
for b, letter in zip([ ord(c) < 128 for c in s ], s):
if b != status:
res.append(word)
status = b
word = ""
word += letter
res.append(word)
print res
>> ["abc", "你好", "def", "啊"]

s = "abc你好def啊"
filter(None, re.split('(\w+|\W+)', s))
works in python 2.x versions

Just ... split.
[s[i:i+3] for i in xrange(0, len(s), 3)]
http://ideone.com/PeoGaF

Related

How do I let Python know that these two words are the same?

I have a .csv file. Here is an example:
Table format:
A
AA
BB
B
C
CC
D D
D DD
Text format:
A,AA
BB,B
C,CC
D D,D DD
I want Python to know that A is equal to AA, BB is equal to B, and C is equal to CC. Also, the fourth example has spaces.
It can also be reversed, such as checking whether AA is equal to A.
What I'm working on is a search engine. A word may be written in two ways, so I need to do this.
For example, I have a boolean variable that checks if the search result is A and it returns AA, then it is True. Of course, returning A is also True.
My code:
query = … // "AA"
result_list = … // ["A"]
sorted_list = [element for element in result_list if element.find(query) != -1]
Feel free to leave a comment if you need more information.
How do I let Python know that these two words are the same?

You can solve that with a simplified string at the left and right sides. For example, A will be simplified as A, AA be A, D D as D, etc. Try this :
import pandas as pd
def simplify(s):
return "".join(set(s))
# data.csv
# A,AA
# BB,B
# C,CC
# D D,D DD
df = pd.read_csv('data.csv', header=None)
df = df.values.tolist()
result = []
for i in df:
result.append(simplify(i[0]) == simplify(i[1]))
print(result)
The output must be like this :
[True, True, True, True]

After using csv module to read the rows as lists, remove any spaces, and compare the characters:
ex: all(set(x.replace(' ', '')) == set(A[0]) for x in A)
>>> A = ['AA']
>>> D = ['D','D DD']
>>> all(set(x.replace(' ', '')) == set(A[0]) for x in A)
True
>>> all(set(x.replace(' ', '')) == set(D[0]) for x in D)
True

Turn both strings into sets, add a space to one, and check if they match. Wrap that into function and use as you will:
def isequal_with_spaces(s1, s2):
set1 = set(s1 + " ")
return set1.issuperset(s2)
assert isequal_with_spaces('A', 'AA')
assert not isequal_with_spaces('A', 'BA')
assert isequal_with_spaces('A', 'A A')
assert isequal_with_spaces('AA', 'AAA A ')
assert not isequal_with_spaces('B', 'A A')
This doesn't take into account case and might work not quite as you need for strings like 'AB' == 'A A', but that wasn't specified in the question =)
Moreover, if a "word" could be defined as "capital letter followed by any number of spaces or the same letters", and that's guaranteed, you can simplify isequal_with_spaces even further:
def isequal_with_spaces(s1, s2):
return s1[0] == s2[0]

One way to handle this scenario is to use a dictionary that maps words to their equivalent words. For example:
word_map = {
"A": "AA",
"AA": "A",
"BB": "B",
"B": "BB",
"C": "CC",
"CC": "C",
"D D": "D DD",
"D DD": "D D",
}
Then, you can use this dictionary to check if the query is equivalent to any of the elements in the result_list:
equivalent_word = word_map.get(query, None)
if equivalent_word in result_list:
sorted_list.append(equivalent_word)
You can add all the equivalent words to the word_map dictionary, and then use this dictionary to check if a word is equivalent to another word.

Maybe it helps if you check for the hex code. Something like this:
input = "A AA"
hex_code=(":".join("{:02x}".format(ord(x)) for x in input))
if "41" in hex_code:
print("True")

how to split after each word and get the following string in an organized way?

Given the following string:
'hello0192239world0912903spam209394'
I would like to be able to split the above string into this
hello, 0192239, world, 0912903, spam, 209394
and ideally end with a list:
[hello, 0192239], [world, 0912903], [spam, 209394]
But I just don't know how to go about even the first step, splitting by word x number. I know there's the split method and something called regex but I don't know how to use it and even if it's the right thing to use

Try this:
>>> lst = re.split('(\d+)','hello0192239world0912903spam209394')
>>> list(zip(lst[::2],lst[1::2]))
[('hello', '0192239'), ('world', '0912903'), ('spam', '209394')]
>>> lst = re.split('(\d+)','09182hello2349283world892')
>>> list(zip(lst[::2],lst[1::2]))
[('', '09182'), ('hello', '2349283'), ('world', '892')]
# as a list
>>> list(map(list,zip(lst[::2],lst[1::2])))
[['', '09182'], ['hello', '2349283'], ['world', '892']]

See below. The idea is to maintain a 'mode' and flip mode every time you switch from digit to char or the other way around.
data = 'hello0192239world0912903spam209394'
A = 'A'
D = 'D'
mode = D if data[0].isdigit() else A
holder = []
tmp = []
for x in data:
if mode == A:
is_digit = x.isdigit()
if is_digit:
mode = D
holder.append(''.join(tmp))
tmp = [x]
continue
else:
is_char = not x.isdigit()
if is_char:
mode = A
holder.append(''.join(tmp))
tmp = [x]
continue
tmp.append(x)
holder.append(''.join(tmp))
print(holder)
output
['hello', '0192239', 'world', '0912903', 'spam', '209394']

python distinguish number and string solution

I'm new to python and trying to solve the distinguish between number and string
For example :
Input: 111aa111aa
Output : Number: 111111 , String : aaaa

Here is your answer
for numbers
import re
x = '111aa111aa'
num = ''.join(re.findall(r'[\d]+',x))
for alphabets
import re
x = '111aa111aa'
alphabets = ''.join(re.findall(r'[a-zA-Z]', x))

You can use in-built functions as isdigit() and isalpha()
>>> x = '111aa111aa'
>>> number = ''.join([i for i in x if i.isdigit()])
'111111'
>>> string = ''.join([i for i in x if i.isalpha()])
'aaaa'
Or You can use regex here :
>>> x = '111aa111aa'
>>> import re
>>> numbers = ''.join(re.findall(r'\d+', x))
'111111'
>>> string = ''.join(re.findall(r'[a-zA-Z]', x))
'aaaa'

>>> my_string = '111aa111aa'
>>> ''.join(filter(str.isdigit, my_string))
'111111'
>>> ''.join(filter(str.isalpha, my_string))
'aaaa'

Try with isalpha for strings and isdigit for numbers,
In [45]: a = '111aa111aa'
In [47]: ''.join([i for i in a if i.isalpha()])
Out[47]: 'aaaa'
In [48]: ''.join([i for i in a if i.isdigit()])
Out[48]: '111111'
OR
In [18]: strings,numbers = filter(str.isalpha,a),filter(str.isdigit,a)
In [19]: print strings,numbers
aaaa 111111

As you mentioned you are new to Python, most of the presented approaches using str.join with list comprehensions or functional styles are quite sufficient. Alternatively, I present some options using dictionaries that can help organize data, starting from basic to intermediate examples with arguably increasing intricacy.
Basic Alternative
# Dictionary
d = {"Number":"", "String":""}
for char in s:
if char.isdigit():
d["Number"] += char
elif char.isalpha():
d["String"] += char
d
# {'Number': '111111', 'String': 'aaaa'}
d["Number"] # access by key
# '111111'
import collections as ct
# Default Dictionary
dd = ct.defaultdict(str)
for char in s:
if char.isdigit():
dd["Number"] += char
elif char.isalpha():
dd["String"] += char
dd

Same Python code returns different results for same input string

Below code is supposed to return the most common letter in the TEXT string in the format:
always lowercase
ignoring punctuation and spaces
in the case of words such as "One" - where there is no 2 letters the same - return the first letter in the alphabet
Each time I run the code using the same string, e.g. "One" the result cycles through the letters...weirdly though, only from the third try (in this "One" example).
text=input('Insert String: ')
def mwl(text):
from string import punctuation
from collections import Counter
for l in punctuation:
if l in text:
text = text.replace(l,'')
text = text.lower()
text=''.join(text.split())
text= sorted(text)
collist=Counter(text).most_common(1)
print(collist[0][0])
mwl(text)

Counter uses a dictionary:
>>> Counter('one')
Counter({'e': 1, 'o': 1, 'n': 1})
Dictionaries are not ordered, hence the behavior.

You can get the desired output with OrderedDict replacing the below two lines:
text= sorted(text)
collist=Counter(text).most_common(1)
with:
collist = OrderedDict([(i,text.count(i)) for i in text])
collist = sorted(collist.items(), key=lambda x:x[1], reverse=True)
You also need to import OrderedDict for this.
Demo:
>>> from collections import Counter, OrderedDict
>>> text = 'One'
>>> collist = OrderedDict([(i,text.count(i)) for i in text])
>>> print(sorted(collist.items(), key=lambda x:x[1], reverse=True)[0][0])
O
>>> print(sorted(collist.items(), key=lambda x:x[1], reverse=True)[0][0])
O # it will always return O
>>> text = 'hello'
>>> collist = OrderedDict([(i,text.count(i)) for i in text])
>>> print(sorted(collist.items(), key=lambda x:x[1], reverse=True)[0][0])
l # l returned because it is most frequent

This can also be done without Counter or OrderedDict:
In [1]: s = 'Find the most common letter in THIS sentence!'
In [2]: letters = [letter.lower() for letter in s if letter.isalpha()]
In [3]: max(set(letters), key=letters.count)
Out[3]: 'e'

Single integer to multiple integer translation in Python

I'm trying to translate a single integer input to a multiple integer output, and am currently using the transtab function. For instance,
intab3 = "abcdefg"
outtab3 = "ABCDEFG"
trantab3 = maketrans(intab3, outtab3)
is the most basic version of what I'm doing. What I'd like to be able to do is have the input be a single letter and the output be multiple letters. So something like:
intab4 = "abc"
outtab = "yes,no,maybe"
but commas and quotation marks don't work.
It keeps saying :
ValueError: maketrans arguments must have same length
Is there a better function I should be using? Thanks,

You can use a dict here:
>>> dic = {"a":"yes", "b":"no", "c":"maybe"}
>>> strs = "abcd"
>>> "".join(dic.get(x,x) for x in strs)
'yesnomaybed'

In python3, the str.translate method was improved so this just works.
>>> intab4 = "abc"
>>> outtab = "yes,no,maybe"
>>> d = {ord(k): v for k, v in zip(intab4, outtab.split(','))}
>>> print(d)
{97: 'yes', 98: 'no', 99: 'maybe'}
>>> 'abcdefg'.translate(d)
'yesnomaybedefg'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python split a unicode string by 3-bytes utf8 character - python

suppose that we have a unicode string in python, s = u"abc你好def啊" Now I want to split that by no-ascii characters, with result like result = ["abc", "你好", "def", "啊"] So, how to implement that?

You can do something like this: s = u"abc你好def啊" status = ord(s[0]) < 128 word = "" res =[] for b, letter in zip([ ord(c) < 128 for c in s ], s): if b != status: res.append(word) status = b word = "" word += letter res.append(word) print res >> ["abc", "你好", "def", "啊"]

s = "abc你好def啊" filter(None, re.split('(\w+|\W+)', s)) works in python 2.x versions

Just ... split. [s[i:i+3] for i in xrange(0, len(s), 3)] http://ideone.com/PeoGaF

Related

How do I let Python know that these two words are the same?

how to split after each word and get the following string in an organized way?

python distinguish number and string solution

Same Python code returns different results for same input string

Single integer to multiple integer translation in Python

Categories

Resources