Python convert utf-8 back to string

Python convert utf-8 back to string - python

I have a string which looks like
a = 'Verm\xc3\xb6gensverzeichnis'
When i do print(a), it shows me the right result, which is Vermögensverzeichnis.
print(a)
Vermögensverzeichnis
What i want to do is to calculate the occurrence of each letter using Counter() and save them in a dataframe. When I use Counter(a), it gives me a result like this:
Counter({'V': 1,
'c': 1,
'e': 4,
'g': 1,
'h': 1,
'i': 2,
'm': 1,
'n': 2,
'r': 2,
's': 2,
'v': 1,
'z': 1,
'\xb6': 1,
'\xc3': 1})
Could you please help me get rid of codes like \xc3\xb6? I have tried with many existing answers, unfortunately they do not work.
Thanks a lot in advance!

This must be Python 2. Work with Unicode if you want to count characters vs. encoded bytes. \xc3\xb6 are the encoded bytes of ö:
>>> a = 'Verm\xc3\xb6gensverzeichnis'
>>> print a # Note this only works if your terminal is configured for UTF-8 encoding.
Vermögensverzeichnis
Decode to Unicode. It should still print correctly as long as your terminal is configured correctly:
>>> u = a.decode('utf8')
>>> u
u'Verm\xf6gensverzeichnis'
>>> print u
Vermögensverzeichnis
Count the Unicode code points:
>>> from collections import Counter
>>> Counter(u)
Counter({u'e': 4, u'i': 2, u'n': 2, u's': 2, u'r': 2, u'c': 1, u'v': 1, u'g': 1, u'h': 1, u'V': 1, u'm': 1, u'\xf6': 1, u'z': 1})
u'\xf6' is the Unicode codepoint for ö. Print the keys and values to display them on the terminal properly:
>>> for k,v in Counter(u).iteritems():
... print k,v
...
c 1
v 1
e 4
g 1
i 2
h 1
V 1
m 1
n 2
s 2
r 2
ö 1
z 1
Future study to see where this will break: Unicode normalization and graphemes.

Related

Detect specific value change in Pandas DataSeries

I want to detect a specific change of the value of a DataSeries in Pandas.
Given I got a DataSeries of the following Format:
ds = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 1, 'f': 5, 'g': 2}
With DataSeries built in diff(1) function I am able to detect a value change and how big it was. Is it possible to only get the occurances where the value changes from 4 to 1?

Compare original values by 4 and shifted by 1 and for count Trues use sum:
ds = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 1, 'f': 5, 'g': 2}
s = pd.Series(ds)
s = (s.eq(4) & s.shift(-1).eq(1)).sum()
print (s)
1
Details:
print (s.eq(4) & s.shift(-1).eq(1))
a False
b False
c False
d True
e False
f False
g False
dtype: bool

Combinations with max length and per element max repetition values

My goal is to find a more efficient way to get all combinations of 1 to r mixed elements, where each family of element potentially has a different count and r is a parameter. The elements can be any (hashable) type. The result is a list of Counter-like dictionaries.
Here is an example data:
example = {1e-8: 3, "k": 2}
r = 5 # sum(example.values()) == 5 therefore all possible combinations for this example
The expected result is the following:
[{1e-08: 1},
{'k': 1},
{1e-08: 2},
{1e-08: 1, 'k': 1},
{'k': 2},
{1e-08: 3},
{1e-08: 2, 'k': 1},
{1e-08: 1, 'k': 2},
{1e-08: 3, 'k': 1},
{1e-08: 2, 'k': 2},
{1e-08: 3, 'k': 2}]
... correspondong to every possible combinations of 1, 2, 3, 4 and 5 elements.
The order preservation of the list is preferable (since Python 3.7+ preserves the order of keys inside dictionaries) but not mandatory.
Here is the solution I currently use:
from more_itertools import distinct_combinations
from collections import Counter
def all_combis(elements, r=None):
if r is None:
r = sum(elements.values())
# "Flattening" by repeating the elements according to their count
flatt = []
for k, v in elements.items():
flatt.extend([k] * v)
for r in range(1, r+1):
for comb in distinct_combinations(flatt, r):
yield dict(Counter(comb))
list(all_combis(example))
# > The expected result
A real-life example has 300 elements distributed among 15 families. It is processed in ~13 seconds with a value of r=10 for about 2 million combinations, and ~31 seconds with r=11 for 4.5 million combinations.
I'm guessing there are better ways which avoid "flattening" the elements and/or counting the combinations, but I struggle to find any when each element has a different count.
Can you design a more time-efficient solution ?

The keys are a bit of a distraction. They can be added in later. Mathematically, what you have is a vector of bounds, together with a global bound, and want to generate all tuples where each element is bounded by its respective bound, and the total is bounded by the global bound. This leads to a simple recursive approach based on the idea that if
(a_1, a_2, ..., a_n) <= (b_1, b_2, ..., b_n) with a_1 + ... a_n <= k
then
(a_2, ..., a_n) <= (b_2, ..., b_n) with a_2 + ... a_n <= k - a_1
This leads to something like:
def bounded_tuples(r,bounds):
n = len(bounds)
if r == 0:
return [(0,)*n]
elif n == 0:
return [()]
else:
tuples = []
for i in range(1+min(r,bounds[0])):
tuples.extend((i,)+t for t in bounded_tuples(r-i,bounds[1:]))
return tuples
Note that this includes the solution with all 0's -- which you exclude, but that can be filtered out and the keys reintroduced:
def all_combis(elements, r=None):
if r is None:
r = sum(elements.values())
for t in bounded_tuples(r,list(elements.values())):
if max(t) > 0:
yield dict(zip(elements.keys(),t))
For example:
example = {1e-8: 3, "k": 2}
for d in all_combis(example):
print(d)
Output:
{1e-08: 0, 'k': 1}
{1e-08: 0, 'k': 2}
{1e-08: 1, 'k': 0}
{1e-08: 1, 'k': 1}
{1e-08: 1, 'k': 2}
{1e-08: 2, 'k': 0}
{1e-08: 2, 'k': 1}
{1e-08: 2, 'k': 2}
{1e-08: 3, 'k': 0}
{1e-08: 3, 'k': 1}
{1e-08: 3, 'k': 2}
Which is essentially what you have. The code could obviously be tweaked to eliminate dictionary entries with the value 0.
Timing with larger examples seems to suggest that my approach isn't any quicker than yours, though it still might give you some ideas.

As #John Coleman said without the keys you may be able to speed things up.
This recursive approach starts at the end of the list and iterates until either the max sum is reached, or the max value of that element.
It returns a list, but as #John Coleman also showed, it is easy to add the keys later.
From my tests it appears to run in about half the time as your current implementation.
def all_combis(elements, r=None):
if r is None:
r = sum(elements)
if r == 0:
yield [0] * len(elements)
return
if not elements:
yield []
return
elements = list(elements)
element = elements.pop(0)
for i in range(min(element + 1, r + 1)):
for combi in all_combis(elements, r - i):
yield [i] + combi
example = {1e-8: 3, "k": 2}
list(all_combis([val for val in example.values()]))
Output:
[[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2], [2, 0], [2, 1], [2, 2], [3, 0], [3, 1], [3, 2]]

How to separate upper and lower case with counter?

I am thinking of something with collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
import collections
c = collections.Counter(s)
As a result I have
Counter({' ': 8,
',': 1,
'.': 1,
'?': 1,
'H': 1,
'M': 1,
'R': 1,
'T': 1,
'a': 2,
'd': 1,
'e': 5,
'f': 1,
'g': 1,
'h': 2,
'i': 2,
'l': 2,
'n': 1,
'o': 4,
'r': 3,
's': 3,
't': 1,
'u': 2,
'w': 1,
'y': 2})
If I try sum I got syntax problem
print sum(1 for i in c if i.isupper())
File "<ipython-input-21-66a7538534ee>", line 4
print sum(1 for i in c if i.isupper())
^
SyntaxError: invalid syntax
How should I count only upper or lower from the counter?

You lack the () in your generator expresion:
sum((1 for x in c if x.isupper()))
4
EDIT: As #Błotosmętek sugest, you lack the () in your print, i guess you are using python3, you should use print()

You can try something like this:
import collections
s = 'Hello Mr. Rogers, how are you this fine Tuesday?'
c = collections.Counter([ch for ch in s if ch.isupper()])
# Change to ch.islower() if you need lower case
# c = collections.Counter([ch for ch in s if ch.islower()])
print(c)

Counting total number of letters in a string

a = "All men are created equal under the power of the constitution, Thomas Jefferson"
i know a.count('A') will return how many "A"s there are. But I want to count how many A's, e's, c's and T's there are and adding them together. Help much appreciated.
Im using Python3

Look into collections.Counter:
>>> from collections import Counter
>>> import string
>>> c = Counter(l for l in a if l in string.ascii_letters)
>>> c
Counter({'e': 11, 't': 6, 'o': 6, 'r': 5, 'n': 5, 'a': 4, 'l': 3, 'f': 3,
's': 3, 'u': 3, 'h': 3, 'i': 2, 'd': 2, 'c': 2, 'm': 2, 'A': 1,
'p': 1, 'w': 1, 'T': 1, 'J': 1, 'q': 1})
>>> sum(c.values())
66
>>> c = Counter(l for l in a if l in 'AecT')
>>> c
Counter({'e': 11, 'c': 2, 'A': 1, 'T': 1})
>>> sum(c.values())
15

Python has a great module for this. Use Counter from collections
from collections import Counter
a = "All men are created equal under the power of the constitution, Thomas Jefferson"
counter = Counter(a)
print(counter)
It will output a dictionary of all letters as keys and the values will be the occurrences.

You could use regex expressions to find the total number of letters easily
import re
p = re.compile("\w")
a = "All men are created equal under the power of the constitution, Thomas Jefferson"
numberOfLetters = len(p.findall(a))
Will return 66.
If you just want A,e,c, and T you should use this regex instead:
p = re.compile("[A|e|c|T]")
Will return 15.

Just tried with an another approach
map(lambda x: [x, a.count(x)], 'AecT')
'a' is the input string. 'AecT' can replace with required letters as per the need.

Count every word in a text file python

What i want is to be able to feed in a multiline Text file which is like a paragraph long and then to be returned with something like:
{'Total words': 'NUMBER', 'Words ending with LY': 'NUMBER'}
I have never used Counter before but i believe that is how i would do it. So i want it to count every word and if the word ends in LY add it to the second count. Considering i have never used Counter i don't know where to go...
with open('SOMETHING.txt') as f:
# something to do with counter here?
EDIT: I have to do it without using counter! how would i achieve the same result but without the counter library?

This should work for you...
def parse_file():
with open('SOMETHING.txt', 'r') as f:
c1 = 0
c2 = 0
for i in f:
w = i.split()
c1 += len(w)
for j in w:
if j.endswith('LY'):
c2 += 1
return {'Total words': c1, 'Words ending with LY': c2}
I would recommend however, you have a look at a few python basics.

Is this hard to try?
from collections import defaultdict
result = defaultdict(int)
result_second = defaultdict(int)
for word in open('text.txt').read().split():
result[word] += 1
if word.endswith('LY'):
result_second[word] +=1
print result,result_second
Output:
defaultdict(<type 'int'>, {'and': 1, 'Considering': 1, 'have': 2, "don't": 1, 'is': 1, 'it': 2, 'second': 1, 'want': 1, 'in': 1, 'before': 1, 'would': 1, 'to': 3, 'count.': 1, 'go...': 1, 'how': 1, 'add': 1, 'if': 1, 'LY': 1, 'it.': 1, 'do': 1, 'ends': 1, 'used': 2, 'that': 1, 'I': 1, 'Counter': 2, 'but': 1, 'So': 1, 'know': 1, 'never': 2, 'believe': 1, 'count': 1, 'word': 2, 'i': 5, 'every': 1, 'the': 2, 'where': 1})

Use collections.Counter()
import collections
with open('your_file.txt') as fp:
text = fp.read()
counter = collections.Counter(['ends_in_ly' if token.endswith('LY') else 'doesnt_end_in_ly' for token in text.split()])
Without counter
with open('file.txt') as fp:
tokens = fp.read().split()
c = sum([1 if token.endswith('LY') else 0 for token in tokens])
return {'ending_in_ly': c, 'not_ending_in_ly': len(tokens) - c}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python convert utf-8 back to string - python

Related

Detect specific value change in Pandas DataSeries

Combinations with max length and per element max repetition values

How to separate upper and lower case with counter?

Counting total number of letters in a string

Count every word in a text file python

Categories

Resources