I have a dataframe in which each row shows one transaction and items within that transactions. Here is how my dataframe looks like
itemList
A,B,C
B,F
G,A
...
I want to find the frequency of each item (how many times it appeared in the transactions. I have defined a dictionary and try to update its value as shown below
dict ={}
def update(itemList):
#Update the value of each item in the dict
df.itemList.apply(lambda x: update(x))
As apply function gets executed for multiple row at the same time, multiple rows try to update the values in dict at the same time and it's causing an issue. How can I make sure multiple updated to dict does not cause any issue?
I think you only need Series.str.get_dummies:
df['itemList'].str.get_dummies(',').sum().to_dict()
#{'A': 2, 'B': 2, 'C': 1, 'F': 1, 'G': 1}
If there are more columns use:
df.stack().str.get_dummies(',').sum().to_dict()
if you want to count for each row:
df['itemList'].str.get_dummies(',').to_dict('index')
#{0: {'A': 1, 'B': 1, 'C': 1, 'F': 0, 'G': 0},
# 1: {'A': 0, 'B': 1, 'C': 0, 'F': 1, 'G': 0},
# 2: {'A': 1, 'B': 0, 'C': 0, 'F': 0, 'G': 1}}
As #Quang Hoang said in the comments apply simply apply the function to each row / column using a loop
You might be better off relying on native python here,
df = pd.DataFrame({'itemlist':['a,b,c', 'b,f', 'g,a', 'd,g,f,d,s,a,v', 'e,w,d,f,g,h', 's,d,f,e,r,t', 'e,d,f,g,r,r','s,d,f']})
Here is a solution using Counter,
df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
Some comparisons,
%timeit df['itemlist'].str.split(',', expand = True).stack().value_counts().to_dict()
2.64 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['itemlist'].str.get_dummies(',').sum().to_dict()
3.22 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
from collections import Counter
%timeit df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
778 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related
What is the fastest way to count number of two-letter pairs in a string (i.e AA, AB, AC, ... etc)? Is it possible to use numpy to speed up this computation?
I am using a list comprehension with str.count(), but this is quite slow.
import itertools
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
print(pairs[:10])
print(len(pairs))
['AA', 'AC', 'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AK', 'AL']
400
%timeit counts = np.array([seq.count(pair) for pair in pairs])
231 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
print counts[:10]
[0, 1, 1, 0, 0, 0, 1, 1, 1, 0]
If you don't mind getting the counts in a dictionary, the Counter class from collections would process 2-3 times faster:
from collections import Counter
chars = set('ACDEFGHIKLMNPQRSTVWY')
counts = Counter( a+b for a,b in zip(seq,seq[1:]) if a in chars and b in chars)
print(counts)
Counter({'RS': 4, 'VV': 4, 'SI': 4, 'MR': 3, 'SG': 3, 'LL': 3, 'LS': 3,
'PL': 3, 'IE': 3, 'DI': 3, 'IA': 3, 'AN': 3, 'VK': 3, 'KE': 3,
'EV': 3, 'TS': 3, 'NL': 2, 'LA': 2, 'IP': 2, 'AR': 2, 'SK': 2,
...
This approach will properly count sequences of the same character repeated 3 or more times (i.e. "WWW" will count as 2 for "WW" whereas seq.count() or re.findall() would only count 1).
Keep in mind that he Counter dictionary will return zero for counts['LC'] but counts.items() will not contain 'LC' or any other pair not actually in the string.
If needed you could get the counts for all theoretical pairs in a second step:
from itertools import product
chars = 'ACDEFGHIKLMNPQRSTVWY'
print([counts[a+b] for a,b in product(chars,chars)][:10])
[1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
There is a numpy function, np.char.count(). But it appears to be much slower than str.count().
%timeit counts = np.array([np.char.count(seq, pair) for pair in pairs])
1.79 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since speed is of utmost importance, below is a comparison of different methods:
import numpy as np
import itertools
from collections import Counter
seq = 'MRNLAIIPARSGSKGLKDKNIKLLSGKPLLAYTIEAARESGLFGEIMVSTDSQEYAD'\
'IAKQWGANVPFLRSNELSNDTASSWDVVKEVIEGYKNLGTEFDTVVLLQPTSPLRTS'\
'IEGYKIMKEKDANFVVGVCEMDHSPLWANTLPEDLSMENFIRPEVVKMPRQSIPTYY'\
'RINGALYIVKVDYLMRTSDIYGERSIASVMRKENSIDIDNQMDFTIAEVLISERSKK'
chars = list('ACDEFGHIKLMNPQRSTVWY')
pairs = [''.join(pair) for pair in itertools.product(chars, chars)]
def countpairs1():
return np.array([seq.count(pair) for pair in pairs])
%timeit counts = countpairs1()
144 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs2():
counted = Counter(a+b for a,b in zip(seq,seq[1:]))
return np.array([counted[pair] for pair in pairs])
%timeit counts = countpairs2()
102 µs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
def countpairs3():
return np.array([np.char.count(seq, pair) for pair in pairs])
%timeit counts = countpairs3()
1.65 ms ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Obviously, the best/fastest method is Counter.
There is a random 1D array m_0
np.array([0, 1, 2])
I need to generate two 1D arrays:
np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
Is there faster way to do it than this one:
import numpy as np
import time
N = 3
m_0 = np.arange(N)
t = time.time()
m_1 = np.tile(m_0, N)
m_2 = np.repeat(m_0, N)
t = time.time() - t
Size of m_0 is 10**3
You could use itertools.product to form the Cartesian product of m_0 with itself, then take the result apart again to get your two arrays.
import numpy as np
from itertools import product
N = 3
m_0 = np.arange(N)
m_2, m_1 = map(np.array, zip(*product(m_0, m_0)))
# m_1 is now array([0, 1, 2, 0, 1, 2, 0, 1, 2])
# m_2 is now array([0, 0, 0, 1, 1, 1, 2, 2, 2])
However, for large N this is probably quite a bit less performant than your solution, as it probably can't use many of NumPy's SIMD optimizations.
For alternatives and comparisons, you'll probably want to look at the answers to Cartesian product of x and y array points into single array of 2D points.
I guess you could try reshape:
>>> np.reshape([m_0]*3, (-1,), order='C')
array([0, 1, 2, 0, 1, 2, 0, 1, 2])
>>> np.reshape([m_0]*3, (-1,), order='F')
array([0, 0, 0, 1, 1, 1, 2, 2, 2])
Should be tiny bit faster for larger arrays.
>>> m_0 = np.random.randint(0, 10**3, size=(10**3,))
>>> %timeit np.tile([m_0]*10**3, N)
5.85 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit np.reshape([m_0]*10**3, (-1,), order='C')
1.94 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can slightly improve speed if you reuse your first variable to create the second.
N=1000
%timeit t = np.arange(N); a = np.tile(t, N); b = np.repeat(t, N)
%timeit t = np.arange(N); a = np.tile(t, N); b = np.reshape(a.reshape((N,N)),-1,'F')
7.55 ms ± 46.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you insist on speeding it up further, you can specify the dtype of your array.
%timeit t = np.arange(N,dtype=np.uint16); a = np.tile(t, N); b = np.repeat(t, N)
%timeit t = np.arange(N,dtype=np.uint16); a = np.tile(t, N); b = np.reshape(a.reshape((N,N)),-1,'F')
6.03 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.2 ms ± 37.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Be sure to keep the data type limit in mind.
I want to repeat len(non_current_assets) times a string in an array. So I tried:
["", "totalAssets", "total_non_current_assets" * len(non_current_assets), "totalAssets"]
But it returns:
['',
'totalAssets',
'total_non_current_assetstotal_non_current_assetstotal_non_current_assetstotal_non_current_assetstotal_non_current_assets',
'totalAssets']
Place your str inside list, multiply, then unpack (using * operator) that is:
non_current_assets = (1, 2, 3, 4, 5) # so len(non_current_assets) == 5, might be anything as long as supports len
lst = ["", "totalAssets", *["total_non_current_assets"] * len(non_current_assets), "totalAssets"]
print(lst)
Output:
['', 'totalAssets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'total_non_current_assets', 'totalAssets']
(tested in Python 3.7)
This should work:
string_to_be_repeated = ["total_non_current_assets"]
needed_list = string_to_be_repeated * 3
list_to_appended = ["","totalAssets"]
list_to_appended.extend(needed_list)
print(list_to_appended)
You want to use a loop:
for x in range(len(non_current_assets)):
YOUR_ARRAY.append(”total_non_current_assets“)
You can use itertools.repeat together with the unpacking operator *:
import itertools as it
["", "totalAssets",
*it.repeat("total_non_current_assets", len(non_current_assets)),
"totalAssets"]
It makes the intent pretty clear and saves the creation of a temporary list (hence better performance).
In [1]: import itertools as it
In [2]: %timeit [0, 1, *[3]*1000, 4, 5]
6.51 µs ± 8.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [3]: %timeit [0, 1, *it.repeat(3, 1000), 4, 5]
4.94 µs ± 73.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I have a dataframe with a column which has dictionaries in it. I want to count the number of occurrences of a dictionary key in the whole column.
One way of doing it is below:
import pandas as pd
from collections import Counter
df = pd.DataFrame({"data": [{"weight": 3, "color": "blue"},
{"size": 5, "weight": 2},{"size": 3, "color": "red"}]})
c = Counter()
for index, row in df.iterrows():
for item in list(row["data"].keys()):
c[item] += 1
print(c)
Which gives
Counter({'weight': 2, 'color': 2, 'size': 2})
Are there faster ways of doing it?
A much faster approach would be to flatten the column with itertools.chain and build a Counter from the result (which will only contain the dictionary keys):
from itertools import chain
Counter(chain.from_iterable(df.data.values.tolist()))
# Counter({'weight': 2, 'color': 2, 'size': 2})
Timings:
def OP(df):
c = Counter()
for index, row in df.iterrows():
for item in list(row["data"].keys()):
c[item] += 1
%timeit OP(df)
# 570 µs ± 49.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit Counter(chain.from_iterable(df.data.values.tolist()))
# 14.2 µs ± 902 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
We can using
pd.DataFrame(df.data.tolist()).notnull().sum().to_dict()
Out[653]: {'color': 2, 'size': 2, 'weight': 2}
First, creating an empty Counter is in my opinion pretty useless. Counter can count for you if you supply a list to it. It's its main purpose, I would say.
I would do:
from functools import reduce
c = reduce(lambda x, y : x+y, [Counter(x.keys()) for x in df['data']])
and c is:
Counter({'color': 2, 'size': 2, 'weight': 2})
To explain what the line abobe does, first it creates a list of Counter objects using a list comprehension. It iterates over the column and makes Counter objects using the keys of each dictionary.
Then using the function reduce those counters are summed. Counter supports the addition.
On my machine this approach, using the provided input, is roughly 4 times faster than OP method.
Shortly with pandas features:
In [171]: df['data'].apply(pd.Series).count(axis=0).to_dict()
Out[171]: {'weight': 2, 'color': 2, 'size': 2}
I have a string with 50ish elements, I need to randomize this and generate a much longer string, I found random.sample() to only pick unique elements, which is great but not fit for my purpose, is there a way to allow repetitions in Python or do I need to manyally build a cycle?
You can use numpy.random.choice. It has an argument to specify how many samples you want, and an argument to specify whether you want replacement. Something like the following should work.
import numpy as np
choices = np.random.choice([1, 2, 3], size=10, replace=True)
# array([2, 1, 2, 3, 3, 1, 2, 2, 3, 2])
If your input is a string, say something like my_string = 'abc', you can use:
choices = np.random.choice([char for char in my_string], size=10, replace=True)
# array(['c', 'b', 'b', 'c', 'b', 'a', 'a', 'a', 'c', 'c'], dtype='<U1')
Then get a new string out of it with:
new_string = ''.join(choices)
# 'cbbcbaaacc'
Performance
Timing the three answers so far and random.choices from the comments (skipping the ''.join part since we all used it) producing 1000 samples from the string 'abc', we get:
numpy.random.choice([char for char in 'abc'], size=1000, replace=True):
34.1 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
random.choices('abc', k=1000)
269 µs ± 4.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[random.choice('abc') for _ in range(1000)]:
924 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[random.sample('abc',1)[0] for _ in range(1000)]:
4.32 ms ± 67.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy is fastest by far. If you put the ''.join parts in there, you actually see numpy and random.choices neck and neck, with both being three times faster than the next fastest for this example.
You could do something like this:
import random
dict = 'abcdef'
''.join([random.choice(dict) for x in range(50)])
Not saying this is the most effective (you should prob. use choice here) ... but consider it:
import random
a = ['a','b','c']
' '.join([random.sample(a,1)[0] for _ in range(6)])
I have found this, I forgot to mention I was on Python 3.6:
DICTIONARY_NUMBERS_HEX = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F']
block_text = "".join(random.choices(DICTIONARY_NUMBERS_HEX,k=50)
Using k=50 named argument will generate repeated elements.