Related
I make bins out of my column using pandas' pd.qcut(). I would like to, then apply smoothing by corresponding bin's mean value.
I generate my bins with something like
pd.qcut(col, 3)
For example,
Given the column values [4, 8, 15, 21, 21, 24, 25, 28, 34]
and the generated bins
Bin1 [4, 15]: 4, 8, 15
Bin2 [21, 24]: 21, 21, 24
Bin3 [25, 34]: 25, 28, 34
I would like to replace the values with the following means
Mean of Bin1 (4, 8, 15) = 9
Mean of Bin2 (21, 21, 24) = 22
Mean of Bin3 (25, 28, 34) = 29
Therefore:
Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29
making the final dataset: [9, 9, 9, 22, 22, 22, 29, 29, 29]
How can one also add a column with closest bin boundaries?
Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34
making the final dataset: [4, 4, 15, 21, 21, 24, 25, 25, 34]
very similar to this question which is for R
It's exactly as you laid out. Using this technique to get nearest
df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})
df2 = df.assign(bin=pd.qcut(df.col, 3),
colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))
col
bin
colbmean
colbin
0
4
(3.999, 19.0]
9
3.999
1
8
(3.999, 19.0]
9
3.999
2
15
(3.999, 19.0]
9
19
3
21
(19.0, 24.333]
22
19
4
21
(19.0, 24.333]
22
19
5
24
(19.0, 24.333]
22
24.333
6
25
(24.333, 34.0]
29
24.333
7
28
(24.333, 34.0]
29
24.333
8
34
(24.333, 34.0]
29
34
You'll find below the solution I came up with to answer your problem.
There is still a limitation, pandas.qcut does not return closed intervals, for this matter the results are not exactly the one you described.
import pandas as pd
df = pd.DataFrame({'value': [4, 8, 15, 21, 21, 24, 25, 28, 34]})
df['bin'] = pd.qcut(df['value'], 3)
df = df.join(df.groupby('bin')['value'].mean(), on='bin', rsuffix='_average_in_bin')
df['bin_left'] = df['bin'].apply(lambda x: x.left)
df['bin_right'] = df['bin'].apply(lambda x: x.right)
df['nearest_boundary'] = df.apply(lambda x: x['bin_left'] if abs(x['value'] - x['bin_left']) < abs(x['value'] - x['bin_right']) else x['bin_right'], axis=1)
I want to print numbers 1-100 in this format
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
etc.
But I can't find out a way to do this effectively,
There are more efficient ways of doing it but seeing it looks like you are just starting out with python try using for loops to iterate through each row of 10.
for i in range(10):
for j in range(1, 11):
print(i * 10 + j, end=" ")
print()
You can try this
Code:
lst = list(range(1,11))
numrows = 5 #Decide number of rows you want to print
for i in range(numrows):
print [x+(i*10) for x in lst]
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
I would check out how mods work in Python. Such as: How to calculate a mod b in python?
for i in range(1, 101):
print(i, end=" ")
if(i%10==0):
print("\n")
[print(i, end=" ") if i % 10 else print(i) for i in range(1, 101)]
This list comprehension is used only for its side effect (printing); the resulting list itself isn't used.
For example, if a question/answer you encounter posts an array like this:
[[ 0 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]
[24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47]
[48 49 50 51 52 53 54 55]
[56 57 58 59 60 61 62 63]]
How would you load it into a variable in a REPL session without having to add commas everywhere?
For a one-time occasion, I might do this:
Copy the text containing the array to the clipboard.
In an ipython shell, enter s = """, but do not hit return.
Paste the text from the clipboard.
Type the closing triple quote.
That gives me:
In [16]: s = """[[ 0 1 2 3 4 5 6 7]
...: [ 8 9 10 11 12 13 14 15]
...: [16 17 18 19 20 21 22 23]
...: [24 25 26 27 28 29 30 31]
...: [32 33 34 35 36 37 38 39]
...: [40 41 42 43 44 45 46 47]
...: [48 49 50 51 52 53 54 55]
...: [56 57 58 59 60 61 62 63]]"""
Then use np.loadtxt() as follows:
In [17]: a = np.loadtxt([line.lstrip(' [').rstrip(']') for line in s.splitlines()], dtype=int)
In [18]: a
Out[18]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63]])
If you have Pandas, pyperclip or something else to read from the clipboard you could use something like this:
from pandas.io.clipboard import clipboard_get
# import pyperclip
import numpy as np
import re
import ast
def numpy_from_clipboard():
inp = clipboard_get()
# inp = pyperclip.paste()
inp = inp.strip()
# if it starts with "array(" we just need to remove the
# leading "array(" and remove the optional ", dtype=xxx)"
if inp.startswith('array('):
inp = re.sub(r'^array\(', '', inp)
dtype = re.search(r', dtype=(\w+)\)$', inp)
if dtype:
return np.array(ast.literal_eval(inp[:dtype.start()]), dtype=dtype.group(1))
else:
return np.array(ast.literal_eval(inp[:-1]))
else:
# In case it's the string representation it's a bit harder.
# We need to remove all spaces between closing and opening brackets
inp = re.sub(r'\]\s+\[', '],[', inp)
# We need to remove all whitespaces following an opening bracket
inp = re.sub(r'\[\s+', '[', inp)
# and all leading whitespaces before closing brackets
inp = re.sub(r'\s+\]', ']', inp)
# replace all remaining whitespaces with ","
inp = re.sub(r'\s+', ',', inp)
return np.array(ast.literal_eval(inp))
And then read what you saved in the clipboard:
>>> numpy_from_clipboard()
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63]])
This should be able to parse (most) arrays (str as well as repr of arrays) from your clipboard. It should even work for multi-line arrays (where np.loadtxt fails):
[[ 0.34866207 0.38494993 0.7053722 0.64586156 0.27607369 0.34850162
0.20530567 0.46583039 0.52982216 0.92062115]
[ 0.06973858 0.13249867 0.52419149 0.94707951 0.868956 0.72904737
0.51666421 0.95239542 0.98487436 0.40597835]
[ 0.66246734 0.85333546 0.072423 0.76936201 0.40067016 0.83163118
0.45404714 0.0151064 0.14140024 0.12029861]
[ 0.2189936 0.36662076 0.90078913 0.39249484 0.82844509 0.63609079
0.18102383 0.05339892 0.3243505 0.64685352]
[ 0.803504 0.57531309 0.0372428 0.8308381 0.89134864 0.39525473
0.84138386 0.32848746 0.76247531 0.99299639]]
>>> numpy_from_clipboard()
array([[ 0.34866207, 0.38494993, 0.7053722 , 0.64586156, 0.27607369,
0.34850162, 0.20530567, 0.46583039, 0.52982216, 0.92062115],
[ 0.06973858, 0.13249867, 0.52419149, 0.94707951, 0.868956 ,
0.72904737, 0.51666421, 0.95239542, 0.98487436, 0.40597835],
[ 0.66246734, 0.85333546, 0.072423 , 0.76936201, 0.40067016,
0.83163118, 0.45404714, 0.0151064 , 0.14140024, 0.12029861],
[ 0.2189936 , 0.36662076, 0.90078913, 0.39249484, 0.82844509,
0.63609079, 0.18102383, 0.05339892, 0.3243505 , 0.64685352],
[ 0.803504 , 0.57531309, 0.0372428 , 0.8308381 , 0.89134864,
0.39525473, 0.84138386, 0.32848746, 0.76247531, 0.99299639]])
However I'm not too good with regexes so this probably isn't foolproof and using ast.literal_eval feels a bit awkard (but it avoids doing the parsing yourself).
Feel free to suggest improvements.
I have a python dataframe which one of its column such as column1 contains series of numbers. I have to mention that each these numbers are the result of cell mutation so cell with number n deviates to two cells with following numbers: 2*n and 2*n+1. I want to search in this column to find all rows corresponds to daughters of specific number k. I mean the rows which contains all possible {2*k, 2*k+1, 2*(2*k), 2*(2*k+1), ... } in their column1. I don't want to use tree structure, how can I approach the solution ? thanks
The two sequences look like the numbers who's binary expansion starts with 10 and the numbers for which the binary expansion starts with 11.
Both sequences can be found directly:
import math
def f(n=2):
while True:
yield int(n + 2**math.floor(math.log(n,2)))
n += 1
def g(n=2):
while True:
yield int(n + 2 * 2**math.floor(math.log(n,2)))
n += 1
a, b = f(), g()
print [a.next() for i in range(15)]
print [b.next() for i in range(15)]
>>> [4, 5, 8, 9, 10, 11, 16, 17, 18, 19, 20, 21, 22, 23, 32]
>>> [6, 7, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31, 48]
EDIT:
For an arbitrary starting point, you can do the following, which I think meets your criteria.
import Queue
def f(k):
q = Queue.Queue()
q.put(k)
while not q.empty():
p = q.get()
a, b = 2*p, 2*p+1
q.put(a)
q.put(b)
yield a
yield b
a = f(4)
print [a.next() for i in range(16)]
>>> [8, 9, 16, 17, 18, 19, 32, 33, 34, 35, 36, 37, 38, 39, 64, 65] # ...
a = f(5)
print [a.next() for i in range(16)]
>>> [10, 11, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45, 46, 47, 80, 81] # ...
Checking those sequences against OEIS:
f(2) - Starting 10 - A004754
f(3) - Starting 11 - A004755
f(4) - Starting 100 - A004756
f(5) - Starting 101 - A004756
f(6) - Starting 110 - A004758
f(7) - Starting 111 - A004759
...
Which means you can simply do:
import math
def f(k, n=2):
while True:
yield int(n + (k-1) * 2**math.floor(math.log(n, 2)))
n+=1
for i in range(2,8):
a = f(i)
print i, [a.next() for j in range(16)]
>>> 2 [4, 5, 8, 9, 10, 11, 16, 17, 18, 19, 20, 21, 22, 23, 32]
>>> 3 [6, 7, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31, 48]
>>> 4 [8, 9, 16, 17, 18, 19, 32, 33, 34, 35, 36, 37, 38, 39, 64]
>>> 5 [10, 11, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45, 46, 47, 80]
>>> 6 [12, 13, 24, 25, 26, 27, 48, 49, 50, 51, 52, 53, 54, 55, 96]
>>> 7 [14, 15, 28, 29, 30, 31, 56, 57, 58, 59, 60, 61, 62, 63, 112]
# ... where the first number is shown for clarity.
Ugly but seems to work alright. What I think you might have needed to know is the newer yield from construction. Used twice in this code. Never thought I would.
from fractions import Fraction
from itertools import count
def daughters(k):
print ('daughters of cell', k)
if k<=0:
return
if k==1:
yield from count(1)
def locateK():
cells = 1
newCells = 2
generation = 1
while True:
generation += 1
previousCells = cells
cells += newCells
newCells *= 2
if k > previousCells and k <= cells :
break
return ( generation, k - previousCells )
parentGeneration, parentCell = locateK()
cells = 1
newCells = 2
generation = 1
while True:
generation += 1
previousCells = cells
if generation > parentGeneration:
if parentCell%2:
firstChildCell=previousCells+int(Fraction(parentCell-1, 2**parentGeneration)*newCells)+1
else:
firstChildCell=previousCells+int(Fraction(parentCell, 2**parentGeneration)*newCells)+1
yield from range(firstChildCell, firstChildCell+int(newCells*Fraction(1,2)))
cells += newCells
newCells *= 2
for n, d in enumerate(daughters(2)):
print (d)
if n > 15:
break
Couple of representative results:
daughters of cell 2
4
5
8
9
10
11
16
17
18
19
20
21
22
23
32
33
34
daughters of cell 3
6
7
12
13
14
15
24
25
26
27
28
29
30
31
48
49
50
Is there a more idiomatic way of doing this in Pandas?
I want to set-up a column that repeats the integers 1 to 48, for an index of length 2000:
df = pd.DataFrame(np.zeros((2000, 1)), columns=['HH'])
h = 1
for i in range(0,2000) :
df.loc[i,'HH'] = h
if h >=48 : h =1
else : h += 1
Here is more direct and faster way:
pd.DataFrame(np.tile(np.arange(1, 49), 2000 // 48 + 1)[:2000], columns=['HH'])
The detailed step:
np.arange(1, 49) creates an array from 1 to 48 (included)
>>> l = np.arange(1, 49)
>>> l
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
np.tile(A, N) repeats the array A N times, so in this case you get [1 2 3 ... 48 1 2 3 ... 48 ... 1 2 3 ... 48]. You should repeat the array 2000 // 48 + 1 times in order to get at least 2000 values.
>>> r = np.tile(l, 2000 // 48 + 1)
>>> r
array([ 1, 2, 3, ..., 46, 47, 48])
>>> r.shape # The array is slightly larger than 2000
(2016,)
[:2000] retrieves the 2000 first values from the generated array to create your DataFrame.
>>> d = pd.DataFrame(r[:2000], columns=['HH'])
df = pd.DataFrame({'HH':np.append(np.tile(range(1,49),int(2000/48)), range(1,np.mod(2000,48)+1))})
That is, appending 2 arrays:
(1) np.tile(range(1,49),int(2000/48))
len(np.tile(range(1,49),int(2000/48)))
1968
(2) range(1,np.mod(2000,48)+1)
len(range(1,np.mod(2000,48)+1))
32
And constructing the DataFrame from a corresponding dictionary.