Pandas repeated values - python

Is there a more idiomatic way of doing this in Pandas?
I want to set-up a column that repeats the integers 1 to 48, for an index of length 2000:
df = pd.DataFrame(np.zeros((2000, 1)), columns=['HH'])
h = 1
for i in range(0,2000) :
df.loc[i,'HH'] = h
if h >=48 : h =1
else : h += 1

Here is more direct and faster way:
pd.DataFrame(np.tile(np.arange(1, 49), 2000 // 48 + 1)[:2000], columns=['HH'])
The detailed step:
np.arange(1, 49) creates an array from 1 to 48 (included)
>>> l = np.arange(1, 49)
>>> l
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48])
np.tile(A, N) repeats the array A N times, so in this case you get [1 2 3 ... 48 1 2 3 ... 48 ... 1 2 3 ... 48]. You should repeat the array 2000 // 48 + 1 times in order to get at least 2000 values.
>>> r = np.tile(l, 2000 // 48 + 1)
>>> r
array([ 1, 2, 3, ..., 46, 47, 48])
>>> r.shape # The array is slightly larger than 2000
(2016,)
[:2000] retrieves the 2000 first values from the generated array to create your DataFrame.
>>> d = pd.DataFrame(r[:2000], columns=['HH'])

df = pd.DataFrame({'HH':np.append(np.tile(range(1,49),int(2000/48)), range(1,np.mod(2000,48)+1))})
That is, appending 2 arrays:
(1) np.tile(range(1,49),int(2000/48))
len(np.tile(range(1,49),int(2000/48)))
1968
(2) range(1,np.mod(2000,48)+1)
len(range(1,np.mod(2000,48)+1))
32
And constructing the DataFrame from a corresponding dictionary.

Related

Sort for Matrix

I have a problem with matrix sort.
I need to create a matrix (MxM) from input. And create nested lists using randrange.
matrix_size = int(input("Enter size of the matrix: "))
matrix = [[randrange(1, 51) for column in range(matrix_size)] for row in range(matrix_size)]
Next step i should find sum of each column of matrix. So i do this thing:
for i in range(matrix_size):
sum_column = 0
for j in range(matrix_size):
sum_column += matrix[j][i]
print(f'{matrix[i][j]:>5}', end='')
print(f'{sum_column:>5}')
So problem is... that i should add sum row in the end of a matrix. But what happens to me:
Enter the size of the matrix: 5
15 23 14 22 20 73
7 26 26 27 27 160
17 36 9 13 42 104
1 32 41 2 29 113
33 43 14 49 12 130
Yeah. It counting right but how i can add it to the end of matrix. And sort ascending to the sums of columns. Hope some of you will understand what i need. Thanks
Do you mean something like this?
import numpy as np
matrix = np.array(matrix)
rowsum = matrix.sum(axis=1) # sum of rows
idx = np.argsort(rowsum) # permutation that makes rowsum sorted
result = np.hstack([matrix, rowsum[:, None]]) # join matrix and roswum
result = result[idx] # sort rows in ascending order
for matrix
array([[31, 13, 29, 5, 1],
[21, 9, 34, 31, 22],
[13, 38, 29, 20, 50],
[21, 12, 26, 5, 15],
[19, 24, 38, 44, 41]])
would the output be:
array([[ 31, 13, 29, 5, 1, 79],
[ 21, 12, 26, 5, 15, 79],
[ 21, 9, 34, 31, 22, 117],
[ 13, 38, 29, 20, 50, 150],
[ 19, 24, 38, 44, 41, 166]])

Replacing data in column with mean value of corresponding bin?

I make bins out of my column using pandas' pd.qcut(). I would like to, then apply smoothing by corresponding bin's mean value.
I generate my bins with something like
pd.qcut(col, 3)
For example,
Given the column values [4, 8, 15, 21, 21, 24, 25, 28, 34]
and the generated bins
Bin1 [4, 15]: 4, 8, 15
Bin2 [21, 24]: 21, 21, 24
Bin3 [25, 34]: 25, 28, 34
I would like to replace the values with the following means
Mean of Bin1 (4, 8, 15) = 9
Mean of Bin2 (21, 21, 24) = 22
Mean of Bin3 (25, 28, 34) = 29
Therefore:
Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29
making the final dataset: [9, 9, 9, 22, 22, 22, 29, 29, 29]
How can one also add a column with closest bin boundaries?
Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34
making the final dataset: [4, 4, 15, 21, 21, 24, 25, 25, 34]
very similar to this question which is for R
It's exactly as you laid out. Using this technique to get nearest
df = pd.DataFrame({"col":[4, 8, 15, 21, 21, 24, 25, 28, 34]})
df2 = df.assign(bin=pd.qcut(df.col, 3),
colbmean=lambda dfa: dfa.groupby("bin").transform("mean"),
colbin=lambda dfa: dfa.apply(lambda r: min([r.bin.left,r.bin.right], key=lambda x: abs(x-r.col)), axis=1))
col
bin
colbmean
colbin
0
4
(3.999, 19.0]
9
3.999
1
8
(3.999, 19.0]
9
3.999
2
15
(3.999, 19.0]
9
19
3
21
(19.0, 24.333]
22
19
4
21
(19.0, 24.333]
22
19
5
24
(19.0, 24.333]
22
24.333
6
25
(24.333, 34.0]
29
24.333
7
28
(24.333, 34.0]
29
24.333
8
34
(24.333, 34.0]
29
34
You'll find below the solution I came up with to answer your problem.
There is still a limitation, pandas.qcut does not return closed intervals, for this matter the results are not exactly the one you described.
import pandas as pd
df = pd.DataFrame({'value': [4, 8, 15, 21, 21, 24, 25, 28, 34]})
df['bin'] = pd.qcut(df['value'], 3)
df = df.join(df.groupby('bin')['value'].mean(), on='bin', rsuffix='_average_in_bin')
df['bin_left'] = df['bin'].apply(lambda x: x.left)
df['bin_right'] = df['bin'].apply(lambda x: x.right)
df['nearest_boundary'] = df.apply(lambda x: x['bin_left'] if abs(x['value'] - x['bin_left']) < abs(x['value'] - x['bin_right']) else x['bin_right'], axis=1)

Print the numbers 1-100 in lines of 10 numbers

I want to print numbers 1-100 in this format
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
etc.
But I can't find out a way to do this effectively,
There are more efficient ways of doing it but seeing it looks like you are just starting out with python try using for loops to iterate through each row of 10.
for i in range(10):
for j in range(1, 11):
print(i * 10 + j, end=" ")
print()
You can try this
Code:
lst = list(range(1,11))
numrows = 5 #Decide number of rows you want to print
for i in range(numrows):
print [x+(i*10) for x in lst]
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
[31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
[41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
I would check out how mods work in Python. Such as: How to calculate a mod b in python?
for i in range(1, 101):
print(i, end=" ")
if(i%10==0):
print("\n")
[print(i, end=" ") if i % 10 else print(i) for i in range(1, 101)]
This list comprehension is used only for its side effect (printing); the resulting list itself isn't used.

Python: How to efficiently count the number of "1"s in the binary representation of 1 to n numbers?

E.g. For the input 5, the output should be 7.
(bin(1) = 1, bin(2) = 10 ... bin(5) = 101) --> 1 + 1 + 2 + 1 + 2 = 7
Here's what I've tried, but it isn't a very efficient algorithm, considering that I iterate the loop once for each integer. My code (Python 3):
i = int(input())
a = 0
for b in range(i+1):
a = a + bin(b).count("1")
print(a)
Thank you!
Here's a solution based on the recurrence relation from OEIS:
def onecount(n):
if n == 0:
return 0
if n % 2 == 0:
m = n/2
return onecount(m) + onecount(m-1) + m
m = (n-1)/2
return 2*onecount(m)+m+1
>>> [onecount(i) for i in range(30)]
[0, 1, 2, 4, 5, 7, 9, 12, 13, 15, 17, 20, 22, 25, 28, 32, 33, 35, 37, 40, 42, 45, 48, 52, 54, 57, 60, 64, 67, 71]
gmpy2, due to Alex Martella et al, seems to perform better, at least on my Win10 machine.
from time import time
import gmpy2
def onecount(n):
if n == 0:
return 0
if n % 2 == 0:
m = n/2
return onecount(m) + onecount(m-1) + m
m = (n-1)/2
return 2*onecount(m)+m+1
N = 10000
initial = time()
for _ in range(N):
for i in range(30):
onecount(i)
print (time()-initial)
initial = time()
for _ in range(N):
total = 0
for i in range(30):
total+=gmpy2.popcount(i)
print (time()-initial)
Here's the output:
1.7816979885101318
0.07404899597167969
If you want a list, and you're using >Py3.2:
>>> from itertools import accumulate
>>> result = list(accumulate([gmpy2.popcount(_) for _ in range(30)]))
>>> result
[0, 1, 2, 4, 5, 7, 9, 12, 13, 15, 17, 20, 22, 25, 28, 32, 33, 35, 37, 40, 42, 45, 48, 52, 54, 57, 60, 64, 67, 71]

How to find all possible daughters in a sequence of numbers stored in dataframe

I have a python dataframe which one of its column such as column1 contains series of numbers. I have to mention that each these numbers are the result of cell mutation so cell with number n deviates to two cells with following numbers: 2*n and 2*n+1. I want to search in this column to find all rows corresponds to daughters of specific number k. I mean the rows which contains all possible {2*k, 2*k+1, 2*(2*k), 2*(2*k+1), ... } in their column1. I don't want to use tree structure, how can I approach the solution ? thanks
The two sequences look like the numbers who's binary expansion starts with 10 and the numbers for which the binary expansion starts with 11.
Both sequences can be found directly:
import math
def f(n=2):
while True:
yield int(n + 2**math.floor(math.log(n,2)))
n += 1
def g(n=2):
while True:
yield int(n + 2 * 2**math.floor(math.log(n,2)))
n += 1
a, b = f(), g()
print [a.next() for i in range(15)]
print [b.next() for i in range(15)]
>>> [4, 5, 8, 9, 10, 11, 16, 17, 18, 19, 20, 21, 22, 23, 32]
>>> [6, 7, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31, 48]
EDIT:
For an arbitrary starting point, you can do the following, which I think meets your criteria.
import Queue
def f(k):
q = Queue.Queue()
q.put(k)
while not q.empty():
p = q.get()
a, b = 2*p, 2*p+1
q.put(a)
q.put(b)
yield a
yield b
a = f(4)
print [a.next() for i in range(16)]
>>> [8, 9, 16, 17, 18, 19, 32, 33, 34, 35, 36, 37, 38, 39, 64, 65] # ...
a = f(5)
print [a.next() for i in range(16)]
>>> [10, 11, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45, 46, 47, 80, 81] # ...
Checking those sequences against OEIS:
f(2) - Starting 10 - A004754
f(3) - Starting 11 - A004755
f(4) - Starting 100 - A004756
f(5) - Starting 101 - A004756
f(6) - Starting 110 - A004758
f(7) - Starting 111 - A004759
...
Which means you can simply do:
import math
def f(k, n=2):
while True:
yield int(n + (k-1) * 2**math.floor(math.log(n, 2)))
n+=1
for i in range(2,8):
a = f(i)
print i, [a.next() for j in range(16)]
>>> 2 [4, 5, 8, 9, 10, 11, 16, 17, 18, 19, 20, 21, 22, 23, 32]
>>> 3 [6, 7, 12, 13, 14, 15, 24, 25, 26, 27, 28, 29, 30, 31, 48]
>>> 4 [8, 9, 16, 17, 18, 19, 32, 33, 34, 35, 36, 37, 38, 39, 64]
>>> 5 [10, 11, 20, 21, 22, 23, 40, 41, 42, 43, 44, 45, 46, 47, 80]
>>> 6 [12, 13, 24, 25, 26, 27, 48, 49, 50, 51, 52, 53, 54, 55, 96]
>>> 7 [14, 15, 28, 29, 30, 31, 56, 57, 58, 59, 60, 61, 62, 63, 112]
# ... where the first number is shown for clarity.
Ugly but seems to work alright. What I think you might have needed to know is the newer yield from construction. Used twice in this code. Never thought I would.
from fractions import Fraction
from itertools import count
def daughters(k):
print ('daughters of cell', k)
if k<=0:
return
if k==1:
yield from count(1)
def locateK():
cells = 1
newCells = 2
generation = 1
while True:
generation += 1
previousCells = cells
cells += newCells
newCells *= 2
if k > previousCells and k <= cells :
break
return ( generation, k - previousCells )
parentGeneration, parentCell = locateK()
cells = 1
newCells = 2
generation = 1
while True:
generation += 1
previousCells = cells
if generation > parentGeneration:
if parentCell%2:
firstChildCell=previousCells+int(Fraction(parentCell-1, 2**parentGeneration)*newCells)+1
else:
firstChildCell=previousCells+int(Fraction(parentCell, 2**parentGeneration)*newCells)+1
yield from range(firstChildCell, firstChildCell+int(newCells*Fraction(1,2)))
cells += newCells
newCells *= 2
for n, d in enumerate(daughters(2)):
print (d)
if n > 15:
break
Couple of representative results:
daughters of cell 2
4
5
8
9
10
11
16
17
18
19
20
21
22
23
32
33
34
daughters of cell 3
6
7
12
13
14
15
24
25
26
27
28
29
30
31
48
49
50

Categories