Is there a more efficient way to remove the 0 from the beginning and insert the 20 at the end and retain the shape (1, 20)?
# What I have.
array = np.arange(20)[np.newaxis]
print(array.shape, array)
# Remove 0 from the beginning and add 20 to the end.
array = np.append(array[0, 1:], np.array([[20]]))
print(array)
array = array[np.newaxis]
print(array.shape, array)
Output:
(1, 20) [[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]]
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
(1, 20) [[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]]
You can just select a subset of the current array excluding the first element and then add 20 or whatever scalar you want at the end.
x = np.append(array[:,1:],[[20]], axis=1)
Maybe like this:
array= np.linspace(start=1, stop=20, num=20, endpoint=True, dtype=int)[np.newaxis]
print(array.shape, array)
Related
I have the following dataframe with each row containing two values.
print(x)
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17
6 16 18
7 16 19
8 17 18
9 17 19
10 18 19
11 20 21
I want to merge these values if one or both values of a particular row reoccur in another row. The principal can be explained as follows: if A and B are together in one row and B and C are together in another row, then it means that A, B and C should be together. What I want as an outcome looking at the dataframe above is:
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I tried creating a loop with df.duplicated that would create such an outcome, but it hasn't worked out yet.
This seems like graph theory problem dealing with connected components. You can use the networkx library:
import networkx as nx
g = nx.from_pandas_edgelist(df, 'a', 'b')
pd.concat([pd.Series([list(i)[0],
' '.join(map(str, list(i)[1:]))],
index=['a', 'b'])
for i in list(nx.connected_components(g))], axis=1).T
Output:
a b
0 0 1
1 4 5
2 8 9
3 10 11
4 14 15
5 16 17 18 19
6 20 21
I am a newbie in python and stackoverflow. I am trying to change my way of thinking about loops.
I have a series of values which type is <class 'pandas.core.series.Series'>.
Goal: Giving a depth n, I would like to compute for each value (except the first 2*n-2) :
result(i) = sum[j=0 to n-1](distance(i-j)*value[i-j])/sum[j=0 to n-1](distance[j])
with distance(i) = sum[k=1 to n-1]((value[i]-value[i-k])^2)
I want to avoid loops, so is there a better way to achieve my goal using numpy?
EDIT :
Ok, it seems that I am not that clear so here is an example with n= 4 :
Index
Value
0
2
1
4
2
5
3
3
4
1
5
8
6
9
7
4
8
2
9
1
10
7
Then I compute the squared difference (value[i]-value[j])^2 with j=i-1 to i-3 :
diff²
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
1
1
4
4
9
16
4
5
9
25
49
6
36
64
1
7
9
16
25
8
36
49
4
9
64
9
1
10
9
25
36
I think that getting this matrix, full or not is the core of my problem.
I can now compute distance(i) which is the sum of a row and distance(i)*value(i) :
Index
distance
distance x Value
0
1
2
3
6
18
4
29
29
5
83
664
6
101
909
7
50
200
8
89
178
9
74
74
10
70
490
And finally I can get the result :
Index
Value
Result
0
2
1
4
2
5
3
3
4
1
5
8
6
9
7.397260274
7
4
6.851711027
8
2
6.040247678
9
1
4.334394904
10
7
3.328621908
For example :
result(10) = (distance(10)*value(10)+distance(9)*value(9)+distance(8)*value(8)+distance(7)*value(7))/(distance(10)+distance(9)+distance(8)+distance(7))
I have a Java version of the algorithm if needed.
Thank you.
UPDATE :
I finally found how to get the full sqared differences matrix :
import numpy as np
import pandas as pd
n=4
myseries=pd.Series([2, 4, 5, 3, 1, 8, 9, 4, 2, 1, 7])
l=len(myseries)
vector = np.repeat(myseries, l)
mat = vector.to_numpy().reshape((l, l))
diff = mat-np.transpose(mat)
squared_diff = np.multiply(diff, diff)
print(squared_diff)
I still have to get the sum of selected elements
You could do like this:
myseries = pd.Series(np.random.rand(100), dtype='float32')
sum_of_squared_distances= np.sum(np.square(np.diff(myseries.values[:n][1::2])))
where "n" is the nth index(depth) and [1::2] part gets only odd-index values since you only need values corresponding to odd-index(except 2*n-2)
Assume I have a data frame like:
import pandas as pd
df = pd.DataFrame({"user_id": [1, 5, 11],
"user_type": ["I", "I", "II"],
"joined_for": [1.4, 9.4, 18.1]})
Now I'd like to:
Take each user's joined_for and get the ceiling integer.
Based on the integer, create a new data frame containing number sequences where the maximum is the ceiling number.
This is how I do it now:
import math
new_df = pd.DataFrame()
for i in range(df.shape[0]):
ceil_num = math.ceil(df.iloc[i]["joined_for"])
new_df = new_df.append(pd.DataFrame({"user_id": df.iloc[i]["user_id"],
"joined_month": range(1, ceil_num+1)}),
ignore_index=True)
new_df = new_df.merge(df.drop(columns="joined_for"), on="user_id")
new_df is what I want, but it's so time-consuming when there are lots of users and the number of joined_for can be larger. Is there any better way to do this? Faster or neater?
Using a comprehension
pd.DataFrame([
[t.user_id, m, t.user_type] for t in df.itertuples(index=False)
for m in range(1, math.ceil(t.joined_for) + 1)
], columns=['user_id', 'joined_month', 'user_type'])
user_id joined_month user_type
0 1 1 I
1 1 2 I
2 5 1 I
3 5 2 I
4 5 3 I
5 5 4 I
6 5 5 I
7 5 6 I
8 5 7 I
9 5 8 I
10 5 9 I
11 5 10 I
12 11 1 II
13 11 2 II
14 11 3 II
15 11 4 II
16 11 5 II
17 11 6 II
18 11 7 II
19 11 8 II
20 11 9 II
21 11 10 II
22 11 11 II
23 11 12 II
24 11 13 II
25 11 14 II
26 11 15 II
27 11 16 II
28 11 17 II
29 11 18 II
30 11 19 II
I want to randomly select 10% of all rows in my df and replace each with a randomly sampled existing row from the df.
To randomly select 10% of rows rows_to_change = df.sample(frac=0.1) works and I can get a new random existing row with replacement_sample = df.sample(n=1) but how do I put this together to quickly iterate over the entire 10%?
The df contains millions of rows x ~100 cols.
Example df:
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'B':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],'C':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
Let's say it randomly samples indexes 2,13 to replace with randomly selected indexes 6,9 the final df would look like:
A B C
0 1 1 1
1 2 2 2
2 7 7 7
3 4 4 4
4 5 5 5
5 6 6 6
6 7 7 7
7 8 8 8
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 10 10 10
14 15 15 15
You can take a random sample, then take another random sample of the same size and replace the values at those indices with the original sample.
import pandas as pd
df = pd.DataFrame({'A': range(1,15), 'B': range(1,15), 'C': range(1,15)})
samp = df.sample(frac=0.1)
samp
# returns:
A B C
6 7 7 7
9 10 10 10
replace = df.loc[~df.index.isin(samp.index)].sample(samp.shape[0])
replace
# returns:
A B C
3 4 4 4
7 8 8 8
df.loc[replace.index] = samp.values
This copies the rows without replacement
df
# returns:
A B C
0 1 1 1
1 2 2 2
2 3 3 3
3 7 7 7
4 5 5 5
5 6 6 6
6 7 7 7
7 10 10 10
8 9 9 9
9 10 10 10
10 11 11 11
11 12 12 12
12 13 13 13
13 14 14 14
14 15 15 15
To sample with replacement, use the keyword replace = True when defining samp
#James' answer is a smart Pandas solution. However, given that you noted your dataset length is somewhere in the millions, you could also consider NumPy given that Pandas often comes with significant performance overhead.
def repl_rows(df: pd.DataFrame, pct: float):
# Modifies `df` inplace.
n, _ = df.shape
rows = int(2 * np.ceil(n * pct)) # Total rows in both sets
idx = np.arange(n, dtype=np.int) # dtype agnostic
full = np.random.choice(idx, size=rows, replace=False)
to_repl, repl_with = np.split(full, 2)
df.values[to_repl] = df.values[repl_with]
Steps:
Get target rows as an integer.
Get a NumPy range-array the same length as your index. Might provide more stability than using the index itself if you have something like an uneven datetime index. (I'm not totally sure, something to toy around with.)
Sample from this index without replacement, sample size is 2 times the number of rows you want to manipulate.
Split the result in half to get targets and replacements. Should be faster than two calls to choice().
Replace at positions to_repl with values from repl_with.
Example input:
[('b', 'c', 4),
('l', 'r', 5),
('i', 'a', 6),
('c', 't', 7),
('a', '$', 8),
('n', '$', 9)]
[0] contains the vertical heading, [1] contains the horizontal heading.
Example output:
c r a t $ $
b 4
l 5
i 6
c 7
a 8
n 9
Note: given enough tuples the entire table could be filled :P
How do I format output as a table in Python using [preferably] one line of code?
Here's an answer for your revised question:
data = [
['A','a','1'],
['B','b','2'],
['C','c','3'],
['D','d','4']
]
# Desired output:
#
# A B C D
# a 1
# b 2
# c 3
# d 4
# Check data consists of colname, rowname, value triples
assert all([3 == len(row) for row in data])
# Convert all data to strings
data = [ [str(c) for c in r] for r in data]
# Check all data is one character wide
assert all([1 == len(s) for s in r for r in data])
#============================================================================
# Verbose version
#============================================================================
col_names, row_names, values = zip(*data) # Transpose
header_line = ' ' + ' '.join(col_names)
row_lines = []
for idx, (row_name, value) in enumerate(zip(row_names,values)):
# Use ' '*n to get 2n consecutive spaces.
row_line = row_name + ' ' + ' '*idx + value
row_lines.append(row_line)
print header_line
for r in row_lines:
print (r)
Or, if that's too long for you, try this:
cs, rs, vs = zip(*data)
print ('\n'.join([' '+' '.join(cs)] + [r+' '+' '*i+v for i,(r,v) in enumerate(zip(rs,vs))]))
Both have the following output:
A B C D
a 1
b 2
c 3
d 4
Here's the kernel of what you want (no reader row or header column)
>>> print('\n'.join([ ''.join([str(i+j+2).rjust(3)
for i in range(10)]) for j in range(10) ]))
2 3 4 5 6 7 8 9 10 11
3 4 5 6 7 8 9 10 11 12
4 5 6 7 8 9 10 11 12 13
5 6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14 15
7 8 9 10 11 12 13 14 15 16
8 9 10 11 12 13 14 15 16 17
9 10 11 12 13 14 15 16 17 18
10 11 12 13 14 15 16 17 18 19
11 12 13 14 15 16 17 18 19 20
It uses a nested list comprehension over i and j to generate the numbers i+j, then str.rjust() to pad all fields to three characters in length, and finally some str.join()s to put all the substrings together.
Assuming python 2.x, it's a bit ugly, but it's functional:
import operator
from functools import partial
x = range(1,11)
y = range(0,11)
multtable = [y]+[[i]+map(partial(operator.add,i),y[1:]) for i in x]
for i in multtable:
for j in i:
print str(j).rjust(3),
print
0 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 11
2 3 4 5 6 7 8 9 10 11 12
3 4 5 6 7 8 9 10 11 12 13
4 5 6 7 8 9 10 11 12 13 14
5 6 7 8 9 10 11 12 13 14 15
6 7 8 9 10 11 12 13 14 15 16
7 8 9 10 11 12 13 14 15 16 17
8 9 10 11 12 13 14 15 16 17 18
9 10 11 12 13 14 15 16 17 18 19
10 11 12 13 14 15 16 17 18 19 20
Your problem is so darn specific, it's difficult to make a real generic example.
The important part here, though, is the part that makes the table, rathter than the actual printing:
[map(partial(operator.add,i),y[1:]) for i in x]