I want to create a simple vector of many repeated values. This is easy in R:
> numbers <- c(rep(1,5), rep(2,4), rep(3,3))
> numbers
[1] 1 1 1 1 1 2 2 2 2 3 3 3
However, if I try to do this in Python using pandas and numpy, I don't quite get the same thing:
numbers = pd.Series([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
numbers
0 [1, 1, 1, 1, 1]
1 [2, 2, 2, 2]
2 [3, 3, 3]
dtype: object
What's the R equivalent in Python?
Just adjust how you use np.repeat
np.repeat([1, 2, 3], [5, 4, 3])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Or with pd.Series
pd.Series(np.repeat([1, 2, 3], [5, 4, 3]))
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
dtype: int64
That said, the purest form to replicate what you've done in R is to use np.concatenate in conjunction with np.repeat. It just isn't what I'd recommend doing.
np.concatenate([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Now you can use the same syntax in python:
>>> from datar.base import c, rep
>>>
>>> numbers = c(rep(1,5), rep(2,4), rep(3,3))
>>> print(numbers)
[1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
I am the author of the datar package. Feel free to submit issues if you have any questions.
Related
I am studying list in python and I wonder how to input lists inside a list and saw this method to input multiple list inside a list:
from itertools import groupby
values = [0, 1, 2, 3, 4, 5, 351, 0, 1, 2, 3, 4, 5, 6, 750, 0, 1, 2, 3, 4, 559]
print([[0]+list(g) for k, g in groupby(values, bool) if k])
I tried to execute this in http://pythontutor.com/ to see the process step-by-step and I don't know what bool is checking if true or false before printing this output: [[0, 1, 2, 3, 4, 5, 351], [0, 1, 2, 3, 4, 5, 6, 750], [0, 1, 2, 3, 4, 559]]
link of the code above: Creating a list within a list in Python
Best way to understand such problems is to dissect it.
for k,g in groupby(values,bool):
if k:
print(*g)
# output
1 2 3 4 5 351
1 2 3 4 5 6 750
1 2 3 4 559
for k,g in groupby(values,bool):
if not k:
print(*g)
# output
0
0
0
Python considers bool(0) as False. So basically, it groups all the numbers between False together and it adds them to [0] therefore you get your result. bool(any non 0) is True.
for k,g in groupby(values,bool):
print(k)
print(*g)
#output
False
0
True
1 2 3 4 5 351
False
0
True
1 2 3 4 5 6 750
False
0
True
1 2 3 4 559
for k,g in groupby(values,bool):
print(k)
if k:
print([0]+list(g))
# output
False
True
[0, 1, 2, 3, 4, 5, 351]
False
True
[0, 1, 2, 3, 4, 5, 6, 750]
False
True
[0, 1, 2, 3, 4, 559]
Note
I might have explained myself loosely, what I meant by it groups what's between False together, is because they are True.
I am trying to repeat the whole data in a column in each each cell of the column.
My code:
df3=pd.DataFrame({
'x':[1,2,3,4,5],
'y':[10,20,30,20,10],
'z':[5,4,3,2,1]
})
df3 =
x y z
0 1 10 5
1 2 20 4
2 3 30 3
3 4 20 2
4 5 10 1
df3['z'] = df['z'].agg(lambda x: list(x))
Present output:
KeyError: 'z'
Expected output:
df=
x y z
0 1 10 [5, 4, 3, 2, 1]
1 2 20 [5, 4, 3, 2, 1]
2 3 30 [5, 4, 3, 2, 1]
3 4 20 [5, 4, 3, 2, 1]
4 5 10 [5, 4, 3, 2, 1]
Another way is to list(df.column.values)
df3.assign(z=[list(df3.z.values)]*len(df3))
x y z
0 5 10 [5, 4, 3, 2, 1]
1 4 20 [5, 4, 3, 2, 1]
2 3 30 [5, 4, 3, 2, 1]
3 2 20 [5, 4, 3, 2, 1]
4 1 10 [5, 4, 3, 2, 1]
Check with
df3['new_z']=[df3.z.tolist()]*len(df3)
More safe
df3['new_z']=[df3.z.tolist() for x in df.index]
I'm not sure how I should proceed in this case.
Consider a df like bellow and when I do df.A.unique() -> give me an array like this [1, 2, 3, 4]
But also I want the index of this values, like numpy.unique()
df = pd.DataFrame({'A': [1,1,1,2,2,2,3,3,4], 'B':[9,8,7,6,5,4,3,2,1]})
df.A.unique()
>>> array([1, 2, 3, 4])
And
np.unique([1,1,1,2,2,2,3,3,4], return_inverse=True)
>>> (array([1, 2, 3, 4]), array([0, 0, 0, 1, 1, 1, 2, 2, 3]))
How can I do it in Pandas? Unique values with index.
In pandas we have drop_duplicates
df.A.drop_duplicates()
Out[22]:
0 1
3 2
6 3
8 4
Name: A, dtype: int64
To match the np.unique output factorize
pd.factorize(df.A)
Out[21]: (array([0, 0, 0, 1, 1, 1, 2, 2, 3]), Int64Index([1, 2, 3, 4], dtype='int64'))
You can also use a dict to .map() with index of .unique():
df.A.map({i:e for e,i in enumerate(df.A.unique())})
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 3
I'm using numpy and I don't know how translate this MATLAB code to python:
C = reshape(A(B.',:).', 6, []).';
I think that the only right thing that I did is:
temp=A[B.transpose(),:]
but I don't know how translate all of the rows.
example of matrix:
A =
1 2
1 3
1 4
1 5
1 6
2 3
2 4
2 5
2 6
B =
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
C =
1 2 1 3 1 4
1 2 1 3 1 5
1 2 1 3 1 6
1 2 1 3 2 3
1 2 1 3 2 4
1 2 1 3 2 5
1 2 1 3 2 6
This looks like an indexing plus reshaping operation; one thing to keep in mind is that numpy is zero-indexed, while matlab is one-indexed. That means you need to index A with B - 1, and then reshape your result as desired. For example:
import numpy as np
A = np.array([[1, 2],
[1, 3],
[1, 4],
[1, 5],
[1, 6],
[2, 3],
[2, 4],
[2, 5],
[2, 6]])
B = np.array([[1, 2, 3],
[1, 2, 4],
[1, 2, 5],
[1, 2, 6],
[1, 2, 7],
[1, 2, 8],
[1, 2, 9]])
C = A[B - 1].reshape(B.shape[0], -1)
The result is:
>>> C
array([[1, 2, 1, 3, 1, 4],
[1, 2, 1, 3, 1, 5],
[1, 2, 1, 3, 1, 6],
[1, 2, 1, 3, 2, 3],
[1, 2, 1, 3, 2, 4],
[1, 2, 1, 3, 2, 5],
[1, 2, 1, 3, 2, 6]])
One potentially confusing piece: the -1 in the reshape method is a marker that indicates numpy should calculate the appropriate dimension to preserve the size of the array.
I have a dataset of like 3 items e.g. [1,2,3]
I want to find the product of it with 3 repeats and then separate them into 3 datasets like this (it should be vertical actually):
[1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3]
[1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3]
[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3]
I noticed that in python I can use iteration.product for finding products as:
data_prod=itertools.product(data,repeat=3)
now my question is how can I convert each column of the result (which the datatype is itertools.product) to 3 new datasets as shown in above example?
Use zip(*..) to turn columns into rows:
dataset1, dataset2, dataset3 = zip(*itertools.product(data,repeat=3))
Demo:
>>> zip(*itertools.product(data,repeat=3))
[(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), (1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3), (1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)]
>>> dataset1, dataset2, dataset3 = zip(*itertools.product(data,repeat=3))
>>> dataset1
(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3)
>>> dataset2
(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3)
>>> dataset3
(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
An alternate way, for display purposes, still using itertools.product:
import itertools
import pandas as pd
cols=['series1', 'series2', 'series3']
originDataset = [1,2,3]
data_prod = lambda x: list(itertools.product(x, repeat=3))
df1 = pd.DataFrame(originDataset, columns=['OriginalDataSet'])
df2 = pd.DataFrame(data_prod(originDataset), columns=cols)
print df1
print '-'*80
print df2
print '-'*80
series1, series2, series3 = df2.T.values
print series1
print series2
print series3
Output:
OriginalDataSet
0 1
1 2
2 3
--------------------------------------------------------------------------------
series1 series2 series3
0 1 1 1
1 1 1 2
2 1 1 3
3 1 2 1
4 1 2 2
5 1 2 3
6 1 3 1
7 1 3 2
8 1 3 3
9 2 1 1
10 2 1 2
11 2 1 3
12 2 2 1
13 2 2 2
14 2 2 3
15 2 3 1
16 2 3 2
17 2 3 3
18 3 1 1
19 3 1 2
20 3 1 3
21 3 2 1
22 3 2 2
23 3 2 3
24 3 3 1
25 3 3 2
26 3 3 3
--------------------------------------------------------------------------------
[1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3]
[1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3]
[1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3]
I hope it helps to, at the same time, learn how to use Pandas