Pandas: back unique values to column in order - python

I'm not sure how I should proceed in this case.
Consider a df like bellow and when I do df.A.unique() -> give me an array like this [1, 2, 3, 4]
But also I want the index of this values, like numpy.unique()
df = pd.DataFrame({'A': [1,1,1,2,2,2,3,3,4], 'B':[9,8,7,6,5,4,3,2,1]})
df.A.unique()
>>> array([1, 2, 3, 4])
And
np.unique([1,1,1,2,2,2,3,3,4], return_inverse=True)
>>> (array([1, 2, 3, 4]), array([0, 0, 0, 1, 1, 1, 2, 2, 3]))
How can I do it in Pandas? Unique values with index.

In pandas we have drop_duplicates
df.A.drop_duplicates()
Out[22]:
0 1
3 2
6 3
8 4
Name: A, dtype: int64
To match the np.unique output factorize
pd.factorize(df.A)
Out[21]: (array([0, 0, 0, 1, 1, 1, 2, 2, 3]), Int64Index([1, 2, 3, 4], dtype='int64'))

You can also use a dict to .map() with index of .unique():
df.A.map({i:e for e,i in enumerate(df.A.unique())})
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 3

Related

Pandas Create Column of Numpy Arrays Given Min and Max in Other Columns

Given the following dataframe:
df = pd.DataFrame({'min':[1,2,3],'max':[4,5,6]})
df
min max
0 1 4
1 2 5
2 3 6
I need to add a third column called "arrays" that is a set of arrays generated from the "min" and "max" columns (with 1 added to the "max" value).
For example, using data from the first row, min = 1 and max = 4:
np.arange(1, 5)
array([1, 2, 3, 4])
So I would need that resulting stored in the new "arrays" column in the first row.
Here is the desired result:
min max arrays
0 1 4 [1, 2, 3, 4]
1 2 5 [2, 3, 4, 5]
2 3 6 [3, 4, 5, 6]
Use list comprehension with range
df['arrays'] = [list(range(m, mx+1)) for m, mx in zip(df['min'], df['max'])]
Out[1015]:
min max arrays
0 1 4 [1, 2, 3, 4]
1 2 5 [2, 3, 4, 5]
2 3 6 [3, 4, 5, 6]
Another solution:
df = pd.DataFrame({'min':[1,2,3],'max':[4,5,6]})
df['arrays'] = df.apply(lambda x: np.arange(x['min'], x['max']+1), axis=1)
print(df)
Prints:
min max arrays
0 1 4 [1, 2, 3, 4]
1 2 5 [2, 3, 4, 5]
2 3 6 [3, 4, 5, 6]

Remove all columns that are of a certain value in a specific row

I'm looking for a way to remove all columns from my pandas df based on the value of a single row, e.g., return a new df with all rows but only those columns that are zero in row X.
You can do this with loc and iloc
df = pd.DataFrame({'a':[1, 20, 30, 4, 0],
'b':[1, 0, 3, 4, 0],
'c':[1, 3, 7, 7, 5],
'd':[1, 8, 3, 8, 5],
'e':[1, 11, 3, 4, 0]})
df.loc[:, df.iloc[4,:] == 0]
a b e
0 1 1 1
1 2 0 2
2 3 3 3
3 4 4 4
4 0 0 0
Ok, I found a way. Any better/quicker/more pythonian/pandas solution anyone?
zero_cols = df1['X' == 0
df2 = df1.loc[:,zero_cols == True]

pandas equivalent to R series of multiple repeated numbers

I want to create a simple vector of many repeated values. This is easy in R:
> numbers <- c(rep(1,5), rep(2,4), rep(3,3))
> numbers
[1] 1 1 1 1 1 2 2 2 2 3 3 3
However, if I try to do this in Python using pandas and numpy, I don't quite get the same thing:
numbers = pd.Series([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
numbers
0 [1, 1, 1, 1, 1]
1 [2, 2, 2, 2]
2 [3, 3, 3]
dtype: object
What's the R equivalent in Python?
Just adjust how you use np.repeat
np.repeat([1, 2, 3], [5, 4, 3])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Or with pd.Series
pd.Series(np.repeat([1, 2, 3], [5, 4, 3]))
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
dtype: int64
That said, the purest form to replicate what you've done in R is to use np.concatenate in conjunction with np.repeat. It just isn't what I'd recommend doing.
np.concatenate([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Now you can use the same syntax in python:
>>> from datar.base import c, rep
>>>
>>> numbers = c(rep(1,5), rep(2,4), rep(3,3))
>>> print(numbers)
[1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
I am the author of the datar package. Feel free to submit issues if you have any questions.

Summing Two DataFrames by Index

I have the following
df1 = pd.DataFrame([1, 1, 1, 1, 1], index=[ 1, 2, 3, 4 ,5 ], columns=['A'])
df2 = pd.DataFrame([ 1, 1, 1, 1, 1], index=[ 2, 3, 4, 5, 6], columns=['A'])
I want to return the DataFrame which will be the sum of the two for each row:
df = pd.DataFrame([ 1, 2, 2, 2, 2, 1], index=[1, 2, 3, 4, 5, 6], columns=['A'])
of course, the idea is that I don't know what the actual indices are, so the intersection could be empty and I'd get a concatenation of both DataFrames.
You can concatenate by row, fill missing values by 0, and sum by row:
>>> pd.concat([df1, df2], axis=1).fillna(0).sum(axis=1)
1 1
2 2
3 2
4 2
5 2
6 1
dtype: float64
If you want it as a DataFrame, simply do
pd.DataFrame({
'A': pd.concat([df1, df2], axis=1).fillna(0).sum(axis=1)})
(Also, note that if you need to do this just for specific Series A, Just use
pd.concat([df1.A, df2.A], axis=1).fillna(0).sum(axis=1)
)

pandas replace zeros with previous non zero value

I have the following dataframe:
index = range(14)
data = [1, 0, 0, 2, 0, 4, 6, 8, 0, 0, 0, 0, 2, 1]
df = pd.DataFrame(data=data, index=index, columns = ['A'])
How can I fill the zeros with the previous non-zero value using pandas? Is there a fillna that is not just for "NaN"?.
The output should look like:
[1, 1, 1, 2, 2, 4, 6, 8, 8, 8, 8, 8, 2, 1]
(This question was asked before here Fill zero values of 1d numpy array with last non-zero values but he was asking exclusively for a numpy solution)
You can use replace with method='ffill'
In [87]: df['A'].replace(to_replace=0, method='ffill')
Out[87]:
0 1
1 1
2 1
3 2
4 2
5 4
6 6
7 8
8 8
9 8
10 8
11 8
12 2
13 1
Name: A, dtype: int64
To get numpy array, work on values
In [88]: df['A'].replace(to_replace=0, method='ffill').values
Out[88]: array([1, 1, 1, 2, 2, 4, 6, 8, 8, 8, 8, 8, 2, 1], dtype=int64)
This is a better answer to the previous one, since the previous answer returns a dataframe which hides all zero values.
Instead, if you use the following line of code -
df['A'].mask(df['A'] == 0).ffill(downcast='infer')
Then this resolves the problem. It replaces all 0 values with previous values.

Categories