I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?
import numpy as np
import pandas as pd
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])
Expected Output:
Value Repetition Count
1 1
3 3
12 3
22 1
43 4
Define a dataframe df based on the array a. Then, use .groupby() + .size() to get the size/count of unique values, as follows:
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})
df.groupby('Value').size().reset_index(name='Repetition Count')
Result:
Value Repetition Count
0 1 1
1 3 3
2 12 3
3 22 1
4 43 4
Edit
If you want also the percentages of counts, you can use:
(df.groupby('Value', as_index=False)
.agg(**{'Repetition Count': ('Value', 'size'),
'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)
Result:
Value Repetition Count Percent
0 1 1 8.33
1 3 3 25.00
2 12 3 25.00
3 22 1 8.33
4 43 4 33.33
or use .value_counts with normalize=True
pd.Series(a).value_counts(normalize=True).mul(100)
Result:
43 33.333333
12 25.000000
3 25.000000
22 8.333333
1 8.333333
dtype: float64
You can use groupby:
>>> pd.Series(a).groupby(a).count()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Or value_counts():
>>> pd.Series(a).value_counts().sort_index()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Easiest if you make a pandas dataframe from np.array and then use value_counts().
df = pd.DataFrame(data=a, columns=['col1'])
print(df.col1.value_counts())
43 4
12 3
3 3
22 1
1 1
Related
I want to make a new column by calculating existing columns.
For example df
df
no data1 data2
1 10 15
2 51 46
3 36 20
......
i want to make this
new_df
no data1 data2 data1/-2 data1/2 data2/-2 data2/2
1 10 15 -5 5 -7.5 7.5
2 51 46 -25.5 25.5 -23 23
3 36 20 -18 18 -9 9
but i don't know how to make this as efficient as possible
To create a new df column based on the calculations of two or more other columns, you would have to define a new column and set it equal to your equation. For example:
df['new_col'] = df['col_1'] * df['col_2']
Is this what you mean? :
import pandas as pd
number = [[1,2],[3,4],[5,6],[7,8],[9,10]]
df = pd.DataFrame(number)
df['Data 1/2'] = df[0] / df[1]
And the output :
0 1 Data 1/2
0 1 2 0.500000
1 3 4 0.750000
2 5 6 0.833333
3 7 8 0.875000
4 9 10 0.900000
I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!
Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object
No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64
After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()
I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17
With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7
Does Pandas contain an easy method to apply a mapper to each row at at time?
For example:
import pandas as pd
df = pd.DataFrame(
[[j + (3*i) for j in range(3)] for i in range(4)],
columns=['a','b','c']
)
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
And then apply some mapper (in pseudocode)
df_ret = df.rowmap(lambda d: d['a'] + d['c'])
print(df_ret)
0
0 2
1 8
2 14
3 20
Note, adding numbers really isn't the point here. The point is to have a row-wise mapper.
You can use apply with parameter axis=1:
df_ret = df.apply(lambda d: d['a'] + d['c'], axis=1)
print(df_ret)
0 2
1 8
2 14
3 20
dtype: int64
but faster is use vectorized solutions:
print (df.a + df.c)
0 2
1 8
2 14
3 20
print (df.a.add(df.c))
0 2
1 8
2 14
3 20
dtype: int64
print (df[['a','c']].sum(axis=1))
0 2
1 8
2 14
3 20
dtype: int64
dtype: int64
fastest solution: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html as it is internally optimized