Counting repeating words with numpy and pandas Python

Counting repeating words with numpy and pandas Python - python

I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?
import numpy as np
import pandas as pd
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])
Expected Output:
Value Repetition Count
1 1
3 3
12 3
22 1
43 4

Define a dataframe df based on the array a. Then, use .groupby() + .size() to get the size/count of unique values, as follows:
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})
df.groupby('Value').size().reset_index(name='Repetition Count')
Result:
Value Repetition Count
0 1 1
1 3 3
2 12 3
3 22 1
4 43 4
Edit
If you want also the percentages of counts, you can use:
(df.groupby('Value', as_index=False)
.agg(**{'Repetition Count': ('Value', 'size'),
'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)
Result:
Value Repetition Count Percent
0 1 1 8.33
1 3 3 25.00
2 12 3 25.00
3 22 1 8.33
4 43 4 33.33
or use .value_counts with normalize=True
pd.Series(a).value_counts(normalize=True).mul(100)
Result:
43 33.333333
12 25.000000
3 25.000000
22 8.333333
1 8.333333
dtype: float64

You can use groupby:
>>> pd.Series(a).groupby(a).count()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Or value_counts():
>>> pd.Series(a).value_counts().sort_index()
1 1
3 3
12 3
22 1
43 4
dtype: int64

Easiest if you make a pandas dataframe from np.array and then use value_counts().
df = pd.DataFrame(data=a, columns=['col1'])
print(df.col1.value_counts())
43 4
12 3
3 3
22 1
1 1

Related

python : Compute columns of data frames and add them to new columns

I want to make a new column by calculating existing columns.
For example df
df
no data1 data2
1 10 15
2 51 46
3 36 20
......
i want to make this
new_df
no data1 data2 data1/-2 data1/2 data2/-2 data2/2
1 10 15 -5 5 -7.5 7.5
2 51 46 -25.5 25.5 -23 23
3 36 20 -18 18 -9 9
but i don't know how to make this as efficient as possible

To create a new df column based on the calculations of two or more other columns, you would have to define a new column and set it equal to your equation. For example:
df['new_col'] = df['col_1'] * df['col_2']

Is this what you mean? :
import pandas as pd
number = [[1,2],[3,4],[5,6],[7,8],[9,10]]
df = pd.DataFrame(number)
df['Data 1/2'] = df[0] / df[1]
And the output :
0 1 Data 1/2
0 1 2 0.500000
1 3 4 0.750000
2 5 6 0.833333
3 7 8 0.875000
4 9 10 0.900000

return first column number that fulfills a condition in pandas

I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!

Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object

No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64

applying several functions in transform in pandas

After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500

You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()

I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17

With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7

Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7

You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

Pandas row-wise mapper

Does Pandas contain an easy method to apply a mapper to each row at at time?
For example:
import pandas as pd
df = pd.DataFrame(
[[j + (3*i) for j in range(3)] for i in range(4)],
columns=['a','b','c']
)
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
And then apply some mapper (in pseudocode)
df_ret = df.rowmap(lambda d: d['a'] + d['c'])
print(df_ret)
0
0 2
1 8
2 14
3 20
Note, adding numbers really isn't the point here. The point is to have a row-wise mapper.

You can use apply with parameter axis=1:
df_ret = df.apply(lambda d: d['a'] + d['c'], axis=1)
print(df_ret)
0 2
1 8
2 14
3 20
dtype: int64
but faster is use vectorized solutions:
print (df.a + df.c)
0 2
1 8
2 14
3 20
print (df.a.add(df.c))
0 2
1 8
2 14
3 20
dtype: int64
print (df[['a','c']].sum(axis=1))
0 2
1 8
2 14
3 20
dtype: int64
dtype: int64

fastest solution: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html as it is internally optimized

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting repeating words with numpy and pandas Python - python

You can use groupby: >>> pd.Series(a).groupby(a).count() 1 1 3 3 12 3 22 1 43 4 dtype: int64 Or value_counts(): >>> pd.Series(a).value_counts().sort_index() 1 1 3 3 12 3 22 1 43 4 dtype: int64

Easiest if you make a pandas dataframe from np.array and then use value_counts(). df = pd.DataFrame(data=a, columns=['col1']) print(df.col1.value_counts()) 43 4 12 3 3 3 22 1 1 1

Related

python : Compute columns of data frames and add them to new columns

return first column number that fulfills a condition in pandas

applying several functions in transform in pandas

Sum pandas dataframe column values based on condition of column name

Pandas row-wise mapper

Categories

Resources