Is there a function in Python that does what the R fct_lump function does (i.e. to group all groups that are too small into one 'OTHER' group)?
Example below:
library(dplyr)
library(forcats)
> x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
> x
[1] A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A B B B B B B B B
[49] B B C C C C C D D D D D D D D D D D D D D D D D D D D D D D D D D D E F G H I
Levels: A B C D E F G H I
> x %>% fct_lump_n(3)
[1] A A A A A A A A A A A A A A A A
[17] A A A A A A A A A A A A A A A A
[33] A A A A A A A A B B B B B B B B
[49] B B Other Other Other Other Other D D D D D D D D D
[65] D D D D D D D D D D D D D D D D
[81] D D Other Other Other Other Other
Levels: A B D Other
pip install siuba
#( in python or anaconda prompth shell)
#use library as:
from siuba.dply.forcats import fct_lump, fct_reorder
#just like fct_lump of R :
df['Your_column'] = fct_lump(df['Your_column'], n= 10)
df['Your_column'].value_counts() # check your levels
#it reduces the level to 10, lumps all the others as 'Other'
You may also want to try datar:
>>> from datar.all import factor, rep, LETTERS, c, fct_lump_n, fct_count
>>>
>>> x = factor(rep(LETTERS[:9], times=c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
>>> x >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 C 5
3 D 27
4 E 1
5 F 1
6 G 1
7 H 1
8 I 1
>>> x >> fct_lump_n(3) >> fct_count()
f n
<category> <int64>
0 A 40
1 B 10
2 D 27
3 Other 10
Disclaimer: I am the author of the datar package.
Related
I searched the internet to find a solution for my problem, but i could not find it.
I have the folowing dataframe
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
and i want to add to the existing dataframe the folowing dataframe:
pos1 pos2 pos3
0 A B C
1 A B C
2 A B C
3 D E F
4 D E F
5 D E F
6 G H I
7 G H I
8 G H I
So that i get the following dataframe:
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
9 A B C
10 A B C
11 A B C
12 D E F
13 D E F
14 D E F
15 G H I
16 G H I
17 G H I
I know that the number of row are always a multiple of the number of columns. That means if i have 4 columns than the rows should be either 4, 8, 12, 16, etc. Im my example the columns are 3 and the rows are 9
What i then want to do is transpose the rows into columns but only for that number of columns. So i want the first 3 rows to be transposed with the columns, then the next 3 rows and so forth.
I have now the following code:
import pandas as pd
import io
s = """pos1 pos2 pos3
A A A
B B B
C C C
D D D
E E E
F F F
G G G
H H H
I I I
"""
df = pd.read_csv(io.StringIO(s), delim_whitespace=True)
final_df = df.copy()
index_values = final_df.index.values
value = 0
while value < len(df.index):
sub_df = df[value:value+3]
sub_df.columns = index_values[value: value + 3]
sub_df = sub_df.T
sub_df.columns = df.columns
final_df = pd.concat([final_df, sub_df])
value += len(df.columns)
final_df = final_df.reset_index(drop=True)
print(final_df)
The code that i now have is slow because of the forloop.
Is it possible to obtain the same solution without using the forloop?
You can use the underlying numpy array with ravel and reshape with the order='F' parameter (column-major order) and the pandas.DataFrame constructor.
Then concat the output with the original array:
pd.concat([df,
pd.DataFrame(df.to_numpy().ravel().reshape(df.shape, order='F'),
columns=df.columns)
], ignore_index=True)
output:
pos1 pos2 pos3
0 A A A
1 B B B
2 C C C
3 D D D
4 E E E
5 F F F
6 G G G
7 H H H
8 I I I
9 A D G
10 A D G
11 A D G
12 B E H
13 B E H
14 B E H
15 C F I
16 C F I
17 C F I
this is somewhat efficient if you want to use pandas only.
for value in range(1,int(len(df.index)/3)):
df.loc[len(df)+value*value]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
df.loc[len(df)+value*value+1]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
df.loc[len(df)+value*value+2]=df.iloc[(value*3)-3:value*3,0:1].T.values[0]
Suppose my pandas dataframe has 3 categories for variable X: [A, B, C] and 2 categories for variable Y:[D,E]. I want to cross-tab this, with something like:
+--------+----------------------+-----+
| X/Y | D | E |
+--------+----------------------+-----+
| A or B | count(X=A or B, Y=D) | ... |
| C | count(X=C),Y=D) | ... |
+--------+----------------------+-----+
Is this what you are looking for?
import pandas as pd
import numpy as np
x = np.random.choice(['A', 'B', 'C'], size=10)
y = np.random.choice(['D', 'E'], size=10)
df = pd.DataFrame({'X':x, 'Y':y})
df.head()
Output:
X Y
0 A D
1 B D
2 B E
3 B D
4 A E
Dataframe modifications:
df['X'] = df['X'].apply(lambda x: 'A or B' if x == 'A' or x == 'B' else x)
Crosstab application:
pd.crosstab(df.X, df.Y)
Output:
Y D E
X
A or B 1 3
C 4 2
You can use pandas.pivot_table() for this purpose. This should do the trick - df refers to input dataframe.
import numpy as np
df["catX"]=np.where(df["X"].isin(["A","B"]), "AB", np.where(df["X"]=="C", "C", "other"))
df2=df.pivot_table(index="catX", columns="Y", aggfunc='count', values="X")
Sample output:
#input - df with extra categorical column - catX
X Y catX
0 A D AB
1 B D AB
2 C E C
3 B E AB
4 C D C
5 B D AB
6 C D C
7 A E AB
8 A D AB
9 A E AB
10 C E C
11 C E C
12 A E AB
#result:
Y D E
catX
AB 4 4
C 2 3
I have a huge dataset having some repeated data(user log file) and would like to to do similar pattern occurrence recognition and recommendation based on the user download. Once the pattern recognition is made, I need to recommend best possible value to the user.
For example following are the download logs based on time:
A C D F A C D A B D A C D F A C D A B D
I would like to recognize the pattern that exists between this dataset and display the result as:
A -> C = 4
C -> D = 4
D -> F = 2
F -> A = 2
D -> A = 3
A -> B = 1
B -> D = 1
A -> C -> D = 2
C -> D -> F = 2
D -> F -> A = 1
F -> A -> C = 1
C -> D -> A = 1
D -> A -> B = 1
A -> B -> D = 1
The number at the end represents the number of repetition of that pattern.
When the user inputs "A", the best recommendation should be "C", And if the user input is "A -> C", then it should be "D".
Currently I am doing data cleaning using pandas in Python and for pattern recognition, I think scikit-learn might work (not sure though).
Is there any good library or algorithm that I can make a use for this problem or is there any good approach for this kind of problem ?
Since the data size is very big, I am implementing it using Python.
The current problem can be easily solved by n_grams. You can use CountVectorizer to find out n_grams and their frequency in the text and generate the output you want.
from sklearn.feature_extraction.text import CountVectorizer
# Changed the token_pattern to identify only single letter words
# n_gram = (2,5), to identify from 2 upto 5-grams
cv = CountVectorizer(ngram_range=(1,5), token_pattern=r"(?u)\b\w\b",
lowercase=False)
# Wrapped the data in a list, because CountVectorizer requires an iterable
data = ['A C D F A C D A B D A C D F A C D A B D']
# Learn about the data
cv.fit(data)
# This is just to prettify the printing
import pandas as pd
df = pd.DataFrame(cv.get_feature_names(), columns = ['pattern'])
# Add the frequencies
df['count'] = cv.transform(data).toarray()[0] #<== Changing to dense matrix
df
#Output
pattern count
A B 2
A B D 2
A B D A 1
A B D A C 1
A C 4
A C D 4
A C D A 2
A C D A B 2
A C D F 2
A C D F A 2
B D 2
B D A 1
B D A C 1
B D A C D 1
C D 4
C D A 2
C D A B 2
C D A B D 2
C D F 2
C D F A 2
C D F A C 2
D A 3
D A B 2
D A B D 2
D A B D A 1
D A C 1
D A C D 1
D A C D F 1
D F 2
D F A 2
D F A C 2
D F A C D 2
F A 2
F A C 2
F A C D 2
F A C D A 2
But I would recommend you to try recommender, pattern finding, association rule mining (apriori) algorithms etc. which will help you more.
I have a pandas dataframe as
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df
A B C D
E -0.585995 1.325598 -1.172405 -2.810322
F -2.282079 -1.203231 -0.304155 -0.119221
G -0.739126 1.114628 0.381701 -0.485394
H 1.162010 -1.472594 1.767941 1.450582
I 0.119481 0.097139 -0.091432 -0.415333
J 1.266389 0.875473 1.787459 -1.149971
How can I flatten this array, whilst keeping the column and index IDs as here:
E A -0.585995
E B 1.325598
E C -1.172405
E D -2.810322
F A ...
F B ...
...
...
J D -1.149971
It doesnt matter what order the values occur in...
np.flatten() can be used to flatten the df.values into a 1D array, but then I lose the order of the index and columns...
Use stack + set_index:
df = df.stack().reset_index()
df.columns = ['a','b','c']
print (df)
a b c
0 E A -0.585995
1 E B 1.325598
2 E C -1.172405
3 E D -2.810322
4 F A -2.282079
5 F B -1.203231
6 F C -0.304155
7 F D -0.119221
8 G A -0.739126
9 G B 1.114628
10 G C 0.381701
11 G D -0.485394
12 H A 1.162010
13 H B -1.472594
14 H C 1.767941
15 H D 1.450582
16 I A 0.119481
17 I B 0.097139
18 I C -0.091432
19 I D -0.415333
20 J A 1.266389
21 J B 0.875473
22 J C 1.787459
23 J D -1.149971
Numpy solution with numpy.tile + numpy.repeat + numpy.ravel:
b = np.tile(df.columns, len(df.index))
a = np.repeat(df.index, len(df.columns))
c = df.values.ravel()
df = pd.DataFrame({'a':a, 'b':b, 'c':c})
print (df)
a b c
0 E A -0.585995
1 E B 1.325598
2 E C -1.172405
3 E D -2.810322
4 F A -2.282079
5 F B -1.203231
6 F C -0.304155
7 F D -0.119221
8 G A -0.739126
9 G B 1.114628
10 G C 0.381701
11 G D -0.485394
12 H A 1.162010
13 H B -1.472594
14 H C 1.767941
15 H D 1.450582
16 I A 0.119481
17 I B 0.097139
18 I C -0.091432
19 I D -0.415333
20 J A 1.266389
21 J B 0.875473
22 J C 1.787459
23 J D -1.149971
Timings:
In [103]: %timeit (df.stack().reset_index())
1000 loops, best of 3: 1.26 ms per loop
In [104]: %timeit (pd.DataFrame({'a':np.repeat(df.index, len(df.columns)), 'b':np.tile(df.columns, len(df.index)), 'c':df.values.ravel()}))
1000 loops, best of 3: 436 µs per loop
I have data in this format
ID Val
1 A
1 B
1 C
2 A
2 C
2 D
I want to group by data at each ID and see combinations that exist and sum the multiple combinations up. The resulting output should look like
v1 v2 count
A B 1
A C 2
A D 1
B C 1
C D 1
Is there a smart way to get this instead of looping through each possible combinations?
this should work:
>>> ts = df.groupby('Val')['ID'].aggregate(lambda ts: set(ts))
>>> ts
Val
A set([1, 2])
B set([1])
C set([1, 2])
D set([2])
Name: ID, dtype: object
>>> from itertools import product
>>> pd.DataFrame([[i, j, len(ts[i] & ts[j])] for i, j in product(ts.index, ts.index) if i < j],
... columns=['v1', 'v2', 'count'])
v1 v2 count
0 A B 1
1 A C 2
2 A D 1
3 B C 1
4 B D 0
5 C D 1
What I came up with:
Use pd.merge to create the cartesian product
Filter the cartesian product to include only combinations of the form that you desire
Count the number of combinations
Convert to the desired dataframe format
Unsure if it is faster than looping through all possible combinations.
#!/usr/bin/env python2.7
# encoding: utf-8
'''
'''
import pandas as pd
from itertools import izip
# Create the dataframe
df = pd.DataFrame([
[1, 'A'],
[1, 'B'],
[1, 'C'],
[2, 'A'],
[2, 'C'],
[2, 'D'],
], columns=['ID', 'Val'])
'''
ID Val
0 1 A
1 1 B
2 1 C
3 2 A
4 2 C
5 2 D
[6 rows x 2 columns]
'''
# Create the cartesian product
df2 = pd.merge(df, df, on='ID')
'''
ID Val_x Val_y
0 1 A A
1 1 A B
2 1 A C
3 1 B A
4 1 B B
5 1 B C
6 1 C A
7 1 C B
8 1 C C
9 2 A A
10 2 A C
11 2 A D
12 2 C A
13 2 C C
14 2 C D
15 2 D A
16 2 D C
17 2 D D
[18 rows x 3 columns]
'''
# Count the values, filtering A, A pairs, and B, A pairs.
counts = pd.Series([
v for v in izip(df2.Val_x, df2.Val_y)
if v[0] != v[1] and v[0] < v[1]
]).value_counts(sort=False).sort_index()
'''
(A, B) 1
(A, C) 2
(A, D) 1
(B, C) 1
(C, D) 1
dtype: int64
'''
# Combine the counts
df3 = pd.DataFrame(dict(
v1=[v1 for v1, _ in counts.index],
v2=[v2 for _, v2 in counts.index],
count=counts.values
))
'''
count v1 v2
0 1 A B
1 2 A C
2 1 A D
3 1 B C
4 1 C D
'''