pandas groupby operation with missing data

pandas groupby operation with missing data - python

In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy entry x later in my code ...
how can avoid this ackish way to use groupby ?

Assuming the column under consideration to be represented by s, we can:
Split on "." delimiter along with expand=True to produce an expanded DF.
fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.
Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF, reset the indices same as the levels used to group the DF before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a

Maybe you have to find a way with regex.
import pandas as pd
df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])
and the result is,
2
0 1
M
E
L
M 1
2
3
E 1
2
3
4
L 1
2
M 1 a
1 b
1 c
2 a
3 a
E 1 a
1 b
1 c
2 a
3 a
3 b
4 a

Related

Create a dataframe of all combinations of columns names per row based on mutual presence of columns pairs

I'm trying to create a dataframe based on other dataframe and a specific condition.
Given the pandas dataframe above, I'd like to have a two column dataframe, which each row would be the combinations of pairs of words that are different from 0 (coexist in a specific row), beginning with the first row.
For example, for this part of image above, the new dataframe that I want is like de following:
and so on...
Does anyone have some tip of how I can do it? I'm struggling... Thanks!

As you didn't provide a text example, here is a dummy one:
>>> df
A B C D E
0 0 1 1 0 1
1 1 1 1 1 1
2 1 0 0 1 0
3 0 0 0 0 1
4 0 1 1 0 0
you could use a combination of masking, explode and itertools.combinations:
from itertools import combinations
mask = df.gt(0)
series = (mask*df.columns).apply(lambda x: list(combinations(set(x).difference(['']), r=2)), axis=1)
pd.DataFrame(series.explode().dropna().to_list(), columns=['X', 'Y'])
output:
X Y
0 C E
1 C B
2 E B
3 E D
4 E C
5 E B
6 E A
7 D C
8 D B
9 D A
10 C B
11 C A
12 B A
13 A D
14 C B

Matching two columns from Pandas Dataframe but the order matters

I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.

IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])

How can I remove a certain type of values in a group in pandas?

I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?

We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.

You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c

First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c

How to get top 5 items for each group in grouped dataframe?

df = pd.DataFrame({'Weekday':list('MMMMMMMMMMTTTTTTTTTT'),
'Items': list("AAABBCDEFGBBBCCADEFG")
})
grouped = df.groupby(['Weekday','Items'],sort=True).agg({'Items': 'count'})
Then, I get the result of grouped:
Weekday Items
M A 3
B 2
C 1
D 1
E 1
F 1
G 1
T A 1
B 3
C 2
D 1
E 1
F 1
G 1
So how to output the top 5 items for each "weekdays" (5 for 'M' and 'T'), like:
Weekday Items
M A 3
B 2
C 1
D 1
E 1
T
B 3
C 2
A 1
D 1
E 1
Anyone can help this?

df = pd.DataFrame({'Weekday':list('MMMMMMMMMMTTTTTTTTTT'),
'Item': list("AAABBCDEFGBBBCCADEFG")
})
grouped = df.groupby(['Weekday','Item'],sort=True).agg(count=('Item', 'count'))
grouped.sort_values(['Weekday','count'],ascending=False).groupby('Weekday').head(5)
count
Weekday Item
T B 3
C 2
A 1
D 1
E 1
M A 3
B 2
C 1
D 1
E 1

grouped = (df.groupby(['Weekday','Items'])
.Items.agg(counter='count')
.groupby(['Weekday'],
as_index=False))
pd.concat([group.nlargest(5,'counter') for name,group in grouped])
counter
Weekday Items
M A 3
B 2
C 1
D 1
E 1
T B 3
C 2
A 1
D 1
E 1
groupby twice, first to get the counter variable. the second groupby allows an iteration through the groups to get the top 5, using nlargest. last step is to combine the dataframes in the list into one.
vb_rise's solution should be faster as it avoids the iteration process.

Formatting dataframe in appending

I want to append 2 dataframes:
data1:
a
1 a
2 b
3 c
4 d
5 e
data2:
b
1 f
2 g
3 h
4 i
5 j
output:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
currently i am using:
all_data= data1.append(data2, ignore_index=True)
this gives me result as:
a b
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
i.e. in different columns.
How can i get them in the same column?
Also tried converting the dataframes into list and then tring to append it. But it gave me the error:
TypeError: append() takes no keyword arguments
Also, is there any other function to remove duplicates from the datarame of strings? The drop_duplicates() function does not work in my case. The data still has duplicates.

You need to change one column name, so append can detect hat you want to do:
data2.columns = ["a"]
or
data1.columns = ["b"]
And then, after using data2.columns = ["a"]:
all_data = data1.append(data2, ignore_index=True)
all_data
a
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
And here you have your column named after the column's name of data1, which you can rename if you want:
all_data.columns = ["Foo"]

merge or concat work on keys. In this case, there are no common columns. However, why not use numpy append and create the dataframe?
In [68]: pd.DataFrame(pd.np.append(data1.values, data2.values), columns=['A'])
Out[68]:
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j

df1.columns = ['b']
Out[78]:
b
0 a
1 b
2 c
3 d
4 e
pd.concat([df1 , df2] , ignore_index=True)
Out[80]:
b
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby operation with missing data - python

Maybe you have to find a way with regex. import pandas as pd df = pd.read_clipboard(header=None).iloc[:, 1] df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)') print df2.set_index([0,1]) and the result is, 2 0 1 M E L M 1 2 3 E 1 2 3 4 L 1 2 M 1 a 1 b 1 c 2 a 3 a E 1 a 1 b 1 c 2 a 3 a 3 b 4 a

Related

Create a dataframe of all combinations of columns names per row based on mutual presence of columns pairs

Matching two columns from Pandas Dataframe but the order matters

How can I remove a certain type of values in a group in pandas?

How to get top 5 items for each group in grouped dataframe?

Formatting dataframe in appending

Categories

Resources