arrange values in a order in a pandas df - python

i have a table in pandasas df:
p_id_x p_id_y count
a b 2
b c 4
a c 8
d a 1
x a 6
m b 3
c z 7
i wam tring to write a function
def function_combination(p_id):
df[['p_id_x', 'p_id_y']] = df[['p_id_x', 'p_id_y']].apply(sorted, axis=1)
df.groupby(['p_id_x', 'p_id_y'], as_index=False)['count'].sum()
(the function is not completed and has errors.)
and i got the result by seperately running the code inside the function.
df['p_id_x','p_id_y']
p_id_x p_id_y
a b
b c
a c
a d
a x
b m
c z
but what i want my output to look like is:
p_id_x p_id_y
a b
a c
a d
a x
b c
b m
c z
i'e all the combinations for a first, followed by b, and followed by c.
this is only a part of my rows. i've 20+ such rows.
is there a way to do this, so that i can have both my codes inside the function?

You need add sort_values by column p_id_x:
df[['p_id_x', 'p_id_y']] = df[['p_id_x', 'p_id_y']].apply(sorted, axis=1)
df = df.groupby(['p_id_x', 'p_id_y'], as_index=False)['count'].sum().sort_values('p_id_x')
print (df)
p_id_x p_id_y count
0 a b 2
1 a c 8
2 a d 1
3 a x 6
4 b c 4
5 b m 3
6 c z 7
print (df[['p_id_x','p_id_y']])
p_id_x p_id_y
0 a b
1 a c
2 a d
3 a x
4 b c
5 b m
6 c z
EDIT by comment - use boolean indexing:
mask = (df.p_id_x == 'a') & (df['count'] > 3)
print (mask)
0 False
1 True
2 False
3 True
4 False
5 False
6 False
dtype: bool
print (df[mask])
p_id_x p_id_y count
1 a c 8
3 a x 6
Or query:
print (df.query("p_id_x == 'a' and count > 3"))
p_id_x p_id_y count
1 a c 8
3 a x 6

Related

Value counts for two columns inside the same table

I'm trying to count values in two columns and then put the results in the same table.
dict = { "before": list("ABCDEFABDCFEFF"),
"after" : list("FABFCFFEEDEBFF") }
df = pd.DataFrame(dict)
Output
before after
0 A F
1 B A
2 C B
3 D F
4 E C
5 F F
6 A F
7 B E
8 D E
9 C D
10 F E
11 E B
12 F F
13 F F
I've achieved something close to what I want, but this looks messy, and I'm hoping for a "smoother" solution.
df.melt().groupby("variable")["value"].value_counts().to_frame().unstack()
Output:
value
value A B C D E F
variable
after 1 2 1 1 3 6
before 2 2 2 2 2 4
df.apply(lambda x: x.value_counts())
If you want to have before and after as the row indexes as shown in your current output, you should use the following.
df.apply(lambda x: x.value_counts()).transpose()
A different way with melt using pivot_table:
>>> df.melt().assign(count=1).pivot_table('count', 'variable', 'value', aggfunc='count')
value A B C D E F
variable
after 1 2 1 1 3 6
before 2 2 2 2 2 4

How do I give score (0/1) to CSV rows

My csv file row column data looks like this -
a a a a a
b b b b b
c c c c c
d d d d d
a b c d e
a d b c c
When I have patterns like row 1-5, I want to return value 0
When I have row like 6 or random alphabets (not like row 1-5), I want to return value 1.
How do I do it using python?It must be done by using csv file
You can read your csv file to pandas dataframe using:
df = pd.read_csv(header=None)
output:
0 1 2 3 4
0 a a a a a
1 b b b b b
2 c c c c c
3 d d d d d
4 a b c d e
5 a d b c c
Then, use nunique to count the number of unique values per row, if 1 or 5 (the max), then it is valid, else not. Use between for that.
df.nunique(1).between(2, len(df.columns)-1).astype(int)
output:
0 0
1 0
2 0
3 0
4 0
5 1
dtype: int64

How can I remove a certain type of values in a group in pandas?

I have the following dataframe which is a small part of a bigger one:
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
I'd like to delete all rows where the last items are "d". So my desired dataframe would look like this:
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
So the point is, that a group shouldn't have "d" as the last item.
There is a code that deletes the last row in the groups where the last item is "d". But in this case, I have to run the code twice to delete all last "d"-s in group 3 for example.
clean_3 = clean_2[clean_2.groupby('account_num')['trans_cdi'].transform(lambda x: (x.iloc[-1] != "d") | (x.index != x.index[-1]))]
Is there a better solution to this problem?
We can use idxmax here with reversing the data [::-1] and then get the index:
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
Testing on consecutive value
acc_num trans_cdi
0 1 c
1 1 d <--- d between two c, so we need to keep
2 1 c
3 1 d <--- row to be dropped
4 3 d
5 3 c
6 3 d
7 3 d
grps = df['trans_cdi'].ne('d').groupby(df['acc_num'], group_keys=False)
idx = grps.apply(lambda x: x.loc[:x[::-1].idxmax()]).index
df.loc[idx]
acc_num trans_cdi
0 1 c
1 1 d
2 1 c
4 3 d
5 3 c
Still gives correct result.
You can try this not so pandorable solution.
def r(x):
c = 0
for v in x['trans_cdi'].iloc[::-1]:
if v == 'd':
c = c+1
else:
break
return x.iloc[:-c]
df.groupby('acc_num', group_keys=False).apply(r)
acc_num trans_cdi
0 1 c
3 3 d
4 3 c
First, compare to the next row with shift if the values are both equal to 'd'. ~ filters out the specified rows.
Second, Make sure the last row value is not d. If it is, then delete the row.
code:
df = df[~((df['trans_cdi'] == 'd') & (df.shift(1)['trans_cdi'] == 'd'))]
if df['trans_cdi'].iloc[-1] == 'd': df = df.iloc[0:-1]
df
input (I tested it on more input data to ensure there were no bugs):
acc_num trans_cdi
0 1 c
1 1 d
3 3 d
4 3 c
5 3 d
6 3 d
7 1 d
8 1 d
9 3 c
10 3 c
11 3 d
12 3 d
output:
acc_num trans_cdi
0 1 c
1 1 d
4 3 c
5 3 d
9 3 c
10 3 c

pandas finds indices of rows in each group which meets certain condition and assign values to these rows

I have a df,
name_id name
1 a
2 b
2 b
3 c
3 c
3 c
now I want to groupby name_id and assign -1 to rows in the group(s), whose length is 1 or < 2;
one_occurrence_indices = df.groupby('name_id').filter(lambda x: len(x) == 1).index.tolist()
for index in one_occurrence_indices:
df.loc[index, 'name_id'] = -1
I am wondering what is the best way to do it. so the result df,
name_id name
-1 a
2 b
2 b
3 c
3 c
3 c
Use transform with loc:
df.loc[df.groupby('name_id')['name_id'].transform('size') == 1, 'name_id'] = -1
Alternative is numpy.where:
df['name_id'] = np.where(df.groupby('name_id')['name_id'].transform('size') == 1,
-1, df['name_id'])
print (df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Also if want test duplicates use duplicated:
df['name_id'] = np.where(df.duplicated('name_id', keep=False), df['name_id'], -1)
Use:
df.name_id*=(df.groupby('name_id').name.transform(len)==1).map({True:-1,False:1})
df
Out[50]:
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Using pd.DataFrame.mask:
lens = df.groupby('name_id')['name'].transform(len)
df['name_id'].mask(lens < 2, -1, inplace=True)
print(df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c

pandas groupby operation with missing data

In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy entry x later in my code ...
how can avoid this ackish way to use groupby ?
Assuming the column under consideration to be represented by s, we can:
Split on "." delimiter along with expand=True to produce an expanded DF.
fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.
Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF, reset the indices same as the levels used to group the DF before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a
Maybe you have to find a way with regex.
import pandas as pd
df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])
and the result is,
2
0 1
M
E
L
M 1
2
3
E 1
2
3
4
L 1
2
M 1 a
1 b
1 c
2 a
3 a
E 1 a
1 b
1 c
2 a
3 a
3 b
4 a

Categories