Increment dataframe column based on condition - python

I have a dataframe and I want to create a new column based on a condition on a different column. Create the new column "ans" with 1 and increment based on the column "ix". In the "ix" column if the value is the same as the next one keep the "ans" column the same and if its different increment "ans"
Thank you for your answer, I am new to Python so I am not sure how to do this
index ix
1 pa
2 pa
3 pa
4 pe
5 fc
6 pb
7 pb
8 df
should result in:-
index ix ans
1 pa 1
2 pa 1
3 pa 1
4 pe 2
5 fc 3
6 pb 4
7 pb 4
8 df 5

In [47]: df['ans'] = (df['ix'] != df['ix'].shift(1)).cumsum()
In [48]: df
Out[48]:
index ix ans
0 1 pa 1
1 2 pa 1
2 3 pa 1
3 4 pe 2
4 5 fc 3
5 6 pb 4
6 7 pb 4
7 8 df 5

Related

How to find the next row that have a value in column in a dataframe pandas?

I have a dataframe such as:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Every group should have the numbers 7, 8 and 9. In the example above, the group 1 does not have the three numbers, the number 9 is missing. In that case, I would like to find the closest row with a 9 in the label, and add it to the dataframe, also changing the date to the group's date.
So the desired result would be:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
6 ii 02/05 1 9
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Welcome to SO. Its good if you include what you have tried so far so keep that in mind. Anyhow for this question, breakdown your thought process into pandas syntax. Like first step would be to check what group do not have which label from [8,9]:
dfs = df.groupby(['group', 'date']).agg({'label':set}).reset_index().sort_values('group')
dfs['label'] = dfs['label'].apply(lambda x: {8, 9}.difference(x)).explode() # This is the missing label
dfs
Which will give you:
group
date
label
1
02/05
9
2
09/05
nan
Now merge it with original on label and have info filled in:
final_df = pd.concat([df, dfs.merge(df[['label', 'info']], on='label', suffixes=['','_grouped'])])
final_df
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
nan
ii
02/05
1
9
And prettify:
final_df.reset_index(drop=True).reset_index().assign(id=lambda x:x['index']+1).drop(columns=['index']).sort_values(['group', 'id'])
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
6
ii
02/05
1
9
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9

Maximum in two columns dataframe python pandas

I have two columns in a dataframe, one of them are strings (country's) and the other are integers related to each country. How do I ask which country has the biggest value using python pandas?
Setup
df = pd.DataFrame(dict(Num=[*map(int, '352741845')], Country=[*'ABCDEFGHI']))
df
Num Country
0 3 A
1 5 B
2 2 C
3 7 D
4 4 E
5 1 F
6 8 G
7 4 H
8 5 I
idxmax
df.loc[[df.Num.idxmax()]]
Num Country
6 8 G
nlargest
df.nlargest(1, columns=['Num'])
Num Country
6 8 G
sort_values and tail
df.sort_values('Num').tail(1)
Num Country
6 8 G

Filling a column with a condition on another column and shifting the values in pandas

My dataframe looks like this
№№№ randomNumCol n_k
0 5 1
1 6 0
2 7 1
3 8 0
4 9 1
5 10 1
6 11 1
7 12 1
...
I need to fill a column n_k as follows: if in the column randomNumCol is 1, then copy the value from the column №№№. If is 0, then insert the previous value from the column n_k.
BUT the first value in the column n_k should be equal to 2(for now I don't know why it so).
It should look like this
№№№ randomNumCol n_k
0 5 1 2
1 6 0 2
2 7 1 7
3 8 0 7
4 9 1 9
5 10 1 10
6 11 1 11
7 12 1 12
...
My code does not give the right result
dftest['n_k'] = np.where(dftest['randomNumCol'] == 1, dftest['№№№'], dftest['n_k'].shift(1))
I do not quite understand how to use shift(). And what to do with the first cell in n_k, which should always be 2?
Any advice, please?
You can copy the values from '№№№' column where randomNumCol is 1, set the remaining values to be nan, and then use ffill to fill the missing values:
import pandas as pd
df['n_k'] = df['№№№'].where(df.randomNumCol == 1, pd.np.nan)
df['n_k'].iat[0] = 2
df['n_k'] = df['n_k'].ffill().astype(df['№№№'].dtype)
df
# №№№ randomNumCol n_k
#0 5 1 2
#1 6 0 2
#2 7 1 7
#3 8 0 7
#4 9 1 9
#5 10 1 10
#6 11 1 11
#7 12 1 12
You can use fillna() instead of shift() .
import pandas as pd
df['n_k']=np.nan
df.loc[df['randomNumCol']==1,'n_k']=df['№№№']
df.ix[0,'n_k']=2
df['n_k'].fillna(method='ffill')

Sum pandas dataframe column values based on condition of column name

I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7

how to skip a row in pandas dataframe group by cumsum function

I'm trying to use cumsum() to get the result of I want in pandas, but I'm stuck.
score1 score2
team slot
a 2 4 6
a 3 3 7
a 4 2 1
a 5 4 3
b 1 7 2
b 2 2 10
b 5 1 9
my original data look like above , I want to do cummulative of score1 and score2 group by team and slot. I used
df= df.groupby(by=['team','slot']).sum().groupby(level=[0]).cumsum()
this code above almost got I want , but each team needs exactly 5 slot like output below , how can I fix this issue ?
as #Paul H commented, here is the code:
import io
import pandas as pd
text = """team slot score1 score2
a 2 4 6
a 3 3 7
a 4 2 1
a 5 4 3
b 1 7 2
b 2 2 10
b 5 1 9
"""
df = pd.read_csv(io.BytesIO(text), delim_whitespace=True, index_col=[0, 1])
df2 = df.reindex(pd.MultiIndex.from_product([df.index.levels[0], range(1, 6)]))
df2.fillna(0).groupby(level=[0]).cumsum()

Categories