I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)
Related
Say I have the following data frame:
import pandas as pd
d = {'id': [1, 2, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3],
'date': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4],
'product': ['a', 'a', 'b', 'a', 'b', 'a', 'b', 'c', 'b', 'c', 'c', 'c']}
df = pd.DataFrame(d)
I want to keep all data for each ID the day of and after they bought product 'b' and get rid of all data before they bought product 'b'. ID 1 would have no data because they did not purchase the product, ID 2 would have data for the 3rd and 4th day, and ID 3 would have data for days 1-4.
I know that I could groupby id and then filter rows from individual groups but I can't figure out how to make the filter dynamic based on the group. I've tried looping through the groups but it's slow (right now I have 19,000 IDs but it'll only grow as I continue the project).
Any help would be greatly appreciated. Thank you!
You can select the products "b" with eq and set the successive rows to True per group using groupby+cummax. Then slice the dataframe
df[df['product'].eq('b').groupby(df['id']).cummax()]
output:
id date product
2 3 1 b
3 3 2 a
4 3 2 b
6 2 3 b
8 2 3 b
9 3 3 c
10 2 4 c
11 3 4 c
NB. this assumes the dataframe is ordered by date. If not use sort_values(by='date') (or by=['group', 'date'])
I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])
I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?
You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)
You could also just do:
df['column_name'] = df['column_name'].replace('-','None')
import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.
you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e
I have to perform the same arithmetic operation on specific columns of a pandas DataFrame. I do it as
c.loc[:,'col3'] += cons
c.loc[:,'col5'] += cons
c.loc[:,'col6'] += cons
There should be a simpler approach to do all of these in one operation. I mean updating col3,col5,col6 in one command.
pd.DataFrame.loc label indexing accepts lists:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.loc[:, ['B', 'C']] += 10
print(df)
A B C
0 1 12 13
1 4 15 16
My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2