suppose i have the following pandas dataframe , and i need to rank rows at
new columns ( i meant if i want to rank 4 rows i will creat 4 new rows )
at the following dataframe , i have three numerical columns , i need to compare and rank each row , there is three rows so i need to craete three new columns to compare the value in each colmuns with the row
Revenue-SaleCount-salesprices-ranka-rankb-rankc
300------10-----------8000--------2--------1-----3
100----9000-----------1000--------1--------3-----2
how can i do that with simple code and using for loop
thanks in advance
import pandas as pd
df = pd.DataFrame({'Revenue':[300,9000,1000,750,500,2000,0,600,50,500],
'Date':['2016-12-02' for i in range(10)],
'SaleCount':[10,100,30,35,20,100,0,30,2,20],
'salesprices':[8000,1000,500,700,2500,3800,16,7400,3200,21]})
print(df)
We can write a loop with string.ascii_lowercase and make each column with rank over axis=1
import string
cols = ['Revenue', 'SaleCount', 'salesprices']
for index, col in enumerate(cols):
df[f'rank{string.ascii_lowercase[index]}'] = df[cols].rank(axis=1)[col]
Output:
print(df)
Revenue Date SaleCount salesprices ranka rankb rankc
0 300 2016-12-02 10 8000 2.0 1.0 3.0
1 9000 2016-12-02 100 1000 3.0 1.0 2.0
2 1000 2016-12-02 30 500 3.0 1.0 2.0
3 750 2016-12-02 35 700 3.0 1.0 2.0
4 500 2016-12-02 20 2500 2.0 1.0 3.0
5 2000 2016-12-02 100 3800 2.0 1.0 3.0
6 0 2016-12-02 0 16 1.5 1.5 3.0
7 600 2016-12-02 30 7400 2.0 1.0 3.0
8 50 2016-12-02 2 3200 2.0 1.0 3.0
9 500 2016-12-02 20 21 3.0 1.0 2.0
Note I used f-string which is only supported with Python version > 3.4.
Else use .format string formatting like following:
import string
cols = ['Revenue', 'SaleCount', 'salesprices']
for index, col in enumerate(cols):
df['rank{}'.format(string.ascii_lowercase[index])] = df[cols].rank(axis=1)[col]
Related
I want to loop through each element of a pandas dataframe row such that only that element is stressed (ie: it's multiplied by 10%) while the other elements of the row are kept equal.
I'm planning to use this for sensitivity analysis.
Example:
df = pd.DataFrame({'AGE':[5,10],'POP':[100,200]})
AGE
POP
5
100
10
200
Final desired output:
AGE
POP
5
100
10
200
5*1.1
100
5
100*1.1
10*1.1
200
10
200*1.1
If you have 2 columns, you can multiply with the [1, stress] and its reverse those columns, concatanate them while sorting to preserve multiplied column order. Lastly, prepend the original frame as well:
stress = 1.1
factor = [stress, 1]
pd.concat([df,
pd.concat([df.mul(factor),
df.mul(factor[::-1])]).sort_index()
], ignore_index=True)
AGE POP
0 5.0 100.0
1 10.0 200.0
2 5.5 100.0
3 5.0 110.0
4 11.0 200.0
5 10.0 220.0
Generalizing to N columns could be via a comprehension:
def gen_factors(stress, N):
for j in range(N):
# make all-1s list, except j'th is `stress`
f = [1] * N
f[j] = stress
yield f
stress = 1.1
N = len(df.columns)
pd.concat([df,
pd.concat(df.mul(factor)
for factor in gen_factors(stress, N)).sort_index()
], ignore_index=True)
Example run for a 3-column frame:
>>> df
AGE POP OTHER
0 5 100 7
1 10 200 8
>>> # output of above:
AGE POP OTHER
0 5.0 100.0 7.0
1 10.0 200.0 8.0
2 5.5 100.0 7.0
3 5.0 110.0 7.0
4 5.0 100.0 7.7
5 11.0 200.0 8.0
6 10.0 220.0 8.0
7 10.0 200.0 8.8
You can use a cross merge and concat:
pd.concat([df,
(df.merge(pd.Series([1.1, 1], name='factor'), how='cross')
.pipe(lambda d: d.mul(d.pop('factor'), axis=0))
)], ignore_index=True)
Output:
AGE POP
0 5.0 100.0
1 10.0 200.0
2 5.5 110.0
3 5.0 100.0
4 11.0 220.0
5 10.0 200.0
I want to update df1 as per updated df1 , if df1 have nans replace with values if df have values matched with ID column on both data frames.
My Expected Output is in the second image nan replaces with values
I have provided sample below
ID QD QP QE
101 4 6 4
102 5 8 5
103 7 6 6
104 8 3 5
105 4 2 5
If your ID columns is sorted and these two columns are one-to-one correspondence, you can use
df1[df1.isnull()] = df.values
print(df1)
ID QD QP QE
0 101 4.0 6.0 4.0
1 102 5.0 8.0 5.0
2 103 7.0 6.0 6.0
3 104 8.0 3.0 5.0
4 105 4.0 2.0 5.0
If not, you'd better set the ID column as index and choose one among fillna method, combine_first method and update method to update column according to index.
df1 = df1.set_index('ID')
# fillna
df1 = df1.fillna(df.set_index('ID').set_axis(df1.columns, axis=1)).reset_index()
# combine_first, if df is bigger than your original df1,
# the additional rows and columns are added
df1 = df1.combine_first(df.set_index('ID').set_axis(df1.columns, axis=1)).reset_index()
# update method will modify data inplace,
# you need to do reset index in separate step
df1.update(df.set_index('ID').set_axis(df1.columns, axis=1))
df1.reset_index()
print(df1)
ID QD QP QE
0 101 4.0 6.0 4.0
1 102 5.0 8.0 5.0
2 103 7.0 6.0 6.0
3 104 8.0 3.0 5.0
4 105 4.0 2.0 5.0
How can I delete only the three consecutive rows in a pandas dataframe that have the same value (in the example below, this would be the integer "4").
Consider the following code:
import pandas as pd
df = pd.DataFrame({
'rating': [4, 4, 3.5, 15, 5 ,4,4,4,4,4 ]
})
rating
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
7 4.0
8 4.0
9 4.0
I would like to get the following result as output with the three consecutive rows containing the value "4" being removed:
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
first get a group each time a new value exists, then use GroupBy.head
new_df = df.groupby(df['rating'].ne(df['rating'].shift()).cumsum()).head(2)
print(new_df)
rating
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
Use GroupBy.cumcount for counter and filter in rows in boolean indexing:
#filter consecutive groups less like 2 (python count from 0)
df= df[df.groupby(df['rating'].ne(df['rating'].shift()).cumsum()).cumcount().lt(2)]
print (df)
rating
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
I have a dataset that has a number of numerical variables and a number of ordinal numeric variables. to fill missing value I want to use mean for numerical variables and use the median for the ordinal numeric variables. With the following code, each of them is created separately and is not collected in a database.
df = [['age', 'score'],
[10,1],
[20,""],
["",0],
[40,1],
[50,0],
["",3],
[70,1],
[80,""],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:])
df.columns = data[0]
df = df[['age']].fillna(df.mean())
df = df[['score']].fillna(df.median())
pandas.DataFrame.fillna accepts dict with keys being column names, so you might do:
import pandas as pd
data = [['age', 'score'],
[10,1],
[20,None],
[None,0],
[40,1],
[50,0],
[None,3],
[70,1],
[80,None],
[90,0],
[100,1]]
df = pd.DataFrame(data[1:], columns=data[0])
df = df.fillna({'age':df['age'].mean(),'score':df['score'].median()})
print(df)
output
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
Keep in mind that empty string is different than NaN, latter might be created using python's None.
First replace empty strings to missing values and then replace mising values per columns:
df = df.replace('', np.nan)
df['age'] = df['age'].fillna(df['age'].mean())
df['score'] = df['score'].fillna(df['score'].median())
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
You can also use DataFrame.agg for Series of aggregate values and pass to DataFrame.fillna:
df = df.replace('', np.nan)
print (df.agg({'age':'mean', 'score':'median'}))
age 57.5
score 1.0
dtype: float64
df = df.fillna(df.agg({'age':'mean', 'score':'median'}))
print (df)
age score
0 10.0 1.0
1 20.0 1.0
2 57.5 0.0
3 40.0 1.0
4 50.0 0.0
5 57.5 3.0
6 70.0 1.0
7 80.0 1.0
8 90.0 0.0
9 100.0 1.0
I have a column ( lets call it Column X) containing around 16000 NaN values. The column has two possible values, 1 or 0 ( so like a binary )
I want to fill the NaN values in column X, but i don't want to use a single value for ALL the NaN entries.
say for instance that; i want to fill 50% of the NaN values with '1' and the other 50% with '0'.
I have read the ' fillna() ' documentation but i have not found any such relevant information which could satisfy this functionality.
I have literally no idea on how to move forward regarding this problem, so i haven't tried anything.
df['Column_x'] = df['Column_x'].fillna(df['Column_x'].mode()[0], inplace= True)
but this would fill ALL the NaN values in Column X of my dataframe 'df' with the mode of the column, i want to fill 50% with one value and other 50% with a different value.
Since i haven't tried anything yet, i can't show or describe any actual results.
what i can tell is that the expected result would be something along the lines of 8000 NaN values of column x replaced with '1' and another 8000 with '0' .
A visual result would be something like;
Before Handling NaN
Index Column_x
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
After Handling NaN
Index Column_x
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 1.0
8 1.0
9 1.0
10 1.0
11 1.0
12 0.0
13 0.0
14 0.0
15 0.0
16 1.0
17 1.0
18 1.0
19 1.0
You can use random.choices with its weights parameter to ensure the distribution stays the same. I've simulated a NaN column with numpy here, and get the exact length of the replacement needed. This approach can also be used for columns with more than two classes and more complex distributions.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'col1': range(16000)})
df['col2'] = np.nan
nans = df['col2'].isna()
length = sum(nans)
replacement = random.choices([0, 1], weights=[.5, .5], k=length)
df.loc[nans,'col2'] = replacement
print(df.describe())
'''
Out:
col1 col2
count 16000.000000 16000.000000
mean 7999.500000 0.507625
std 4618.946489 0.499957
min 0.000000 0.000000
25% 3999.750000 0.000000
50% 7999.500000 1.000000
75% 11999.250000 1.000000
max 15999.000000 1.000000
'''
Using pandas.Series.sample:
mask = df['Column_x'].isna()
ind = df['Column_x'].loc[mask].sample(frac=0.5).index
df.loc[ind, 'Column_x'] = 1
df['Column_x'] = df['Column_x'].fillna(0)
print(df)
Output:
Index Column_x
0 0 0.0
1 1 0.0
2 2 0.0
3 3 0.0
4 4 0.0
5 5 0.0
6 6 1.0
7 7 1.0
8 8 1.0
9 9 1.0
10 10 1.0
11 11 1.0
12 12 1.0
13 13 0.0
14 14 1.0
15 15 0.0
16 16 0.0
17 17 1.0
18 18 1.0
19 19 0.0
Use slicing columns and fill value
isnull() - function detect missing values in the given series object
Ex.
import pandas as pd
df = pd.DataFrame({'Column_y': pd.Series(range(9), index=['a', 'b', 'c','d','e','f','g','h','i']),
'Column_x': pd.Series(range(1), index=['a'])})
print(df)
# get list of index series which have NaN Column_x value
idx = df['Column_x'].index[df['Column_x'].isnull()]
total_nan_len = len(idx)
first_nan = total_nan_len//2
# fill first 50% of 1
df.loc[idx[0:first_nan], 'Column_x'] = 1
# fill last 50% of 0
df.loc[idx[first_nan:total_nan_len], 'Column_x'] = 0
print(df)
O/P:
Before Dataframe
Column_y Column_x
a 0 0.0
b 1 NaN
c 2 NaN
d 3 NaN
e 4 NaN
f 5 NaN
g 6 NaN
h 7 NaN
i 8 NaN
After Dataframe
Column_y Column_x
a 0 0.0
b 1 1.0
c 2 1.0
d 3 1.0
e 4 1.0
f 5 0.0
g 6 0.0
h 7 0.0
i 8 0.0