How to select row before and after NaN in pandas? - python

I have a dataframe which looks like this :
Name Age Job
0 Alex 20 Student
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
4 Rosa 20 senior manager
5 johanes 25 Dentist
6 lina 23 Student
7 yaser 25 Pilot
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
.
.
.
.
I want to select the rows before and after the row that has NaN values in column Job with the row itself. For that I have the following code :
Rows = df[df. Shift(1, fill_value="dummy").Job. isna() | df.Job. isna()| df. Shift(-1, fill_value="dummy"). df. isna()]
print(Rows)
the result is this:
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
The only problem here is the row number 10, it should be double in the result because this row is one time the row after NaN which is number 9 and at the same time the row before NaN value which is row number 11( the row is between two rows with NaN value). So at the end I want to have this :
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter
So every row which is between two rows with NaN values should be also two times in the result (or should be dupplicate). Is there any way to do this? Any help will be appreciated.

Use concat with rows before, after and match by condition:
m = df.Job.isna()
df = pd.concat([df[m.shift(fill_value=False)],
df[m.shift(-1, fill_value=False)],
df[m]]).sort_index()
print (df)
Name Age Job
1 Sara 21 Doctor
2 john 23 NaN
3 kevin 22 Teacher
8 jason 20 Manager
9 Ali 23 NaN
10 Ahmad 21 Professor
10 Ahmad 21 Professor
11 Joe 24 NaN
12 Donald 29 Waiter

Related

Missing value replacemnet using mode in pandas in subgroup of a group

Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30

Left shift with condition in pandas

I have some problems with a csv file, I have tried several solutions through the pandas library but none has worked for me, I want to make a left shift to 3 columns in case that in one of them appears a certain code (in this case 11 or 22), for example, this would be my input:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
11
John
34
44
Rob
23
33
Peter
15
22
Ken
45
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
11
John
45
77
Harry
39
88
Mary
20
And I expect something like this:
code
name
%
code 2
name 2
% 2
code 3
name 3
% 3
44
Rob
23
33
Peter
15
33
Peter
45
44
Rob
25
33
Peter
34
66
Abraham
37
77
Harry
67
77
Harry
39
88
Mary
20
any idea how I could solve my problem with pandas?
Thanks in advance!
Do you want this?
mask = df['code'].isin([11,22])
df.loc[mask] = df.loc[mask].shift(-3,axis=1)
Output -
code name % code 2 name 2 % 2 code 3 name 3 % 3
0 44.0 Rob 23.0 33.0 Peter 15.0 NaN NaN NaN
1 33.0 Peter 45.0 44.0 Rob 25.0 NaN NaN NaN
2 33.0 Peter 34.0 66.0 Abraham 37.0 77.0 Harry 67.0
3 77.0 Harry 39.0 88.0 Mary 20.0 NaN NaN NaN

Binning Categorical Columns Programatically Using Python

I am trying bin categorical columns programtically - any idea on how I can achieve this without manually hard-coding each value in that column
Essentially, what I would like is a function whereby it counts all values up to 80% [leaves the city name as is] and replaces the remaining 20% of city names with the word 'Other'
IE: if the first 17 city names make up 80% of that column, keep the city name as is, else return 'other'.
EG:
0 Brighton
1 Yokohama
2 Levin
3 Melbourne
4 Coffeyville
5 Whakatane
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Te Awamutu
11 Bishkek
12 Melbourne
13 Whanganui
14 Coffeyville
15 New York
16 Brisbane
17 Greymouth
18 Brisbane
19 Chuo City
20 Accra
21 Levin
22 Waiouru
23 Brisbane
24 New York
25 Chuo City
26 Lucerne
27 Whanganui
28 Los Angeles
29 Melbourne
df['city'].head(30).value_counts(ascending=False, normalize=True)*100
Melbourne 16.666667
Levin 10.000000
Brisbane 10.000000
Whanganui 6.666667
Coffeyville 6.666667
New York 6.666667
Chuo City 6.666667
Waiouru 3.333333
Greymouth 3.333333
Te Awamutu 3.333333
Bishkek 3.333333
Lucerne 3.333333
Ashburn 3.333333
Yokohama 3.333333
Whakatane 3.333333
Accra 3.333333
Brighton 3.333333
Los Angeles 3.333333
From Ashburn down - it should be renamed to 'other'
I have tried the below which is a start, but not exactly what I want:
city_map = dict(df['city'].value_counts(ascending=False, normalize=True)*100)
df['city_count']= df['city'].map(city_map)
def count(df):
if df["city_count"] > 10:
return "High"
elif df["city_count"] < 0:
return "Medium"
else:
return "Low"
df.apply(count, axis=1)
I'm not expecting any code - just some guidance on where to start or ideas on how I can achieve this
We can groupby on city and get the size of each city. We divide those values by the length of our dataframe with len and calculate the cumsum. Last step is to check from which point we exceed the threshold, so we can broadcast the boolean series back to your dataframe with map.
threshold = 0.7
m = df['city'].map(df.groupby('city')['city'].size().sort_values(ascending=False).div(len(df)).cumsum().le(threshold))
df['city'] = np.where(m, df['city'], 'Other')
city
0 Other
1 Other
2 Levin
3 Melbourne
4 Coffeyville
5 Other
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Other
11 Bishkek
12 Melbourne
13 Other
14 Coffeyville
15 New York
16 Brisbane
17 Other
18 Brisbane
19 Chuo City
20 Other
21 Levin
22 Other
23 Brisbane
24 New York
25 Chuo City
26 Other
27 Other
28 Other
29 Melbourne
old method
If I understand you correctly you want calculate a cumulative sum with .cumsum and check when it exceeds your set threshold.
Then we use np.where to conditionally fill in the City name or Other.
threshold = 80
m = df['Normalized'].cumsum().le(threshold)
df['City'] = np.where(m, df['City'], 'Other')
City Normalized
0 Auckland 40.399513
1 Christchurch 13.130783
2 Wellington 12.267604
3 Hamilton 4.026242
4 Tauranga 3.867353
5 (not set) 3.540075
6 Dunedin 2.044508
7 Other 1.717975
8 Other 1.632849
9 Other 1.520342
10 Other 1.255651
11 Other 1.173878
12 Other 1.040508
13 Other 0.988166
14 Other 0.880502
15 Other 0.766877
16 Other 0.601468
17 Other 0.539067
18 Other 0.471824
19 Other 0.440903
20 Other 0.440344
21 Other 0.405884
22 Other 0.365836
23 Other 0.321131
24 Other 0.306602
25 Other 0.280524
26 Other 0.237123
27 Other 0.207878
28 Other 0.186084
29 Other 0.167085
30 Other 0.163732
31 Other 0.154977
Note: this method assumed that your Normalized column is sorted descending.

add new column and remove duplicates in that replace null values column wise

Duplication type:
Check this column only (default)
Check other columns only
Check all columns
Use Last Value:
True - retain the last duplicate value
False - retain the first of the duplicates (default)
This rule should add a new column to the dataframe which contains the same as the source column for any unique columns and is null for any duplicate columns.
basic code is df.loc[df.duplicated(),get_unique_column_name(df, "clean")] = df[get_column_name(df, column)] with the parameters for duplicated() set based on the duplication type
See reference for this function above: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html
You should specify the columns in the subset parameter based on the setting of duplication_type
You should specify use_last_value based on use_last_value above
This is my file.
Jason Miller 42 4 25
Tina Ali 36 31 57
Jake Milner 24 2 62
Jason Miller 42 4 25
Jake Milner 24 2 62
Amy Cooze 73 3 70
Jason Miller 42 4 25
Jason Miller 42 4 25
Jake Milner 24 2 62
Jake Miller 42 4 25
I want to get like this by using in pandas.in below file i have choose 2 column.
Jason Miller 42 4 25
Jake Ali 36 31 57
Jake Milner 24 2 62
Jason Miller 4 25
Jake Milner 2 62
Jake Cooze 73 3 70
Jason Miller 4 25
Jason Miller 4 25
Jake Milner 2 62
Jake Miller 4 25
Please anybody reply to my query.
You can use DF.duplicated and assign the values of column C where the first occurence of values appears along columns A and B.
You could then fill the Nans produced with empty strings to produce the required dataframe.
df = pd.read_csv(data, delim_whitespace=True, header=None, names=['A','B','C','D','E'])
df.loc[~df.duplicated(), "C'"] = df['C']
df.fillna('', inplace=True)
df = df[["A","B", "C'","D","E"]]
print(df)
A B C' D E
0 Jason Miller 42 4 25
1 Tina Ali 36 31 57
2 Jake Milner 24 2 62
3 Jason Miller 4 25
4 Jake Milner 2 62
5 Amy Cooze 73 3 70
6 Jason Miller 4 25
7 Jason Miller 4 25
8 Jake Milner 2 62
9 Jake Miller 42 4 25
Another way of doing would be to take a subset of the duplicated columns and replace the concerned column with empty strings. Then, you could use update to modify the dataframe in place with the original, df.
In [2]: duplicated_cols = df[df.duplicated(subset=['C', 'D', 'E'])]
In [3]: duplicated_cols
Out[3]:
A B C D E
3 Jason Miller 42 4 25
4 Jake Milner 24 2 62
6 Jason Miller 42 4 25
7 Jason Miller 42 4 25
8 Jake Milner 24 2 62
9 Jake Miller 42 4 25
In [4]: duplicated_cols.loc[:,'C'] = ''
In [5]: df.update(duplicated_cols)
In [6]: df
Out[6]:
A B C D E
0 Jason Miller 42 4.0 25.0
1 Tina Ali 36 31.0 57.0
2 Jake Milner 24 2.0 62.0
3 Jason Miller 4.0 25.0
4 Jake Milner 2.0 62.0
5 Amy Cooze 73 3.0 70.0
6 Jason Miller 4.0 25.0
7 Jason Miller 4.0 25.0
8 Jake Milner 2.0 62.0
9 Jake Miller 4.0 25.0

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories