How to replace dataframe columns on given condition?

How to replace dataframe columns on given condition? - python

I have a Dataframe , in which i have two columns, first_columns and second_columns. first_columns is id and second_columns is room number.
As you can see from pic that, particular id person serve on different room number. Now i want to replace all the second_columns columns with 1 and zero on given condition
1) if particular first_columns column id person does't not serve in 9, 10 and 11, then replace all the room number with 1, if he work then replace all room number with 0.
In above picture, first_columns id 3737 doest not work in room 9,10, and 11. then all the row of 3737 room number will be replace by 1.

I think need groupby with transform for compare by sets, last invert mask by ~ and convert to integers:
df['new'] = ((~df.groupby('first_column')['second_column']
.transform(lambda x: set(x) >=set([9,10,11])))
.astype(int))
print (df)
first_column second_column new
0 3767 2 1
1 3767 4 1
2 3767 6 1
3 6282 2 0
4 6282 9 0
5 6282 10 0
6 6282 11 0
7 10622 0 1
8 13096 7 1
9 13096 10 1
10 13896 11 1

Related

I have a dataframe where some index number are missing how do I cut dataframe previous to that missing index number

[enter image description here][1]
Index number 72 is missing from original dataframe which is shown in image. I want to cut dataframe like [0:71,:] with condition like when index sequence breaks then dataframe automatically cuts from previous index value.

Compare shifted values of index subtracted by original values if greater like 1 with invert ordering by [::-1] and Series.cummax, last filter in boolean indexing:
df = pd.DataFrame({'a': range(3,13)}).drop(3)
print (df)
a
0 3
1 4
2 5
4 7
5 8
6 9
7 10
8 11
9 12
df = df[df.index.to_series().shift(-1, fill_value=0).sub(df.index).gt(1)[::-1].cummax()]
print (df)
a
0 3
1 4
2 5

i came to this:
df = pd.DataFrame({'col':[1,2,3,4,5,6,7,8,9]}, index=[-1,0,1,2,3,4,5,7,8])
ind = next((i for i in range(len(df)-1) if df.index[i]+1!=df.index[i+1]),len(df))+1
>>> df.iloc[:ind]
'''
col
-1 1
0 2
1 3
2 4
3 5
4 6
5 7

With numpy, get the values that are equal to a normal range starting from the first index, up to the first mismatch (excluded):
df[np.minimum.accumulate(df.index==np.arange(df.index[0], df.index[0]+len(df)))]
Example:
col
-1 1
0 2
1 3
3 4
4 5
output:
col
-1 1
0 2
1 3

Filtering rows that have unique value in a column using pandas

I have a df:
id value
1 10
2 15
1 10
1 10
2 13
3 10
3 20
I am trying to keep only rows that have 1 unique value in column value so that the result df looks like this:
id value
1 10
1 10
1 10
I dropped id = 2, 3 because it has more than 1 unique value in column value, 15, 13 & 10, 20 respectively.
I read this answer.
But this simply removes duplicates whereas I want to check if a given column - in this case column value has more than 1 unique value.
I tried:
df['uniques'] = pd.Series(df.groupby('id')['value'].nunique())
But this returns nan for every row since I am trying to fit n returns on n+m rows after grouping. I can write a function and apply it to every row but I was wondering if there is a smart quick filter that achieves my goal.

Use transform with groupby to align the group values to the individual rows:
df['nuniques'] = df.groupby('id')['value'].transform('nunique')
Output:
id value nuniques
0 1 10 1
1 2 15 2
2 1 10 1
3 1 10 1
4 2 13 2
5 3 10 2
6 3 20 2
If you only need to filter your data, you don't need to assign the new column:
df[df.groupby('id')['value'].transform('nunique') == 1]

Let us do filter
out = df.groupby('id').filter(lambda x : x['value'].nunique()==1)
Out[6]:
id value
0 1 10
2 1 10
3 1 10

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)

df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.

This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000

To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

Pandas: Replace/ Change Duplicate values within a Time Range

I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.
So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below
A B C
01-01-2011 2 10 0
01-02-2011 2 12 2
01-03-2011 2 10 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 5 23 1
01-07-2011 4 21 4
01-08-2011 2 21 5
01-09-2011 1 11 0
So, the output should look like
A B C
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0
Any help will be appreciated.

You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).
You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.
Something like this should work :
(edited the code to make it flexible for endless number of columns and flexible for the number of days)
numberOfDays = 3 # number of days to compare
for col in df.columns:
for x in range(1, numberOfDays):
df.loc[df[col] == df[col].shift(x), col] = 0
print df
This gives me the output:
A B C
date
01-01-2011 2 10 0
01-02-2011 0 12 2
01-03-2011 0 0 0
01-04-2011 3 11 3
01-05-2011 5 15 0
01-06-2011 0 23 1
01-07-2011 4 21 4
01-08-2011 2 0 5
01-09-2011 1 11 0

I don't find anything better than looping over all columns, because every column leads to a different grouping.
First define a function which does what you want at grouped level, i.e. setting all but the first entry to zero:
def set_zeros(g):
g.values[1:] = 0
return g
for c in df.columns:
df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)
This custom function can be applied to each group, which is defined by a time range (freq='3D') and equal values of a column within this period. As the columns generally have their equal values in different rows, this has to be done for each column in a loop.
Change freq to 5D, 10D or 20D for your other considerations.
For a detailed description of how to define the time period see http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?

It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]

We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to replace dataframe columns on given condition? - python

Related

I have a dataframe where some index number are missing how do I cut dataframe previous to that missing index number

Filtering rows that have unique value in a column using pandas

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Pandas: Replace/ Change Duplicate values within a Time Range

Pandas - Delete only contiguous rows that equal zero

Categories

Resources