Imagine I have a dataframe like this:
df = pd.DataFrame({"ID":["A","B","C","C","D"],
"DAY 1":[0, 0, 4, 0, 8],
"DAY 2":[3, 0, 4, 1, 2],
"DAY 3":[0, 2, 9, 9, 6],
"DAY 4":[9, 2, 4, 5, 7]})
df
Out[7]:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 A 0 3 0 9
1 B 0 0 2 2
2 C 4 4 9 4
3 C 0 1 9 5
4 D 8 2 6 7
I would like to iterate over every row and replace all 0 values at the beginning of the row before I see a non-zero value.
The ID column shouldn't be in this condition, only the other columns. And I would like to replace these values by NaN. So the output should be like this:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 A nan 3 0 9
1 B nan nan 2 2
2 C 4 4 9 4
3 C nan 1 9 5
4 D 8 2 6 7
And notice that the 0 value in df.loc[0, "DAY 3"] is still there because it didn't meet the condition, as this condition happens only before df.loc[0, "DAY 2"].
Anyone could help me?
You can use a boolean cummin on a subset of the DataFrame to generate a mask for boolean indexing:
mask = (df.filter(like='DAY').eq(0).cummin(axis=1)
.reindex(columns=df.columns, fill_value=False)
)
df[mask] = float('nan')
print(df)
Output:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 A NaN 3.0 0 9
1 B NaN NaN 2 2
2 C 4.0 4.0 9 4
3 C NaN 1.0 9 5
4 D 8.0 2.0 6 7
Intermediate mask:
ID DAY 1 DAY 2 DAY 3 DAY 4
0 False True False False False
1 False True True False False
2 False False False False False
3 False True False False False
4 False False False False False
Related
I have a dataframe that represents several different machine ids, their job number, and the value they output, as follows:
id job value
0 1 1 42
1 1 2 42
2 1 3 42
3 1 4 45
4 2 1 38
5 2 2 38
6 2 3 40
7 2 4 40
8 2 5 42
9 3 1 44
10 3 2 44
11 3 3 43
12 3 4 43
A machine gets each job done in 20 seconds. My goal is to know how many times the value changed per minute. For example, the intermediate step dataframe would be as follows:
id changes
0 1 0 # jobs 1-3=60 seconds. No changes in value
1 1 1 # job 4. Changed from previous value
2 2 1 # jobs 1-3. One change
3 2 1 # jobs 4-5. One change
4 3 1 # jobs 1-3. One change
5 3 0 # job 4. No changes
Then I can easily calculate the end result (change rate per minute) summing the changes column and divide by the number of entries.
id rate
0 1 0.5
1 2 1.0
2 3 0.5
I looked at other questions that partially answer mine, such as this one that uses df.groupby(df.index // 3) to do the bins, but in my case I want that grouping to be per id (groupby(df["id"] // 3)?). And df['changes'] = df.col1 - df.col1.shift()?
Edit to add more test cases:
# Values start at zero to indicate when changes happen
df = pd.DataFrame(
{
"id": [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
"job": [1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 2, 3, 4],
"value": [0, 1, 2, 3, 4, 5, 0, 0, 0, 0, 1, 1, 2]
}
)
Expected result is
1 2.5 # 5 changes total over two minutes
2 0.0 # 0 changes total over one minute
3 1.0 # 2 changes total over a minute and 20 seconds (2 periods)
You can use groupby().rolling() to group the rows, then compare the max and min to know whether the value has changed:
rolling = df.groupby('id')['value'].rolling(3)
intermediate = rolling.max().dropna() != rolling.min().dropna()
out = intermediate.groupby('id').mean()
Output (intermediate):
id
1 2 False
3 True
2 6 True
7 True
8 True
3 11 True
12 True
Name: value, dtype: bool
Output:
id
1 0.5
2 1.0
3 1.0
Name: value, dtype: float64
Thanks to #quang hoang answer, I was able to reach my goal. Here's my code:
JOB_DURATION_SECONDS = 10
print(df)
# Compute the number of changes
df["num_of_changes"] = df.groupby("id").value.apply(
lambda x: x.rolling(2).apply(lambda x: x.iloc[0] < x.iloc[1])
)
print(df)
# Group by id and calculate rate per minute
out = (
df[["id", "num_of_changes"]]
.groupby(["id"])
.apply(lambda x: x["num_of_changes"].sum() / (len(x) * JOB_DURATION_SECONDS))
)
print(out)
Resulting in
# Input
id job value
0 1 1 0
1 1 2 1
2 1 3 2
3 1 4 3
4 1 5 4
5 1 6 5
6 2 1 0
7 2 2 0
8 2 3 0
9 3 1 0
10 3 2 0
11 3 3 1
12 3 4 1
13 3 5 1
14 3 6 1
15 3 7 1
# Intermediate
id job value num_of_changes
0 1 1 0 NaN
1 1 2 1 1.0
2 1 3 2 1.0
3 1 4 3 1.0
4 1 5 4 1.0
5 1 6 5 1.0
6 2 1 0 NaN
7 2 2 0 0.0
8 2 3 0 0.0
9 3 1 0 NaN
10 3 2 0 0.0
11 3 3 1 1.0
12 3 4 1 0.0
13 3 5 1 0.0
14 3 6 1 0.0
15 3 7 1 0.0
# Out
id
1 5.000000
2 0.000000
3 0.857143
dtype: float64
Given the following data
df = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6, 7], "b": [4, 5, 9, 5, 6, 4, 0]})
df["split_by"] = df["b"].eq(9)
which looks as
a b split_by
0 1 4 False
1 2 5 False
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
I would like to create two dataframes as follows:
a b split_by
0 1 4 False
1 2 5 False
and
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Clearly this is based on the value in column split_by, but I'm not sure how to subset using this.
My approach is:
split_1 = df.index < df[df["split_by"].eq(True)].index.to_list()[0]
split_2 = ~df.index.isin(split_1)
df1 = df[split_1]
df2 = df[split_2]
Use argmax as:
true_index = df['split_by'].argmax()
df1 = df.loc[:true_index-1, :]
df2 = df.loc[true_index:, :]
print(df1)
a b split_by
0 1 4 False
1 2 5 False
print(df2)
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
Another approach:
i = df[df['split_by']==True].index.values[0]
df1 = df.iloc[:i]
df2 = df.iloc[i:]
This is assuming you have only one "True". If you have more than one "True", this code will split df into only two dataframes regardless, considering only the first "True".
Use groupby with cumsum , notice if you have more then one True , this will split the dataframe to n+1 dfs (n True)
d={x : y for x , y in df.groupby(df.split_by.cumsum())}
d[0]
a b split_by
0 1 4 False
1 2 5 False
d[1]
a b split_by
2 3 9 True
3 4 5 False
4 5 6 False
5 6 4 False
6 7 0 False
I am a beginner in python.
I need to recode a CSV file:
unique_id,pid,Age
1,1,1
1,2,3
2,1,5
2,2,6
3,1,6
3,2,4
3,3,6
3,4,1
3,5,4
4,1,6
4,2,5
The condition is: for each [unique_id], if there is any [Age]==6, then put a value 1 in the corresponding rows of with a [pid]=1, others should be 0.
the output csv will look like this:
unique_id,pid,Age,recode
1,1,1,0
1,2,3,0
2,1,5,1
2,2,6,0
3,1,6,1
3,2,4,0
3,3,6,0
3,4,1,0
3,5,4,0
4,1,6,1
4,2,5,0
I was using numpy: like follwoing:
import numpy
input_file1 = "data.csv"
input_folder = 'G:/My Drive/'
Her_HH =pd.read_csv(input_folder + input_file1)
Her_HH['recode'] = numpy.select([Her_PP['Age']==6,Her_PP['Age']<6], [1,0], default=Her_HH['recode'])
Her_HH.to_csv('recode_elderly.csv', index=False)
but it does not put value 1 in where [pid] is 1. Any help will be appreciated.
You can use DataFrame.assign for new column with GroupBy.transform for test if at least one match by GroupBy.any, chain mask for test 1 with & for bitwise AND and last cast output to integers
#sorting if necessary
df = df.sort_values('unique_id')
m1 = df.assign(test=df['Age'] == 6).groupby('unique_id')['test'].transform('any')
Another idea for get groups with 6 is filter them with unique_id and Series.isin:
m1 = df['unique_id'].isin(df.loc[df['Age'] == 6, 'unique_id'])
m2 = df['pid'] == 1
df['recode'] = (m1 & m2).astype(int)
print (df)
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
EDIT:
For check groups with no match 6 in Age column is possible filter by inverted mask by ~ and if want only all unique rows by unique_id values add DataFrame.drop_duplicates:
print (df[~m1])
unique_id pid Age
0 1 1 1
1 1 2 3
df1 = df[~m1].drop_duplicates('unique_id')
print (df1)
unique_id pid Age
0 1 1 1
This a bit clumsy, since I know numpy a lot better than pandas.
Load your csv sample into a dataframe:
In [205]: df = pd.read_csv('stack59885878.csv')
In [206]: df
Out[206]:
unique_id pid Age
0 1 1 1
1 1 2 3
2 2 1 5
3 2 2 6
4 3 1 6
5 3 2 4
6 3 3 6
7 3 4 1
8 3 5 4
9 4 1 6
10 4 2 5
Generate a groupby object based on the unique_id column:
In [207]: gps = df.groupby('unique_id')
In [209]: gps.groups
Out[209]:
{1: Int64Index([0, 1], dtype='int64'),
2: Int64Index([2, 3], dtype='int64'),
3: Int64Index([4, 5, 6, 7, 8], dtype='int64'),
4: Int64Index([9, 10], dtype='int64')}
I've seen pandas ways for iterating on groups, but here's a list comprehension. The iteration produce a tuple, with the id and a dataframe. We want to test each group dataframe for 'Age' and 'pid' values:
In [211]: recode_values = [(gp['Age']==6).any() & (gp['pid']==1) for x, gp in gps]
In [212]: recode_values
Out[212]:
[0 False
1 False
Name: pid, dtype: bool, 2 True
3 False
Name: pid, dtype: bool, 4 True
5 False
6 False
7 False
8 False
Name: pid, dtype: bool, 9 True
10 False
Name: pid, dtype: bool]
The result is a list of Series, with a True where pid is 1 and there's a 'Age' 6 in the group.
Joining these Series with numpy.hstack produces a boolean array, which we can convert to an integer array:
In [214]: np.hstack(recode_values)
Out[214]:
array([False, False, True, False, True, False, False, False, False,
True, False])
In [215]: df['recode']=_.astype(int) # assign that to a new column
In [216]: df
Out[216]:
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
Again, I think there's an idiomatic pandas way of joining those series. But for now this works.
===
OK, the groupby object has an apply:
In [223]: def foo(gp):
...: return (gp['Age']==6).any() & (gp['pid']==1).astype(int)
...:
In [224]: gps.apply(foo)
Out[224]:
unique_id
1 0 0
1 0
2 2 1
3 0
3 4 1
5 0
6 0
7 0
8 0
4 9 1
10 0
Name: pid, dtype: int64
And remove the multi-indexing with:
In [242]: gps.apply(foo).reset_index(0, True)
Out[242]:
0 0
1 0
2 1
3 0
4 1
5 0
6 0
7 0
8 0
9 1
10 0
Name: pid, dtype: int64
In [243]: df['recode']=_ # and assign to recode
Lots of experimenting and learning here.
Lets say I have something that looks like this
df = pd.DataFrame({'Event':['A','A','A','A', 'A' ,'B','B','B','B','B'], 'Number':[1,2,3,4,5,6,7,8,9,10],'Ref':[False,False,False,False,True,False,False,False,True,False]})
What I want to do is create a new column which is the difference in Number from the True in ref. So for the A group, the True is the last one, so the column would read -4,-3,-2,-1,0. I have been thinking to do the following:
for col in df.groupby('Event'):
temp = col[1]
reference = temp[temp.Ref==True]
dist1 = temp.apply(lambda x:x.Number-reference.Number,axis=1)
This seems to correctly calculate for each group, but I am not sure how to join the result into the df.
In your case
df['new']=(df.set_index('Event').Number-df.query('Ref').set_index('Event').Number).to_numpy()
df
Event Number Ref new
0 A 1 False -4
1 A 2 False -3
2 A 3 False -2
3 A 4 False -1
4 A 5 True 0
5 B 6 False -3
6 B 7 False -2
7 B 8 False -1
8 B 9 True 0
9 B 10 False 1
You could do the following:
df["new"] = df.Number - df.Number[df.groupby('Event')['Ref'].transform('idxmax')].reset_index(drop=True)
print(df)
Output
Event Number Ref new
0 A 1 False -4
1 A 2 False -3
2 A 3 False -2
3 A 4 False -1
4 A 5 True 0
5 B 6 False -3
6 B 7 False -2
7 B 8 False -1
8 B 9 True 0
9 B 10 False 1
This: df.groupby('Event')['Ref'].transform('idxmax') fill find the indices by group where Ref is True. Basically it finds the indices of the max values, so given that True = 1, and False = 0, it find the indices of the True values.
Try where and grouby transform first
s = df.Number.where(df.Ref).groupby(df.Event).transform('first')
df.Number - s
Out[319]:
0 -4.0
1 -3.0
2 -2.0
3 -1.0
4 0.0
5 -3.0
6 -2.0
7 -1.0
8 0.0
9 1.0
Name: Number, dtype: float64
I have a Dataset which I need to modify using pandas. Below is the detail of the particular column I need to work on:
df["Dependents"].value_counts()
0 345
1 102
2 101
3+ 51
Name: Dependents, dtype: int64
df["Dependents"].notnull().value_counts()
True 599
False 15
Name: Dependents, dtype: int64
I need to assign the null values as 0, 1 or 2 one by one. Like if for first row, I assign 0, then next row should be 1 and then the next 2. Then again start from 0 until all null values are filled.
How can I achieve it?
IIUC you can do it this way:
assuming you have the following DF:
In [214]: df
Out[214]:
Dependents
0 NaN
1 0
2 0
3 0
4 NaN
5 1
6 NaN
7 3+
8 NaN
9 3+
10 2
11 3+
12 1
13 NaN
Solution:
In [215]: idx = df.index[df.Dependents.isnull()]
In [216]: idx
Out[216]: Int64Index([0, 4, 6, 8, 13], dtype='int64')
In [217]: df.loc[idx, 'Dependents'] = np.take(list('012'), [x%3 for x in range(len(idx))])
In [218]: df
Out[218]:
Dependents
0 0
1 0
2 0
3 0
4 1
5 1
6 2
7 3+
8 0
9 3+
10 2
11 3+
12 1
13 1
Similar to MaxU's answer, but using numpy put with 'wrap' mode.
Sample dataframe (df):
Dependents
0 NaN
1 0
2 0
3 0
4 NaN
5 1
6 NaN
7 3+
8 NaN
9 3+
10 2
11 3+
12 1
13 NaN
idx = df.index[df.Dependents.isnull()]
np.put(df.Dependents, idx, [0, 1, 2], mode='wrap')
Dependents
0 0
1 0
2 0
3 0
4 1
5 1
6 2
7 3+
8 0
9 3+
10 2
11 3+
12 1
13 1