I am looking to write a quick script that will run through a csv file with two columns and provide me the rows in which the values in column B switch from one value to another:
eg:
dataframe:
# | A | B
--+-----+-----
1 | 2 | 3
2 | 3 | 3
3 | 4 | 4
4 | 5 | 4
5 | 5 | 4
would tell me that the change happened between row 2 and row 3. I know how to get these values using for loops but I was hoping there was a more pythonic way of approaching this problem.
You can create a new column for the difference
> df['C'] = df['B'].diff()
> print df
# A B C
0 1 2 3 NaN
1 2 3 3 0
2 3 4 4 1
3 4 5 4 0
4 5 5 4 0
> df_filtered = df[df['C'] != 0]
> print df_filtered
# A B C
2 3 4 4 1
This will your required rows
You can do the following which also works for non numerical values:
>>> import pandas as pd
>>> df = pd.DataFrame({"Status": ["A","A","B","B","C","C","C"]})
>>> df["isStatusChanged"] = df["Status"].shift(1, fill_value=df["Status"].head(1)) != df["Status"]
>>> df
Status isStatusChanged
0 A False
1 A False
2 B True
3 B False
4 C True
5 C False
6 C False
>>>
Note the fill_value could be different depending on your application.
you can use this it is much faster, hope it helps!!
my_column_changes = df["MyStringColumn"].shift() != df["MyStringColumn"]
Related
I have a df with multiple columns and trying to select a subset of the data based on an OR logic:
df [ (df['col1']==0) | (df['col2']==0) | (df['col3']==0) | (df['col4']==0) |
(df['col5']==0) | (df['col6']==0) | (df['col7']==0) | (df['col8']==0) |
(df['col9']==0) | (df['col10']==0) | (df['col11']==0) ]
When I apply this logic the result is empty but I know some of the values are zero
All the values of the these column are int64.
I noticed that 'col11' are all 1's. When I remove 'col11' or swap the order of the query (e.g., putting "| (df['col11']==0)" in the middle )I get the expected results.
I wonder if anyone has had this problem or any ideas what's the reason I'm returning an empty df.
Use (df==0).any(axis=1)
df...
a b c d e f
0 6 8 7 19 3 14
1 14 19 3 13 10 10
2 6 18 16 0 15 12
3 19 4 14 3 8 3
4 4 14 15 1 6 11
>>> (df==0).any(axis=1)
0 False
1 False
2 True
3 False
4 False
>>> #subset of the columns
>>> (df[['a','c','e']]==0).any(axis=1)
0 False
1 False
2 False
3 False
4 False
dtype: bool
If the DataFrame is all integers you can make use of the fact that zero is falsey and use
~df.all(axis=1)
To make fake data
import numpy as np
import pandas as pd
rng = np.random.default_rng()
nrows = 5
df = pd.DataFrame(rng.integers(0,20,(nrows,6)),columns=['a', 'b', 'c', 'd','e','f'])
Given the following datatable
DT = dt.Frame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:
DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]
a) How can I number the rows within groups?
b) Is there a comprehensive resource with all the currently available functions for pydatatable?
Update:
Datatable now has a cumcount function in dev :
DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]
| A B C
| str32 str32 int64
-- + ----- ----- -----
0 | A a 1
1 | A a 2
2 | A b 1
3 | B a 1
4 | B a 2
5 | B a 3
[6 rows x 3 columns]
Old answer:
This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools or other performant tools within python while still being very fast :
Step 1 : Get count of columns A and B and export to list
result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()
Step 2 : Use a combination of itertools chain and list comprehension to get the cumulative counts :
from itertools import chain
cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])
Step 3: Assign result back to DT
DT['C'] = dt.Frame(tuple(cumcount))
print(DT)
A B C
▪▪▪▪ ▪▪▪▪ ▪▪▪▪
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
6 rows × 3 columns
I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7
I have a dataframe with three series. Column A contains a group_id. Column B contains True or False. Column C contains a 1-n ranking (where n is the number of rows per group_id).
I'd like to store a subset of this dataframe for each row that:
1) Column C == 1
OR
2) Column B == True
The following logic copies my old dataframe row for row into the new dataframe:
new_df = df[df.column_b | df.column_c == 1]
IIUC, starting from a sample dataframe like:
A,B,C
01,True,1
01,False,2
02,False,1
02,True,2
03,True,1
you can:
df = df[(df['C']==1) | (df['B']==True)]
which returns:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
You've couple of methods for filtering, and performance varies based on size of your data
In [722]: df[(df['C']==1) | df['B']]
Out[722]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [723]: df.query('C==1 or B==True')
Out[723]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
In [724]: df[df.eval('C==1 or B==True')]
Out[724]:
A B C
0 1 True 1
2 2 False 1
3 2 True 2
4 3 True 1
I want to treat non consecutive ids as different variables during groupby, so that I can take return the first value of stamp, and the sum of increment as a new dataframe. Here is sample input and output.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.array(['a','a','a','b','c','b','b','a','a','a']),
np.arange(1, 11), np.ones(10)]).T
df.columns = ['id', 'stamp', 'increment']
df_result = pd.DataFrame([ np.array(['a','b','c','b','a']),
np.array([1,4,5,6,8]), np.array([3,1,1,2,3])]).T
df_result.columns = ['id', 'stamp', 'increment_sum']
In [2]: df
Out[2]:
id stamp increment
0 a 1 1
1 a 2 1
2 a 3 1
3 b 4 1
4 c 5 1
5 b 6 1
6 b 7 1
7 a 8 1
8 a 9 1
9 a 10 1
In [3]: df_result
Out[3]:
id stamp increment_sum
0 a 1 3
1 b 4 1
2 c 5 1
3 b 6 2
4 a 8 3
I can accomplish this via
def get_result(d):
sum = d.increment.sum()
stamp = d.stamp.min()
name = d.id.max()
return name, stamp, sum
#idea from http://stackoverflow.com/questions/25147091/combine-consecutive-rows-with-the-same-column-values
df['key'] = (df['id'] != df['id'].shift(1)).astype(int).cumsum()
result = zip(*df.groupby([df.key]).apply(get_result))
df = pd.DataFrame(np.array(result).T)
df.columns = ['id', 'stamp', 'increment_sum']
But I'm sure there must be a more elegant solution
Not that good in terms of optimum code, but solves the problem
> df_group = df.groupby('id')
we cant use id alone for groupby, so adding another new column to groupby within id based whether it is continuous or not
> df['group_diff'] = df_group['stamp'].diff().apply(lambda v: float('nan') if v == 1 else v).ffill().fillna(0)
> df
id stamp increment group_diff
0 a 1 1 0
1 a 2 1 0
2 a 3 1 0
3 b 4 1 0
4 c 5 1 0
5 b 6 1 2
6 b 7 1 2
7 a 8 1 5
8 a 9 1 5
9 a 10 1 5
Now we can the new column group_diff for secondary grouping.. Added sort function in the end as suggested in the comments to get the exact function
> df.groupby(['id','group_diff']).agg({'increment':sum, 'stamp': 'first'}).reset_index()[['id', 'stamp','increment']].sort('stamp')
id stamp increment
0 a 1 3
2 b 4 1
4 c 5 1
3 b 6 2
1 a 8 3