I have a dataframe where I have transformed all NaN to 0 for a specific reason. In doing another calculation on the df, my group by is picking up a 0 and making it a value to perform the counts on. Any idea how to get python and pandas to exclude the 0 value? In this case the 0 represents a single row in the data. Is there a way to exclude all 0's from the groupby?
My groupby looks like this
+----------------+----------------+-------------+
| Team | Method | Count |
+----------------+----------------+-------------+
| Team 1 | Automated | 1 |
| Team 1 | Manual | 14 |
| Team 2 | Automated | 5 |
| Team 2 | Hybrid | 1 |
| Team 2 | Manual | 25 |
| Team 4 | 0 | 1 |
| Team 4 | Automated | 1 |
| Team 4 | Hybrid | 13 |
+----------------+----------------+-------------+
My code looks like this (after importing excel file)
df = df1.filnna(0)
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'}
I'd filter the df prior to grouping:
In [8]:
a = df.loc[df['Method'] !=0, ['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[8]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 Automated 1
Hybrid 1
Here we only select rows where method is not equal to 0
compare against without filtering:
In [9]:
a = df[['Team', 'Method']]
b = a.groupby(['Team', 'Method']).agg({'Method' : 'count'})
b
Out[9]:
Method
Team Method
1 Automated 1
Manual 1
2 Automated 1
Hybrid 1
Manual 1
4 0 1
Automated 1
Hybrid 1
You need the filter.
The filter method returns a subset of the original object. Suppose
we want to take only elements that belong to groups with a group sum
greater than 2.
Example:
In [94]: sf = pd.Series([1, 1, 2, 3, 3, 3])
In [95]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[95]: 3 3
4 3 5 3 dtype: int64
Source.
Related
If I have a python datatable like this:
from datatable import f, dt
data = dt.Frame(grp=["a","a","b","b","b","b","c"], value=[2,3,1,2,5,9,2])
how do I create an new column that has the row number, by group?. That is, what is the equivalent of R data.table's
data[, id:=1:.N, by=.(grp)]
This works, but seems completely ridiculous
data['id'] = np.concatenate(
[np.arange(x)
for x in data[:,dt.count(), dt.by(f.grp)]['count'].to_numpy()])
desired output:
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
Update:
Datatable now has a cumcount function in dev :
data[:, [f.value, dt.cumcount()], 'grp']
| grp value C0
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Old Answer:
datatable does not have a cumulative count function, in fact there is no cumulative function for any aggregation at the moment.
One way to possibly improve the speed is to use a faster iteration of numpy, where the for loop is done within C, and with more efficiency. The code is from here and modified for this purpose:
from datatable import dt, f, by
import numpy as np
In [244]: def create_ranges(indices):
...: cum_length = indices.cumsum()
...: ids = np.ones(cum_length[-1], dtype=int)
...: ids[0] = 0
...: ids[cum_length[:-1]] = -1 * indices[:-1] + 1
...: return ids.cumsum()
counts = data[:, dt.count(), by('grp', add_columns=False)].to_numpy().ravel()
data[:, f[:].extend({"counts" : create_ranges(counts)})]
| grp value counts
| str32 int32 int64
-- + ----- ----- ------
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
The create_ranges function is wonderful (the logic built on cumsum is nice) and really kicks in as the array size increases.
Of course this has its drawbacks; you are stepping out of datatable into numpy territory and then back into datatable; the other aspect is that I am banking on the fact that the groups are sorted lexically; this won't work if the data is unsorted (and has to be sorted on the grouping column).
Preliminary tests show a marked improvement in speed; again it is limited in scope and it would be much easier/better if this was baked into the datatable library.
If you are good with C++, you could consider contributing this function to the library; I and so many others would appreciate your effort.
You could have a look at pypolars and see if it helps with your use case. From the h2o benchmarks it looks like a very fast tool.
One approach is to convert to_pandas, groupby (on the pandas DataFrame) and use cumcount:
import datatable as dt
data = dt.Frame(grp=["a", "a", "b", "b", "b", "b", "c"], value=[2, 3, 1, 2, 5, 9, 2])
data["id"] = data.to_pandas().groupby("grp").cumcount()
print(data)
Output
| grp value id
| str32 int32 int64
-- + ----- ----- -----
0 | a 2 0
1 | a 3 1
2 | b 1 0
3 | b 2 1
4 | b 5 2
5 | b 9 3
6 | c 2 0
[7 rows x 3 columns]
Lets say I have a data frame with 3 columns:
| id | value | date |
+====+=======+===========+
| 1 | 50 | 1-Feb-19 |
+----+-------+-----------+
| 1 | 100 | 5-Feb-19 |
+----+-------+-----------+
| 1 | 200 | 6-Jun-19 |
+----+-------+-----------+
| 1 | 500 | 1-Dec-19 |
+----+-------+-----------+
| 2 | 10 | 6-Jul-19 |
+----+-------+-----------+
| 3 | 500 | 1-Mar-19 |
+----+-------+-----------+
| 3 | 200 | 5-Apr-19 |
+----+-------+-----------+
| 3 | 100 | 30-Jun-19 |
+----+-------+-----------+
| 3 | 10 | 25-Dec-19 |
+----+-------+-----------+
ID column contains the ID of a particular person.
Value column contains the value of their transaction.
Date column contains the date of their transaction.
Is there a way in Python to identify ID 1 as the ID with the increasing value of transactions over time?
I'm looking for some way I can extract ID 1 as my desired ID with increasing value of transactions, filter out ID 2 because it doesn't have enough transactions to analyze a trend and also filter out ID 3 as it's trend of transactions is declining over time.
Perhaps group by the id, and check that the sorted values are the same whether sorted by values or by date:
>>> df.groupby('id').apply( lambda x:
... (
... x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value']
... ).all()
... )
id
1 True
2 True
3 False
dtype: bool
EDIT:
To make id=2 not True, we can do this instead:
>>> df.groupby('id').apply( lambda x:
... (
... (x.sort_values('value', ignore_index=True)['value'] == x.sort_values('date', ignore_index=True)['value'])
... & (len(x) > 1)
... ).all()
... )
id
1 True
2 False
3 False
dtype: bool
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
df = df.groupby('id').new.agg(['last'])
df
Output:
last
id
1 increase
2 --
3 decrease
Only increasing ID:
increasingList = df[(df['last']=='increase')].index.values
print(increasingList)
Result:
[1]
Assuming this won't happen
1 50
1 100
1 50
If so, then:
df['new'] = df.groupby(['id'])['value'].transform(lambda x : \
np.where(x.diff()>0,'increase',
np.where(x.diff()<0,'decrease','--')))
df
Output:
value new
id
1 50 --
1 100 increase
1 200 increase
2 10 --
3 500 --
3 300 decrease
3 100 decrease
Concat strings:
df = df.groupby(['id'])['new'].apply(lambda x: ','.join(x)).reset_index()
df
Intermediate Result:
id new
0 1 --,increase,increase
1 2 --
2 3 --,decrease,decrease
Check if decrease exist in a row / only "--" exists. Drop them
df = df.drop(df[df['new'].str.contains("dec")].index.values)
df = df.drop(df[(df['new']=='--')].index.values)
df
Result:
id new
0 1 --,increase,increase
I have a dataframe (df) as given below
Hair Legs Type
0 0 1
0 0 1
1 2 1
1 2 1
1 4 1
0 2 2
0 2 2
1 6 6
and I want to calculate contingency tables for all the different combinations. For example, a contingency table for Hair(H=1), Legs (L=2) and Type (T=1) would look like this
| T | T` |
--------------------
(H,2L) | 2 | 0 |
--------------------
(H,2L)` | 3 | 3 |
I am currently doing the same with other variables and it is doing quite well however those variables are boolean and I am facing with this problem with only the variable (Legs) which has numerical values (0,2,4,5,6,8).
Moreover, I want to calculate for H=0, L=0, T=0 and then for H=1, L=4 and T=1 and so on.
Any help regarding this would be highly appreciated.
I have the following grid with bins defined by x and y, and each grid square given a unique id
mapping = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9], 'x': [1,1,1,2,2,2,3,3,3], 'y': [1,2,3,1,2,3,1,2,3]})
id x y
0 1 1 1
1 2 1 2
2 3 1 3
3 4 2 1
4 5 2 2
5 6 2 3
6 7 3 1
7 8 3 2
8 9 3 3
I also have a new dataframe of observations for which I would like to know what their associated id should be (e.g. which grid square they fall into)
coordinates = pd.DataFrame({'x': [1.4, 2.7], 'y': [1.9, 1.1]})
x y
0 1.4 1.9
1 2.7 1.1
My solution is the following function:
import bisect
def get_id(coords, mapping):
x_val = mapping.x[bisect.bisect_right(mapping.x, coords[0]) - 1]
y_val = mapping.y[bisect.bisect_right(mapping.y, coords[1]) - 1]
id = mapping[(mapping.x == x_val) & (mapping.y == y_val)].iloc[0, 0]
return id
coordinates.apply(get_id, mapping = mapping, axis = 1)
Out[21]:
0 1
1 4
dtype: int64
This works but becomes slow as the coordinates dataframe grows long. I'm sure there is a fast way to do this for a coordinates dataframe with 10^6 + observations. Is there a faster way to do this?
Edit:
To answer #abdurrehman245 question from the comments below.
My current method is to simply round down any data point, this allows me to map it to an ID by using the mapping dataframe which contains the min entries (the bins) for any given ID. So x=1.4 y=1.9 round to x=1 y=1 which is mapped to id=1 according to the mapping.
Maybe this cartesian visualisation makes this a little bit more clear:
Y
4 -------------------------
| 3 | 6 | 9 |
| | | |
3 -------------------------
| 2 | 5 | 8 |
| | | |
2 -------------------------
| 1 | 4 | 7 |
| | | |
1 ------------------------- X
1 2 3 4
I would also add that I could not use the floor function as the bins are not necessarily nice integers as in this example.
I got the following problem which I got stuck on and unfortunately cannot resolve by myself or by similar questions that I found on stackoverflow.
To keep it simple, I'll give a short example of my problem:
I got a Dataframe with several columns and one column that indicates the ID of a user. It might happen that the same user has several entries in this data frame:
| | userID | col2 | col3 |
+---+-----------+----------------+-------+
| 1 | 1 | a | b |
| 2 | 1 | c | d |
| 3 | 2 | a | a |
| 4 | 3 | d | e |
Something like this. Now I want to known the number of rows that belongs to a certain userID. For this operation I tried to use df.groupby('userID').size() which in return I want to use for another simple calculation, like division whatsover.
But as I try to save the results of the calculation in a seperate column, I keep getting NaN values.
Is there a way to solve this so that I get the result of the calculations in a seperate column?
Thanks for your help!
edit//
To make clear, how my output should look like. The upper dataframe is my main data frame so to say. Besides this frame I got a second frame looking like this:
| | userID | value | value/appearances |
+---+-----------+----------------+-------+
| 1 | 1 | 10 | 10 / 2 = 5 |
| 3 | 2 | 20 | 20 / 1 = 20 |
| 4 | 3 | 30 | 30 / 1 = 30 |
So I basically want in the column 'value/appearances' to have the result of the number in the value column divided by the number of appearances of this certain user in the main dataframe. For user with ID=1 this would be 10/2, as this user has a value of 10 and has 2 rows in the main dataframe.
I hope this makes it a bit clearer.
IIUC you want to do the following, groupby on 'userID' and call transform on the grouped column and pass 'size' to identify the method to call:
In [54]:
df['size'] = df.groupby('userID')['userID'].transform('size')
df
Out[54]:
userID col2 col3 size
1 1 a b 2
2 1 c d 2
3 2 a a 1
4 3 d e 1
What you tried:
In [55]:
df.groupby('userID').size()
Out[55]:
userID
1 2
2 1
3 1
dtype: int64
When assigned back to the df aligns with the df index so it introduced NaN for the last row:
In [57]:
df['size'] = df.groupby('userID').size()
df
Out[57]:
userID col2 col3 size
1 1 a b 2
2 1 c d 1
3 2 a a 1
4 3 d e NaN