I want to extract specific number of groups after applying group by column. For example first 2 or 3 groups.
I have a data frame:
id gender value
1 f 1123
1 f 10
2 m 123
2 m 154
2 m 165
3 m 654
3 m 987
4 f 7654
4 f 7654
4 f 7654
... ... ....
I want something like this
id gender value
2 m 123
2 m 154
3 m 654
3 m 987
... .. ...
My code is:
dtFrame2 = dtFrame.groupby('id').head(2)
dtFrameMale = dtFrame2.loc[dtFrame2.gender=='male']
temp = maleGroups.filter(lambda x: len(x) == 2)
The last statement gives me all the groups with two row but after that I want to extract first two, three or n number of groups.
Something like this
In [60]: s = df[df['gender'] == 'm'].groupby('id').size()
In [61]: s.name = 'size'
In [62]: df2 = df.join(s, on='id')
In [63]: df2[df2['size'] == 2]
Out[63]:
id gender value size
5 3 m 654 2
6 3 m 987 2
Related
I am working on a raw excel file to develop an organized format database (.xlsx) format. The demo input file is given as:
input file
FromTo B# Bname Id Mend
1 to 2 123 bus1 1 F
1 to 3 234 bus2 1 F
5 to 6 321 bus3 2 F
9 to 10 322 bus5 2 F
1 to 2 326 bus6 1 F
1 to 2 457 bus7 1 F
5 to 6 656 bus8 1 F
9 to 10 780 bus9 2 F
1 to 3 875 bus10 2 F
1 to 3 564 bus11 2 F
The required output is in the following format:
output format
Essentially, I want to automate the filter method on column 'FromTo' (based on cell value) of the input and put the information of other columns as it is, as depicted in the output format image.
For output, I am able to get the columns B to E as required in the correct order and format. For this, I used the following logic using pandas
import pandas as pd
df = pd.read_excel('st2_trial.xlsx')
#create an empty dataframe
df_1 = pd.DataFrame()
ai = ['1 to 2','1 to 3','5 to 6', '9 to 10'] #all entries from input Col 'FromTo'
for i in range(len(ai)):
filter_ai = (df['FromTo'] == (ai[i]))
df_ai = (df.loc[filter_ai])
df_1 = pd.concat([df_1,df_ai])
print(df_1)
Getting the following output from this code:
FromTo B# Bname Id Mend
1 to 2 123 bus1 1 F
1 to 2 326 bus6 1 F
1 to 2 457 bus7 1 F
1 to 3 234 bus2 1 F
1 to 3 875 bus10 2 F
1 to 3 564 bus11 2 F
1 to 3 893 bus12 1 F
5 to 6 321 bus3 2 F
5 to 6 656 bus8 1 F
5 to 6 212 bus13 2 F
9 to 10 322 bus5 2 F
9 to 10 780 bus9 2 F
However, clearly, the first column is not the way I want! I am looking to aviod redundunt entries of '1 to 2', '1 to 3', etc. in the first column.
I believe this can be achieved by proper loops in place for the first output column. Any help with the same will be highly appreciated!
PS: I have something in mind to work around this:
-create empty dataframe
-list of all unique entries of column 'FromTo'
-take first element of the list put in first column of output
-Then go over my logic to get other required information as it is in loop
This way, I think, it would avoid the redundant entries in first column of output.
The above question seems similar if not exactly to How to print a groupby object . However I will post my answer here if it helps.
import pandas as pd
df = pd.read_excel('st2_trial.xlsx')
df_group = df.groupby('FromTo').apply(lambda a: a.drop('FromTo', axis = 1)[:].reset_index(drop = True))
print(df_group)
OUTPUT:
B# Bname Id Mend
FromTo
1 to 2 0 123 bus1 1 F
1 326 bus6 1 F
2 457 bus7 1 F
1 to 3 0 234 bus2 1 F
1 875 bus10 2 F
2 564 bus11 2 F
5 to 6 0 321 bus3 2 F
1 656 bus8 1 F
9 to 10 0 322 bus5 2 F
1 780 bus9 2 F
You could try something like this to get your expected output:
df_group = df_1.groupby('FromTo')
print(df_group)
I have a wide dataframe I want to be able to reshape.
I have some columns that I wanna preserve. I have been exploring melt and wide_to_long but I'm not sure that's what I need.
Imagine I have some columns named: 'id', 'classroom', 'city'
And other columns called: 'alumn_x_subject_y_mark', 'alumn_x_subject_y_name', 'alumn_x_subject_y_teacher'
And x and y are the product of [range(20), range(10)].
I would like to end with a df that has columns: id, classroom, city, alumn, subject, mark, name, teacher
With all the original 20*10 columns converted to rows.
An empty dataframe with that structure can be generated this way:
import pandas as pd
import itertools
vals = list(itertools.product(*[range(20), range(10)]))
pd.DataFrame(columns=['id', 'classroom', 'city']+ \
['alumn_{0}_subject_{1}_mark'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_name'.format(x, y) for x, y in vals] + \
['alumn_{0}_subject_{1}_teacher'.format(x, y) for x, y in vals]
, dtype=object)
I'm not building this dataframe but receiving it from a file, that's why it has so many columns and I cannot change that.
If you had only 2 parameters to extract, wide_to_long would work.
Here you have 3, thus you can perform a manual reshaping with a MultiIndex:
regex = r'alumn_(\d+)_subject_(\d+)_(.*)'
out = (df
.set_index(['id', 'classroom', 'city'])
.pipe(lambda d: d.set_axis(pd.MultiIndex
.from_frame(d.columns.str.extract(regex),
names=['alumn', 'subject', None]
),
axis=1))
.stack(['alumn', 'subject'])
.reset_index()
)
output:
Empty DataFrame
Columns: [id, classroom, city, alumn, subject, mark, name, teacher]
Index: []
output with a single row (after df.loc[0] = range(df.shape[1])):
id classroom city alumn subject mark name teacher
0 0 1 2 0 0 3 203 403
1 0 1 2 0 1 4 204 404
2 0 1 2 0 2 5 205 405
3 0 1 2 0 3 6 206 406
4 0 1 2 0 4 7 207 407
.. .. ... ... ... ... ... ... ...
195 0 1 2 9 5 98 298 498
196 0 1 2 9 6 99 299 499
197 0 1 2 9 7 100 300 500
198 0 1 2 9 8 101 301 501
199 0 1 2 9 9 102 302 502
[200 rows x 8 columns]
I would like to add for each category within the Customer_Acquisition_Channel column all values in Days_To_Acquisition column to a separate df.
All Customer_ID vales are unique in dataset below
DF
Customer_ID Customer_Acquisition_Channel Days_To_Acquisition
323 Organic 2
583 Organic 5
838 Organic 2
193 Website 7
241 Website 7
642 Website 1
Desired Output:
Days_To_Acq_Organic_Df
Index Days_To_Acquisition
0 2
1 5
2 2
Days_To_Acq_Website_Df
Index Days_To_Acquisition
0 7
1 7
2 1
This is what I have tried so far, but I would like to use a for loop instead going through each column manually
sub_1 = df.loc[df['Customer_Acquisition_Channel'] == 'Organic']
Days_To_Acq_Organic_Df=sub_1[['Days_To_Acquisition']]
sub_2 = df.loc[df['Customer_Acquisition_Channel'] == 'Website']
Days_To_Acq_Website_Df=sub_2[['Days_To_Acquisition']]
You can iterate through unique values of the channel column and create new dataframes, change the column names, and append them to a list:
dataframes = []
for channel in df.Customer_Acquisition_Channel.unique():
new_df = df[df['Customer_Acquisition_Channel'] == channel][['Customer_ID','Days_To_Acquisition']]
new_df.columns = ['Customer_ID',f'Days_To_Acquisition_{channel}_df']
dataframes.append(new_df)
OUTPUT:
for df in dataframes:
print(df,'\n__________')
Customer_ID Days_To_Acquisition_Organic_df
0 323 2
1 583 5
2 838 2
__________
Customer_ID Days_To_Acquisition_Website_df
3 193 7
4 241 7
5 642 1
__________
Alternatively, you can store the dataframes to a dictionary so you can name them and call them individually:
dataframes = {}
for channel in df.Customer_Acquisition_Channel.unique():
new_df = df[df['Customer_Acquisition_Channel'] == channel][['Customer_ID','Days_To_Acquisition']]
new_df.columns = ['Customer_ID',f'Days_To_Acquisition_{channel}']
dataframes[f'Days_To_Acquisition_{channel}_df'] = new_df
OUTPUT:
print(dataframes['Days_To_Acquisition_Organic_df'])
Customer_ID Days_To_Acquisition_Organic
0 323 2
1 583 5
2 838 2
There is the following data frame df:
df =
ID_DATA FD_1 FD_2 FD_3 FD_4 GRADE
111 23 12 34 45 1
111 23 67 45 5
111 12 67 45 23 5
222 23 55 66 4
222 55 66 4
I calculated the frequency per ID_DATA as follows:
freq = df.ID_DATA.value_counts().reset_index()
freq =
ID_DATA FREQ
111 3
222 2
However, I need to change the logic of this calculation as follows. There are two lists with different values of FD_*:
BaseList = [23,34]
AdjList = [12,45,67]
I need to count the frequency of the occurrence of values from these two lists in df. But there are some rules:
1) If a row contains any value of FD_* that belongs to AdjList, then BaseList should not be counted. The counting of BaseList should only be done if a row does not contain any value from AdjList.
2) If a row contains multiple values of BaseList, then it should be counted as +1.
3) If a row contains multiple values of AdjList, then only the last column FD_* should be counted.
The result should be this one:
ID_DATA FREQ_BaseList FREQ_12 FREQ_45 FREQ_67
111 0 0 3 0
222 1 0 0 0
The value of FREQ_BaseList is equal to 0 for 111, because of firing the rule #1.
The idea is to create custom function for this and then adjust it as needed. You can of course make it a bit more pretty by replacing hardcoded lists of columns:
>>> def worker1(x):
... b = 0
... for v in x:
... if v in AdjList:
... return ['FREQ_' + str(int(v)), 1]
... else:
... b = b + BaseList.count(v)
... return ('FREQ_BaseList', b)
...
>>> def worker2(x):
... r = worker1(x[['FD_4','FD_3','FD_2','FD_1']])
... return pd.Series([x['ID_DATA'], r[1]], index=['ID_DATA', r[0]])
...
>>> res = df.apply(worker2, axis=1).groupby('ID_DATA').sum()
>>> res
FREQ_45 FREQ_BaseList
ID_DATA
111.0 3.0 NaN
222.0 NaN 1.0
>>> res.reindex(columns=['FREQ_BaseList','FREQ_12','FREQ_45','FREQ_67']).fillna(0).astype(int)
FREQ_BaseList FREQ_12 FREQ_45 FREQ_67
ID_DATA
111.0 0 0 3 0
222.0 1 0 0 0
I need a count of non-zero variables in pairs of rows.
I have a dataframe that lists density of species found at several sampling points. I need to know the total number of species found at each pair of sampling points. Here is an example of my data:
>>> import pandas
>>> df = pd.DataFrame({'ID':[111,222,333,444],'minnow':[1,3,5,4],'trout':[2,0,0,3],'bass':[0,1,3,0],'gar':[0,1,0,0]})
>>> df
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
I will pair the rows by ID number, so the pair (111,222) should return a total of 4, while the pair (111,333) should return a total of 3. I know I can get a sum of non-zeros for each row, but if I add those totals for each pair I will be double counting some of the species.
Here's an approach with NumPy -
In [35]: df
Out[35]:
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
In [36]: a = df.iloc[:,1:].values!=0
In [37]: r,c = np.triu_indices(df.shape[0],1)
In [38]: l = df.ID
In [39]: pd.DataFrame(np.column_stack((l[r], l[c], (a[r] | a[c]).sum(1))))
Out[39]:
0 1 2
0 111 222 4
1 111 333 3
2 111 444 2
3 222 333 3
4 222 444 4
5 333 444 3
You can do this using iloc for slicing and numpy
np.sum((df.iloc[[0, 1], 1:]!=0).any(axis=0))
Here df.iloc[[0, 1], 1:] gives you first two rows and numpy sum is counting the total number of non zero pairs in the selected row. You can use df.iloc[[0, 1], 1:] to select any combination of rows.
If the rows are sorted so that the two groups occur one after another, you could do
import pandas as pd
import numpy as np
x = np.random.randint(0,2,(10,3))
df = pd.DataFrame(x)
pair_a = df.loc[::2].reset_index(drop = True)
pair_b = df.loc[1::2].reset_index(drop = True)
paired = pd.concat([pair_a,pair_b],axis = 1)
Then find where paired is non-zero.