import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9,10,11],
'text': ['abc','zxc','qwe','asf','efe','ert','poi','wer','eer','poy','wqr']})
I have a DataFrame with columns:
id text
1 abc
2 zxc
3 qwe
4 asf
5 efe
6 ert
7 poi
8 wer
9 eer
10 poy
11 wqr
I have a list L = [1,3,6,10] which contains list of id's.
I am trying to append the text column using a list such that, from my list first taking 1 and 3(first two values in a list) and appending text column in my DataFrame with id = 1 which has id's 2, then deleting rows with id column 2 similarly then taking 3 and 6 and then appending text column where id = 4,5 to id 3 and then delete rows with id = 4 and 5 and iteratively for elements in list (x, x+1)
My final output would look like this:
id text
1 abczxc # joining id 1 and 2
3 qweasfefe # joining id 3,4 and 5
6 ertpoiwereer # joining id 6,7,8,9
10 poywqr # joining id 10 and 11
You can use isin with cumsum for Series, which is use for groupby with apply join function:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
df1 = df.groupby(s)['text'].apply(''.join).reset_index()
print (df1)
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr
It working because:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
print (s)
0 1
1 1
2 3
3 3
4 3
5 6
6 6
7 6
8 6
9 10
10 10
Name: id, dtype: int32
I changed the values not in list to np.nan and then ffill and groupby. Though #Jezrael's approach is much better. I need to remember to use cumsum:)
l = [1,3,6,10]
df.id[~df.id.isin(l)] = np.nan
df = df.ffill().groupby('id').sum()
text
id
1.0 abczxc
3.0 qweasfefe
6.0 ertpoiwereer
10.0 poywqr
Use pd.cut to create you bins then groupby with a lambda function to join your text in that group.
df.groupby(pd.cut(df.id,L+[np.inf],right=False, labels=[i for i in L])).apply(lambda x: ''.join(x.text))
EDIT:
(df.groupby(pd.cut(df.id,L+[np.inf],
right=False,
labels=[i for i in L]))
.apply(lambda x: ''.join(x.text)).reset_index().rename(columns={0:'text'}))
Output:
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr
Related
I've seen this answered a lot for values, but not for the column header itself.
Say I have my original dataframe, df1:
axx byy czz
0 1 2 3
1 4 5 6
And a second dataframe, df2:
dd ee
0 7 8
1 9 10
If dataframe 1 contains the string sequence "yy" , append the whole column (values included) to dataframe 2, so in the end for df2 I would get this:
dd ee byy
0 7 8 2
1 9 10 5
How do I do this? I know it has something along the lines of df1.columns.str.contains('yy') but this returns a boolean, and I can't work out how to use that to copy over and append the entire column.
You can use filter for this:
new_df = pd.concat([df2, df1.filter(like='yy')], axis=1)
Output:
>>> new_df
dd ee byy
0 7 8 2
1 9 10 5
Or df.columns.str.contains, like you were thinking:
yy_cols = df1.columns[df1.columns.str.contains('yy')]
new_df = pd.concat([df2, df1[yy_cols]], axis=1)
Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])
I'm trying to group the following dataframe by unique binId and then parse the resulting rows based of 'z' and pick the row with highest value of 'z'. Here is my dataframe.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3','4','5','6'], 'binId': ['1','2','2','1','1','3'], 'x':[1,4,5,6,3,4], 'y':[11,24,35,16,23,34],'z':[1,4,5,2,3,4]})
`
I tried following code which gives required answer,
def f(x):
tp = df[df['binId'] == x][['binId','ID','x','y','z']].sort_values(by='z', ascending=False).iloc[0]
return tp`
and then,
binids= pd.Series(df.binId.unique())
print binids.apply(f)
The output is,
binId ID x y z
0 1 5 3 23 3
1 2 3 5 35 5
2 3 6 4 34 4
But the execution is too slow. What is the faster way of doing this?
Use idxmax for indices of max and select by loc:
df1 = df.loc[df.groupby('binId')['z'].idxmax()]
Or faster is use sort_values with drop_duplicates:
df1 = df.sort_values(['binId', 'z']).drop_duplicates('binId', keep='last')
print (df1)
ID binId x y z
4 5 1 3 23 3
2 3 2 5 35 5
5 6 3 4 34 4
I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7
Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)