How do I specify a column header for pandas groupby result? - python

I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks

The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y

Related

Pandas Groupby remove row where specific value combination occurs

As Stated, I would like to remove a specific row based on group by logic. In the below dataframe wherever combination for F and G occurs for an ID, I would like to remove row with value G.
import pandas as pd
op_d = {'ID': [1,1,2,2,3,4],'Value':['F','G','K','G','H','G']}
df = pd.DataFrame(data=op_d)
df
In this case, I would like to remove second row with value 'G' for ID = 1. So far
temp = df.groupby('ID').apply(lambda x: (x['Value'].nunique()>1)).reset_index().rename(columns={0:'Expected_Output'})
temp = temp.loc[temp['Expected_Output']==True]
multiple_options = df.loc[df['ID'].isin(temp['ID'])]
So far, I am able to figure out where each ID has a multiple value. Could you tell how to remove this specific row ?
Use Series.eq + Series.groupby transform with any:
m1, m2 = df['Value'].eq('F'), df['Value'].eq('G')
m = m2 & m1.groupby(df['ID']).transform('any') & m2.groupby(df['ID']).transform('any')
df1 = df[~m]
Result:
print(df1)
ID Value
0 1 F
2 2 K
3 2 G
4 3 H
5 4 G
Using isin:
c = (df['Value'].isin(['F','G']).groupby(df['ID']).transform('sum').eq(2)
& df['Value'].eq('G'))
out = df[~c].copy()
ID Value
0 1 F
2 1 H
3 2 K
4 2 G
5 3 H
6 4 G

extract rows with conditions and with new created column in python

I have a data like this
id name sub marks
1 a m 52
1 a s 69
1 a p 63
2 b m 36
2 b s 52
2 b p 56
3 c m 85
3 c s 62
3 c p 56
And I want output table which contain columns such as id, name and new column result(using criteria if marks in all subject is greater than 40 then this student is pass)
id name result
1 a pass
2 b fail
3 c pass
I would like to do this in python.
Create a boolean mask from marks, and then use groupby (on id and name) + all:
import pandas as pd
df = pd.read_csv('file.csv')
v = df.assign(result=df.marks.gt(40))\
.groupby(['id', 'name'])\
.result\
.all()\
.reset_index()
v['result'] = np.where(v['result'], 'pass', 'fail')
v
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Here's one way
In [127]: df.groupby(['id', 'name']).marks.agg(
lambda x: 'pass' if x.ge(40).all() else 'fail'
).reset_index(name='result')
Out[127]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Another way, inspired from jpp's solution, use replace or map
In [132]: df.groupby(['id', 'name']).marks.min().ge(40).replace(
{True: 'pass', False: 'fail'}
).reset_index(name='result')
Out[132]:
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Here is one way via pandas. Note your criteria is equivalent to the minimum mark being above 40. This algorithm is computationally more efficient.
import pandas as pd
df = pd.read_csv('file.csv')
df = df.groupby(['id', 'name'])['marks'].apply(min).reset_index()
df['result'] = np.where(df['marks'] > 40, 'pass', 'fail')
df = df[['id', 'name', 'result']]
Result
id name result
0 1 a pass
1 2 b fail
2 3 c pass
Explanation
First perform a groupby.min() by id and name.
Then assign the column a string depending on value.

Python/Pandas dataframe - return column name

Is there a way to return the name/header of a column into a string in a pandas dataframe? I want to work with a row of data which has the same prefix. The dataframe header looks like this:
col_00 | col_01 | ... | col_51 | bc_00 | cd_00 | cd_01 | ... | cd_90
I'd like to apply a function to each row, but only from col_00 to col_51 and to cd_00 to cd_90 separately. To do this, I thought I'd collect the column names into a list, fe. to_work_with would be the list of columns starting with the prefix 'col', apply the function to df[to_work_with]. Then I'd change the to_work_with and it would contain the list of columns starting with the 'cd' prefix et cetera. But I don't know how to iterate through the column names.
So basically, the thing I'm looking for is this function:
to_work_with = column names in the df that start with "thisstring"
How can I do that? Thank you!
You can use boolean indexing with str.startswith:
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
Sample:
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 6 7 8 9
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
#if want apply some function for filtered columns only
def f(x):
return x + 1
df[cols] = df[cols].apply(f)
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 7 8 9 10
Another solution with list comprehension:
cols = [col for col in df.columns if col.startswith("cd")]
print (cols)
['cd_00', 'cd_01', 'cd_02', 'cd_90']

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Editing then concatenating values of several columns into a single one (pandas, python)

I'm looking for a way to use pandas and python to combine several columns in an excel sheet with known column names into a new, single one, keeping all the important information as in the example below:
input:
ID,tp_c,tp_b,tp_p
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked
desired output:
ID,tp_all
0,transportation
1,cars
2,boats
3,cars+boats
4,boats+planes
5,cars+boats+planes
The row with ID of 0 contans a description of the contents of the column. Ideally the code would parse the description in the second row, look after the '-' and concatenate those values in the new "tp_all" column.
This is quite interesting as it's a reverse get_dummies...
I think I would manually munge the column names so that you have a boolean DataFrame:
In [11]: df1 # df == 'checked'
Out[11]:
cars boats planes
0
1 True False False
2 False True False
3 True True False
4 False True True
5 True True True
Now you can use an apply with zip:
In [12]: df1.apply(lambda row: '+'.join([col for col, b in zip(df1.columns, row) if b]),
axis=1)
Out[12]:
0
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
dtype: object
Now you just have to tweak the headers, to get the desired csv.
Would be nice if there were a less manual way / faster to do reverse get_dummies...
OK a more dynamic method:
In [63]:
# get a list of the columns
col_list = list(df.columns)
# remove 'ID' column
col_list.remove('ID')
# create a dict as a lookup
col_dict = dict(zip(col_list, [df.iloc[0][col].split(' - ')[1] for col in col_list]))
col_dict
Out[63]:
{'tp_b': 'boats', 'tp_c': 'cars', 'tp_p': 'planes'}
In [64]:
# define a func that tests the value and uses the dict to create our string
def func(x):
temp = ''
for col in col_list:
if x[col] == 'checked':
if len(temp) == 0:
temp = col_dict[col]
else:
temp = temp + '+' + col_dict[col]
return temp
df['combined'] = df[1:].apply(lambda row: func(row), axis=1)
df
Out[64]:
ID tp_c tp_b tp_p \
0 0 transportation - cars transportation - boats transportation - planes
1 1 checked NaN NaN
2 2 NaN checked NaN
3 3 checked checked NaN
4 4 NaN checked checked
5 5 checked checked checked
combined
0 NaN
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
[6 rows x 5 columns]
In [65]:
df = df.ix[1:,['ID', 'combined']]
df
Out[65]:
ID combined
1 1 cars
2 2 boats
3 3 cars+boats
4 4 boats+planes
5 5 cars+boats+planes
[5 rows x 2 columns]
Here is one way:
newCol = pandas.Series('',index=d.index)
for col in d.ix[:, 1:]:
name = '+' + col.split('-')[1].strip()
newCol[d[col]=='checked'] += name
newCol = newCol.str.strip('+')
Then:
>>> newCol
0 cars
1 boats
2 cars+boats
3 boats+planes
4 cars+boats+planes
dtype: object
You can create a new DataFrame with this column or do what you like with it.
Edit: I see that you have edited your question so that the names of the modes of transportation are now in row 0 instead of in the column headers. It is easier if they're in the column headers (as my answer assumes), and your new column headers don't seem to contain any additional useful information, so you should probably start by just setting the column names to the info from row 0, and deleting row 0.

Categories