Get only first row per subject in dataframe - python

I was wondering if there is an easy way to get only the first row of each grouped object (subject id for example) in a dataframe. Doing this:
for index, row in df.iterrows():
# do stuff
gives us each one of the rows, but I am interested in doing something like this:
groups = df.groupby('Subject id')
for index, row in groups.iterrows():
# give me the first row of each group
continue
Is there a pythonic way to do the above?

Direct solution - without .groupby() - by .drop_duplicates()
what you want is to keep only the rows with first occurrencies in a specific column:
df.drop_duplicates(subset='Subject id', keep='first')
General solution
Using the .apply(func) in Pandas:
df.groupby('Subject id').apply(lambda df: df.iloc[0, :])
It applies a function (mostly on the fly generated with lambda) to every data frame in the list of data frames returned by df.groupby() and aggregates the result to a single final data frame.
However, the solution by #AkshayNevrekar is really nice with .first(). And like he did there, you could also attach here - a .reset_index() at the end.
Let's say this is the more general solution - where you could also take any nth row ... - however, this works only if all sub-dataframes have at least n rows.
Otherwise, use:
n = 3
col = 'Subject id'
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < (df.shape[0]):
res_df = res_df.append(df.reset_index().iloc[n, :])
Or as a function:
def group_by_select_nth_row(df, col, n):
res_df = pd.DataFrame()
for name, df in df.groupby(col):
if n < df.shape[0]:
res_df = res_df.append(df.reset_index().iloc[n, :])
return res_df
Quite confusing is that df.append() in contrast to list.append() only returns the appended value but leaves the original df unchanged.
Therefore you should always reassign it if you want an 'in place' appending, like one is used from list.append().

Use first() to get first row of each group.
df = pd.DataFrame({'subject_id': [1,1,2,2,2,3,4,4], 'val':[20,32,12,34,45,43,23,10]})
# print(df.groupby('subject_id').first().reset_index())
print(df.groupby('subject_id', as_index=False).first())
Output:
subject_id val
0 1 20
1 2 12
2 3 43
3 4 23

Related

Concatenating values into column from multiple rows

I have a dataframe containing only duplicate "MainID" rows. One MainID may have multiple secondary IDs (SecID). I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col. What is the best way of achieving this? Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series, and you can convert it to a dataframe as #Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Select a specific slice of data from a main dataframe, conditional on a value in a main dataframe column

I have a main dataframe (df) with a Date column (non-index), a column 'VXX_Full' with values, and a 'signal' column.
I want to iterate through the signals column, and whenever it is 1, i want to capture a slice (20 rows before, 40 rows after) of the 'VXX_Full' column and create a new dataframe with all the slices. I would like the column name of the new dataframe to be the row number of the original dataframe.
VXX_signal = pd.DataFrame(np.zeros((60,0)))
counter = 1
for row in df.index:
if df.loc[row,'signal'] == 1:
add_row = df.loc[row - 20:row +20,'VXX_Full']
VXX_signal[counter] = add_row
counter +=1
VXX_signal
It just doesn't seem to be working. It creates a dataframe, however the values are all Nan. The first slice, it at least appears to be getting data from the main df, however the data doesn't correspond to the correct location. The following set of columns (there are 30 signals so 30 columns created) in the new df are all NaN
Thanks in advance!
I'm not sure about your current code - but basically all you need is a list of ranges of indexes. If your index is linear, this would be something like:
indexes = list(df[df.signal==1].index)
ranges = [(i,list(range(i-20,i+21))) for i in indexes] #create tuple (original index,range)
dfs = [df.loc[i[1]].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#EDIT: for only the VXX_Full Column:
dfs = [df.loc[i[1]].copy()[['VXX_Full']].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#here we take the -20:+20 slice of df, make a separate dataframe, the
#we change 'VXX_Full' to the original index value, and reset index to give it 0:40 index.
#The new index will be useful when putting all the columns next to each other.
So we made a list of indexes with signal == 1, turned it into a list of ranges and finally a list of dataframes with reset index.
Now we want to merge it all together:
from functools import reduce
merged_df = reduce(lambda left, right: pd.merge(
left, right, left_index=True, right_index=True), dfs)
I would build the resulting dataframe from a dictionnary of lists:
resul = pd.DataFrame({i:df.loc[i-20 if i >=20 else 0: i+40 if i <= len(df) - 40 else len(df),
'VXX_FULL'].values for i in df.loc[df.signal == 1].index})
The trick is that .values extract a numpy array with no associated index.
Beware: above code assumes that the index of the original dataframe is just the row number. Use reset_index first if it is different.

How to keep indexes when sum by columns based on grouped_by in pandas

I have a dataset where each ID has 6 corresponding rows. I want to this dataset grouped by the column ID and sum aggregate using sum. I wrote this piece of code:
col = [col for col in train.columns if col not in ['Month', 'ID']]
train.groupby('ID')[col].sum().reset_index()
Everything works fine except that I lose column ID. Now, Unique ID from my initial database disappeared and instead I have just enumerated ids from 0 up to the number of rows in the resulting dataset. I want to keep initial indexes, because I will need to merge this dataset with another further. How I can deal with this problem? Thanks for helping very much!
P.S: deleting reset_index() has no effect
P.S: You can see two problems on the images. On first image there is original database. You can see 6 entries for each ID. On the second image there is a databased which is a result from the grouped statement. First problem: IDs are not the same as in the original table. Second problem: the sum over 6 months for each ID is not correct.
Instead of using reset_index() you can simply use the keyword argument as_index: df.groupby('ID', as_index=False)
This will preserve column ID in the resulting DataFrameGroupBy, as described in groupby()'s doc.
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
When you group a data frame by some columns, those columns become your new index.
import pandas as pd
import numpy as np
# Create data
n = 6; m = 3
col_id = np.hstack([['id-'+str(i)] * n for i in range(m)]).reshape(-1, 1)
np.random.shuffle(col_id)
data = np.random.rand(m*n, m)
columns = ['v'+str(i+1) for i in range(m)]
df = pd.DataFrame(data, columns=columns)
df['ID'] = col_id
# Group by ID
print(df.groupby('ID').sum())
Will simply give you
v1 v2 v3
ID
id-0 2.099219 2.708839 2.766141
id-1 2.554117 2.183166 3.914883
id-2 2.485505 2.739834 2.250873
If you just want the column ID back, you just have to reset_index()
print(df.groupby('ID').sum().reset_index())
which will leave you with
ID v1 v2 v3
0 id-0 2.099219 2.708839 2.766141
1 id-1 2.554117 2.183166 3.914883
2 id-2 2.485505 2.739834 2.250873
Note:
groupby will sort the resulting DataFrame by its index. If you don't want that for any reason just set sorted=False (see also the documentation)
print(df.groupby('ID', sorted=false).sum())

print the unique values in every column in a pandas dataframe

I have a dataframe (df) and want to print the unique values from each column in the dataframe.
I need to substitute the variable (i) [column name] into the print statement
column_list = df.columns.values.tolist()
for column_name in column_list:
print(df."[column_name]".unique()
Update
When I use this: I get "Unexpected EOF Parsing" with no extra details.
column_list = sorted_data.columns.values.tolist()
for column_name in column_list:
print(sorted_data[column_name].unique()
What is the difference between your syntax YS-L (above) and the below:
for column_name in sorted_data:
print(column_name)
s = sorted_data[column_name].unique()
for i in s:
print(str(i))
It can be written more concisely like this:
for col in df:
print(df[col].unique())
Generally, you can access a column of the DataFrame through indexing using the [] operator (e.g. df['col']), or through attribute (e.g. df.col).
Attribute accessing makes the code a bit more concise when the target column name is known beforehand, but has several caveats -- for example, it does not work when the column name is not a valid Python identifier (e.g. df.123), or clashes with the built-in DataFrame attribute (e.g. df.index). On the other hand, the [] notation should always work.
Most upvoted answer is a loop solution, hence adding a one line solution using pandas apply() method and lambda function.
print(df.apply(lambda col: col.unique()))
This will get the unique values in proper format:
pd.Series({col:df[col].unique() for col in df})
If you're trying to create multiple separate dataframes as mentioned in your comments, create a dictionary of dataframes:
df_dict = dict(zip([i for i in df.columns] , [pd.DataFrame(df[i].unique(), columns=[i]) for i in df.columns]))
Then you can access any dataframe easily using the name of the column:
df_dict[column name]
We can make this even more concise:
df.describe(include='all').loc['unique', :]
Pandas describe gives a few key statistics about each column, but we can just grab the 'unique' statistic and leave it at that.
Note that this will give a unique count of NaN for numeric columns - if you want to include those columns as well, you can do something like this:
df.astype('object').describe(include='all').loc['unique', :]
I was seeking for a solution to this problem as well, and the code below proved to be more helpful in my situation,
for col in df:
print(col)
print(df[col].unique())
print('\n')
It gives something like below:
Fuel_Type
['Diesel' 'Petrol' 'CNG']
HP
[ 90 192 69 110 97 71 116 98 86 72 107 73]
Met_Color
[1 0]
The code below could provide you a list of unique values for each field, I find it very useful when you want to take a deeper look at the data frame:
for col in list(df):
print(col)
print(df[col].unique())
You can also sort the unique values if you want them to be sorted:
import numpy as np
for col in list(df):
print(col)
print(np.sort(df[col].unique()))
cu = []
i = []
for cn in card.columns[:7]:
cu.append(card[cn].unique())
i.append(cn)
pd.DataFrame( cu, index=i).T
Simply do this:
for i in df.columns:
print(df[i].unique())
Or in short it can be written as:
for val in df['column_name'].unique():
print(val)
Even better. Here's code to view all the unique values as a dataframe column-wise transposed:
columns=[*df.columns]
unique_values={}
for i in columns:
unique_values[i]=df[i].unique()
unique=pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in unique_vals.items() ]))
unique.fillna('').T
This solution constructs a dataframe of unique values with some stats and gracefully handles any unhashable column types.
Resulting dataframe columns are: col, unique_len, df_len, perc_unique, unique_values
df_len = len(df)
unique_cols_list = []
for col in df:
try:
unique_values = df[col].unique()
unique_len = len(unique_values)
except TypeError: # not all cols are hashable
unique_values = ""
unique_len = -1
perc_unique = unique_len*100/df_len
unique_cols_list.append((col, unique_len, df_len, perc_unique, unique_values))
df_unique_cols = pd.DataFrame(unique_cols_list, columns=["col", "unique_len", "df_len", "perc_unique", "unique_values"])
df_unique_cols = df_unique_cols[df_unique_cols["unique_len"] > 0].sort_values("unique_len", ascending=False)
print(df_unique_cols)
The best way to do that:
Series.unique()
For example students.age.unique() the output will be the different values that occurred in the age column of the students' data frame.
To get only the number of how many different values:
Series.nunique()

Categories