I have a dataframe containing only duplicate "MainID" rows. One MainID may have multiple secondary IDs (SecID). I want to concatenate the values of SecID if there is a common MainID, joined by ':' in SecID col. What is the best way of achieving this? Yes, I know this is not best practice, however it's the structure the software wants.
Need to keep the df structure and values in rest of the df. They will always match the other duplicated row. Only SecID will be different.
Current:
data={'MainID':['NHFPL0580','NHFPL0580','NHFPL0582','NHFPL0582'],'SecID':['G12345','G67890','G11223','G34455'], 'Other':['A','A','B','B']}
df=pd.DataFrame(data)
print(df)
MainID SecID Other
0 NHFPL0580 G12345 A
1 NHFPL0580 G67890 A
2 NHFPL0582 G11223 B
3 NHFPL0582 G34455 B
Intended Structure
MainID SecID Other
NHFPL0580 G12345:G67890 A
NHFPL0582 G11223:G34455 B
Try:
df.groupby('MainID').apply(lambda x: ':'.join(x.SecID))
the above code returns a pd.Series, and you can convert it to a dataframe as #Guy suggested:
You need .reset_index(name='SecID') if you want it back as DataFrame
The solution to the edited question:
df = df.groupby(['MainID', 'Other']).apply(lambda x: ':'.join(x.SecID)).reset_index(name='SecID')
You can then change the column order
cols = df.columns.tolist()
df = df[[cols[i] for i in [0, 2, 1]]]
Related
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I have following sample dataframe with column A and B:
df:
A B
123 555
456 123
789 666
I want to know which method can be used to print out 123 (a method to print out values of A which also exist in column B). I tried following:
for i, row in df.iterrows():
if row.A in row.B:
print(row.A, row.B)
but, got error: argument of type 'float' is not iterable.
If you are trying to print any row that row.A exists in column B, then your code should be:
for i, row in df.iterrows():
if row.A in df.B:
print(row.A, row.B)
col_B = df['B'].unique()
val_B_in_A = [ i for i in df['A'].unique() if i in col_B ]
print(val_B_in_A)
Be careful with "dot" notation in dataframes, since columns can contain spaces and it starts to be a pain dealing with those. With that said,
Depending on how many rows you are iterating over, and the proportion of rows that contain unique values, it may be computationally less expensive to iterate over the unique values in 'A', and check if each one is in 'B':
import pandas as pd
tmp = []
for value in df['A'].unique():
tmp.append(df.loc[df['B']==value])
df_results = pd.concat(tmp)
print(df_results)
You could also use the built-in method .isin(), in fact, much of the power of pandas is in its array-wise operators, which are significantly quicker than most approaches involving loops:
df.loc[df['B'].isin(df['A'].unique())]
And to only show one column with the ".loc" accessor, just add
df.loc[df['B'].isin(df['A'].unique()), 'A']
And to just return the values in an optimized array
df.loc[df['B'].isin(df['A'].unique()), 'A'].values
If you are concerned with an exact match try
df['match'] = pd.Series([(df['col1']==item).sum() for item in df['col1']])
i have following data sample i am trying to flatten it out using pandas, i wanna flatten this data over Candidate_Name.
This is my implementation,
df= df.merge(df,on=('Candidate_Name'))
but i am not getting desired result. My desired output is as follows. So basically have all the rows that match Candidate_Name in a single row, where duplicate column names may suffix with _x
I think you need GroupBy.cumcount with DataFrame.unstack and then flatten MultiIndex with same values for first groups and added numbers for another levels for avoid duplicated columns names:
df = df.set_index(['Candidate_Name', df.groupby('Candidate_Name').cumcount()]).unstack()
df.columns = [a if b == 0 else f'{a}_{b}' for a, b in df.columns]
I have two dataframes df1 and df2. Df1 has columns A,B,C,D,E,F and df2 A,B,J,D,E,K. I want to update the second dataframe with the rows of the first but only when two first columns have the same value in both dataframes. For each row that the following two conditions are true:
df1.A = df2.A
df1.B = df2.B
then update accordingly:
df2.D = df1.D
df2.E = df1.E
My dataframes have different number of rows.
When I tried this code I get a TypeError :cannot do positional indexing with these indexers of type 'str'.
for a in df1:
for t in df2:
if df1.iloc[a]['A'] == df2.iloc[t]['A'] and df1.iloc[a]['B'] == df2.iloc[t]['B']:
df2.iloc[t]['D'] = df1.iloc[a]['D']
df2.iloc[t]['E'] = df1.iloc[a]['E']
The Question:
You'd be better served merging the dataframes than doing nested iteration.
df2 = df2.merge(df1[['A', 'B', 'D', 'E']], on=['A', 'B'], how='left', suffixes=['_old', ''])
df2['D'] = df2['D'].fillna(df2['D_old'])
df2['E'] = df2['E'].fillna(df2['E_old'])
del df2['D_old']
del df2['E_old']
The first row attaches columns to df2 with values for columns D and E from corresponding rows of df1, and renames the old columns.
The next two lines fill in the rows for which df1 had no matching row, and the next two delete the initial, now outdated versions of the columns.
The Error:
Your TypeError happened because for a in df1: iterates over the columns of a dataframe, which are strings here, while .iloc only takes integers. Additionally, though you didn't get to this point, to set a value you'd need both index and column contained within the brackets.
So if you did need to set values by row, you'd want something more like
for a in df1.iterrows():
for t in df2.iterrows():
if df1.loc[a, 'A'] == ...
Though I'd strongly caution against doing that. If you find yourself thinking about it, there's probably either a much faster, less painful way to do it in pandas, or you're better off using another tool less focused on tabular data.
Suppose I have df1:
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
I would like to create df2 with the same shape, index and columns. I often find myself doing something like this:
df2= pd.DataFrame(np.ones(shape(df1),index = df1.index,columns =df1.columns)
This is less than ideal. What's the pythonic way?
How about this:
df2 = df1.copy()
df2[:] = 1 # Or any other value, for the matter
The last line is not even necessary if all you want is to preserve the shape and the row/column headers.
You can also use the dataframe method "where" which will allow you to keep data based on condition and preserve the shape/index of the original df.
dates = pd.date_range('20170101',periods=20)
df1 = pd.DataFrame(np.random.randint(10,size=(20,3)),index=dates,columns=['foo','bar','see'])
df2= df1.where(df1['foo'] % 2 == 0, 9999)
df2