Reindexing dataframes and joining columns - python

Given two DataFrames A and B which are the same length (number of rows) but have different integer indices. How do I add the columns of A to the columns of B but ignore the indices? (i.e. row 1 of A goes with row 1 of B regardless of the index value.)
If the index of A is non-consecutive integer index, how do I reindex A to be 1...n using consecutive integers? The index of be is a 1...n consecutive integer index.
Is it best practice to reindex A and then add columns from B to it?

You can combine the columns of two DataFrames using concat:
pd.concat([A, B], axis=1)
To make the index consecutive integers you can use reset_index:
A.reset_index(inplace=True)
Or, alternatively you can match the index of B to that of A using:
B.index = A.index
What the "best" choice is here I think depends on the context/the meaning of the index.

Related

Pandas merge two data frame only to first occurrence

I have two dataframe, I am able to merge by pd.merge(df1, df2, on='column_name'). But I only want to merge on first occurrence in df1 Any pointer or solution? It's a many to one, and I only want the first occurrence merged. Thanks in advance!
Since you want to merge two dataframes of different lengths, you'll have to have NaN values in the merged dataframe cells where there are no corresponding indices in df2. So let's try this. Merge left. This will duplicate df2 values for duplicated column_name rows in df1. Have a mask ready to filter those rows and assign NaN for them in the columns from df2.
mask = df1['column_name'].duplicated()
new_df = df1.merge(df2, how='left', on='column_name')
new_df.loc[mask, df2.columns[df2.columns!='column_name']] = np.nan

Get columns with the max value accounting for ties in pandas

let's say that we have this pandas DF.
I want to know which column has the maximum value per row
The output for row 1,2 and 3 would be all the 5 columns
For row 4 would be visits_total
And for row 5 would be ['content_gene_strength', 'sport_gene_strength', 'visits_total']
Thanks
Compare all columns by DataFrame.eq by maximal value, then use DataFrame.dot for matrix multiplication with columns names with separator, last remove separator from right side by Series.str.rstrip:
df['new'] = df.eq(df.max(axis=1), axis=0).dot(df.columns + ',').str.rstrip(',')

Comparing Overlap in Pandas Columns

So I have four columns in a pandas dataframe, column A, B, C and D. Column A contains 30 words, 18 of which are in column B. Column C contains either a 1 or 2 (keyboard response to column B words) and column D contains 1 or 2 also (the correct response).
What I need to do is see the total correct for only the words where column A and B overlap. I understand how to compare the C and D columns to get the total correct once I have the correct dataframe, but I am having a hard time wrapping my head around comparing the overlap in A and B.
Use Series.isin():
df.B.isin(df.A)
That will give you a boolean Series the same length as df.B indicating for each value in df.B whether it is also present in df.A.

Pandas reindexing task based on a column value

I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4

How to multiply one column to few other multiple column in Python DataFrame

I have a Dataframe of 100 Columns and I want to multiply one column ('Count') value with the columns position ranging from 6 to 74. Please tell me how to do that.
I have been trying
df = df.ix[0, 6:74].multiply(df["Count"], axis="index")
df = df[df.columns[6:74]]*df["Count"]
None of them is working
The result Dataframe should be of 100 columns with all original columns where columns number 6 to 74 have the multiplied values in all the rows.
You can multiply the columns in place.
columns = df.columns[6:75]
df[columns] *= df['Count']

Categories