Remove duplicate columns only by their values - python

I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.

I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.

Related

How to modify column names when combining rows of multiple columns into single rows in a dataframe based on a categorical value. + selective sum/mean

I'm using the pandas .groupby function to combine multiple rows into a single row.
Currently I have a dataframe df_clean which has 405 rows, and 85 columns. There are up to 3 rows that correspond to a single Batch.
My current code for combining the multiple rows is:
num_V = 84 #number of rows -1 for the row "Batch" that they are being grouped by
max_row = df_clean.groupby('Batch').Batch.count().max()
df2= (
df_clean.groupby('Batch')
.apply(lambda x: x.values[:,1:].reshape(1,-1)[0])
.apply(pd.Series)
)
This code works creating a dataframe df2 which groups the rows by Batch, however the columns in the resulting dataframe are simply numbered (0,1,2,3,...249,250,251) note that 84*3=252, ((number of columns - Batch column)*3)=252, Batch becomes the index.
I'm cleaning some data for analysis and I want to combine the data of several (generally 1-3) Sub_Batch values on separate rows into a single row based on their Batch. Ideally I would like to be able to determine which columns are grouped into a row and remain separate columns in the row, as well as which columns the average, or total value is reported.
for example desired input/output:
Original dataframe
Output dataframe
note: naming of columns, and that all columns are copied over and that the columns are ordered according to which sub-batch they belong to. ie Weight_2 will always correspond to the second sub_batch that is a part of that Batch, Weight_3 will correspond to the third Sub_batch that is part of the Batch.
Ideal output dataframe
note: naming of columns, and that in this dataframe there is only a single column that records the Color as they are identical for all Sub-Batch values within a Batch. The individual Temperature values are recorded, as well as the average of the Temperature values for a Batch. The individual Weight values are recorded as well as the sum of the weight values in the column 'Total_weight`
I am 100% okay with the Output Dataframe scenario as I will simply add the values that I want afterwards using .mean and .sum for the values that I desire, I am simply asking if it can be done using `.groupby' as it is not something that I have worked with before, and I know that it does have some ability to sum or average results.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Looking for the "pandas" way to aggregate multiple columns to a single row

I'm working with tables containing a number of image-derived features, with a row of features for every timepoint of an image series, and multiple images. An image identifier is contained in one column. To condense this (primarily to be used in a parallel coordinates plot), I want to reduce/aggregate the columns to a single value. The reduce operation has to be chosen depending on the feature in the column (for example mean, max, or something custom). DataFrame.agg seemed like the natural thing to do, but it gives me a table with multiple rows. Right now I'm doing something like this:
result_df = DataFrame()
for col in df.columns:
if col in ReduceThisColumnByMean:
result_df[col] = df.mean()
elif col in ReduceThisColumnByMax:
result.df[col] = df.max()
This seems like a detour to me, and might not scale well (not a big concern, as the number of reduce operations will most probably not grow beyond a few). Is there a more pandas-esk way to aggregate multiple columns with specific operations to a single row?
You can select all columns by list, get mean and max and join together by concat, last convert Series to one row DataFrame by Series.to_frame and transpose:
result_df = pd.concat([df[ReduceThisColumnByMean].mean(),
df[ReduceThisColumnByMax].max()]).to_frame().T

Aggregation based on the median of probability distribution

I have a big data set (75 Million rows) consists of 12 columns. The rows are repeated holding the same values except for the last 2 columns that constitute the probability distribution
As we can see in this snipping, regarding the first 10 columns, the rows are equal in values and the last 2 ones (value_count, value) are the probability distribution for the rows. I want to aggregate those rows to be one row based on the median of the probability distribution of value_count, value
EDIT: edited after the comment.
you can get the median easily by using summary.
df_result = your_table.select("value_count", "value").summary("50%")
The result is a dataframe with one row and 2 columns. You can join it to your original dataframe if you want to: your_table.select("col1", .. , "coln").distinct().join(df_result, "outer")
Alternatively there are the functions approx_quantile and percentile_approx which could probably do the job without using a join (like above)

How to find rows that differ by only one column in pandas?

I have a dataframe, with three columns. I have grouped them based on two of the 3 columns. Now I need to find only those rows where the two columns word1,word2 are same but the column Tag,the third column, is different.
This something like I need to find those columns, where for the same word1 and word2 we have different labels. But I am not able to filter the dataFrame based on the groupby construct shown below
newComps.groupby(['word1','word2']).count()
Here it wil lbe helpful if I can see only the ones with same word1,word2 but with a different Tag, rather than all the entries. I have tried with calling the above code inside [], as we use to filter the data, but to no avail
Ideally I should see only
A,gawam, A1
A,gawam,BS1
A,gawaH, T1
A, gawaH, T2
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
look at the subset and the keep option

Categories