How to find rows that differ by only one column in pandas? - python

I have a dataframe, with three columns. I have grouped them based on two of the 3 columns. Now I need to find only those rows where the two columns word1,word2 are same but the column Tag,the third column, is different.
This something like I need to find those columns, where for the same word1 and word2 we have different labels. But I am not able to filter the dataFrame based on the groupby construct shown below
newComps.groupby(['word1','word2']).count()
Here it wil lbe helpful if I can see only the ones with same word1,word2 but with a different Tag, rather than all the entries. I have tried with calling the above code inside [], as we use to filter the data, but to no avail
Ideally I should see only
A,gawam, A1
A,gawam,BS1
A,gawaH, T1
A, gawaH, T2

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
look at the subset and the keep option

Related

Pandas - drop rows based on two conditions on different columns

Although there are several related questions answered in Pandas, I cannot solve this issue. I have a large dataframe (~ 49000 rows) and want to drop rows the meet two conditions at the same time(~ 120):
For one column: an exact string
For another column: a NaN value
My code is ignoring the conditions and no row is removed.
to_remove = ['string1', 'string2']
df.drop(df[df['Column 1'].isin(to_remove) & (df['Column 2'].isna())].index, inplace=True)
What am I doing wrong? Thanks for any hint!
Instead of calling drop, and passing the index, You can create the mask for the condition for which you want to keep the rows, then take only those rows. Also, the logic error seems to be there, you are checking two different condition combined by AND for the same column values.
df[~(df['Column1'].isin(to_remove) & (df['Column2'].isna()))]
Also, if you need to check in the same column, then you probably want to combine the conditions by or i.e. |
If needed, you can reset_index at last.
Also, as side note, your list to_remove has two same string values, I'm assuming thats a typo in the question.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

combine the values of a specific column of a dataframe in one row or unit

I want to combine the different values/rows of a certain column. these values are texts and I want to combine them together to perform word count and find the most common words.
the dataframe is called df and is made of 30 columns. I want to combine all the rows of the first column (labeled 'text') into one row, or one list etc,. it doesn't matter as long as I can perform FreqDist on it. I am not interested in grouping the values according to a certain value, I just want all the values in this column to become one block.
I looked around a lot and I couldn't find what I am looking for.
thanks a lot.

Remove duplicate columns only by their values

I just got an assignment which i got a lot of features (as columns) and records (as rows) in a csv file.
Cleaning the data using Python (including pandas):
A,B,C
1,1,1
0,0,0
1,0,1
I would like to delete all the duplicate columns with the same values and to remain only one of them. A and B will be the only column one to remain.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
thanks.
I would like to delete all the duplicate columns with the same values and to remain only one of them. A will be the only column one to remain.
You mean that's the only one among the A and C that's kept, right? (B doesn't duplicate anything.)
You can use DataFrame.drop_duplicates
df = df.T.drop_duplicates().T
It works on rows, not columns, so I transpose before/after calling it.
I would like to combine the columns that have high Pearson correlation with the target value, how can i do it?
You can do a loop matching all columns up and computing their correlation with DataFrame.corr or with numpy.corrcoef.

Categories