Python Pandas Dataframes comparison on 2 columns (with where clause) - python

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.

You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Related

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

How to check splitted values of a column without changing the dataframe?

I am trying to find columns hitting specific conditions and put a value in the column col.
My current implementation is:
df.loc[~(df['myCol'].isin(myInfo)), 'col'] = 'ok'
In the future, myCol will have multiple info. So I need to split the value in myCol without changing the dataframe and check if any of the splitted values are in myInfo. If one of them are, the current row should get the value 'ok' in the column col. Is there an elegant way without really splitting and saving in an extra variable?
Currently, I do not know how the multiple info will be represented (either separated by a character or just concatenated one after one, each consisting of 4 alphanumeric values).
Let's say you need to split on "-" for your myCol column.
sep='-'
deconcat = df['MyCol'].str.split(sep, expand=True)
new_df=df.join(deconcat)
The new_df DataFrame will have the same index as df, therefore you can do what you want with new_df and then join back to df to filter it how you want.
You can do the above .isin code for each of the new split columns to get your desired result.
Source:
Code taken from the pyjanitor documentation which has a built-in function, deconcatenate_column, that does this.
Source code for deconcatenate_column

Python pandas issues with .drop and a non-unique index

I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).
If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates

Why do I get Pandas data frame with only one column vs Series?

I've noticed single-column data frames a couple of times to much chagrin (examples below); but in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?
Examples:
1) when indexing columns by a boolean mask where the mask only has one true value:
df = pd.DataFrame([list('abc'), list('def')], columns = ['foo', 'bar', 'tar'])
mask = [False, True, False]
type(df.ix[:,mask])
2) when setting an index on DataFrame that only has two columns to begin with:
df = pd.DataFrame([list('ab'), list('de'), list('fg')], columns = ['foo', 'bar']
type(df.set_index('foo'))
I feel like if I'm expecting a DF with only one column, I can deal with it by just calling
pd.Series(df.values().ravel(), index = df.index)
But in most other cases a one-column data frame would just be a Series. Is there any rhyme or reason as to why a one column DF would be returned?
In general, a one-column DataFrame will be returned when the operation could return a multicolumn DataFrame. For instance, when you use a boolean column index, a multicolumn DataFrame would have to be returned if there was more than one True value, so a DataFrame will always be returned, even if it has only one column. Likewise when setting an index, if your DataFrame had more than two columns, the result would still have to be a DataFrame after removing one for the index, so it will still be a DataFrame even if it has only one column left.
In contrast, if you do something like df.ix[:,'col'], it returns a Series, because there is no way that passing one column name to select can ever select more than one column.
The idea is that doing an operation should not sometimes return a DataFrame and sometimes a Series based on features specific to the operands (i.e., how many columns they happen to have, how many values are True in your boolean mask). When you do df.set_index('col'), it's simpler if you know that you will always get a DataFrame, without having to worry about how many columns the original happened to have.
Note that there is also the DataFrame method .squeeze() for turning a one-column DataFrame into a Series.

Categories