How do I iterate through two dataframes of different sizes? - python

Specifically I want to iterate through two dataframes, one being large and one being small.
Ultimately, I would like to compare values within a certain column.
I tried creating a nested for loop; the outer loop iterating through the large dataframe and the inner loop iterating through the small dataframe however I am having difficulties.
I'm looking for a way to identify that the "name" and "value" in my large dataframe that matches my small dataframe.
Background info: I am using the panda library.
Large dataframe:
Small dataframe:
Name Value
SF 12.84
TH -49.45

If the goal is to iterate through one, or especially more, DataFrames, then explicit for loops is usually the wrong move. In this case, because you're trying to
identify that the "name" and "value" in my large dataframe that matches my small dataframe,
the operation that you're looking for is either pd.merge or pd.DataFrame.join which do the comparisons "under the hood" and return matching information. So, say you have the 2 DataFrames and they're called large and small. Then
import pandas as pd
new_large = pd.merge(left=large,
right=small,
how='left',
on=('Name', 'Value'),
indicator=True)
new_large._merge = new_large._merge.apply(lambda x: 1 if x=='both' else 0)
By doing a left join between large and small (how='left'), pd.merge returns the rows in large that contain a match in small on the ('Name', 'Value') tuple. Then, most of the heavy lifting is done by the indicator keyword that, quoting the pd.merge version 0.25.0 docs:
If True, adds a column to output DataFrame called "_merge" with
information on the source of each row.
Information column is Categorical-type and takes on a value of "left_only"
for observations whose merge key only appears in 'left' DataFrame,
"right_only" for observations whose merge key only appears in 'right'
DataFrame, and "both" if the observation's merge key is found in both.
So, new_large is the original large DataFrame with a new column called _merge the entries of which correspond to the rows of large that matched small just on Name (by the value 'left_only') and the rows that matched on Name as well as Value; the latter having the value both. The last step is changing both and left_only to 1 and 0, as you specified.
Now, the left join returned what it did because both of the Name values in the small DataFrame were present in the large DataFrame so the left-join of large and small returned the whole large DataFrame. When this is not the case, there will be pd.NaN values resulting from pd.merge and you'll have to employ a few more tricks to get the nice Boolean (integer) column to show what matched and what didn't. HTH.

Related

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

list of booleans to slice DataFrame

I'm trying to find a less manual, more convenient way to slice a Pandas DataFrame based on multiple boolean conditions. To illustrate what I'm after, here is a simplified example
df = pd.DataFrame({'col1':[True,False,True,False,False,True],'col2':[False,False,True,True,False,False]})
suppose I am interested in the subset of the DataFrame where both 'col1' and 'col2' are True. I can find this by running:
df[(df['col1']==True) & (df['col2']==True)]
This is manageable enough in a small dimensional example like this one, but the real one can have up to a hundred columns, so rather than have to string together a long boolean like the one above, I would rather read the columns of interest into a list, e.g.
['col1','col2']
and select where those columns listed are True
If you need all columns:
df[df.all(axis=1)==True]
If you have list of columns:
df[df[COLS].all(axis=1)==True]
For opposite just do False:
df[df.all(axis=1)==False]

Pandas, when merging two dataframes and values for some columns don't carry over

I'm trying to combine two dataframes together in pandas using left merge on common columns, only when I do that the data that I merged doesn't carry over and instead gives NaN values. All of the columns are objects and match that way, so i'm not quite sure whats going on.
this is my first dateframe header, which is the output from a program
this is my second data frame header. the second df is a 'key' document to match the first output with its correct id/tastant/etc and they share the same date/subject/procedure/etc
and this is my code thats trying to merge them on the common columns.
combined = first.merge(second, on=['trial', 'experiment','subject', 'date', 'procedure'], how='left')
with output (the id, ts and tastant columns should match correctly with the first dataframe but doesn't.
Check your dtypes, make sure they match between the 2 dataframes. Pandas makes assumptions about data types when it imports, it could be assuming numbers are int in one dataframe and object in another.
For the string columns, check for additional whitespaces. They can appear in datasets and since you can't see them and Pandas can, it result in no match. You can use df['column'].str.strip().
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.strip.html

Python pandas issues with .drop and a non-unique index

I have a pandas DataFrame, say df, and I'm trying to drop certain rows by an index. Specifically:
myindex = df[df.column2 != myvalue].index
df.drop(myindex, inplace = True)
This seems to work just fine for most DataFrames but strange things seem to happen with one DataFrame where I get a non-unique index myindex (I am not quite sure why since the DataFrame has no duplicate rows). To be more precise, a lot more values get dropped than there are in the index (in the extreme case I actually drop all rows even though there are several hundred rows where column2 has myvalue). Extracting only unique values (myindex.unique() and dropping the rows using the unique index doesn't help either. At the same time,
df = df[df.column2 != myvalue]
works just as I'd like it to. I'd rather use the inplace drop however but more importantly I would like to understand why the results are not the same with the direct asignment and with the drop method using the index.
Unfortunately, I cannot provide the data as those cannot be published and since I am not sure what is wrong exactly, I cannot simulate them either. However, I suspect it probably has something to do with myindex being nonunique (which also confuses me since there are no duplicate rows in df but it might very well be that I misunderstand the way the index is created).
If there are repeated values in your index, doing reset_index before might help. That will set your current index as a column and add a new sequential index (with unique values) instead.
df = df.reset_index()
The reason the 2 methods are not the same is that in one case you are passing a series of booleans that represents with rows to keep and which ones to drop (index values are not relevant here). In the case with the drop, you are passing a list of index values (which map to several positions).
Finally, to check is your index has duplicates, you shouldn't check for duplicate rows. Simply do:
df.index.has_duplicates

Python Pandas Dataframes comparison on 2 columns (with where clause)

I'm stuck on particluar python question here. I have 2 dataframes DF1 and DF2. In both, I have 2 columns pID and yID (which are not indexed, just default). I'm look to add a column Found in DF1 where the respective values of columns (pID and yID) were found in DF2. Also, I would like to zone in on just values in DF2 where aID == 'Text'.
I believe the below gets me the 1st part of this question; however, I'm unsure how as to incorporate the where.
DF1['Found'] = (DF1[['pID', 'yID']] == DF2[['pID','yID']]).all(axis=1).astype(bool)
Suggestions or answers would be most appreciated. Thanks.
You could subset the second dataframe containing aID == 'Text' to get a reduced DF from which select those portions of columns to be compared against the first dataframe.
Use DF.isin() to check if the values that are present under these column names match or not. And, .all(axis=1) returns True if both the columns happen to be True, else they become False. Convert the boolean series to integers via astype(int) and assign the result to the new column, Found.
df1_sub = df1[['pID', 'yID']]
df2_sub = df2.query('aID=="Text"')[['pID', 'yID']]
df1['Found'] = df1_sub.isin(df2_sub).all(axis=1).astype(int)
df1
Demo DF's used:
df1 = pd.DataFrame(dict(pID=[1,2,3,4,5],
yID=[10,20,30,40,50]))
df2 = pd.DataFrame(dict(pID=[1,2,8,4,5],
yID=[10,12,30,40,50],
aID=['Text','Best','Text','Best','Text']))
If it does not matter where those matches occur, then merge the two dataframes on 'pID', 'yID' common columns as the key by considering the bigger DF's index (right_index=True) as the new index axis that needs to be emitted and aligned after the merge operation is over.
Access these indices which indicate matches found and assign the value, 1 to a new column named Found while filling it's missing elements with 0's throughout.
df1.loc[pd.merge(df1_sub, df2_sub, on=['pID', 'yID'], right_index=True).index, 'Found'] = 1
df1['Found'].fillna(0, inplace=True)
df1 should be modifed accordingly post the above steps.

Categories