How to multiply matching rows from two dataframes - python

I have two dataframes that looks like this
df1:
df2:
So the thing is that I want to multiply the ratio column of df1 with the columns Total, Hombres, Mujeres in df2 when it the column of 'Estado' matches with the column of 'Entidad Federativa in both tables', and when it stops matching it goes to the second row and does the same with the matching columns. Anyone has any idea on his? I would appreciate it a lot.

Use DataFrame.div with level=0 for match first level of MultiIndex in df2 and index values of Estado:
df1 = df1.rename(index={'Aguascal':'Aguascalientes'})
#if necessary
#df1 = df1.set_index('Estado')
df2 = (df2.replace(',','', regex=True)
.astype(int)
.set_index(['Entidad Federativa','Grupo quinquenal de edad']))
df = df2.div(df1['Ratio'], level=0, axis=0)

Related

How to drop columns, do operation on remaining columns, then insert back the dropped columns?

I have a large dataset. I want to apply something on all the columns except for 2.
I dropped the 2 columns and created a separate dataframe, then tried merging the dataframes after the operation is applied.
I tried appending, merging, joining the two dataframes but they all created duplicate rows. Appending doubled the row count, and changed the dropped columns.
I just want to add back the 2 columns to the initial dataframe unchanged. Any help?
df= col1 col2 col3... col100
1 2 3 100
df2=df.loc[:,['col2', 'col3']]
df.drop(columns=['col2', 'col3'], inplace=True)
Then do what I needed to do to df.
Now I want to merge df and df2.
Like this:
cols = ['col2', 'col3']
df2 = df[cols]
df.drop(columns=cols, inplace=True)
# do something
df = pd.concat([df, df2], axis=1)
This will work as long as you didn't remove rows from either dataframes or changed their order

How to get rows from a dataframe that are not joined when merging with other dataframe in pandas

I am trying to make 2 new dataframes by using 2 given dataframe objects:
DF1 = id feature_text length
1 "example text" 12
2 "example text2" 13
....
....
DF2 = id case_num
3 0
....
....
As you could see, both df1 and df2 have column called "id". However, the df1 has all id values, where df2 only has some of them. I mean, df1 has 3200 rows, where each row has a unique id value (1~3200), however, df2 has only some of them (i.e. id=[3,7,20,...]).
What I want to do is 1) get a merged dataframe which contains all rows that have the id values which are included in both df1 and df2, and 2) get a dataframe, which contains the rows in the df1, which have id values that are not included in the df2.
I was able to find a solution for 1), however, have no idea how to do 2).
Thanks.
For the first case, you could use inner merge:
out = df1.merge(df2, on='id')
For the second case, you could use isin, with negation operator, so that we filter out the rows in df1 that have ids that also exist in df2:
out = df1[~df1['id'].isin(df2['id'])]

Pandas merge two data frame only to first occurrence

I have two dataframe, I am able to merge by pd.merge(df1, df2, on='column_name'). But I only want to merge on first occurrence in df1 Any pointer or solution? It's a many to one, and I only want the first occurrence merged. Thanks in advance!
Since you want to merge two dataframes of different lengths, you'll have to have NaN values in the merged dataframe cells where there are no corresponding indices in df2. So let's try this. Merge left. This will duplicate df2 values for duplicated column_name rows in df1. Have a mask ready to filter those rows and assign NaN for them in the columns from df2.
mask = df1['column_name'].duplicated()
new_df = df1.merge(df2, how='left', on='column_name')
new_df.loc[mask, df2.columns[df2.columns!='column_name']] = np.nan

Filter pandas dataframe columns based on other dataframe

I have two dataframes df1 and df2. df1 gives some numerical data on some elements (A,B,C ...) while df2 is a dataframe acting like a classification table with its index being the column names of df1. I would like to filter df1 by only keeping columns that are matching a certain classification in df2.
For instance, let's assume the following two dataframes and that I only want to keep elements (i.e. columns of df1) that belong to class 'C1':
df1 = pd.DataFrame({'A': [1,2],'B': [3,4],'C': [5,6]},index=[0, 1])
df2 = pd.DataFrame({'Name': ['A','B','C'],'Class': ['C1','C1','C2'],'Subclass': [C11,C12,C21]},index=[0, 1, 2])
df2 = df2.set_index('Name')
The expected result should be the dataframe df1 with only columns A and B because in df2, we can see that A and B are in class C1. Not sure how to do that. I was thinking about first filtering df2 by 'C1' values in its 'Class' column and then check if df1.columns are in df2.index but I suppose there is a much efficient way to do that. Thanks for your help
Here is one way using index slice
df1.loc[:,df2.index[df2.Class=='C1']]
Out[578]:
Name A B
0 1 3
1 2 4

Pandas Merge a Grouped-by dataframe with another dataframe for each group

I have a datframe like:
id date temperature
1 2011-09-12 12
2011-09-15 12
2011-10-13 12
2 2011-12-12 14
2011-12-24 15
I want to make sure that each device id has temperature recordings for each day, if the value exists it will be copied from above if it doesn't i will put 0.
so, I prepare another dataframe which has dates for the entire year:
using pd.DataFrame(0, index=pd.range('2011-01-01', '2011-12-12'), columns=['temperature'])
date temperature
2011-01-01 0
.
.
.
2011-12-12 0
Now, for each id I want to merge this dataframe so that I have entire year's entry for each of the id.
I am stuck at the merge step, just merging on the date column does not work, i.e.
pd.merge(df1, df2, on=['date'])
gives a blank dataframe.
As an alternative to jezrael's answer, you could also do the following iteration, especially if you want to keep your device id intact:
data={"date":[pd.Timestamp('2011-09-12'), pd.Timestamp('2011-09-15'), pd.Timestamp('2011-10-13'),pd.Timestamp('2011-12-12'),pd.Timestamp('2011-12-24')],"temperature":[12,12,12,14,15],"sensor_id":[1,1,1,2,2]}
df1=pd.DataFrame(data,index=data["sensor_id"])
df2=pd.DataFrame(0, index=pd.date_range('2011-01-01', '2011-12-12'), columns=['temperature','sensor_id'])
for i,row in df1.iterrows():
df2.loc[df2.index==row["date"], ['temperature']] = row['temperature']
df2.loc[df2.index==row["date"], ['sensor_id']] = row['sensor_id']
for t in data["date"]:
print(df2[df2.index==t])
Note that df2 in your question only goes to 2011-12-12, hence the last print() will return an empty DataFrame. I wasn't whether you did this on purpose.
Also, depending on the variability and density in your actual data, it might make sense to use:
for s in [1,2]: ## iterate over device ids
ma=(df['sensor_id']==s)
df.loc[ma]=df.loc[ma].fillna(method='ffill') # fill forward
hence an incomplete time series would be filled (forward) by the last measured temperature value. Depends on the quality of your data, of course, and df.resample() might make more sense.
Create MultiIndex by MultiIndex.from_product and merge by both MultiIndexes:
mux = pd.MultiIndex.from_product([df.index.levels[0],
pd.date_range('2011-01-01', '2011-12-12')],
names=['id','date'])
df1 = pd.DataFrame(0, index=mux, columns=['temperature'])
df = pd.merge(df1, df, left_index=True, right_index=True, how='left')
If want only one column temperature:
df = pd.merge(df1, df, left_index=True, right_index=True, how='left', suffixes=('','_'))
df['temperature'] = df.pop('temperature_').fillna(df['temperature'])
Another idea is use itertools.product for 2 columns DataFrame:
from itertools import product
data = list(product(df.index.levels[0], pd.date_range('2011-01-01', '2011-12-12')))
df1 = pd.DataFrame(data, columns=['id','date'])
df = pd.merge(df1, df, left_on=['id','date'], right_index=True, how='left')
Another idea is use DataFrame.reindex:
mux = pd.MultiIndex.from_product([df.index.levels[0],
pd.date_range('2011-01-01', '2011-12-12')],
names=['id','date'])
df = df.reindex(mux, fill_value=0)

Categories