Please would you like to know how I can update two DataFrames df1 y df2 from another DataFrame df3. All this is done within a for loop that iterates over all the elements of the DataFrame df3
for i in range(len(df3)):
df1.p_mw = ...
df2.p_mw = ...
The initial DataFrames df1 and df2 are as follows:
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
The DataFrame from which I want to update the data is:
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
As you can see the DataFrame df3 contains data from the corresponding column p_mw for both DataFrames df1 and df2. Furthermore, the DataFrame df2 has an element named GF_1 for which there is no update and should remain the same.
After updating for the last iteration, the desired output is the following:
df1 = pd.DataFrame([['GH_1', 60, 'Hidro'],
['GH_2', 40, 'Hidro'],
['GH_3', 90, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 20, 'Termo'],
['GT_2', 0, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
Create a mapping series by selecting the last row from df3, then map it on the column name and fill the nan values using the values from p_mw column
s = df3.iloc[-1]
df1['p_mw'] = df1['name'].map(s).fillna(df1['p_mw'])
df2['p_mw'] = df2['name'].map(s).fillna(df2['p_mw'])
If there are multiple dataframes that needed to be updated then we can use a for loop to avoid repetition of our code:
for df in (df1, df2):
df['p_mw'] = df['name'].map(s).fillna(df['p_mw'])
>>> df1
name p_mw type
0 GH_1 60 Hidro
1 GH_2 40 Hidro
2 GH_3 90 Hidro
>>> df2
name p_mw type
0 GT_1 20.0 Termo
1 GT_2 0.0 Termo
2 GF_1 10.0 Fict
This should do as you ask. No need for a for loop.
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
updates = df3.iloc[-1].values
df1["p_mw"] = updates[:3]
df2["p_mw"] = np.append(updates[3:], df2["p_mw"].iloc[-1])
Related
I have two Dataframes, and would like to create a new column in DataFrame 1 based on DataFrame 2 values.
But I dont want to join the two dataframes per say and make one big dataframe, but rather use the second Dataframe simply as a look-up.
#Main Dataframe:
df1 = pd.DataFrame({'Size':["Big", "Medium", "Small"], 'Sold_Quantity':[10, 6, 40]})
#Lookup Dataframe
df2 = pd.DataFrame({'Size':["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean':[10, 20, 30]})
#Create column in Dataframe 1 based on lookup dataframe values:
df1['New_Column'] = when df1['Size'] = df2['Size'] and df1['Sold_Quantity'] < df2['Sold_Quantiy_Score_Mean'] then 'Below Average Sales' else 'Above Average Sales!' end
One approach, is to use np.where:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantity': [10, 6, 40]})
df2 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean': [10, 20, 30]})
condition = (df1['Size'] == df2['Size']) & (df1['Sold_Quantity'] < df2['Sold_Quantiy_Score_Mean'])
df1['New_Column'] = np.where(condition, 'Below Average Sales', 'Above Average Sales!')
print(df1)
Output
Size Sold_Quantity New_Column
0 Big 10 Above Average Sales!
1 Medium 6 Below Average Sales
2 Small 40 Above Average Sales!
Given that df2 is sort of like a lookup based on Size, it would make sense if your Size column was its index:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantity': [10, 6, 40]})
df2 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean': [10, 20, 30]})
lookup = df2.set_index("Size")
You can then map the Sizes in df1 to their mean and compare each with the sold quantity:
is_below_mean = df1["Sold_Quantity"] < df1["Size"].map(lookup["Sold_Quantiy_Score_Mean"])
and finally map the boolean values to the respective strings using np.where
df1["New_Column"] = np.where(is_below_mean, 'Below Average Sales', 'Above Average Sales!')
df1:
Size Sold_Quantity New_Column
0 Big 10 Above Average Sales!
1 Medium 6 Below Average Sales
2 Small 40 Above Average Sales!
I have two dataframes
df1
x1
y1
x2
y2
label
0
0
1240
1755
label1
0
0
1240
2
label2
df2
x1
y1
x2
y2
text
992.0
943.0
1166.0
974.0
tex1
1110.0
864.0
1166.0
890.0
text2
Based on a condition like the following:
if df1['x1'] >= df2['x1'] or df1['y1'] >= df2['y1']:
# I want to add a new column 'text' in df1 with the text from df2.
df1['text'] = df2['text']
What's more, it is possible in df2 to have more than one row that makes the above-mentioned condition True, so I will need to add another if statement for df2 to get the best match.
My problem here is not the conditions but how am I supposed to approach the interaction between both data frames. Any help, or advice would be appreciated.
If you want to iterate from df1 through every row of df2 and return a match you can do it with the .apply() function in df1 and use the df2 as lookup table.
NOTE: In the above example I return the first match (by using the .iloc[0]) not all the matches.
Create two dummy dataframes
import pandas as pd
df1 = pd.DataFrame({'x1': [1, 2, 3], 'y1': [1, 5, 6]})
df2 = pd.DataFrame({'x1': [11, 1, 13], 'y1': [3, 52, 26], 'text': ['text1', 'text2', 'text3']})
Create a lookup function
def apply_condition(row, df):
condition = ((row['x1'] >= df['x1']) | (row['y1'] >= df['y1']))
return df[condition]['text'].iloc[0] # ATTENTION: Only the first match return
Create new column and print results
df1['text'] = df1.apply(lambda row: apply_condition(row, df2), axis=1)
df1.head()
Result:
My integers become NaNs when I add the index to the DataFrame.
I run this:
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
and I get this:
guavas pears avocados
0 10 111 200
1 20 222 3000
guavas pears avocados
Store
Thriftway NaN NaN NaN
Meijer NaN NaN NaN
The "old" newDF has index [0, 1] while the "new" newDF has index ['Thriftway', 'Meijer']. When using the DataFrame-constructor with a DataFrame, i.e. pd.DataFrame(newDF, index=['Thriftway', 'Meijer']), pandas internally does a reindex with the list in the index-argument on the index of newDF.
Values in the new index that do not have corresponding records in the DataFrame are assigned NaN. The index [0, 1] and the index ['Thriftway', 'Meijer'] have no overlapping values thus result is a DataFrame with NaN as values.
To appreciate this try running the following:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print (newDF)
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer', 0, 1])
newDF.index.name = 'Store'
print(newDF)
and notice that the new DataFrame now contains the old data. To achieve what you want you can instead reindex the existing DataFrame with the new index like so:
import pandas as pd
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
print(newDF)
newDF = newDF.reindex(['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
You can even reproduce what pandas is doing internally by using the index-argument of reindex:
newDF.reindex(index=['Thriftway', 'Meijer'])
The result is, as before, a DataFrame where labels that were not in the DataFrame before have been assigned NaN:
guavas pears avocados
Thriftway NaN NaN NaN
Meijer NaN NaN NaN
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
In above line, you are passing both dataframe and index to pd.DataFrame().
From the source code of pandas.DataFrame(), I pick some related codes as following with assumption that data is a dataframe:
def __init__(
self,
data=None,
index: Optional[Axes] = None,
columns: Optional[Axes] = None,
dtype: Optional[Dtype] = None,
copy: bool = False,
):
if isinstance(data, BlockManager):
if index is None and columns is None and dtype is None and copy is False:
# GH#33357 fastpath
NDFrame.__init__(self, data)
return
mgr = self._init_mgr(
data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
)
If index is given, pandas.DataFrame() will create a dataframe with the same columns as the passed dataframe. Each cell is filled with NaN.
If index is not given, it will create a dataframe as same as the passed dataframe including index, columns and data.
As far as I understand you want to set the index in your dataframe to something else than 0,1. However,
newDF = pd.DataFrame(newDF, index=['Thriftway', 'Meijer'])
This will actually set your newDF from given index (['Thriftway', 'Meijer']) in newDF. And since (currently) you don't have any values for these two index values in newDF it will write the column values as NaN for these index values.
Two possible solutions for setting up your custom index can be like this:
you specify index when you create your dataframe
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows, index=['Thriftway', 'Meijer'])
newDF.index.name = 'Store'
print(newDF)
you use set_index after
newRows = {'guavas': [10, 20],
'pears': [111,222],
'avocados':[200,3000]}
newDF = pd.DataFrame(newRows)
newDF = newDF.set_index(pd.Index(['Thriftway', 'Meijer']))
newDF.index.name = 'Store'
print(newDF)
I have two data frames say df1, df2 each has two columns ['Name', 'Marks']
I want to find the difference between the two ifs for corresponding Name Values.
Eg:
df = pd.DataFrame([["Shivi",70],["Alex",40]],columns=['Names', 'Value'])
df2 = pd.DataFrame([["Shivi",40],["Andrew",40]],columns=['Names', 'Value'])
For df1-df2 I want
pd.DataFrame([["Shivi",30],["Alex",40],["Andrew",40]],columns=['Names', 'Value'])
You can use:
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
So a complete program will look like this:
import pandas as pd
data1 = {'Name': ["Ashley", "Tom"], 'Marks': [40, 50]}
data2 = {'Name': ["Ashley", "Stan"], 'Marks': [80, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
print(diff)
Output:
Marks
Name
Ashley -40.0
Stan -90.0
Tom 50.0
I want to order a DataFrame by multiple regex. That is to say, for example in this DataFrame
df = pd.DataFrame({'Col1': [20, 30],
'Col2': [50, 60],
'Pol2': [50, 60]})
get the columns beginning with P before the ones beginning with C.
I've discovered that you can filter with one regex like
df.filter(regex = "P*")
but I can't do that with more levels.
UPDATE:
I want to do that in one instruction, I'm already able to use a list of regex and concatenate the columns in another DataFrame.
I believe you need list of DataFrames filtered by regexes in list with concat:
reg = ['^P','^C']
df1 = pd.concat([df.filter(regex = r) for r in reg], axis=1)
print (df1)
Pol2 Col1 Col2
0 50 20 50
1 60 30 60
you can just re-order columns by regular assignment.
export the colums to a sorted list, and index by it.
try:
import pandas as pd
df = pd.DataFrame({'Col1': [20, 30],
'Pol2': [50, 60],
'Col2': [50, 60],
})
df = df[sorted(df.columns.to_list(), key=lambda col: col.startswith("P"), reverse=True)]
print(df)