How to join and explode dataframes by multiple IDs using Pandas - python

I have two dataframes and I would like to perform a join with multiple IDs.
In df1 I have the column KeyWordGroupID with multiple IDs. These IDs can also be found in df2.
If there is a match the result dataframe with the column KeyWordGroupName of df2 is splitted into new columns containing the values of KeyWords.
# initialize list of lists
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1
Output:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [100, 101, 102]
1 Standard2 [100, 102]
2 Standard3 [103]
The second dataframe is:
# initialize list of lists
data = [[100, 'verahren', ['word1, word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4, word5']],
[103, 'ort', ['word6, word7']]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2
Output:
KeyWordGroupID KeyWordGroupName KeyWords
100 verahren [word1, word2]
101 flaechen [word3]
102 nutzung [word4, word5]
103 ort [word6, word7]
The desired output is:
RuleSetID RuleSetName KeyWordGroupID verfahren flaechen nutzung ort
0 Standard1 [100, 101, 102] [word1, word2] [word3] [word4, word5] None
1 Standard2 [100, 102] [word1, word2] None [word4, word5] None
2 Standard3 [103] None None None [word6, word7]
Any hint how to perform a join like this is highly appreciated.

This one is a little tricky, but here's on approach. It takes advantage of explode to make the merge possible, and pivot which is what this ultimately is. Then to get rid of the empty lists it uses applymap
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
data = [[100, 'verahren', ['word1, word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4, word5']],
[103, 'ort', ['word6, word7']]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
(
df1.explode('KeyWordGroupID')
.merge(df2, on='KeyWordGroupID')
.pivot(index=['RuleSetID','RuleSetName','KeyWordGroupID'], columns='KeyWordGroupName',values='KeyWords')
.reset_index()
.groupby(['RuleSetID','RuleSetName'])
.agg(lambda x: list(x) if x.name=='KeyWordGroupID' else x.dropna())
.applymap(lambda x: np.nan if len(x)==0 else x)
.reset_index()
)

Related

How to compare coordinates in two dataframes?

I have two dataframes
df1
x1
y1
x2
y2
label
0
0
1240
1755
label1
0
0
1240
2
label2
df2
x1
y1
x2
y2
text
992.0
943.0
1166.0
974.0
tex1
1110.0
864.0
1166.0
890.0
text2
Based on a condition like the following:
if df1['x1'] >= df2['x1'] or df1['y1'] >= df2['y1']:
# I want to add a new column 'text' in df1 with the text from df2.
df1['text'] = df2['text']
What's more, it is possible in df2 to have more than one row that makes the above-mentioned condition True, so I will need to add another if statement for df2 to get the best match.
My problem here is not the conditions but how am I supposed to approach the interaction between both data frames. Any help, or advice would be appreciated.
If you want to iterate from df1 through every row of df2 and return a match you can do it with the .apply() function in df1 and use the df2 as lookup table.
NOTE: In the above example I return the first match (by using the .iloc[0]) not all the matches.
Create two dummy dataframes
import pandas as pd
df1 = pd.DataFrame({'x1': [1, 2, 3], 'y1': [1, 5, 6]})
df2 = pd.DataFrame({'x1': [11, 1, 13], 'y1': [3, 52, 26], 'text': ['text1', 'text2', 'text3']})
Create a lookup function
def apply_condition(row, df):
condition = ((row['x1'] >= df['x1']) | (row['y1'] >= df['y1']))
return df[condition]['text'].iloc[0] # ATTENTION: Only the first match return
Create new column and print results
df1['text'] = df1.apply(lambda row: apply_condition(row, df2), axis=1)
df1.head()
Result:

Function not working when looping through a list of dataframes

I have a list of dataframes that I want to loop through all of them and perform the same actions. The dataframes have the same format. I used a function and a loop like you see in the code below but it seems that the only changes that are passed is the renaming of the columns. Am I missing something here?
def changes(df):
df = df[["A","B","C"]]
df = df/1000000
df["A"] = df["A"]*1000000
df.rename(columns={'A': 'A1', 'B': 'B1','C': 'C1'}, inplace=True)
df["A"] = df["A"].astype(int)
df = df.transpose()
return df
dfs = [df1,df2,df3]
for i in dfs:
i = changes(i)
Use enumerate in your loop:
# Setup
df1 = pd.DataFrame({'A': [10 , 20 , 30], 'B': [11, 21, 31], 'C': [12, 22, 32]})
df2 = pd.DataFrame({'A': [110 , 120 , 130], 'B': [111, 121, 131], 'C': [112, 122, 132]})
df3 = pd.DataFrame({'A': [210 , 220 , 230], 'B': [211, 221, 231], 'C': [212, 222, 232]})
dfs = [df1, df2, df3]
def changes(df):
df = df[["A","B","C"]]
df = df/1000000
df["A"] = df["A"]*1000000
df = df.rename(columns={'A': 'A1', 'B': 'B1','C': 'C1'}) # <- Don't use inplace
df["A1"] = df["A1"].astype(int) # <- A does not exist anymore
df = df.transpose()
return df
for i, df in enumerate(dfs):
dfs[i] = changes(df)
Output:
>>> dfs
[ 0 1 2
A1 10.000000 20.000000 30.000000
B1 0.000011 0.000021 0.000031
C1 0.000012 0.000022 0.000032,
0 1 2
A1 110.000000 120.000000 130.000000
B1 0.000111 0.000121 0.000131
C1 0.000112 0.000122 0.000132,
0 1 2
A1 210.000000 220.000000 230.000000
B1 0.000211 0.000221 0.000231
C1 0.000212 0.000222 0.000232]
The problem is you are naming the modified dataframe as i which is the iterator in your for loop, it's not being stored anywhere. You could solve this by creating a new list of dataframes with the desired output using list comprehensions to avoid for loops. For example:
dfs = [df1,df2,df3]
new_dfs = [changes(i) for i in dfs]
Edit:
You can simply reassign them with:
df1,df2,df3 = [changes(i) for i in dfs]
Better apply with oops
class Changes:
def __init__(self,df):
self.df = df
def transform(self):
df = self.df[["A","B","C"]]
df= self.df/1000000
df = self.df["A"]*1000000
self.df.rename(columns={'A': 'A1', 'B': 'B1','C': 'C1'},inplace=True}
df["A"] = self.df["A"].astype(int)
df = self.df.transpose()
return df
obj = Changes(df)
df = obj.transform()
now you can iterate through your list of dataframe

How to update several DataFrames values with another DataFrame values?

Please would you like to know how I can update two DataFrames df1 y df2 from another DataFrame df3. All this is done within a for loop that iterates over all the elements of the DataFrame df3
for i in range(len(df3)):
df1.p_mw = ...
df2.p_mw = ...
The initial DataFrames df1 and df2 are as follows:
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
The DataFrame from which I want to update the data is:
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
As you can see the DataFrame df3 contains data from the corresponding column p_mw for both DataFrames df1 and df2. Furthermore, the DataFrame df2 has an element named GF_1 for which there is no update and should remain the same.
After updating for the last iteration, the desired output is the following:
df1 = pd.DataFrame([['GH_1', 60, 'Hidro'],
['GH_2', 40, 'Hidro'],
['GH_3', 90, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 20, 'Termo'],
['GT_2', 0, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
Create a mapping series by selecting the last row from df3, then map it on the column name and fill the nan values using the values from p_mw column
s = df3.iloc[-1]
df1['p_mw'] = df1['name'].map(s).fillna(df1['p_mw'])
df2['p_mw'] = df2['name'].map(s).fillna(df2['p_mw'])
If there are multiple dataframes that needed to be updated then we can use a for loop to avoid repetition of our code:
for df in (df1, df2):
df['p_mw'] = df['name'].map(s).fillna(df['p_mw'])
>>> df1
name p_mw type
0 GH_1 60 Hidro
1 GH_2 40 Hidro
2 GH_3 90 Hidro
>>> df2
name p_mw type
0 GT_1 20.0 Termo
1 GT_2 0.0 Termo
2 GF_1 10.0 Fict
This should do as you ask. No need for a for loop.
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
updates = df3.iloc[-1].values
df1["p_mw"] = updates[:3]
df2["p_mw"] = np.append(updates[3:], df2["p_mw"].iloc[-1])

How to delete rows from a DF until an empty row is found (Python)

I have built up a dictionary of dataframes containing similar data imported from many excel sheets. However, the data are a bit messy and for each DF I have a header of data I need to remove (like metadata). The problem is that this header of useless data is not always the same in terms of length so I cannot use always the same number of rows to drop while looping through these DFs.
The only common thing across all the DFs is that between this messy data and the data I need(tabular data) there is an empty excel row.
So my idea was to loop through all the DFs in this dictionary and ask starting from the first row to drop all of them until and empty row is met. Once you meet the empty row I still cancel this and then I exist the loop. I hope it is clear. Any help would be more than appreciated. BR Luigi
In pandas, empty values are represented with np.nan. For a single dataframe, you can use pd.isnull with all(axis=1) to find a whole empty row. Then, you can use idxmax to get the first row where that is true (if you have more than 1 empty row, you'll want the first one, right?), and then loc with a : to get the "rest". Like so,
df = pd.DataFrame({
'cola': [100, 99, 98, np.nan, 96, np.nan],
'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
print(df)
cola colb
0 100.0 1.0
1 99.0 NaN
2 98.0 3.0
3 NaN NaN <- This is the row we want
4 96.0 5.0
5 NaN NaN <- Not thie one
rest = df.iloc[pd.isnull(df).all(axis=1).idxmax() + 1:, :]
print(rest)
cola colb
4 96.0 5.0
5 NaN NaN
In terms of doing it to mutiple dataframes in a dictionary, you can simply iterate over them and repeat the previous method.
# Sample data
df1 = pd.DataFrame({
'cola': [100, 99, 98, np.nan, 96, np.nan],
'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
df2 = pd.DataFrame({
'cola': [100, 99, 98, np.nan, 96, np.nan],
'colb': [1, np.nan, 3, np.nan, 5, np.nan]
})
dct = {'first': df1, 'second': df2}
# Solution
out_dict = {}
for key, frame in dct.items():
new_frame = frame.iloc[pd.isnull(frame).all(axis=1).idxmax() + 1:, :].reset_index(drop=True)
out_dict[key] = new_frame
out_dict now contains your desired dataframes.

Order DataFrame columns by multiple regex

I want to order a DataFrame by multiple regex. That is to say, for example in this DataFrame
df = pd.DataFrame({'Col1': [20, 30],
'Col2': [50, 60],
'Pol2': [50, 60]})
get the columns beginning with P before the ones beginning with C.
I've discovered that you can filter with one regex like
df.filter(regex = "P*")
but I can't do that with more levels.
UPDATE:
I want to do that in one instruction, I'm already able to use a list of regex and concatenate the columns in another DataFrame.
I believe you need list of DataFrames filtered by regexes in list with concat:
reg = ['^P','^C']
df1 = pd.concat([df.filter(regex = r) for r in reg], axis=1)
print (df1)
Pol2 Col1 Col2
0 50 20 50
1 60 30 60
you can just re-order columns by regular assignment.
export the colums to a sorted list, and index by it.
try:
import pandas as pd
df = pd.DataFrame({'Col1': [20, 30],
'Pol2': [50, 60],
'Col2': [50, 60],
})
df = df[sorted(df.columns.to_list(), key=lambda col: col.startswith("P"), reverse=True)]
print(df)

Categories