Order DataFrame columns by multiple regex - python

I want to order a DataFrame by multiple regex. That is to say, for example in this DataFrame
df = pd.DataFrame({'Col1': [20, 30],
'Col2': [50, 60],
'Pol2': [50, 60]})
get the columns beginning with P before the ones beginning with C.
I've discovered that you can filter with one regex like
df.filter(regex = "P*")
but I can't do that with more levels.
UPDATE:
I want to do that in one instruction, I'm already able to use a list of regex and concatenate the columns in another DataFrame.

I believe you need list of DataFrames filtered by regexes in list with concat:
reg = ['^P','^C']
df1 = pd.concat([df.filter(regex = r) for r in reg], axis=1)
print (df1)
Pol2 Col1 Col2
0 50 20 50
1 60 30 60

you can just re-order columns by regular assignment.
export the colums to a sorted list, and index by it.
try:
import pandas as pd
df = pd.DataFrame({'Col1': [20, 30],
'Pol2': [50, 60],
'Col2': [50, 60],
})
df = df[sorted(df.columns.to_list(), key=lambda col: col.startswith("P"), reverse=True)]
print(df)

Related

How to compare coordinates in two dataframes?

I have two dataframes
df1
x1
y1
x2
y2
label
0
0
1240
1755
label1
0
0
1240
2
label2
df2
x1
y1
x2
y2
text
992.0
943.0
1166.0
974.0
tex1
1110.0
864.0
1166.0
890.0
text2
Based on a condition like the following:
if df1['x1'] >= df2['x1'] or df1['y1'] >= df2['y1']:
# I want to add a new column 'text' in df1 with the text from df2.
df1['text'] = df2['text']
What's more, it is possible in df2 to have more than one row that makes the above-mentioned condition True, so I will need to add another if statement for df2 to get the best match.
My problem here is not the conditions but how am I supposed to approach the interaction between both data frames. Any help, or advice would be appreciated.
If you want to iterate from df1 through every row of df2 and return a match you can do it with the .apply() function in df1 and use the df2 as lookup table.
NOTE: In the above example I return the first match (by using the .iloc[0]) not all the matches.
Create two dummy dataframes
import pandas as pd
df1 = pd.DataFrame({'x1': [1, 2, 3], 'y1': [1, 5, 6]})
df2 = pd.DataFrame({'x1': [11, 1, 13], 'y1': [3, 52, 26], 'text': ['text1', 'text2', 'text3']})
Create a lookup function
def apply_condition(row, df):
condition = ((row['x1'] >= df['x1']) | (row['y1'] >= df['y1']))
return df[condition]['text'].iloc[0] # ATTENTION: Only the first match return
Create new column and print results
df1['text'] = df1.apply(lambda row: apply_condition(row, df2), axis=1)
df1.head()
Result:

If more than one column name contains the same string, how to perform analysis on each

I am performing analysis on several dataframes. Some of them have variables with really similar names, for example:
d = {'id': [1, 2, 3], '1-abc': [13, 15, 27], '2-abc': [23, 36, 12]}
df = pd.DataFrame(data=d)
In this example, I have specific analysis I want to perform on the quantity columns. When there is more than one column containing 'abc' I want to perform the analysis on both 'abc' columns.
I have tried df['ABC']= df.loc[:,['abc' in i for i in df.columns]] but this doesn't work on the dataframes where there is more than 1 column containing 'abc'
Is there a way to create an if-else statement that performs like the psuedo-code below?
for col in df.columns:
if df.columns contains > 1 col containing 'abc':
*perform analysis on 'abc' columns*
else:
continue
Your question is a bit broad, but a better way to obtain a subset of columns is df.filter(). Then you could perform your analysis on each column of the filtered DataFrame.
ABC = df.filter(like='abc').columns
for col in ABC:
ANALYZE_ME(df[col]) # Perform analysis on each column
You can also pack two columns into a list with something like:
df['ABC'] = df.filter(like='abc').apply(list, axis=1)
Output:
id 1-abc 2-abc ABC
0 1 13 23 [13, 23]
1 2 15 36 [15, 36]
2 3 27 12 [27, 12]

How to join and explode dataframes by multiple IDs using Pandas

I have two dataframes and I would like to perform a join with multiple IDs.
In df1 I have the column KeyWordGroupID with multiple IDs. These IDs can also be found in df2.
If there is a match the result dataframe with the column KeyWordGroupName of df2 is splitted into new columns containing the values of KeyWords.
# initialize list of lists
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
df1
Output:
RuleSetID RuleSetName KeyWordGroupID
0 Standard1 [100, 101, 102]
1 Standard2 [100, 102]
2 Standard3 [103]
The second dataframe is:
# initialize list of lists
data = [[100, 'verahren', ['word1, word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4, word5']],
[103, 'ort', ['word6, word7']]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
df2
Output:
KeyWordGroupID KeyWordGroupName KeyWords
100 verahren [word1, word2]
101 flaechen [word3]
102 nutzung [word4, word5]
103 ort [word6, word7]
The desired output is:
RuleSetID RuleSetName KeyWordGroupID verfahren flaechen nutzung ort
0 Standard1 [100, 101, 102] [word1, word2] [word3] [word4, word5] None
1 Standard2 [100, 102] [word1, word2] None [word4, word5] None
2 Standard3 [103] None None None [word6, word7]
Any hint how to perform a join like this is highly appreciated.
This one is a little tricky, but here's on approach. It takes advantage of explode to make the merge possible, and pivot which is what this ultimately is. Then to get rid of the empty lists it uses applymap
data = [[0, 'Standard1', [100, 101, 102]], [1, 'Standard2', [100, 102]], [2, 'Standard3', [103]]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['RuleSetID', 'RuleSetName', 'KeyWordGroupID'])
data = [[100, 'verahren', ['word1, word2']],
[101, 'flaechen', ['word3']],
[102, 'nutzung', ['word4, word5']],
[103, 'ort', ['word6, word7']]]
# Create the pandas DataFrame
df2 = pd.DataFrame(data, columns = ['KeyWordGroupID', 'KeyWordGroupName', 'KeyWords'])
(
df1.explode('KeyWordGroupID')
.merge(df2, on='KeyWordGroupID')
.pivot(index=['RuleSetID','RuleSetName','KeyWordGroupID'], columns='KeyWordGroupName',values='KeyWords')
.reset_index()
.groupby(['RuleSetID','RuleSetName'])
.agg(lambda x: list(x) if x.name=='KeyWordGroupID' else x.dropna())
.applymap(lambda x: np.nan if len(x)==0 else x)
.reset_index()
)

How to sum same columns (differentiated by suffix) in pandas?

I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!
Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})
assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want
import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5
Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5
I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample

How to update several DataFrames values with another DataFrame values?

Please would you like to know how I can update two DataFrames df1 y df2 from another DataFrame df3. All this is done within a for loop that iterates over all the elements of the DataFrame df3
for i in range(len(df3)):
df1.p_mw = ...
df2.p_mw = ...
The initial DataFrames df1 and df2 are as follows:
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
The DataFrame from which I want to update the data is:
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
As you can see the DataFrame df3 contains data from the corresponding column p_mw for both DataFrames df1 and df2. Furthermore, the DataFrame df2 has an element named GF_1 for which there is no update and should remain the same.
After updating for the last iteration, the desired output is the following:
df1 = pd.DataFrame([['GH_1', 60, 'Hidro'],
['GH_2', 40, 'Hidro'],
['GH_3', 90, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 20, 'Termo'],
['GT_2', 0, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
Create a mapping series by selecting the last row from df3, then map it on the column name and fill the nan values using the values from p_mw column
s = df3.iloc[-1]
df1['p_mw'] = df1['name'].map(s).fillna(df1['p_mw'])
df2['p_mw'] = df2['name'].map(s).fillna(df2['p_mw'])
If there are multiple dataframes that needed to be updated then we can use a for loop to avoid repetition of our code:
for df in (df1, df2):
df['p_mw'] = df['name'].map(s).fillna(df['p_mw'])
>>> df1
name p_mw type
0 GH_1 60 Hidro
1 GH_2 40 Hidro
2 GH_3 90 Hidro
>>> df2
name p_mw type
0 GT_1 20.0 Termo
1 GT_2 0.0 Termo
2 GF_1 10.0 Fict
This should do as you ask. No need for a for loop.
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
updates = df3.iloc[-1].values
df1["p_mw"] = updates[:3]
df2["p_mw"] = np.append(updates[3:], df2["p_mw"].iloc[-1])

Categories