I have 2 dataframes left_df and right-df, which both have 20 columns with identical names and dtypes. right_df also has 2 additional columns with unique values on every row.
I want to update rows in right_df with ALL the values from left_df where the values in ALL columns in a list of a subset of columns, matching_cols = ['col_1', 'col_3', 'col_10', 'col_12'] are identical in both dataframes. The values in the additional 2 unique columns in right_df should be preserved.
Ideally, I want to also drop those rows from left_df in the same command, or as the next command if this isn't possible. I need to do this process more than once, matching on several different lists of columns, with the left_df dropping matched rows each loop, until eventually no further matches are found.
An acceptable alternative would be any method to create a new dataframe new_df containing the set of rows where all specified columns in the list matching_cols match, with values from left_df in the first 20 columns and values from right_df in the remaining 2 columns.
I don't care about preserving the indices at any point in either dataframe, I am importing them to SQL after this and will reindex them on one of the 2 right_df values at the end.
New to Pandas and can't determine what method to use, have tried variations of .merge, .join, .update, etc, but can't seem to specify to only update when my desired column values all match, or how to drop those rows/export them to a new df.
Update: Added pseudocode below:
For a left_df as:
left_df = pd.DataFrame({
'col_0': ['0', '1', '2', '3', '4', '5'],
'col_1': ['A', 'B', 'C', 'D', 'E', 'F'],
'col_2': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_3': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_4': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_5': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_6': ['new', 'new', 'new', 'new', 'new', 'new'],
'col_7': ['new', 'new', 'new', 'new', 'new', 'new'],
})
and a right_df as:
right_df = pd.DataFrame({
'col_0': ['0', '1', '2', '3', '4', '5'],
'col_1': ['A', 'B', 'C', 'X', 'E', 'F'],
'col_2': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_3': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_4': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_5': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_6': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_7': ['old', 'old', 'old', 'old', 'old', 'old'],
'col_8': ['uid_0', 'uid_1', 'uid_2', 'uid_3', 'uid_4', 'uid_5'],
'col_9': ['uid_a', 'uid_b', 'uid_c', 'uid_d', 'uid_e', 'uid_f'],
})
Where matching_cols = ['col_0', 'col_1']
I want to get the following result either as a new dataframe or in-place on right_df (note that col_1 doesn't match on row 3, so is not changed)
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0 A new new new new new new uid_0 uid_a
1 1 B new new new new new new uid_1 uid_b
2 2 C new new new new new new uid_2 uid_c
3 3 X old old old old old old uid_3 uid_d
4 4 E new new new new new new uid_4 uid_e
5 5 F new new new new new new uid_5 uid_f
Worked it out thanks to this post and the Pandas documentation:
First, it's a .merge I need, and I specify the suffixes as '_r' for only the columns to be copied from the right_df / for the old values I'm updating:
merged_df = pd.merge(left_df, right_df, on=['col_0', 'col_1'], suffixes=(None, '_r'))
This yields a new dataframe with rows containing both the new and old columns, only for rows in each dataframe where the values in columns on=['col_0', 'col_1'] are a match. Then I drop the "old" columns by using a regex filter on the text '_r':
merged_df.drop(list(merged_df.filter(regex = '_r')), axis=1, inplace=True)
This yields a dataframe with only the "modified" rows and no unmodified rows, which is close enough for what I need.
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0 A new new new new new new uid_0 uid_a
1 1 B new new new new new new uid_1 uid_b
2 2 C new new new new new new uid_2 uid_c
3 4 E new new new new new new uid_4 uid_e
4 5 F new new new new new new uid_5 uid_f
Try this
new_df=pd.concat([left_df,right_df.iloc[:,-1:-3]],axis=1)
Related
This question already has an answer here:
DataFrame.apply in python pandas alters both original and duplicate DataFrames
(1 answer)
Closed 3 months ago.
so here my issue :
# creation of dataframe
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index = labels)
if I run the code below the values of the column priorty change for the dataframe dg (ok good) but for the dataframe df too, why ??
# map function
dg = df
dg["priority"] = dg["priority"].map({"yes":True, "no":False})
print(dg)
simply use df.copy()
because df is a DataFrame instance and so an object and when you set it to another variable, in reality you point to one object with 2 variables and pandas does not create a new object
That's because pandas dataframes are mutables.
pandas.DataFrame
Two-dimensional, size-mutable, potentially
heterogeneous tabular data.
You want pandas.DataFrame.copy to keep the original dataframe (in your case df) unchanged.
# map function
dg = df.copy()
dg["priority"] = dg["priority"].map({"yes":True, "no":False})
I am trying to apply the following calculation in all the columns of a dataframe EXCEPT a list of 3 string columns. The issue is that although the code bellow works fine based on the sample data, in reality the Month columns are upwards of 100+ and they are getting increased every month while the 3 string columns are fix. The months list should contain 100+ columns which they will be +1 every month so I want to just apply the /100 on all the columns that the View description == 'Percent change' except the Series ID, View Description and Country columns. How can I modify the list so that it includes just the 3 string columns and the .loc is applied to everything else.
import pandas as pd
df = pd.DataFrame({
'Series ID': ['Food', 'Drinks', 'Food at Home'],
'View Description': ['Percent change', 'Original Data Value', 'Original Data Value'],
'Jan': [219.98, 'B', 'A'],
'Feb': [210.98, 'B', 'A'],
'Mar': [205, 'B', 'A'],
'Apr': [202, 'B', 'A'],
'Country': ['Italy', 'B', 'A']
})
months = ['Jan', 'Feb', 'Mar', 'Apr']
df.loc[df['View Description'] == 'Percent change', months] /= 100
print(df)
Thanks!
You can change months to be a boolean array which omits the string columns:
months = ~df.columns.isin(['Series ID', 'View Description', 'Country'])
The command for applying the division will be the same as you have above. This change just programmatically selects the month columns by excluding the non-month columns.
I need help with a problem with a dataframe like this:
df = pd.DataFrame({'column_A': [[{'zone':'A', 'number':'7'}, {'zone':'B', 'number': '8'}],
[{'zone':'A', 'number':'6'}, {'zone':'E', 'number':'7'}]],
'column_B': [[{'zone':'C', 'number':'4'}], [{'zone':'D', 'number': '9'}]]})
I want to insert column_B into the column_A list so the output of the first line of column_A has to be:
[{'zone':'A', 'number':'7'}, {'zone':'B', 'number': '8'}, {'zone':'C', 'number':'4'}]
Probably is the easiest thing, I can imagine, but I find so many errors with functions like insert and the '+' command and I ran out of ideas.
Simpliest is join lists by +:
df['column_A'] = df['column_A'] + df['column_B']
print (df)
column_A \
0 [{'zone': 'A', 'number': '7'}, {'zone': 'B', '...
1 [{'zone': 'A', 'number': '6'}, {'zone': 'E', '...
column_B
0 [{'zone': 'C', 'number': '4'}]
1 [{'zone': 'D', 'number': '9'}]
Data are different, seems in second column are not lists:
df = pd.DataFrame({'column_A': [[{'zone':'A', 'number':'7'}, {'zone':'B', 'number': '8'}],
[{'zone':'A', 'number':'6'}, {'zone':'E', 'number':'7'}]],
'column_B': [{'zone':'C', 'number':'4'}, {'zone':'D', 'number': '9'}]})
df['column_A'] = df['column_A'] + df['column_B'].apply(lambda x: [x])
print (df)
column_A \
0 [{'zone': 'A', 'number': '7'}, {'zone': 'B', '...
1 [{'zone': 'A', 'number': '6'}, {'zone': 'E', '...
column_B
0 {'zone': 'C', 'number': '4'}
1 {'zone': 'D', 'number': '9'}
When I serached a way to remove an entire column in pandas if there is a null/NaN value, the only appropriate function I found was dropna(). For some reason, it's not removing the entire row as intended, but instead replacing the null values with zero. As I want to discard the entire row to then make a mean age of the animals from the dataframe, I need a way to not count the NaN values.
Here's the code:
import numpy as np
import pandas as pd
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, labels)
df.dropna(inplace= True)
df.head()
In this case, I need to delete the Dog 'd' and Cat 'h'. But the code that comes out is:
To note I have also done this, and it didn't work either:
df2 = df.dropna()
you have to specify the axis = 1 and any to remove column
see : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html
df.dropna(axis=1, inplace= True, how='any')
if you want just delet the row :
df.dropna(inplace= True, how='any')
I'm still relatively new to Pandas and I can't tell which of the functions I'm best off using to get to my answer. I have looked at pivot, pivot_table, group_by and aggregate but I can't seem to get it to do what I require. Quite possibly user error, for which I apologise!
I have data like this:
Code to create df:
import pandas as pd
df = pd.DataFrame([
['1', '1', 'A', 3, 7],
['1', '1', 'B', 2, 9],
['1', '1', 'C', 2, 9],
['1', '2', 'A', 4, 10],
['1', '2', 'B', 4, 0],
['1', '2', 'C', 9, 8],
['2', '1', 'A', 3, 8],
['2', '1', 'B', 10, 4],
['2', '1', 'C', 0, 1],
['2', '2', 'A', 1, 6],
['2', '2', 'B', 10, 2],
['2', '2', 'C', 10, 3]
], columns = ['Field1', 'Field2', 'Type', 'Price1', 'Price2'])
print(df)
I am trying to get data like this:
Although my end goal will be to end up with one column for A, one for B and one for C. As A will use Price1 and B & C will use Price2.
I don't want to necessarily get the max or min or average or sum of the Price as theoretically (although unlikely) there could be two different Price1's for the same Fields & Type.
What's the best function to use in Pandas to get to what I need?
Use DataFrame.set_index with DataFrame.unstack for reshape - output is MultiIndex in columns, so added sorting second level by DataFrame.sort_index, flatten values and last create column from Field levels:
df1 = (df.set_index(['Field1','Field2', 'Type'])
.unstack(fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{b}-{a}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
Field1 Field2 A-Price1 A-Price2 B-Price1 B-Price2 C-Price1 C-Price2
0 1 1 3 7 2 9 2 9
1 1 2 4 10 4 0 9 8
2 2 1 3 8 10 4 0 1
3 2 2 1 6 10 2 10 3
Solution with DataFrame.pivot_table is also possible, but it aggregate values in duplicates first 3 columns with default mean function:
df2 = (df.pivot_table(index=['Field1','Field2'],
columns='Type',
values=['Price1', 'Price2'],
aggfunc='mean')
.sort_index(axis=1, level=1))
df2.columns = [f'{b}-{a}' for a, b in df2.columns]
df2 = df2.reset_index()
print (df2)
use pivot_table
pd.pivot_table(df, values =['Price1', 'Price2'], index=['Field1','Field2'],columns='Type').reset_index()