Calculate difference between columns with same underlying name

Calculate difference between columns with same underlying name - python

I have a df which compares the new and old data. Is there a way to calculate the difference between the old and new data? For generality, I don't want to sort the dataframe, but only compare root variables that have a prefix "_old" and "_new"
df
apple_old daily banana_new banana_tree banana_old apple_new
0 5 3 4 2 10 6
for x in df.columns:
if x.endswith("_old") and x.endswith("_new"):
x = x.dif()
Expected Output; brackets are shown just for clarity
df_diff
apple_diff(old-new) banana_diff(old-new)
0 -1 (5-6) 6 (10-4)

Let's try creating a Multi-Index, then subtracting old from new.
Setup:
import pandas as pd
df = pd.DataFrame({'apple_old': {0: 5}, 'daily': {0: 3}, 'banana_new': {0: 4},
'banana_tree': {0: 2}, 'banana_old': {0: 10},
'apple_new': {0: 6}})
# Creation of Multi-Index:
df.columns = df.columns.str.rsplit('_', n=1, expand=True).swaplevel(0, 1)
# Subtract old from new:
output_df = (df['old'] - df['new']).add_suffix('_diff')
# Display:
print(output_df)
apple_diff banana_diff
0 -1 6
Multi-Index with str.rsplit
and max split length n=1 so multiple _ are handled safely:
df.columns = df.columns.str.rsplit('_', n=1, expand=True).swaplevel(0, 1)
old NaN new tree old new
apple daily banana banana banana apple
0 5 3 4 2 10 6
Then selection:
df['old']
apple banana
0 5 10
df['new']
banana apple
0 4 6
Subtraction will align by columns. Then add_suffix to add the _diff to columns.

Related

pandas pivot data Cols to rows and rows to cols

I am using python and pandas have tried a variety of attempts to pivot the following (switch the row and columns)
Example:
A is unique
A B C D E... (and so on)
[0] apple 2 22 222
[1] peach 3 33 333
[N] ... and so on
And I would like to see
? ? ? ? ... and so on
A apple peach
B 2 3
C 22 33
D 222 333
E
... and so on
I am ok if the columns are named after the col "A", and if the first column needs a name, lets call it "name"
name apple peach ...
B 2 3
C 22 33
D 222 333
E
... and so on

Think you're wanting transpose here.
df = pd.DataFrame({'A': {0: 'apple', 1: 'peach'}, 'B': {0: 2, 1: 3}, 'C': {0: 22, 1: 33}})
df = df.T
print(df)
0 1
A apple peach
B 2 3
C 22 33
Edit for comment. I would probably reset the index and then use the df.columns to update the column names with a list. You may want to reset the index again at the end as needed.
df.reset_index(inplace=True)
df.columns = ['name', 'apple', 'peach']
df = df.iloc[1:, :]
print(df)
name apple peach
1 B 2 3
2 C 22 33

try df.transpose() it should do the trick

Taking the advice from the other posts, and a few other tweaks (explained in line) here is what worked for me.
# get the key column that will become the column names.
# add the column name for the existing columns
cols = df['A'].tolist()
cols.append('name')
# Transform
df = df.T
# the transform takes the column, and makes it an index column.
# need to add it back into the data set (you might want to drop
# the index later to get rid if it all together)
df['name'] = df.index
# now to rebuild the columns and move the new "name" column to the first col
df.columns = cols
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
# remove the row, (was the column we used for the column names
df = df.iloc[1:, :]

How do you match the value of one dataframe's column with another dataframe's column using conditionals?

I have two dataframes:
Row No. Subject
1 Apple
2 Banana
3 Orange
4 Lemon
5 Strawberry
row_number Subjects Special?
1 Banana Yes
2 Lemon No
3 Apple No
4 Orange No
5 Strawberry Yes
6 Cranberry Yes
7 Watermelon No
I want to change the Row No. of the first dataframe to match the second. It should be like this:
Row No. Subject
3 Apple
1 Banana
4 Orange
2 Lemon
5 Strawberry
I have tried this code:
for index, row in df1.iterrows():
if df1['Subject'] == df2['Subjects']:
df1['Row No.'] = df2['row_number']
But I get the error:
ValueError: Can only compare identically-labeled Series objects
Does that mean the dataframes have to have the same amount of rows and columns? Do they have to be labelled the same too? Is there a way to bypass this limitation?
Edit: I have found a promising alternative formula:
for x in df1['Subject']:
if x in df2['Subjects'].values:
df2.loc[df2['Subjects'] == x]['row_number'] = df1.loc[df1['Subject'] == x]['Row No.']
But it appears it doesn't modify the first dataframe like I want it to. Any tips why?
Furthermore, I get this warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

I would avoid using for loops especially when pandas has such great methods to handle these types of problems already.
Using pd.Series.replace
Here is a vectorized way of doing this -
d is the dictionary that maps the fruit to the number in second dataframe
You can use df.Subject.replace(d) to now simply replace the keys in the dict d to their values.
Overwrite the Row No. column with this now.
d = dict(zip(df2['Subjects'], df2['row_number']))
df1['Row No.'] = df1.Subject.replace(d)
print(df1)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Using pd.merge
Let's try simply merging the 2 dataframe and replace the column completely.
ddf = pd.merge(df1['Subject'],
df2[['row_number','Subjects']],
left_on='Subject',
right_on='Subjects',
how='left').drop('Subjects',1)
ddf.columns = df1.columns[::-1]
print(ddf)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5

Assuming the first is df1 and the second is df2, this should do what you want it to:
import pandas as pd
d1 = {'Row No.': [1, 2, 3, 4, 5], 'Subject': ['Apple', 'Banana', 'Orange',
'Lemon', 'Strawberry']}
df1 = pd.DataFrame(data=d1)
d2 = {'row_number': [1, 2, 3, 4, 5, 6, 7], 'Subjects': ['Banana', 'Lemon', 'Apple',
'Orange', 'Strawberry', 'Cranberry', 'Watermelon'], 'Special?': ['Yes', 'No',
'No', 'No',
'Yes', 'Yes', 'No']}
df2 = pd.DataFrame(data=d2)
for x in df1['Subject']:
if x in df2['Subjects'].values:
df1.loc[df1['Subject'] == x, 'Row No.'] = (df2.loc[df2['Subjects'] == x]['row_number']).item()
#print(df1)
#print(df2)
In your edited answer it looks like you had the dataframes swapped and you were missing the item() to get the actual row_number value and not the Series object.

How to fastly select dataframe according to multi-columns in pandas

I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.

Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])

Add suffix to columns in pandas dataframe [duplicate]

I want to add _x suffix to each column name like so:
featuresA = myPandasDataFrame.columns.values + '_x'
How do I do this? Additionally, if I wanted to add x_ as a suffix, how would the solution change?

The following is the nicest way to add suffix in my opinion.
df = df.add_suffix('_some_suffix')
As it is a function that is called on DataFrame and returns DataFrame - you can use it in chain of the calls.

You can use a list comprehension:
df.columns = [str(col) + '_x' for col in df.columns]
There are also built-in methods like .add_suffix() and .add_prefix() as mentioned in another answer.

Elegant In-place Concatenation
If you're trying to modify df in-place, then the cheapest (and simplest) option is in-place addition directly on df.columns (i.e., using Index.__iadd__).
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
df.columns += '_some_suffix'
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
To add a prefix, you would similarly use
df.columns = 'some_prefix_' + df.columns
df
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Another cheap option is using a list comprehension with f-string formatting (available on python3.6+).
df.columns = [f'{c}_some_suffix' for c in df]
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
And for prefix, similarly,
df.columns = [f'some_prefix{c}' for c in df]
Method Chaining
It is also possible to do add *fixes while method chaining. To add a suffix, use DataFrame.add_suffix
df.add_suffix('_some_suffix')
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
This returns a copy of the data. IOW, df is not modified.
Adding prefixes is also done with DataFrame.add_prefix.
df.add_prefix('some_prefix_')
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Which also does not modify df.
Critique of add_*fix
These are good methods if you're trying to perform method chaining:
df.some_method1().some_method2().add_*fix(...)
However, add_prefix (and add_suffix) creates a copy of the entire dataframe, just to modify the headers. If you believe this is wasteful, but still want to chain, you can call pipe:
def add_suffix(df):
df.columns += '_some_suffix'
return df
df.some_method1().some_method2().pipe(add_suffix)

I Know 4 ways to add a suffix (or prefix) to your column's names:
1- df.columns = [str(col) + '_some_suffix' for col in df.columns]
or
2- df.rename(columns= lambda col: col+'_some_suffix')
or
3- df.columns += '_some_suffix' much easiar.
or, the nicest:
3- df.add_suffix('_some_suffix')

I haven't seen this solution proposed above so adding this to the list:
df.columns += '_x'
And you can easily adapt for the prefix scenario.

Using DataFrame.rename
df = pd.DataFrame({'A': range(3), 'B': range(4, 7)})
print(df)
A B
0 0 4
1 1 5
2 2 6
Using rename with axis=1 and string formatting:
df.rename('col_{}'.format, axis=1)
# or df.rename(columns='col_{}'.format)
col_A col_B
0 0 4
1 1 5
2 2 6
To actually overwrite your column names, we can assign the returned values to our df:
df = df.rename('col_{}'.format, axis=1)
or use inplace=True:
df.rename('col_{}'.format, axis=1, inplace=True)

I figured that this is what I would use quite often, for example:
df = pd.DataFrame({'silverfish': range(3), 'silverspoon': range(4, 7),
'goldfish': range(10, 13),'goldilocks':range(17,20)})
My way of dynamically renaming:
color_list = ['gold','silver']
for i in color_list:
df[f'color_{i}']=df.filter(like=i).sum(axis=1)
OUTPUT:
{'silverfish': {0: 0, 1: 1, 2: 2},
'silverspoon': {0: 4, 1: 5, 2: 6},
'goldfish': {0: 10, 1: 11, 2: 12},
'goldilocks': {0: 17, 1: 18, 2: 19},
'color_gold': {0: 135, 1: 145, 2: 155},
'color_silver': {0: 20, 1: 30, 2: 40}}

Pandas also has a add_prefix method and a add_suffix method to do this.

Explode a row to multiple rows in pandas dataframe

I have a dataframe with the following header:
id, type1, ..., type10, location1, ..., location10
and I want to convert it as follows:
id, type, location
I managed to do this using embedded for loops but it's very slow:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
Any suggestions for improvement using native pandas features?

You can use lreshape:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
Sample:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
Another solution with double melt:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calculate difference between columns with same underlying name - python

Related

pandas pivot data Cols to rows and rows to cols

How do you match the value of one dataframe's column with another dataframe's column using conditionals?

How to fastly select dataframe according to multi-columns in pandas

Add suffix to columns in pandas dataframe [duplicate]

Explode a row to multiple rows in pandas dataframe

Categories

Resources