I have a df which compares the new and old data. Is there a way to calculate the difference between the old and new data? For generality, I don't want to sort the dataframe, but only compare root variables that have a prefix "_old" and "_new"
df
apple_old daily banana_new banana_tree banana_old apple_new
0 5 3 4 2 10 6
for x in df.columns:
if x.endswith("_old") and x.endswith("_new"):
x = x.dif()
Expected Output; brackets are shown just for clarity
df_diff
apple_diff(old-new) banana_diff(old-new)
0 -1 (5-6) 6 (10-4)
Let's try creating a Multi-Index, then subtracting old from new.
Setup:
import pandas as pd
df = pd.DataFrame({'apple_old': {0: 5}, 'daily': {0: 3}, 'banana_new': {0: 4},
'banana_tree': {0: 2}, 'banana_old': {0: 10},
'apple_new': {0: 6}})
# Creation of Multi-Index:
df.columns = df.columns.str.rsplit('_', n=1, expand=True).swaplevel(0, 1)
# Subtract old from new:
output_df = (df['old'] - df['new']).add_suffix('_diff')
# Display:
print(output_df)
apple_diff banana_diff
0 -1 6
Multi-Index with str.rsplit
and max split length n=1 so multiple _ are handled safely:
df.columns = df.columns.str.rsplit('_', n=1, expand=True).swaplevel(0, 1)
old NaN new tree old new
apple daily banana banana banana apple
0 5 3 4 2 10 6
Then selection:
df['old']
apple banana
0 5 10
df['new']
banana apple
0 4 6
Subtraction will align by columns. Then add_suffix to add the _diff to columns.
Related
I am using python and pandas have tried a variety of attempts to pivot the following (switch the row and columns)
Example:
A is unique
A B C D E... (and so on)
[0] apple 2 22 222
[1] peach 3 33 333
[N] ... and so on
And I would like to see
? ? ? ? ... and so on
A apple peach
B 2 3
C 22 33
D 222 333
E
... and so on
I am ok if the columns are named after the col "A", and if the first column needs a name, lets call it "name"
name apple peach ...
B 2 3
C 22 33
D 222 333
E
... and so on
Think you're wanting transpose here.
df = pd.DataFrame({'A': {0: 'apple', 1: 'peach'}, 'B': {0: 2, 1: 3}, 'C': {0: 22, 1: 33}})
df = df.T
print(df)
0 1
A apple peach
B 2 3
C 22 33
Edit for comment. I would probably reset the index and then use the df.columns to update the column names with a list. You may want to reset the index again at the end as needed.
df.reset_index(inplace=True)
df.columns = ['name', 'apple', 'peach']
df = df.iloc[1:, :]
print(df)
name apple peach
1 B 2 3
2 C 22 33
try df.transpose() it should do the trick
Taking the advice from the other posts, and a few other tweaks (explained in line) here is what worked for me.
# get the key column that will become the column names.
# add the column name for the existing columns
cols = df['A'].tolist()
cols.append('name')
# Transform
df = df.T
# the transform takes the column, and makes it an index column.
# need to add it back into the data set (you might want to drop
# the index later to get rid if it all together)
df['name'] = df.index
# now to rebuild the columns and move the new "name" column to the first col
df.columns = cols
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
# remove the row, (was the column we used for the column names
df = df.iloc[1:, :]
I have two dataframes:
Row No. Subject
1 Apple
2 Banana
3 Orange
4 Lemon
5 Strawberry
row_number Subjects Special?
1 Banana Yes
2 Lemon No
3 Apple No
4 Orange No
5 Strawberry Yes
6 Cranberry Yes
7 Watermelon No
I want to change the Row No. of the first dataframe to match the second. It should be like this:
Row No. Subject
3 Apple
1 Banana
4 Orange
2 Lemon
5 Strawberry
I have tried this code:
for index, row in df1.iterrows():
if df1['Subject'] == df2['Subjects']:
df1['Row No.'] = df2['row_number']
But I get the error:
ValueError: Can only compare identically-labeled Series objects
Does that mean the dataframes have to have the same amount of rows and columns? Do they have to be labelled the same too? Is there a way to bypass this limitation?
Edit: I have found a promising alternative formula:
for x in df1['Subject']:
if x in df2['Subjects'].values:
df2.loc[df2['Subjects'] == x]['row_number'] = df1.loc[df1['Subject'] == x]['Row No.']
But it appears it doesn't modify the first dataframe like I want it to. Any tips why?
Furthermore, I get this warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would avoid using for loops especially when pandas has such great methods to handle these types of problems already.
Using pd.Series.replace
Here is a vectorized way of doing this -
d is the dictionary that maps the fruit to the number in second dataframe
You can use df.Subject.replace(d) to now simply replace the keys in the dict d to their values.
Overwrite the Row No. column with this now.
d = dict(zip(df2['Subjects'], df2['row_number']))
df1['Row No.'] = df1.Subject.replace(d)
print(df1)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Using pd.merge
Let's try simply merging the 2 dataframe and replace the column completely.
ddf = pd.merge(df1['Subject'],
df2[['row_number','Subjects']],
left_on='Subject',
right_on='Subjects',
how='left').drop('Subjects',1)
ddf.columns = df1.columns[::-1]
print(ddf)
Subject Row No.
0 Apple 3
1 Banana 1
2 Orange 4
3 Lemon 2
4 Strawberry 5
Assuming the first is df1 and the second is df2, this should do what you want it to:
import pandas as pd
d1 = {'Row No.': [1, 2, 3, 4, 5], 'Subject': ['Apple', 'Banana', 'Orange',
'Lemon', 'Strawberry']}
df1 = pd.DataFrame(data=d1)
d2 = {'row_number': [1, 2, 3, 4, 5, 6, 7], 'Subjects': ['Banana', 'Lemon', 'Apple',
'Orange', 'Strawberry', 'Cranberry', 'Watermelon'], 'Special?': ['Yes', 'No',
'No', 'No',
'Yes', 'Yes', 'No']}
df2 = pd.DataFrame(data=d2)
for x in df1['Subject']:
if x in df2['Subjects'].values:
df1.loc[df1['Subject'] == x, 'Row No.'] = (df2.loc[df2['Subjects'] == x]['row_number']).item()
#print(df1)
#print(df2)
In your edited answer it looks like you had the dataframes swapped and you were missing the item() to get the actual row_number value and not the Series object.
I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.
Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])
I want to add _x suffix to each column name like so:
featuresA = myPandasDataFrame.columns.values + '_x'
How do I do this? Additionally, if I wanted to add x_ as a suffix, how would the solution change?
The following is the nicest way to add suffix in my opinion.
df = df.add_suffix('_some_suffix')
As it is a function that is called on DataFrame and returns DataFrame - you can use it in chain of the calls.
You can use a list comprehension:
df.columns = [str(col) + '_x' for col in df.columns]
There are also built-in methods like .add_suffix() and .add_prefix() as mentioned in another answer.
Elegant In-place Concatenation
If you're trying to modify df in-place, then the cheapest (and simplest) option is in-place addition directly on df.columns (i.e., using Index.__iadd__).
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
df.columns += '_some_suffix'
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
To add a prefix, you would similarly use
df.columns = 'some_prefix_' + df.columns
df
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Another cheap option is using a list comprehension with f-string formatting (available on python3.6+).
df.columns = [f'{c}_some_suffix' for c in df]
df
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
And for prefix, similarly,
df.columns = [f'some_prefix{c}' for c in df]
Method Chaining
It is also possible to do add *fixes while method chaining. To add a suffix, use DataFrame.add_suffix
df.add_suffix('_some_suffix')
A_some_suffix B_some_suffix
0 9 12
1 4 7
2 2 5
3 1 4
This returns a copy of the data. IOW, df is not modified.
Adding prefixes is also done with DataFrame.add_prefix.
df.add_prefix('some_prefix_')
some_prefix_A some_prefix_B
0 9 12
1 4 7
2 2 5
3 1 4
Which also does not modify df.
Critique of add_*fix
These are good methods if you're trying to perform method chaining:
df.some_method1().some_method2().add_*fix(...)
However, add_prefix (and add_suffix) creates a copy of the entire dataframe, just to modify the headers. If you believe this is wasteful, but still want to chain, you can call pipe:
def add_suffix(df):
df.columns += '_some_suffix'
return df
df.some_method1().some_method2().pipe(add_suffix)
I Know 4 ways to add a suffix (or prefix) to your column's names:
1- df.columns = [str(col) + '_some_suffix' for col in df.columns]
or
2- df.rename(columns= lambda col: col+'_some_suffix')
or
3- df.columns += '_some_suffix' much easiar.
or, the nicest:
3- df.add_suffix('_some_suffix')
I haven't seen this solution proposed above so adding this to the list:
df.columns += '_x'
And you can easily adapt for the prefix scenario.
Using DataFrame.rename
df = pd.DataFrame({'A': range(3), 'B': range(4, 7)})
print(df)
A B
0 0 4
1 1 5
2 2 6
Using rename with axis=1 and string formatting:
df.rename('col_{}'.format, axis=1)
# or df.rename(columns='col_{}'.format)
col_A col_B
0 0 4
1 1 5
2 2 6
To actually overwrite your column names, we can assign the returned values to our df:
df = df.rename('col_{}'.format, axis=1)
or use inplace=True:
df.rename('col_{}'.format, axis=1, inplace=True)
I figured that this is what I would use quite often, for example:
df = pd.DataFrame({'silverfish': range(3), 'silverspoon': range(4, 7),
'goldfish': range(10, 13),'goldilocks':range(17,20)})
My way of dynamically renaming:
color_list = ['gold','silver']
for i in color_list:
df[f'color_{i}']=df.filter(like=i).sum(axis=1)
OUTPUT:
{'silverfish': {0: 0, 1: 1, 2: 2},
'silverspoon': {0: 4, 1: 5, 2: 6},
'goldfish': {0: 10, 1: 11, 2: 12},
'goldilocks': {0: 17, 1: 18, 2: 19},
'color_gold': {0: 135, 1: 145, 2: 155},
'color_silver': {0: 20, 1: 30, 2: 40}}
Pandas also has a add_prefix method and a add_suffix method to do this.
I have a dataframe with the following header:
id, type1, ..., type10, location1, ..., location10
and I want to convert it as follows:
id, type, location
I managed to do this using embedded for loops but it's very slow:
new_format_columns = ['ID', 'type', 'location']
new_format_dataframe = pd.DataFrame(columns=new_format_columns)
print(data.head())
new_index = 0
for index, row in data.iterrows():
ID = row["ID"]
for i in range(1,11):
if row["type"+str(i)] == np.nan:
continue
else:
new_row = pd.Series([ID, row["type"+str(i)], row["location"+str(i)]])
new_format_dataframe.loc[new_index] = new_row.values
new_index += 1
Any suggestions for improvement using native pandas features?
You can use lreshape:
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
Sample:
import pandas as pd
df = pd.DataFrame({
'type1': {0: 1, 1: 4},
'id': {0: 'a', 1: 'a'},
'type10': {0: 1, 1: 8},
'location1': {0: 2, 1: 9},
'location10': {0: 5, 1: 7}})
print (df)
id location1 location10 type1 type10
0 a 2 5 1 1
1 a 9 7 4 8
types = [col for col in df.columns if col.startswith('type')]
location = [col for col in df.columns if col.startswith('location')]
print(pd.lreshape(df, {'Type':types, 'Location':location}, dropna=False))
id Location Type
0 a 2 1
1 a 9 4
2 a 5 1
3 a 7 8
Another solution with double melt:
print (pd.concat([pd.melt(df, id_vars='id', value_vars=types, value_name='type'),
pd.melt(df, value_vars=location, value_name='Location')], axis=1)
.drop('variable', axis=1))
id type Location
0 a 1 2
1 a 4 9
2 a 1 5
3 a 8 7
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.