Pandas duplicate row based on condition and unstack - python

I have a pandas data frame of the following form:
Name Age BMoney BTime BEffort
John 22 1 0 0
Pete 54 0 1 0
Lisa 26 0 1 1
And I would like to convert it to
Name Age B
John 22 Money
Pete 54 Time
Lisa 26 Effort
Lisa 26 Time
That is, based on the values in the "Breason" column I would like to create a new column "B" containing "reason". If for a person multiple reasons exists (i.e: a row contains multiple 1's) I would like to create seperate rows for that person in my new dataframe showcasing their different reasons.

With Multi Index and stack():
# Create the dataframe
df = [["John", 22, 1, 0, 0],
["Pete", 54, 0, 1, 0],
["Lisa", 26, 1, 1, 0]]
df = pd.DataFrame(df, columns=["Name", "Age", "BMoney", "BTime", "BEffort"])
# Set Multi Indexing
df.set_index(["Name", "Age"], inplace=True)
# Use the fact that columns and Series can carry names and use stack to do the transformation
df.columns.name = "B"
df = df.stack()
df.name = "value"
df = df.reset_index()
# Select only the "valid" rows, remove the last columns and remove first letter in B columns
df = df[df.value == 1]
df.drop("value", axis=1, inplace=True)
df["B"] = df.B.apply(lambda x: x[1:])

Related

How to change column names of Pandas Series object?

I'm trying to prefix the names of the columns for each series in my Pandas Series, based on one of the other columns value. Currently my objective is to change a Pandas Dataframe that contains 3 columns into a Dataframe of only 1 column named 'Data' - or whatever. Below is an example of stacking a Dataframe to obtain a single dimension to work with.
df_single_level_cols = pd.DataFrame([[0, 1, 20], [2, 3, 40]],columns=['weight', 'height', 'girth'])
df = df_single_level_cols.stack()
print(df)
0 weight 0
height 1
girth 20
1 weight 2
height 3
girth 40
dtype: int64
For each series I need to prefix both column names weight and height with the value of girth. When that is done I will drop girth from the equation, leaving me with only the weight and height for my series. After prefixing and dropping the series object should look like the following:
0 20weight 0
20height 1
1 40weight 2
40height 3
dtype: int64
Then when converting this to a Dataframe I shall have the following:
Data
20weight 0
20height 1
40weight 2
40height 3
I've tried messing around with .apply(...), .rename(...) and .add_prefix(...) but none of them seem to be doing the trick. If I do something like
df[0] = df[0].add_prefix("test")
I end up getting errors as I'm setting an array element with a sequence + this does not actually use the value of girth - but was more a way of getting accustomed with the rename functionality..
You can melt instead:
df = (df_single_level_cols
.astype({'girth': str})
.melt('girth', value_name='Data')
.assign(**{'girth': lambda d: d['girth']+d.pop('variable')})
.set_index('girth')
)
output:
Data
girth
20weight 0
40weight 2
20height 1
40height 3
Like this?
df = pd.DataFrame([[0, 1, 20], [2, 3, 40]],columns=['weight', 'height', 'girth'])
df = df[['weight', 'height']].stack().reset_index(level=1).merge(df.girth, left_index=True, right_index=True, how='left')
df = df.set_index(df.girth.astype(str) + df.level_1).rename(columns={0: 'Data'})[['Data']]
> df
Data
20weight 0
20height 1
40weight 2
40height 3

pandas pivot data Cols to rows and rows to cols

I am using python and pandas have tried a variety of attempts to pivot the following (switch the row and columns)
Example:
A is unique
A B C D E... (and so on)
[0] apple 2 22 222
[1] peach 3 33 333
[N] ... and so on
And I would like to see
? ? ? ? ... and so on
A apple peach
B 2 3
C 22 33
D 222 333
E
... and so on
I am ok if the columns are named after the col "A", and if the first column needs a name, lets call it "name"
name apple peach ...
B 2 3
C 22 33
D 222 333
E
... and so on
Think you're wanting transpose here.
df = pd.DataFrame({'A': {0: 'apple', 1: 'peach'}, 'B': {0: 2, 1: 3}, 'C': {0: 22, 1: 33}})
df = df.T
print(df)
0 1
A apple peach
B 2 3
C 22 33
Edit for comment. I would probably reset the index and then use the df.columns to update the column names with a list. You may want to reset the index again at the end as needed.
df.reset_index(inplace=True)
df.columns = ['name', 'apple', 'peach']
df = df.iloc[1:, :]
print(df)
name apple peach
1 B 2 3
2 C 22 33
try df.transpose() it should do the trick
Taking the advice from the other posts, and a few other tweaks (explained in line) here is what worked for me.
# get the key column that will become the column names.
# add the column name for the existing columns
cols = df['A'].tolist()
cols.append('name')
# Transform
df = df.T
# the transform takes the column, and makes it an index column.
# need to add it back into the data set (you might want to drop
# the index later to get rid if it all together)
df['name'] = df.index
# now to rebuild the columns and move the new "name" column to the first col
df.columns = cols
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]
# remove the row, (was the column we used for the column names
df = df.iloc[1:, :]

Change column values based on other dataframe columns

I have two dataframes that look like this
df1 ==
IDLocation x-coord y-coord
1 -1.546 7.845
2 3.256 1.965
.
.
35 5.723 -2.724
df2 ==
PIDLocation DIDLocation
14 5
3 2
7 26
I want to replace the columns PIDLocation, DIDLocation with Px-coord, Py-coord, Dx-coord, Dy-coord such that the two columns PIDLocation, DIDLocation are IDLocation and each IDLocation corresponds to an x-coord and y-coord in the first dataframe.
If you set the ID column as the index of df1, you can get the coord values by indexing. I changed the values in df2 in the example below to avoid index errors that would result from not having the full dataset.
import pandas as pd
df1 = pd.DataFrame({'IDLocation': [1, 2, 35],
'x-coord': [-1.546, 3.256, 5.723],
'y-coord': [7.845, 1.965, -2.724]})
df2 = pd.DataFrame({'PIDLocation': [35, 1, 2],
'DIDLocation': [2, 1, 35]})
df1.set_index('IDLocation', inplace=True)
df2['Px-coord'] = [df1['x-coord'].loc[i] for i in df2.PIDLocation]
df2['Py-coord'] = [df1['y-coord'].loc[i] for i in df2.PIDLocation]
df2['Dx-coord'] = [df1['x-coord'].loc[i] for i in df2.DIDLocation]
df2['Dy-coord'] = [df1['y-coord'].loc[i] for i in df2.DIDLocation]
del df2['PIDLocation']
del df2['DIDLocation']
print(df2)
Px-coord Py-coord Dx-coord Dy-coord
0 5.723 -2.724 3.256 1.965
1 -1.546 7.845 -1.546 7.845
2 3.256 1.965 5.723 -2.724

How to fastly select dataframe according to multi-columns in pandas

I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.
Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])

Pandas: Converting Columns to Rows based on ID

I am new to pandas,
I have the following dataframe:
df = pd.DataFrame([[1, 'name', 'peter'], [1, 'age', 23], [1, 'height', '185cm']], columns=['id', 'column','value'])
id column value
0 1 name peter
1 1 age 23
2 1 height 185cm
I need to create a single row for each ID. Like so:
id name age height
0 1 peter 23 185cm
Any help is greatly appreciated, thank you.
You can use pivot_table with aggregate join:
df = pd.DataFrame([[1, 'name', 'peter'],
[1, 'age', 23],
[1, 'height', '185cm'],
[1, 'age', 25]], columns=['id', 'column','value'])
print (df)
id column value
0 1 name peter
1 1 age 23
2 1 height 185cm
3 1 age 25
df1 = df.astype(str).pivot_table(index="id",columns="column",values="value",aggfunc=','.join)
print (df1)
column age height name
id
1 23,25 185cm peter
Another solution with groupby + apply join and unstack:
df1 = df.astype(str).groupby(["id","column"])["value"].apply(','.join).unstack(fill_value=0)
print (df1)
column age height name
id
1 23,25 185cm peter
Assuming your dataframe as "df", below line would help:
df.pivot(index="subject",columns="predicate",values="object")

Categories