Python dataframe transpose where some rows have multiple values - python

I've a dataframe:
field,value
a,1
a,2
b,8
I want to pivot it to this form
a,b
1,8
2,8

set_index with a cumcount on each field group + field
unstack + ffill
df.set_index(
[df.groupby('field').cumcount(), 'field']
).value.unstack().ffill().astype(df.value.dtype)
field a b
0 1 8
1 2 8

You can do like so:
# df = pd.read_clipboard(sep=',')
df.pivot(columns=field, values=value).bfill().dropna()

print (df)
0 1
0 a 1
1 a 2
2 b 8
Solution with creating groups for new index by GroupBy.cumcount, then pivot and fill forward missing values:
g = df.groupby(0).cumcount()
df1 = pd.pivot(index=g, columns=df[0], values=df[1]).ffill().astype(int)
.rename_axis(None, axis=1)
print (df1)
a b
0 1 8
1 2 8
Another solution creates groups with apply and reshape by unstack:
print (df.groupby(0).apply(lambda x: pd.Series(x[1].values)).unstack(0).ffill().astype(int)
.rename_axis(None, axis=1))
a b
0 1 8
1 2 8

A much simpler solution would just be to do DataFrame.T (transpose)
df_new = df.T

Related

Shuffle Columns in Dataframe

I want to shuffle columns without order; completely pseudo-randomly, on one line of code.
Before:
A B
0 1 2
1 1 2
After:
B A
0 2 1
1 2 1
My attempts so far:
df = df.reindex(columns=columns)
df.sample(frac=1, axis=1)
df.apply(np.random.shuffle, axis=1)
You can use np.random.default_rng()'s permutation with a seed to make it reproducible.
df = df[np.random.default_rng(seed=42).permutation(df.columns.values)]
Use DataFrame.sample with the axis argument set to columns (1):
df = df.sample(frac=1, axis=1)
print(df)
B A
0 2 1
1 2 1
Or use Series.sample with columns converted to Series and change order of columns by subset:
df = df[df.columns.to_series().sample(frac=1)]
print(df)
B A
0 2 1
1 2 1
Use numpy.random.permutation with list of column names.
df = df[np.random.permutation(df.columns)]

Pandas groupby with new column for each value

I hope the title speaks for itself; I'd just like to add that it can be assumed that each key has the same amount of values.
Online searching the title yielded the following solution:
Split pandas dataframe based on groupby
Which supposed to be solving my problem, although it does not.
I'll give an example:
Input:
pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
Output:
pd.DataFrame(data={'a':['foo','bar'],'b':[1,4],'c':[2,5],'d':[3,6]})
Intuitively, it would be a groupby function without an aggregation function, or an aggregation function that makes a list out of the keys.
Obviously, it can be done 'manually' using for loops etc., but using for loops with large data sets is very expensive computationally.
Use GroupBy.cumcount for Series or column g, then reshape by DataFrame.set_index + Series.unstack or DataFrame.pivot, last data cleaning by DataFrame.add_prefix, DataFrame.rename_axis with
DataFrame.reset_index:
g = df1.groupby('a').cumcount()
df = (df1.set_index(['a', g])['b']
.unstack()
.add_prefix('new_')
.reset_index()
.rename_axis(None, axis=1))
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Or:
df1['g'] = df1.groupby('a').cumcount()
df = df1.pivot('a','g','b').add_prefix('new_').reset_index().rename_axis(None, axis=1)
print (df)
a new_0 new_1 new_2
0 bar 4 5 6
1 foo 1 2 3
Here is an alternative approach, using groupby.apply and string.ascii_lowercase if column names are important:
from string import ascii_lowercase
df = pd.DataFrame(data={'a':['foo','foo','foo','bar','bar','bar'],'b':[1,2,3,4,5,6]})
# Groupby 'a'
g = df.groupby('a')['b'].apply(list)
# Construct new DataFrame from g
new_df = pd.DataFrame(g.values.tolist(), index=g.index).reset_index()
# Fix column names
new_df.columns = [x for x in ascii_lowercase[:new_df.shape[1]]]
print(new_df)
a b c d
0 bar 4 5 6
1 foo 1 2 3

Combining similar dataframe rows

I currently have a dataframe which looks like this
User Date FeatureA FeatureB
John DateA 1 2
John DateB 3 5
Is there anyway that I can combine the 2 rows such that it becomes
User Date1 Date2 FeatureA1 FeatureB1 FeatureA2 FeatureB2
John DateA DateB 1 2 3 5
I think need:
g = df.groupby(['User']).cumcount()
df = df.set_index(['User', g]).unstack()
df.columns = ['{}{}'.format(i, j+1) for i, j in df.columns]
df = df.reset_index()
print (df)
User Date1 Date2 FeatureA1 FeatureA2 FeatureB1 FeatureB2
0 John DateA DateB 1 3 2 5
Explanation:
Get count per groups by Users with cumcount
Create MultiIndex by set_index
Reshape by unstack
Flatenning MultiIndex in columns
Convert index to columns by reset_index

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

Make new column in Panda dataframe by adding values from other columns

I have a dataframe with values like
A B
1 4
2 6
3 9
I need to add a new column by adding values from column A and B, like
A B C
1 4 5
2 6 8
3 9 12
I believe this can be done using lambda function, but I can't figure out how to do it.
Very simple:
df['C'] = df['A'] + df['B']
Building a little more on Anton's answer, you can add all the columns like this:
df['sum'] = df[list(df.columns)].sum(axis=1)
The simplest way would be to use DeepSpace answer. However, if you really want to use an anonymous function you can use apply:
df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)
You could use sum function to achieve that as #EdChum mentioned in the comment:
df['C'] = df[['A', 'B']].sum(axis=1)
In [245]: df
Out[245]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
You could do:
df['C'] = df.sum(axis=1)
If you only want to do numerical values:
df['C'] = df.sum(axis=1, numeric_only=True)
The parameter axis takes as arguments either 0 or 1, with 0 meaning to sum across columns and 1 across rows.
As of Pandas version 0.16.0 you can use assign as follows:
df = pd.DataFrame({"A": [1,2,3], "B": [4,6,9]})
df.assign(C = df.A + df.B)
# Out[383]:
# A B C
# 0 1 4 5
# 1 2 6 8
# 2 3 9 12
You can add multiple columns this way as follows:
df.assign(C = df.A + df.B,
Diff = df.B - df.A,
Mult = df.A * df.B)
# Out[379]:
# A B C Diff Mult
# 0 1 4 5 3 4
# 1 2 6 8 4 12
# 2 3 9 12 6 27
Concerning n00b's comment: "I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead"
I was getting the same error. In my case it was because I was trying to perform the column addition on a dataframe that was created like this:
df_b = df[['colA', 'colB', 'colC']]
instead of:
df_c = pd.DataFrame(df, columns=['colA', 'colB', 'colC'])
df_b is a copy of a slice from df
df_c is an new dataframe. So
df_c['colD'] = df['colA'] + df['colB']+ df['colC']
will add the columns and won't raise any warning. Same if .sum(axis=1) is used.
I wanted to add a comment responding to the error message n00b was getting but I don't have enough reputation. So my comment is an answer in case it helps anyone...
n00b said:
I get the following warning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
He got this error because whatever manipulations he did to his dataframe prior to creating df['C'] created a view into the dataframe rather than a copy of it. The error didn't arise form the simple calculation df['C'] = df['A'] + df['B'] suggested by DeepSpace.
Have a look at the Returning a view versus a copy docs.
Can do using loc
In [37]: df = pd.DataFrame({"A":[1,2,3],"B":[4,6,9]})
In [38]: df
Out[38]:
A B
0 1 4
1 2 6
2 3 9
In [39]: df['C']=df.loc[:,['A','B']].sum(axis=1)
In [40]: df
Out[40]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
eval lets you sum and create columns right away:
In [8]: df.eval('C = A + B', inplace=True)
In [9]: df
Out[9]:
A B C
0 1 4 5
1 2 6 8
2 3 9 12
Since inplace=True you don't need to assign it back to df.
You can solve it by adding simply:
df['C'] = df['A'] + df['B']

Categories