I have a dataframe containing some data, which I want to transform, so that the values of one column define the new columns.
>>> import pandas as pd
>>> df = pd.DataFrame([['a','a','b','b'],[6,7,8,9]]).T
>>> df
A B
0 a 6
1 a 7
2 b 8
3 b 9
The values of the column A shall be the column names of the new dataframe. The result of the transformation should look like this:
a b
0 6 8
1 7 9
What I came up with so far didn't work completely:
>>> pd.DataFrame({ k : df.loc[df['A'] == k, 'B'] for k in df['A'].unique() })
a b
0 6 NaN
1 7 NaN
2 NaN 8
3 NaN 9
Besides this being incorrect, I guess there probably is a more efficient way anyway. I'm just really having a hard time understanding how to handle things with pandas.
You were almost there but you need the .values as the list of array and then provide the column names.
pd.DataFrame(pd.DataFrame({ k : df.loc[df['A'] == k, 'B'].values for k in df['A'].unique() }), columns=df['A'].unique())
Output:
a b
0 6 8
1 7 9
Using a dictionary comprehension with groupby:
res = pd.DataFrame({col: vals.loc[:, 1].values for col, vals in df.groupby(0)})
print(res)
a b
0 6 8
1 7 9
Use set_index, groupby, cumcount, and unstack:
(df.set_index(['A', df.groupby('A').cumcount()])['B']
.unstack(0)
.rename_axis([None], axis=1))
Output:
a b
0 6 8
1 7 9
Related
I have a dataframe like this
df = pd.DataFrame(
np.arange(2, 11).reshape(-1, 3),
index=list('ABC'),
columns=pd.MultiIndex.from_arrays([
['data1', 'data2','data3'],
['F', 'K',''],
['', '','']
], names=['meter', 'Sleeper',''])
).rename_axis('Index')
df
meter data1 data2 data3
Sleeper F K
Index
A 2 3 4
B 5 6 7
C 8 9 10
So I want to join level names and flatted the data
following this solution Pandas dataframe with multiindex column - merge levels
df.columns = df.columns.map('_'.join).str.strip('|')
df.reset_index(inplace=True)
Getting this
Index data1_F_ data2_K_ data3__
0 A 2 3 4
1 B 5 6 7
2 C 8 9 10
but I dont want those _ end of the column names so I added
df.columns = df.columns.apply(lambda x: x[:-1] if x.endswith('_') else x)
df
But got
AttributeError: 'Index' object has no attribute 'apply'
How can I combine map and apply (flatten the column names and remove _ at the end of the column names in one run ?
expected output
Index data1_F data2_K data3
0 A 2 3 4
1 B 5 6 7
2 C 8 9 10
Thanks
You can try this:
df.columns = df.columns.map('_'.join).str.strip('_')
df
Out[132]:
data1_F data2_K data3
Index
A 2 3 4
B 5 6 7
C 8 9 10
I hope someone could help me solve my issue.
Given a pandas dataframe as depicted in the image below,
I would like to re-arrange it into a new dataframe, combining several sets of columns (the sets have all the same size) such that each set becomes a single column as shown in the desired result image below.
Thank you in advance for any tips.
For a general solution, you can try one of this two options:
You could try this, using OrderedDict to get the alpha-nonnumeric column names ordered alphabetically, pd.DataFrame.filter to filter the columns with similar names, and then concat the values with pd.DataFrame.stack:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame([[0,1,2,3,4],[5,6,7,8,9]], columns=['a1','a2','b1','b2','c'])
newdf=pd.DataFrame()
for col in list(OrderedDict.fromkeys( ''.join(df.columns)).keys()):
if col.isalpha():
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reset_index(drop=True)
Output:
df
a1 a2 b1 b2 c
0 0 1 2 3 4
1 5 6 7 8 9
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Another way to get the column names could be using re and set like this, and then sort columns alphabetically:
newdf=pd.DataFrame()
import re
for col in set(re.findall('[^\W\d_]',''.join(df.columns))):
newdf[col]=df.filter(like=col, axis=1).stack().reset_index(level=1,drop=True)
newdf=newdf.reindex(sorted(newdf.columns), axis=1).reset_index(drop=True)
Output:
newdf
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
You can do this with pd.wide_to_long and rename the 'c' column:
df_out = pd.wide_to_long(df.reset_index().rename(columns={'c':'c1'}),
['a','b','c'],'index','no')
df_out = df_out.reset_index(drop=True).ffill().astype(int)
df_out
Output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Same dataframe just sorting is different.
pd.wide_to_long(df, ['a','b'], 'c', 'no').reset_index().drop('no', axis=1)
Output:
c a b
0 4 0 2
1 9 5 7
2 4 1 3
3 9 6 8
The fact that column c only had one columns versus other letters having two columns, made it kind of tricky. I first stacked the dataframe and got rid of the numbers in the column names. Then for a and b I pivoted a dataframe and removed all nans. For c, I multiplied the length of the dataframe by 2 to make it match a and b and then merged it in with a and b.
input:
import pandas as pd
df = pd.DataFrame({'a1': {0: 0, 1: 5},
'a2': {0: 1, 1: 6},
'b1': {0: 2, 1: 7},
'b2': {0: 3, 1: 8},
'c': {0: 4, 1: 9}})
df
code:
df1=df.copy().stack().reset_index().replace('[0-9]+', '', regex=True)
dfab = df1[df1['level_1'].isin(['a','b'])].pivot(index=0, columns='level_1', values=0) \
.apply(lambda x: pd.Series(x.dropna().values)).astype(int)
dfc = pd.DataFrame(np.repeat(df['c'].values,2,axis=0)).rename({0:'c'}, axis=1)
df2=pd.merge(dfab, dfc, how='left', left_index=True, right_index=True)
df2
output:
a b c
0 0 2 4
1 1 3 4
2 5 7 9
3 6 8 9
Hi I need to sort a data frame. My data frame looks like below.
A B
2 5
3 9
2 7
I want to sort this by column A.
A B
2 5
2 7
3 9
when having duplicates in the column A,
sorted_data=data.sort_values(by=['A'], inplace=True)
doesn't workout. Any suggestion how I can fix this
It has worked correctly. The problem is that if you use inplace=True the sorting is done in your original DataFrame, data in your case.
If you want the order dataframe and to store it in sorted_data, do the following:
sorted_data=data.sort_values(by=['A'])
For example:
>>> df = pd.DataFrame({'A': [2,3,2], 'B': [5,9,7]})
>>> df.sort_values(by=['A'],inplace=True)
>>> df
a b
0 2 5
2 2 7
1 3 9
The other way:
>>> df = pd.DataFrame({'A': [2,3,2], 'B': [5,9,7]})
>>> sorted_df = df.sort_values(by=['A'])
>>> sorted_df
a b
0 2 5
2 2 7
1 3 9
>>> df
a b
0 2 5
1 3 9
2 2 7
If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a
Suppose I have a DataFrame like this:
>>> df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a','b','b'])
>>> df
a b b
0 1 2 3
1 4 5 6
2 7 8 9
And I want to remove second 'b' column. If I just use del statement, it'll delete both 'b' columns:
>>> del df['b']
>>> df
a
0 1
1 4
2 7
I can select column by index with .iloc[] and reassign DataFrame, but how can I delete only second 'b' column, for example by index?
df = df.drop(['b'], axis=1).join(df['b'].ix[:, 0:1])
>>> df
a b
0 1 2
1 4 5
2 7 8
Or just for this case
df = df.ix[:, 0:2]
But I think it has other better ways.