My current data frame is comprised of 10 rows and thousands of columns. The setup currently looks similar to this:
A B A B
1 2 3 4
5 6 7 8
But I desire something more like below, where essentially I would transpose the columns into rows once the headers start repeating themselves.
A B
1 2
5 6
3 4
7 8
I've been trying df.reshape but perhaps can't get the syntax right. Any suggestions on how best to transpose the data like this?
I'd probably go for stacking, grouping and then building a new DataFrame from scratch, eg:
pd.DataFrame({col: vals for col, vals in df.stack().groupby(level=1).agg(list).items()})
That'll also give you:
A B
0 1 2
1 3 4
2 5 6
3 7 8
Try with stack, groupby and pivot:
stacked = df.T.stack().to_frame().assign(idx=df.T.stack().groupby(level=0).cumcount()).reset_index()
output = stacked.pivot("idx", "level_0", 0).rename_axis(None, axis=1).rename_axis(None, axis=0)
>>> output
A B
0 1 2
1 5 6
2 3 4
3 7 8
Related
I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1
I'm currently trying to do analysis of rolling correlations of a dataset with four compared values but only need the output of rows containing 'a'
I got my data frame by using the command newdf = df.rolling(3).corr()
Sample input (random numbers)
a b c d
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b 5 6 3
3 c 4 3 1
3 d 3 4 2
4 a 1 3 5 6
4 b 6 2 4 1
4 c 8 6 6 7
4 d 2 5 4 6
5 a 2 5 4 1
5 b 1 4 6 3
5 c 2 6 3 7
5 d 3 6 3 7
and need the output
a b c d
1 a 1 3 5 6
2 a 2 5 4 1
I've tried filtering it by doing adf = newdf.filter(['a'], axis=0) however that gets rid of everything and when doing it for the other axis it filters by column. Unfortunately the column containing the rows with values: a, b, c, d is unnamed so I cant filter that column individually. This wouldn't be an issue however if its possible to flip the rows and columns with the values being listed by index to get the desired output.
Try using loc. Put the column of abcdabcd ... as index and just use loc
df.loc['a']
The actual source of problem in your case is that your DataFrame
has a MultiIndex.
So when you attempt to execute newdf.filter(['a'], axis=0) you want
to leave rows with the index containing only "a" string.
But since your DataFrame has a MultiIndex, each row with "a" at
level 1 contains also some number at level 0.
To get your intended result, run:
newdf.filter(like='a', axis=0)
maybe followed by .dropna().
An alterantive solution is:
newdf.xs('a', level=1, drop_level=False)
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
3 c 2 9
4 b 2 10
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
I want to select the last 3 rows of each group(from the above df) like the following but perform the operation using Inplace. I want to ensure that I am keeping only the new df object in memory after assignment. What would be an efficient way of doing it?
df = df.groupby('Group').tail(3)
The result should look like the following:
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
N.B:- This question is related to Keeping the last N duplicates in pandas
df = df.groupby('Group').tail(3) is already an efficient way of doing it. Because you are overwriting the df variable, Python will take care of releasing the memory of the old dataframe, and you will only have access to the new one.
Trying way too hard to guess what you want.
NOTE: using Pandas inplace argument where it is available is NO guarantee that a new DataFrame won't be created in memory. In fact, it may very well create a new DataFrame in memory and replace the old one behind the scenes.
from collections import defaultdict
def f(s):
c = defaultdict(int)
for i, x in zip(s.index[::-1], s.values[::-1]):
c[x] += 1
if c[x] > 3:
yield i
df.drop([*f(df.Group)], inplace=True)
df
value Group something
0 a 1 1
1 b 1 2
2 c 1 4
5 x 2 5
6 d 2 3
7 e 3 5
8 d 2 10
9 a 3 5
Your answer already into the Post , However as earlier said in the comments you are overwriting the existing df , so to avoid that assign a new column name like below:
df['new_col'] = df.groupby('Group').tail(3)
However, out of curiosity, if you are not concerned about the the groupby and only looking for N last lines of the df yo can do it like below:
df[-2:] # last 2 rows
I want to sort a subset of a dataframe (say, between indexes i and j) according to some value. I tried
df2=df.iloc[i:j].sort_values(by=...)
df.iloc[i:j]=df2
No problem with the first line but nothing happens when I run the second one (not even an error). How should I do ? (I tried also the update function but it didn't do either).
I believe need assign to filtered DataFrame with converting to numpy array by values for avoid align indices:
df = pd.DataFrame({'A': [1,2,3,4,3,2,1,4,1,2]})
print (df)
A
0 1
1 2
2 3
3 4
4 3
5 2
6 1
7 4
8 1
9 2
i = 2
j = 7
df.iloc[i:j] = df.iloc[i:j].sort_values(by='A').values
print (df)
A
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 4
8 1
9 2
Hi I have a dataframe looks like:
a b c d
2 4 2 7
4 2 3 8
5 3 2 9
I want to return 2 2 2 7.
I would like if there is a function to do this, or the most efficient to do this. Thanks
df.min(axis=0)
Axis keyword changes between rows or columns.