Looping through pandas and finding row column pairs - python

I can't wrap my head around the best way to accomplish this.
The visualization below is what I would like to accomplish. I'm not sure what you would call it exactly but essentially I want to iterate through rows and columns and make a new dataframe with the x, y and then the intercepting point. Just reshaping the dataframe I don't want to lose any values.
I'm doing this just to learn Pandas so any help in the right direction of how to think about this/ best way to solve it would be greatly appreciated.
1 2 3
1 10 15 20
2 11 16 21
3 12 17 22
x y z
1 1 10
1 2 15
1 3 20
2 1 11
2 2 16
2 3 21
3 1 12
3 2 17
3 3 22

Given your dataset and expected output you are looking for pandas melt. Iterating over a dataframe can be slow and inefficient, if you are learning I strongly suggest you look for more efficient ways of working (for example vectorizing an operation, or pivoting) instead of using for loops. The last line of the proposed solution is merely for ordering the columns to match your desired output. Kindly try the following:
df = pd.DataFrame({1:[10,11,12],
2:[15,16,17],
3:[20,21,22]})
df = df.melt(value_vars=[1,2,3],var_name='x',value_name='z')
df['y'] = df.groupby('x').cumcount()+1
df[['x','y','z']]
Outputs:
x y z
0 1 1 10
1 1 2 11
2 1 3 12
3 2 1 15
4 2 2 16
5 2 3 17
6 3 1 20
7 3 2 21
8 3 3 22

Related

Ordering a dataframe by each column

I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

Filling data in Pandas

I'm using pandas and i have a little data like that.
4 1
5 8
6 25
7 33
8 24
9 4
and I want to fill in missing parts. I want to like that :
1 0
2 0
3 0
4 1
5 8
6 25
7 33
8 24
9 4
10 0
It's gonna be a list for use. like that [0,0,0,1,8,25,33,24,4,0]
looked for a solution but couldn't find any. Any idea?
Try with reindex
l = s.reindex(range(10+1),fill_value=0).tolist()

In python using iloc how would you retrive the last 12 values of a specific column in a data frame?

So the problem I seem to have is that I want to acces the data in a dataframe but only the last twelve numbers in every column so I have a data frame:
index A B C
20 1 2 3
21 2 5 6
22 7 8 9
23 10 1 2
24 3 1 2
25 4 9 0
26 10 11 12
27 1 2 3
28 2 1 5
29 6 7 8
30 8 4 5
31 1 3 4
32 1 2 3
33 5 6 7
34 1 3 4
The values inside A,B,C are not important they are just to show an example
currently I am using
df1=df2.iloc[23:35]
perhaps there is an easier way to do this because I have to do this for around 20 different dataframes of different sizes I know that if I use
df1=df2.iloc[-1]
it will return the last number but I dont know how to incorporate it for the last twelve numbers. any help would be appreciated.
You can get the last n rows of a DataFrame by:
df.tail(n)
or
df.iloc[-n-1:-1]

Axis bug on Pandas groupby boxplots

The head of my Pandas dataframe , df, is shown below:
count1 count2 totalcount season
0 3 13 16 1
1 8 32 40 1
2 5 27 32 1
3 3 10 13 1
4 0 1 1 1
I'd like to make boxplots of count1, count2, and totalcount, grouped by season (there are 4 seasons) and have each set of box plots show up on their own subplot in a single figure.
When I do this with only two of the columns, say count1 and count2, everything looks great.
df.boxplot(['count1', 'count2'], by='season')
But when I add totalcount to the mix, the axis limits go haywire.
df.boxplot(['count1', 'count2', 'totalcount'], by='season')
This happens irregardless of the order of the columns. I realize there are several ways around this problem, but it would be much more convenient if this worked properly.
Am I missing something? Is this a known bug in Pandas? I wasn't able to find anything in my first pass of the Pandas bug reports.
I'm using Pandas 0.14.0 and matplotlib 1.3.1.
Did you tried to upgrade your pandas/matplotlib packages?
I'm using Pandas 0.13.1 + Matplotlib 1.2.1 and this is the plot I get:
In [31]: df
Out[34]:
count1 count2 totalcount season
0 3 13 16 1
1 8 32 40 1
2 5 27 32 1
3 3 10 13 1
4 0 1 1 1
5 3 13 16 2
6 8 32 40 2
7 5 27 32 3
8 3 10 13 3
9 0 1 1 4
10 3 10 13 4
11 3 13 16 4
[12 rows x 4 columns]

Resample pandas dataframe only knowing result measurement count

I have a dataframe which looks like this:
Trial Measurement Data
0 0 12
1 4
2 12
1 0 12
1 12
2 0 12
1 12
2 NaN
3 12
I want to resample my data so that every trial has just two measurements
So I want to turn it into something like this:
Trial Measurement Data
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12
This rather uncommon task stems from the fact that my data has an intentional jitter on the part of the stimulus presentation.
I know pandas has a resample function, but I have no idea how to apply it to my second-level index while keeping the data in discrete categories based on the first-level index :(
Also, I wanted to iterate, over my first-level indices, but apparently
for sub_df in np.arange(len(df['Trial'].max()))
Won't work because since 'Trial' is an index pandas can't find it.
Well, it's not the prettiest I've ever seen, but from a frame looking like
>>> df
Trial Measurement Data
0 0 0 12
1 0 1 4
2 0 2 12
3 1 0 12
4 1 1 12
5 2 0 12
6 2 1 12
7 2 2 NaN
8 2 3 12
then we can manually build the two "average-like" objects and then use pd.melt to reshape the output:
avg = df.groupby("Trial")["Data"].agg({0: lambda x: x.head((len(x)+1)//2).mean(),
1: lambda x: x.tail((len(x)+1)//2).mean()})
result = pd.melt(avg.reset_index(), "Trial", var_name="Measurement", value_name="Data")
result = result.sort("Trial").set_index(["Trial", "Measurement"])
which produces
>>> result
Data
Trial Measurement
0 0 8
1 8
1 0 12
1 12
2 0 12
1 12

Categories