Pandas Structured 2D Data to XYZ Table - python

I want to create an xyz table from a structured grid representation of data in a Pandas DataFrame.
What I have is:
# Grid Nodes/indices
x=np.arange(5)
y=np.arange(5)
# DataFrame
df = pd.DataFrame(np.random.rand(5,5), columns=x, index=y)
>>>df
0 1 2 3 4
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
I want to convert the above DataFrame to this DataFrame structure:
>>>df
x y val
0 0 #
0 1 #
...
4 4 #
I can create a for loop to do this but I believe I should be able to do this using the pivot, stack, or some other built-in method though but I am not getting it from the documentation. It seems to create multilevel DataFrames which I do not want. Bonus points for converting it back.

dff = df.stack().reset_index(name="values")
pd.pivot_table(index="level_0",columns="level_1",values="values",data=dff)
First part is taken from the previous answer to be used for the unstacking part.
First one is for stacking and second for unstacking.
level_0 level_1 values
0 0 0 0.536047
1 0 1 0.673782
2 0 2 0.935536
3 0 3 0.853286
4 0 4 0.916081
5 1 0 0.884820
6 1 1 0.438207
7 1 2 0.070120
8 1 3 0.292445
9 1 4 0.789046
10 2 0 0.899633
11 2 1 0.822928
12 2 2 0.445154
13 2 3 0.643797
14 2 4 0.776154
15 3 0 0.682129
16 3 1 0.974159
17 3 2 0.078451
18 3 3 0.306872
19 3 4 0.689137
20 4 0 0.117842
21 4 1 0.770962
22 4 2 0.861076
23 4 3 0.429738
24 4 4 0.149199
# Unstacking
level_1 0 1 2 3 4
level_0
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199

Use df.stack with df.reset_index:
In [4474]: df = df.stack().reset_index(name='value').rename(columns={'level_0':'x', 'level_1': 'y'})
In [4475]: df
Out[4475]:
x y value
0 0 0 0.772210
1 0 1 0.921495
2 0 2 0.903645
3 0 3 0.980514
4 0 4 0.156923
5 1 0 0.516448
6 1 1 0.121148
7 1 2 0.394074
8 1 3 0.532963
9 1 4 0.369175
10 2 0 0.605971
11 2 1 0.712189
12 2 2 0.866299
13 2 3 0.174830
14 2 4 0.042236
15 3 0 0.350161
16 3 1 0.100152
17 3 2 0.049185
18 3 3 0.808631
19 3 4 0.562624
20 4 0 0.090918
21 4 1 0.713475
22 4 2 0.723183
23 4 3 0.569887
24 4 4 0.980238
For converting it back, use df.pivot:
In [4481]: unstacked_df = df.pivot('x', 'y')
In [4482]: unstacked_df
Out[4482]:
value
y 0 1 2 3 4
x
0 0.772210 0.921495 0.903645 0.980514 0.156923
1 0.516448 0.121148 0.394074 0.532963 0.369175
2 0.605971 0.712189 0.866299 0.174830 0.042236
3 0.350161 0.100152 0.049185 0.808631 0.562624
4 0.090918 0.713475 0.723183 0.569887 0.980238

Related

pandas restart cumsum every time the value is zero

so I have a series, I want to cumsum, but start over every time I hit a 0, somthing like this:
orig
wanted result
0
0
0
1
1
1
2
1
2
3
1
3
4
1
4
5
1
5
6
1
6
7
0
0
8
1
1
9
1
2
10
1
3
11
0
0
12
1
1
13
1
2
14
1
3
15
1
4
16
1
5
17
1
6
any ideas? (pandas, pure python, other)
Use df['orig'].eq(0).cumsum() to generate groups starting on each 0, then cumcount to get the increasing values:
df['result'] = df.groupby(df['orig'].eq(0).cumsum()).cumcount()
output:
orig wanted result result
0 0 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 1 4 4
5 1 5 5
6 1 6 6
7 0 0 0
8 1 1 1
9 1 2 2
10 1 3 3
11 0 0 0
12 1 1 1
13 1 2 2
14 1 3 3
15 1 4 4
16 1 5 5
17 1 6 6
Intermediate:
df['orig'].eq(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 2
9 2
10 2
11 3
12 3
13 3
14 3
15 3
16 3
17 3
Name: orig, dtype: int64
import pandas as pd
condition = df.Orig.eq(0)
df['reset'] = condition.cumsum()

Column that counts up within subgroups pandas

I've got a df
df1
a b
4 0 1
5 0 1
6 0 2
2 0 3
3 1 2
15 1 3
12 1 3
13 1 1
15 3 1
14 3 1
8 3 3
9 3 2
10 3 1
the df should be grouped by a and b and I need a column c that goes up from 1 to amount of groups within subgroups of a
df1
a b c
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 4
How can I do that?
We can do groupby + transform factorize
df['C']=df.groupby('a').b.transform(lambda x : x.factorize()[0]+1)
4 1
5 1
6 2
2 3
3 1
15 2
12 2
13 3
15 1
14 1
8 1
9 1
10 2
Name: b, dtype: int64
Just so we can see the loop version
from itertools import count
from collections import defaultdict
x = defaultdict(count)
y = {}
c = []
for a, b in zip(df.a, df.b):
if (a, b) not in y:
y[(a, b)] = next(x[a]) + 1
c.append(y[(a, b)])
df.assign(C=c)
a b C
4 0 1 1
5 0 1 1
6 0 2 2
2 0 3 3
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 3
15 3 1 1
14 3 1 1
8 3 3 2
9 3 2 3
10 3 1 1
One option is groupby a and then iterate through each group and groupby b. Then use can use ngroup
df['c'] = np.hstack([g.groupby('b').ngroup().to_numpy() for _,g in df.groupby('a')])
a b c
4 0 1 0
5 0 1 0
6 0 2 1
2 0 3 2
3 1 2 1
15 1 3 2
12 1 3 2
13 1 1 0
15 3 1 0
14 3 1 0
8 3 1 0
9 3 1 0
10 3 2 1
you can use groupby.rank if you don't care about the order in the data.
df['c'] = df.groupby('a')['b'].rank('dense').astype(int)

Stacking Pandas Dataframe without dropping row

Currently, I have a dataframe like this:
0 0 0 3 0 0
0 7 8 9 1 0
0 4 5 2 4 0
My code to stack it:
dt = dataset.iloc[:,0:7].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
dt['variable'] = pandas.Categorical(dt.index).codes+1
dt.rename(columns={0:index_column_name}, inplace=True)
dt.set_index(index_column_name, inplace=True)
dt['variable'] = numpy.sort(dt['variable'])
However, it drops the first row when I'm stacking it, and I want to keep the headers / first row, how would I achieve this?
In essence, I'm losing the data from the first row (a.k.a column headers) and I want to keep it.
Desired Output:
value,variable
0 1
0 1
0 1
0 2
7 2
4 2
0 3
8 3
5 3
3 4
9 4
2 4
0 5
1 5
4 5
0 6
0 6
0 6
Current output:
value,variable
0 1
0 1
7 2
4 2
8 3
5 3
9 4
2 4
1 5
4 5
0 6
0 6
Why not use df.melt as #WeNYoBen mentioned?
print(df)
1 2 3 4 5 6
0 0 0 0 3 0 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
print(df.melt())
variable value
0 1 0
1 1 0
2 1 0
3 2 0
4 2 7
5 2 4
6 3 0
7 3 8
8 3 5
9 4 3
10 4 9
11 4 2
12 5 0
13 5 1
14 5 4
15 6 0
16 6 0
17 6 0

Separating elements of a Pandas DataFrame in Python

I have a pandas DataFrame that looks like the following:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
which can be generated with the following code:
import pandas
time=[0,1,2,3,4]
repeat_1_conc_1=[1,2,3,4,5]
repeat_1_conc_2=[2,3,4,5,6]
repeat_1_conc_3=[3,4,5,6,7]
d1=pandas.DataFrame([time,repeat_1_conc_1]).transpose()
d2=pandas.DataFrame([time,repeat_1_conc_2]).transpose()
d3=pandas.DataFrame([time,repeat_1_conc_3]).transpose()
repeat_2_conc_1=[1,2,3,4,5]
repeat_2_conc_2=[2,3,4,5,6]
repeat_2_conc_3=[3,4,5,6,7]
d4=pandas.DataFrame([time,repeat_2_conc_1]).transpose()
d5=pandas.DataFrame([time,repeat_2_conc_2]).transpose()
d6=pandas.DataFrame([time,repeat_2_conc_3]).transpose()
df= pandas.concat([d1,d2,d3,d4,d5,d6]).reset_index()
df.drop('index',axis=1,inplace=True)
df.columns=['Time','Measurement']
print df
If you look at the code, you'll see that I have two experimental repeats in the same DataFrame which should be separated at df.iloc[:15]. Additionally, within each experiment I have 3 sub-experiments that can be thought of like the starting conditions of a dose response, i.e. first sub-experiment starts with 1, second with 2 and third with 3. These should be separated at index intervals of `len(time)', which is 0-4, 5 elements for each experimental repeat. Could somebody please tell me the best way to separate this data into individual time course measurements for each experiment? I'm not exactly sure what the best data structure would be to use but I just need to be able to access each data for each sub experiment for each experimental repeat easily. Perhaps sometime like:
repeat1=
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
Repeat 2=
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
IIUC, you may set a multiindex so that you can index your DF accessing experiments and subexperiments easily:
In [261]: dfi = df.set_index([df.index//15+1, df.index//5 - df.index//15*3 + 1])
In [262]: dfi
Out[262]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
2 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
selecting subexperiments
In [263]: dfi.loc[1,1]
Out[263]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
In [264]: dfi.loc[2,2]
Out[264]:
Time Measurement
2 2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
select second experiment with all subexperiments:
In [266]: dfi.loc[2,:]
Out[266]:
Time Measurement
1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
alternatively you can create your own slicing function:
def my_slice(rep=1, subexp=1):
rep -= 1
subexp -= 1
return df.ix[rep*15 + subexp*5 : rep*15 + subexp*5 + 4, :]
demo:
In [174]: my_slice(1,1)
Out[174]:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
In [175]: my_slice(2,1)
Out[175]:
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
In [176]: my_slice(2,2)
Out[176]:
Time Measurement
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
PS bit more convenient way to concatenate your DFs:
df = pandas.concat([d1,d2,d3,d4,d5,d6], ignore_index=True)
so you don't need the following .reset_index() and drop()

Python Pandas dataframe is not including all duplicates

I'm basically trying to create a Pandas dataframe (CQUAD_mech_loads) that is a subset of a larger dataframe (CQUAD_Mech). This subset dataframe is essentially created by filtering based on two conditions. There are NO duplicates in the larger dataframe (CQUAD_Mech).
The problem is that my subset dataframe doesn't include the duplicate ID's in the ELM column. It does, however, include duplicates in the LC column.
CQUAD_ELM is a list containing four ID's ([387522, 387522, 387506, 387507]). I have duplicate ID's of 387522. Right now, CQUAD_mech_loads is a dataframe with only three rows for the three unique IDs. I want that fourth duplicate ID in there as well.
The code:
def get_df(df, col1, cond1, col2='', cond2=0):
return df[(df[col1] == cond1) & (df[col2].isin(cond2))].reset_index(drop=True)
CQUAD_mech_loads = get_df(CQUAD_Mech,'LC', LC, 'ELM', CQUAD_ELM)
The output (where is the other line for 387522?):
LC ELM FX FY FXY
0 3113 387506 0 0 0
1 3113 387507 0 0 0
2 3113 387522 0 0 0
Since you're dropping the index anyway, you can just set the index to be the column you're interested in and use .ix indexing:
In [28]: df = pd.DataFrame(np.arange(25).reshape(5,5))
In [29]: df
Out[29]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [30]: df.set_index(4, drop=False).ix[[4,4,19,4,24]].reset_index(drop=True)
Out[30]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 15 16 17 18 19
3 0 1 2 3 4
4 20 21 22 23 24
EDIT: Your current method just finds each distinct col1/col2 pair. If you want to filter on multiple columns, just do it twice, once for each column:
In [98]: df.set_index(1, drop=False).ix[[1, 6, 16]].set_index(4, drop=False).ix[[4,4,4,4,4,4,4,4,19,9]].reset_index(drop=True)
Out[98]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
6 0 1 2 3 4
7 0 1 2 3 4
8 15 16 17 18 19
9 5 6 7 8 9

Categories