Transform list to dataframe efficiently - python

I have a list of images and I want to get all the pixels of each image in one DataFrame column and the number of the image into another column. I am trying to do it with
plotDF = DataFrame()
plotData = [np.array([[1,2,1],[1,1,2],[4,2,1]]), np.array([[1,2,2,1],[1,3,1,3]]), np.array([[1,1,2,3],[4,1,1,1],[1,1,1,4]])]
plotData = [image.flatten() for image in plotData]
for n, pD in zip(range(len(plotData)), plotData):
for pixel in pD:
plotDF = plotDF.append(DataFrame.from_records([{'n': n, 'pixel': pixel}]))
plotDF = plotDF.reset_index(drop=True)
but this seems really inefficient.
How can I do this more efficient, possibly with https://github.com/kieferk/dfply?

I think you can use numpy.repeat for repeat values by legths by str.len and flat values of nested lists by chain.
from itertools import chain
s = pd.Series(plotData)
df2 = pd.DataFrame({
"n": np.repeat(s.index + 1, s.str.len()),
"pixel": list(chain.from_iterable(s))})
print (df2)
n pixel
0 1 1
1 1 2
2 1 1
3 1 1
4 1 1
5 1 2
6 1 4
7 1 2
8 1 1
9 2 1
10 2 2
11 2 2
12 2 1
13 2 1
14 2 3
15 2 1
16 2 3
17 3 1
18 3 1
19 3 2
20 3 3
21 3 4
22 3 1
23 3 1
24 3 1
25 3 1
26 3 1
27 3 1
28 3 4

Related

Replicating rows of a Pandas dataframe based on a column condition

I have a dataframe looking like this:
Weekday Day_in_Month Starting_hour Ending_hour Power
3 1 1 3 35
3 1 3 7 15
4 2 22 2 5
.
.
.
I want to duplicate every column until the Starting_hour matches the Ending_hour.
-> All values of the row should be the same, but the Starting_hour value should change by Starting_hour + 1 for every new row.
The final dataframe should look like the following:
Weekday Day_in_Month Starting_hour Ending_hour Power
3 1 1 3 35
3 1 2 3 35
3 1 3 3 35
3 1 3 7 15
3 1 4 7 15
3 1 5 7 15
3 1 6 7 15
3 1 7 7 15
4 2 22 2 5
4 2 23 2 5
4 2 24 2 5
4 2 1 2 5
4 2 2 2 5
I appreciate any ideas on it, thanks!
Use Index.repeat with subtracted values and repeat rows by DataFrame.loc, then add counter to Starting_hour by GroupBy.cumcount:
df1 = df.loc[df.index.repeat(df['Ending_hour'].sub(df['Starting_hour']).add(1))]
df1['Starting_hour'] += df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
EDIT: If possible greater Starting_hour add 24 to Ending_hour, then in last step remove 1 for starting hours by 0, use modulo by 24 and last add 1:
m = df['Starting_hour'].gt(df['Ending_hour'])
e = df['Ending_hour'].mask(m, df['Ending_hour'].add(24))
df1 = df.loc[df.index.repeat(e.sub(df['Starting_hour']).add(1))]
df1['Starting_hour'] = (df1['Starting_hour'].add(df1.groupby(level=0).cumcount())
.sub(1).mod(24).add(1))
df1 = df1.reset_index(drop=True)
print (df1)
Weekday Day_in_Month Starting_hour Ending_hour Power
0 3 1 1 3 35
1 3 1 2 3 35
2 3 1 3 3 35
3 3 1 3 7 15
4 3 1 4 7 15
5 3 1 5 7 15
6 3 1 6 7 15
7 3 1 7 7 15
8 4 2 22 2 5
9 4 2 23 2 5
10 4 2 24 2 5
11 4 2 1 2 5
12 4 2 2 2 5

Pandas Structured 2D Data to XYZ Table

I want to create an xyz table from a structured grid representation of data in a Pandas DataFrame.
What I have is:
# Grid Nodes/indices
x=np.arange(5)
y=np.arange(5)
# DataFrame
df = pd.DataFrame(np.random.rand(5,5), columns=x, index=y)
>>>df
0 1 2 3 4
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
I want to convert the above DataFrame to this DataFrame structure:
>>>df
x y val
0 0 #
0 1 #
...
4 4 #
I can create a for loop to do this but I believe I should be able to do this using the pivot, stack, or some other built-in method though but I am not getting it from the documentation. It seems to create multilevel DataFrames which I do not want. Bonus points for converting it back.
dff = df.stack().reset_index(name="values")
pd.pivot_table(index="level_0",columns="level_1",values="values",data=dff)
First part is taken from the previous answer to be used for the unstacking part.
First one is for stacking and second for unstacking.
level_0 level_1 values
0 0 0 0.536047
1 0 1 0.673782
2 0 2 0.935536
3 0 3 0.853286
4 0 4 0.916081
5 1 0 0.884820
6 1 1 0.438207
7 1 2 0.070120
8 1 3 0.292445
9 1 4 0.789046
10 2 0 0.899633
11 2 1 0.822928
12 2 2 0.445154
13 2 3 0.643797
14 2 4 0.776154
15 3 0 0.682129
16 3 1 0.974159
17 3 2 0.078451
18 3 3 0.306872
19 3 4 0.689137
20 4 0 0.117842
21 4 1 0.770962
22 4 2 0.861076
23 4 3 0.429738
24 4 4 0.149199
# Unstacking
level_1 0 1 2 3 4
level_0
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
Use df.stack with df.reset_index:
In [4474]: df = df.stack().reset_index(name='value').rename(columns={'level_0':'x', 'level_1': 'y'})
In [4475]: df
Out[4475]:
x y value
0 0 0 0.772210
1 0 1 0.921495
2 0 2 0.903645
3 0 3 0.980514
4 0 4 0.156923
5 1 0 0.516448
6 1 1 0.121148
7 1 2 0.394074
8 1 3 0.532963
9 1 4 0.369175
10 2 0 0.605971
11 2 1 0.712189
12 2 2 0.866299
13 2 3 0.174830
14 2 4 0.042236
15 3 0 0.350161
16 3 1 0.100152
17 3 2 0.049185
18 3 3 0.808631
19 3 4 0.562624
20 4 0 0.090918
21 4 1 0.713475
22 4 2 0.723183
23 4 3 0.569887
24 4 4 0.980238
For converting it back, use df.pivot:
In [4481]: unstacked_df = df.pivot('x', 'y')
In [4482]: unstacked_df
Out[4482]:
value
y 0 1 2 3 4
x
0 0.772210 0.921495 0.903645 0.980514 0.156923
1 0.516448 0.121148 0.394074 0.532963 0.369175
2 0.605971 0.712189 0.866299 0.174830 0.042236
3 0.350161 0.100152 0.049185 0.808631 0.562624
4 0.090918 0.713475 0.723183 0.569887 0.980238

Separating elements of a Pandas DataFrame in Python

I have a pandas DataFrame that looks like the following:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
which can be generated with the following code:
import pandas
time=[0,1,2,3,4]
repeat_1_conc_1=[1,2,3,4,5]
repeat_1_conc_2=[2,3,4,5,6]
repeat_1_conc_3=[3,4,5,6,7]
d1=pandas.DataFrame([time,repeat_1_conc_1]).transpose()
d2=pandas.DataFrame([time,repeat_1_conc_2]).transpose()
d3=pandas.DataFrame([time,repeat_1_conc_3]).transpose()
repeat_2_conc_1=[1,2,3,4,5]
repeat_2_conc_2=[2,3,4,5,6]
repeat_2_conc_3=[3,4,5,6,7]
d4=pandas.DataFrame([time,repeat_2_conc_1]).transpose()
d5=pandas.DataFrame([time,repeat_2_conc_2]).transpose()
d6=pandas.DataFrame([time,repeat_2_conc_3]).transpose()
df= pandas.concat([d1,d2,d3,d4,d5,d6]).reset_index()
df.drop('index',axis=1,inplace=True)
df.columns=['Time','Measurement']
print df
If you look at the code, you'll see that I have two experimental repeats in the same DataFrame which should be separated at df.iloc[:15]. Additionally, within each experiment I have 3 sub-experiments that can be thought of like the starting conditions of a dose response, i.e. first sub-experiment starts with 1, second with 2 and third with 3. These should be separated at index intervals of `len(time)', which is 0-4, 5 elements for each experimental repeat. Could somebody please tell me the best way to separate this data into individual time course measurements for each experiment? I'm not exactly sure what the best data structure would be to use but I just need to be able to access each data for each sub experiment for each experimental repeat easily. Perhaps sometime like:
repeat1=
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 0 2
6 1 3
7 2 4
8 3 5
9 4 6
10 0 3
11 1 4
12 2 5
13 3 6
14 4 7
Repeat 2=
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
25 0 3
26 1 4
27 2 5
28 3 6
29 4 7
IIUC, you may set a multiindex so that you can index your DF accessing experiments and subexperiments easily:
In [261]: dfi = df.set_index([df.index//15+1, df.index//5 - df.index//15*3 + 1])
In [262]: dfi
Out[262]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
2 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
selecting subexperiments
In [263]: dfi.loc[1,1]
Out[263]:
Time Measurement
1 1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
In [264]: dfi.loc[2,2]
Out[264]:
Time Measurement
2 2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
select second experiment with all subexperiments:
In [266]: dfi.loc[2,:]
Out[266]:
Time Measurement
1 0 1
1 1 2
1 2 3
1 3 4
1 4 5
2 0 2
2 1 3
2 2 4
2 3 5
2 4 6
3 0 3
3 1 4
3 2 5
3 3 6
3 4 7
alternatively you can create your own slicing function:
def my_slice(rep=1, subexp=1):
rep -= 1
subexp -= 1
return df.ix[rep*15 + subexp*5 : rep*15 + subexp*5 + 4, :]
demo:
In [174]: my_slice(1,1)
Out[174]:
Time Measurement
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
In [175]: my_slice(2,1)
Out[175]:
Time Measurement
15 0 1
16 1 2
17 2 3
18 3 4
19 4 5
In [176]: my_slice(2,2)
Out[176]:
Time Measurement
20 0 2
21 1 3
22 2 4
23 3 5
24 4 6
PS bit more convenient way to concatenate your DFs:
df = pandas.concat([d1,d2,d3,d4,d5,d6], ignore_index=True)
so you don't need the following .reset_index() and drop()

Python Pandas dataframe is not including all duplicates

I'm basically trying to create a Pandas dataframe (CQUAD_mech_loads) that is a subset of a larger dataframe (CQUAD_Mech). This subset dataframe is essentially created by filtering based on two conditions. There are NO duplicates in the larger dataframe (CQUAD_Mech).
The problem is that my subset dataframe doesn't include the duplicate ID's in the ELM column. It does, however, include duplicates in the LC column.
CQUAD_ELM is a list containing four ID's ([387522, 387522, 387506, 387507]). I have duplicate ID's of 387522. Right now, CQUAD_mech_loads is a dataframe with only three rows for the three unique IDs. I want that fourth duplicate ID in there as well.
The code:
def get_df(df, col1, cond1, col2='', cond2=0):
return df[(df[col1] == cond1) & (df[col2].isin(cond2))].reset_index(drop=True)
CQUAD_mech_loads = get_df(CQUAD_Mech,'LC', LC, 'ELM', CQUAD_ELM)
The output (where is the other line for 387522?):
LC ELM FX FY FXY
0 3113 387506 0 0 0
1 3113 387507 0 0 0
2 3113 387522 0 0 0
Since you're dropping the index anyway, you can just set the index to be the column you're interested in and use .ix indexing:
In [28]: df = pd.DataFrame(np.arange(25).reshape(5,5))
In [29]: df
Out[29]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [30]: df.set_index(4, drop=False).ix[[4,4,19,4,24]].reset_index(drop=True)
Out[30]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 15 16 17 18 19
3 0 1 2 3 4
4 20 21 22 23 24
EDIT: Your current method just finds each distinct col1/col2 pair. If you want to filter on multiple columns, just do it twice, once for each column:
In [98]: df.set_index(1, drop=False).ix[[1, 6, 16]].set_index(4, drop=False).ix[[4,4,4,4,4,4,4,4,19,9]].reset_index(drop=True)
Out[98]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
6 0 1 2 3 4
7 0 1 2 3 4
8 15 16 17 18 19
9 5 6 7 8 9

pandas python sorting according to a pattern

I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks
How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]

Categories