Python Pandas dataframe is not including all duplicates - python

I'm basically trying to create a Pandas dataframe (CQUAD_mech_loads) that is a subset of a larger dataframe (CQUAD_Mech). This subset dataframe is essentially created by filtering based on two conditions. There are NO duplicates in the larger dataframe (CQUAD_Mech).
The problem is that my subset dataframe doesn't include the duplicate ID's in the ELM column. It does, however, include duplicates in the LC column.
CQUAD_ELM is a list containing four ID's ([387522, 387522, 387506, 387507]). I have duplicate ID's of 387522. Right now, CQUAD_mech_loads is a dataframe with only three rows for the three unique IDs. I want that fourth duplicate ID in there as well.
The code:
def get_df(df, col1, cond1, col2='', cond2=0):
return df[(df[col1] == cond1) & (df[col2].isin(cond2))].reset_index(drop=True)
CQUAD_mech_loads = get_df(CQUAD_Mech,'LC', LC, 'ELM', CQUAD_ELM)
The output (where is the other line for 387522?):
LC ELM FX FY FXY
0 3113 387506 0 0 0
1 3113 387507 0 0 0
2 3113 387522 0 0 0

Since you're dropping the index anyway, you can just set the index to be the column you're interested in and use .ix indexing:
In [28]: df = pd.DataFrame(np.arange(25).reshape(5,5))
In [29]: df
Out[29]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [30]: df.set_index(4, drop=False).ix[[4,4,19,4,24]].reset_index(drop=True)
Out[30]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 15 16 17 18 19
3 0 1 2 3 4
4 20 21 22 23 24
EDIT: Your current method just finds each distinct col1/col2 pair. If you want to filter on multiple columns, just do it twice, once for each column:
In [98]: df.set_index(1, drop=False).ix[[1, 6, 16]].set_index(4, drop=False).ix[[4,4,4,4,4,4,4,4,19,9]].reset_index(drop=True)
Out[98]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
6 0 1 2 3 4
7 0 1 2 3 4
8 15 16 17 18 19
9 5 6 7 8 9

Related

Pandas Structured 2D Data to XYZ Table

I want to create an xyz table from a structured grid representation of data in a Pandas DataFrame.
What I have is:
# Grid Nodes/indices
x=np.arange(5)
y=np.arange(5)
# DataFrame
df = pd.DataFrame(np.random.rand(5,5), columns=x, index=y)
>>>df
0 1 2 3 4
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
I want to convert the above DataFrame to this DataFrame structure:
>>>df
x y val
0 0 #
0 1 #
...
4 4 #
I can create a for loop to do this but I believe I should be able to do this using the pivot, stack, or some other built-in method though but I am not getting it from the documentation. It seems to create multilevel DataFrames which I do not want. Bonus points for converting it back.
dff = df.stack().reset_index(name="values")
pd.pivot_table(index="level_0",columns="level_1",values="values",data=dff)
First part is taken from the previous answer to be used for the unstacking part.
First one is for stacking and second for unstacking.
level_0 level_1 values
0 0 0 0.536047
1 0 1 0.673782
2 0 2 0.935536
3 0 3 0.853286
4 0 4 0.916081
5 1 0 0.884820
6 1 1 0.438207
7 1 2 0.070120
8 1 3 0.292445
9 1 4 0.789046
10 2 0 0.899633
11 2 1 0.822928
12 2 2 0.445154
13 2 3 0.643797
14 2 4 0.776154
15 3 0 0.682129
16 3 1 0.974159
17 3 2 0.078451
18 3 3 0.306872
19 3 4 0.689137
20 4 0 0.117842
21 4 1 0.770962
22 4 2 0.861076
23 4 3 0.429738
24 4 4 0.149199
# Unstacking
level_1 0 1 2 3 4
level_0
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
Use df.stack with df.reset_index:
In [4474]: df = df.stack().reset_index(name='value').rename(columns={'level_0':'x', 'level_1': 'y'})
In [4475]: df
Out[4475]:
x y value
0 0 0 0.772210
1 0 1 0.921495
2 0 2 0.903645
3 0 3 0.980514
4 0 4 0.156923
5 1 0 0.516448
6 1 1 0.121148
7 1 2 0.394074
8 1 3 0.532963
9 1 4 0.369175
10 2 0 0.605971
11 2 1 0.712189
12 2 2 0.866299
13 2 3 0.174830
14 2 4 0.042236
15 3 0 0.350161
16 3 1 0.100152
17 3 2 0.049185
18 3 3 0.808631
19 3 4 0.562624
20 4 0 0.090918
21 4 1 0.713475
22 4 2 0.723183
23 4 3 0.569887
24 4 4 0.980238
For converting it back, use df.pivot:
In [4481]: unstacked_df = df.pivot('x', 'y')
In [4482]: unstacked_df
Out[4482]:
value
y 0 1 2 3 4
x
0 0.772210 0.921495 0.903645 0.980514 0.156923
1 0.516448 0.121148 0.394074 0.532963 0.369175
2 0.605971 0.712189 0.866299 0.174830 0.042236
3 0.350161 0.100152 0.049185 0.808631 0.562624
4 0.090918 0.713475 0.723183 0.569887 0.980238

pandas create category column based on sequence repetition in another column

This is very likely a duplicate, but I'm not sure what to search for to find it.
I have a column in a dataframe that cycles from 0 to some value a number of times (in my example it cycles to 4 three times) . I want to create another column that simply shows which cycle it is. Example:
import pandas as pd
df = pd.DataFrame({'A':[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]})
df['desired_output'] = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
print(df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2
I was thinking maybe something along the lines of a groupby(), cumsum() and transform(), but I'm not quite sure how to implement it. Could be wrong though.
Compare by 0 with Series.eq and then add Series.cumsum, last subtract 1:
df['desired_output'] = df['A'].eq(0).cumsum() - 1
print (df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2

Creating new column in pandas from values of other column

I am dealing with the following dataframe:
p q
0 11 2
1 11 2
2 11 2
3 11 3
4 11 3
5 12 2
6 12 2
7 13 2
8 13 2
I want create a new column, say s, which starts with 0 and goes on. This new col is based on "p" column, whenever the p changes, the "s" should change too.
For the first 4 rows, the "p" = 11, so "s" column should have values 0 for this first 4 rows, and so on...
Below is the expectant df:
s p q
0 0 11 2
1 0 11 2
2 0 11 2
3 0 11 2
4 1 11 4
5 1 11 4
6 1 11 4
7 1 11 4
8 2 12 2
9 2 12 2
10 2 12 2
11 3 12 3
12 3 12 3
You need diff with cumsum (subtract one if you want the id to start from 0):
df["finalID"] = (df.ProjID.diff() != 0).cumsum()
df
Update, if you want to take both voyg_id and ProjID into consideration, you can use a OR condition on the two columns difference, so that whichever column changes, you get an increase in the final id.
df['final_id'] = ((df.voyg_id.diff() != 0) | (df.proj_id.diff() != 0)).cumsum()
df

Transform list to dataframe efficiently

I have a list of images and I want to get all the pixels of each image in one DataFrame column and the number of the image into another column. I am trying to do it with
plotDF = DataFrame()
plotData = [np.array([[1,2,1],[1,1,2],[4,2,1]]), np.array([[1,2,2,1],[1,3,1,3]]), np.array([[1,1,2,3],[4,1,1,1],[1,1,1,4]])]
plotData = [image.flatten() for image in plotData]
for n, pD in zip(range(len(plotData)), plotData):
for pixel in pD:
plotDF = plotDF.append(DataFrame.from_records([{'n': n, 'pixel': pixel}]))
plotDF = plotDF.reset_index(drop=True)
but this seems really inefficient.
How can I do this more efficient, possibly with https://github.com/kieferk/dfply?
I think you can use numpy.repeat for repeat values by legths by str.len and flat values of nested lists by chain.
from itertools import chain
s = pd.Series(plotData)
df2 = pd.DataFrame({
"n": np.repeat(s.index + 1, s.str.len()),
"pixel": list(chain.from_iterable(s))})
print (df2)
n pixel
0 1 1
1 1 2
2 1 1
3 1 1
4 1 1
5 1 2
6 1 4
7 1 2
8 1 1
9 2 1
10 2 2
11 2 2
12 2 1
13 2 1
14 2 3
15 2 1
16 2 3
17 3 1
18 3 1
19 3 2
20 3 3
21 3 4
22 3 1
23 3 1
24 3 1
25 3 1
26 3 1
27 3 1
28 3 4

pandas python sorting according to a pattern

I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks
How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]

Categories