Python Pandas dataframe is not including all duplicates

Python Pandas dataframe is not including all duplicates - python

I'm basically trying to create a Pandas dataframe (CQUAD_mech_loads) that is a subset of a larger dataframe (CQUAD_Mech). This subset dataframe is essentially created by filtering based on two conditions. There are NO duplicates in the larger dataframe (CQUAD_Mech).
The problem is that my subset dataframe doesn't include the duplicate ID's in the ELM column. It does, however, include duplicates in the LC column.
CQUAD_ELM is a list containing four ID's ([387522, 387522, 387506, 387507]). I have duplicate ID's of 387522. Right now, CQUAD_mech_loads is a dataframe with only three rows for the three unique IDs. I want that fourth duplicate ID in there as well.
The code:
def get_df(df, col1, cond1, col2='', cond2=0):
return df[(df[col1] == cond1) & (df[col2].isin(cond2))].reset_index(drop=True)
CQUAD_mech_loads = get_df(CQUAD_Mech,'LC', LC, 'ELM', CQUAD_ELM)
The output (where is the other line for 387522?):
LC ELM FX FY FXY
0 3113 387506 0 0 0
1 3113 387507 0 0 0
2 3113 387522 0 0 0

Since you're dropping the index anyway, you can just set the index to be the column you're interested in and use .ix indexing:
In [28]: df = pd.DataFrame(np.arange(25).reshape(5,5))
In [29]: df
Out[29]:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
4 20 21 22 23 24
In [30]: df.set_index(4, drop=False).ix[[4,4,19,4,24]].reset_index(drop=True)
Out[30]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 15 16 17 18 19
3 0 1 2 3 4
4 20 21 22 23 24
EDIT: Your current method just finds each distinct col1/col2 pair. If you want to filter on multiple columns, just do it twice, once for each column:
In [98]: df.set_index(1, drop=False).ix[[1, 6, 16]].set_index(4, drop=False).ix[[4,4,4,4,4,4,4,4,19,9]].reset_index(drop=True)
Out[98]:
0 1 2 3 4
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
6 0 1 2 3 4
7 0 1 2 3 4
8 15 16 17 18 19
9 5 6 7 8 9

Related

Pandas Structured 2D Data to XYZ Table

I want to create an xyz table from a structured grid representation of data in a Pandas DataFrame.
What I have is:
# Grid Nodes/indices
x=np.arange(5)
y=np.arange(5)
# DataFrame
df = pd.DataFrame(np.random.rand(5,5), columns=x, index=y)
>>>df
0 1 2 3 4
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199
I want to convert the above DataFrame to this DataFrame structure:
>>>df
x y val
0 0 #
0 1 #
...
4 4 #
I can create a for loop to do this but I believe I should be able to do this using the pivot, stack, or some other built-in method though but I am not getting it from the documentation. It seems to create multilevel DataFrames which I do not want. Bonus points for converting it back.

dff = df.stack().reset_index(name="values")
pd.pivot_table(index="level_0",columns="level_1",values="values",data=dff)
First part is taken from the previous answer to be used for the unstacking part.
First one is for stacking and second for unstacking.
level_0 level_1 values
0 0 0 0.536047
1 0 1 0.673782
2 0 2 0.935536
3 0 3 0.853286
4 0 4 0.916081
5 1 0 0.884820
6 1 1 0.438207
7 1 2 0.070120
8 1 3 0.292445
9 1 4 0.789046
10 2 0 0.899633
11 2 1 0.822928
12 2 2 0.445154
13 2 3 0.643797
14 2 4 0.776154
15 3 0 0.682129
16 3 1 0.974159
17 3 2 0.078451
18 3 3 0.306872
19 3 4 0.689137
20 4 0 0.117842
21 4 1 0.770962
22 4 2 0.861076
23 4 3 0.429738
24 4 4 0.149199
# Unstacking
level_1 0 1 2 3 4
level_0
0 0.536047 0.673782 0.935536 0.853286 0.916081
1 0.884820 0.438207 0.070120 0.292445 0.789046
2 0.899633 0.822928 0.445154 0.643797 0.776154
3 0.682129 0.974159 0.078451 0.306872 0.689137
4 0.117842 0.770962 0.861076 0.429738 0.149199

Use df.stack with df.reset_index:
In [4474]: df = df.stack().reset_index(name='value').rename(columns={'level_0':'x', 'level_1': 'y'})
In [4475]: df
Out[4475]:
x y value
0 0 0 0.772210
1 0 1 0.921495
2 0 2 0.903645
3 0 3 0.980514
4 0 4 0.156923
5 1 0 0.516448
6 1 1 0.121148
7 1 2 0.394074
8 1 3 0.532963
9 1 4 0.369175
10 2 0 0.605971
11 2 1 0.712189
12 2 2 0.866299
13 2 3 0.174830
14 2 4 0.042236
15 3 0 0.350161
16 3 1 0.100152
17 3 2 0.049185
18 3 3 0.808631
19 3 4 0.562624
20 4 0 0.090918
21 4 1 0.713475
22 4 2 0.723183
23 4 3 0.569887
24 4 4 0.980238
For converting it back, use df.pivot:
In [4481]: unstacked_df = df.pivot('x', 'y')
In [4482]: unstacked_df
Out[4482]:
value
y 0 1 2 3 4
x
0 0.772210 0.921495 0.903645 0.980514 0.156923
1 0.516448 0.121148 0.394074 0.532963 0.369175
2 0.605971 0.712189 0.866299 0.174830 0.042236
3 0.350161 0.100152 0.049185 0.808631 0.562624
4 0.090918 0.713475 0.723183 0.569887 0.980238

pandas create category column based on sequence repetition in another column

This is very likely a duplicate, but I'm not sure what to search for to find it.
I have a column in a dataframe that cycles from 0 to some value a number of times (in my example it cycles to 4 three times) . I want to create another column that simply shows which cycle it is. Example:
import pandas as pd
df = pd.DataFrame({'A':[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]})
df['desired_output'] = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
print(df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2
I was thinking maybe something along the lines of a groupby(), cumsum() and transform(), but I'm not quite sure how to implement it. Could be wrong though.

Compare by 0 with Series.eq and then add Series.cumsum, last subtract 1:
df['desired_output'] = df['A'].eq(0).cumsum() - 1
print (df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2

Creating new column in pandas from values of other column

I am dealing with the following dataframe:
p q
0 11 2
1 11 2
2 11 2
3 11 3
4 11 3
5 12 2
6 12 2
7 13 2
8 13 2
I want create a new column, say s, which starts with 0 and goes on. This new col is based on "p" column, whenever the p changes, the "s" should change too.
For the first 4 rows, the "p" = 11, so "s" column should have values 0 for this first 4 rows, and so on...
Below is the expectant df:
s p q
0 0 11 2
1 0 11 2
2 0 11 2
3 0 11 2
4 1 11 4
5 1 11 4
6 1 11 4
7 1 11 4
8 2 12 2
9 2 12 2
10 2 12 2
11 3 12 3
12 3 12 3

You need diff with cumsum (subtract one if you want the id to start from 0):
df["finalID"] = (df.ProjID.diff() != 0).cumsum()
df
Update, if you want to take both voyg_id and ProjID into consideration, you can use a OR condition on the two columns difference, so that whichever column changes, you get an increase in the final id.
df['final_id'] = ((df.voyg_id.diff() != 0) | (df.proj_id.diff() != 0)).cumsum()
df

Transform list to dataframe efficiently

I have a list of images and I want to get all the pixels of each image in one DataFrame column and the number of the image into another column. I am trying to do it with
plotDF = DataFrame()
plotData = [np.array([[1,2,1],[1,1,2],[4,2,1]]), np.array([[1,2,2,1],[1,3,1,3]]), np.array([[1,1,2,3],[4,1,1,1],[1,1,1,4]])]
plotData = [image.flatten() for image in plotData]
for n, pD in zip(range(len(plotData)), plotData):
for pixel in pD:
plotDF = plotDF.append(DataFrame.from_records([{'n': n, 'pixel': pixel}]))
plotDF = plotDF.reset_index(drop=True)
but this seems really inefficient.
How can I do this more efficient, possibly with https://github.com/kieferk/dfply?

I think you can use numpy.repeat for repeat values by legths by str.len and flat values of nested lists by chain.
from itertools import chain
s = pd.Series(plotData)
df2 = pd.DataFrame({
"n": np.repeat(s.index + 1, s.str.len()),
"pixel": list(chain.from_iterable(s))})
print (df2)
n pixel
0 1 1
1 1 2
2 1 1
3 1 1
4 1 1
5 1 2
6 1 4
7 1 2
8 1 1
9 2 1
10 2 2
11 2 2
12 2 1
13 2 1
14 2 3
15 2 1
16 2 3
17 3 1
18 3 1
19 3 2
20 3 3
21 3 4
22 3 1
23 3 1
24 3 1
25 3 1
26 3 1
27 3 1
28 3 4

pandas python sorting according to a pattern

I have a pandas data frame that consists of 5 columns. The second column has the numbers 1 to 500 repeated 5 times. As a shorter example the second column is something like this (1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3) and I want to sort it to look like this (1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4). The code i am using to sort is df=res.sort([2],ascending=True) but this code sorts it (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4).
Any help will be much appreciated. Thanks

How's about this: sort by the cumcount and then the value itself:
In [11]: df = pd.DataFrame({"s": [1,4,2,4,3,1,1,2,4,3,2,1,4,3,2,3]})
In [12]: df.groupby("s").cumcount()
Out[12]:
0 0
1 0
2 0
3 1
4 0
5 1
6 2
7 1
8 2
9 1
10 2
11 3
12 3
13 2
14 3
15 3
dtype: int64
In [13]: df["s_cumcounts"] = df.groupby("s").cumcount()
In [14]: df.sort_values(["s_cumcounts", "s"])
Out[14]:
s s_cumcounts
0 1 0
2 2 0
4 3 0
1 4 0
5 1 1
7 2 1
9 3 1
3 4 1
6 1 2
10 2 2
13 3 2
8 4 2
11 1 3
14 2 3
15 3 3
12 4 3
In [15]: df = df.sort_values(["s_cumcounts", "s"])
In [16]: del df["s_cumcounts"]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas dataframe is not including all duplicates - python

Related

Pandas Structured 2D Data to XYZ Table

pandas create category column based on sequence repetition in another column

Creating new column in pandas from values of other column

Transform list to dataframe efficiently

pandas python sorting according to a pattern

Categories

Resources