Pandas drop duplicates based on 2 columns sometimes reversed - python

I have a DF that looks something like
c1 c2 c3
1 A B x
2 A C y
3 B A x
4 B D z
5 A B y
As you can see, lines 1 and 3 are repeated if we disregard that c1 and c2 are different columns (or if they become reversed). However, line 5 is not. How can I drop rows based on columns c1 and c2, regardless of where the repeated values are?
Thanks in advance

Ok let us try something new frozenset, will order your column in to sorted tuple , then using the duplicated
df[~df[['c1','c2']].apply(frozenset,axis=1).duplicated()]
Out[666]:
c1 c2 c3
1 A B x
2 A C y
4 B D z

You can select columns by subset, sorting by numpy.sort,create new DataFrame from array and use DataFrame.duplicated with filtering by inverse condition by boolean indexing:
df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1), index=df.index).duplicated()]
print (df)
c1 c2 c3
1 A B x
2 A C y
4 B D z
Or:
df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1)).duplicated().values]

Also it could be done by sorting the row values using sorted():
df[~df[['c1','c2']].apply(lambda row: sorted(row), axis = 1).duplicated()]

Related

How to group a certain part of a heading in a dataset using pandas?

I have a dataset here in which I have to group each sample. The group of each sample is apart of the sample's name. I have the rows each with a semi unique heading. e.g
TCGA.02.0047.GBM.C4, TCGA.02.0055.GBM.C4, TCGA.ZS.A9CG.LIHC.C3, TCGA.ZU.A8S4.CHOL.C1, TCGA.ZX.AA5X.CESC.C2.
I need to target the C bit in the heading and group the values in that heading so that each sample will be in either, C1, C2, C3 or C4.
How would I go about doing this?
For example, you have dataset like this:
import pandas as pd
df = pd.DataFrame({"Column_A": ["TCGA.02.0047.GBM.C4", "TCGA.02.0055.GBM.C4", "TCGA.ZS.A9CG.LIHC.C3", "TCGA.ZU.A8S4.CHOL.C1", "TCGA.ZX.AA5X.CESC.C2"]})
Column_A
0 TCGA.02.0047.GBM.C4
1 TCGA.02.0055.GBM.C4
2 TCGA.ZS.A9CG.LIHC.C3
3 TCGA.ZU.A8S4.CHOL.C1
4 TCGA.ZX.AA5X.CESC.C2
You can add new column with group:
df["Group"] = df["Column_A"].str[-2:]
Column_A Group
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2
if you have column names
You can extract the part after the last period and use it as grouper:
df.groupby(df.columns.str.extract('([^.]+)$', expand=False), axis=1)
Then perform the desired aggregation.
if you have one column
df['new'] = df['col'].str.extract('([^.]+)$')
Output:
col new
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2

Pandas Melt Trouble in Python

I have a csv file data frame that looks like the following:
My goal is to melt (transform) the dataframe into a refined dataframe that looks like the following:
This is my code up to now:
import glob, pandas as pd
file = r"C:\Users\jrivera\OneDrive - Accelerate Resources\Documents\Python\maverickAvgTCProductionInput.csv"
dfTotal = pd.DataFrame()
for prd in glob.glob(file):
df = pd.read_csv(prd)
dfTotal = pd.concat([dfTotal, df])
dfTotal.shape
dfHDprd = pd.read_csv(r"C:\Users\jrivera\OneDrive - Accelerate Resources\Documents\Python\maverickAvgTCProductionInput.csv")
id_vars, dct = ["TCA","MONTH",],{}
for x in ["OIL", "GAS"]:
dct["value_vars_%s" % x] = ["NORM_%s"%x]
dfNew = pd.melt(frame = dfHDprd, id_vars = ["TCA", "MONTHS"], value_vars = ["NORM_OIL_1KFT", "NORM_GAS_1KFT"], var_name= "OIL", var_value = "GAS")
I'm not really sure what your goal is, from the link it just seems like you want to limit the months between 0-3 and remove some columns. I would suggest explicitly explaining what you need.
pd.melt is used to convert a wide dataframe into a long dataframe by 'melting' columns, with the variable names (NORM_OIL_1KFT, NORM_GAS_1KFT) going into the rows instead of as column headers. I don't think this is what you are looking for.
If you simply want to retain only the columns in your desired dataframe:
new_df = dfHDprd[['TCA','MONTH','NORM_OIL_1KFT','NORM_GAS_1KFT']]
new_df.columns = ['TCA','MONTH','OIL','GAS']
Melting the dataframe (which probably is not what you are looking to do), you would need to re-define your expression like this to understand the purpose of melting:
dfNew = pd.melt(frame = dfHDprd, id_vars = ["TCA", "MONTHS"], value_vars = ["NORM_OIL_1KFT", "NORM_GAS_1KFT"], var_name= "FUEL_TYPE", var_value = "QUANTITY")
Where var_name is the column header distinguishing between the variables which get melted into the dataframe, and var_value is the column header which is the label for the values.
Trivial example (as I can't copy any of your data):
df = pd.DataFrame({'id':['a','b','c'], 'C1':[1,2,3],'C2':[4,5,6],'C3':[5,6,7]})
>>>
id C1 C2 C3
0 a 1 4 5
1 b 2 5 6
2 c 3 6 7
pd.melt(frame=df, id_vars=['id'], value_vars=['C1','C2','C3'], value_name='value', var_name='variable')
>>>
id variable value
0 a C1 1
1 b C1 2
2 c C1 3
3 a C2 4
4 b C2 5
5 c C2 6
6 a C3 5
7 b C3 6
8 c C3 7

pandas iterrows then match

I have a dataframe like this:
C1 C2 C3 C4
A TV /r/tv3 NaN
B Music Pop /r/pop
C /r/foo NaN NaN
I need to iterate through each row and get the value of the first column and then find the value of the column which starts with /r/. So the output should look like this:
A /r/tv3
B /r/pop
C /r/foo
What is the fastest pythonic way to do this?
Using where after startswith
df.where(df.apply(lambda x : x.str.startswith(pat='/r/'),axis=1)).stack().reset_index(level=1,drop=True)
Out[680]:
C1
A /r/tv3
B /r/pop
C /r/foo
dtype: object

df.unique() on whole DataFrame based on a column

I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index Id Type
0 a1 A
1 a2 A
2 b1 B
3 b3 B
4 a1 A
...
When I use:
uniqueId = df["Id"].unique()
I get a list of unique IDs.
How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?
It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:
#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
Id Type
Index
0 a1 A
1 a2 A
2 b1 B
3 b3 B
#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
4 a1 A
#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
It's also possible to call duplicated() to flag the duplicates and drop the negation of the flags.
df = df[~df.duplicated(subset=['Id'])].copy()
This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).
new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()

How to not sort the index in pandas

I have 2 data frames with one column each. Index of the first is [C,B,F,A,Z] not sorted in any way. Index of the second is [C,B,Z], also unsorted.
I use pd.concat([df1,df2],axis=1) and get a data frame with 2 columns and NaN in the second column where there is no appropriate value for the index.
The problem I have is that index automatically becomes sorted in alphabetical order.
I have tried = pd.concat([df1,df2],axis=1, names = my_list) where my_list = [C,B,F,A,Z], but that didn't make any changes.
How can I specify index to be not sorted?
This seems to be by design, the only thing I'd suggest is to call reindex on the concatenated df and pass the index of df:
In [56]:
df = pd.DataFrame(index=['C','B','F','A','Z'], data={'a':np.arange(5)})
df
Out[56]:
a
C 0
B 1
F 2
A 3
Z 4
In [58]:
df1 = pd.DataFrame(index=['C','B','Z'], data={'b':np.random.randn(3)})
df1
Out[58]:
b
C -0.146799
B -0.227027
Z -0.429725
In [67]:
pd.concat([df,df1],axis=1).reindex(df.index)
Out[67]:
a b
C 0 -0.146799
B 1 -0.227027
F 2 NaN
A 3 NaN
Z 4 -0.429725

Categories