df.unique() on whole DataFrame based on a column - python

I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index Id Type
0 a1 A
1 a2 A
2 b1 B
3 b3 B
4 a1 A
...
When I use:
uniqueId = df["Id"].unique()
I get a list of unique IDs.
How can I apply this filtering on the whole DataFrame such that it keeps the structure but that the duplicates (based on "Id") are removed?

It seems you need DataFrame.drop_duplicates with parameter subset which specify where are test duplicates:
#keep first duplicate value
df = df.drop_duplicates(subset=['Id'])
print (df)
Id Type
Index
0 a1 A
1 a2 A
2 b1 B
3 b3 B
#keep last duplicate value
df = df.drop_duplicates(subset=['Id'], keep='last')
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B
4 a1 A
#remove all duplicate values
df = df.drop_duplicates(subset=['Id'], keep=False)
print (df)
Id Type
Index
1 a2 A
2 b1 B
3 b3 B

It's also possible to call duplicated() to flag the duplicates and drop the negation of the flags.
df = df[~df.duplicated(subset=['Id'])].copy()
This is particularly useful if you want to conditionally drop duplicates, e.g. drop duplicates of a specific value, etc. For example, the following code drops duplicate 'a1's from column Id (other duplicates are not dropped).
new_df = df[~df['Id'].duplicated() | df['Id'].ne('a1')].copy()

Related

How to group a certain part of a heading in a dataset using pandas?

I have a dataset here in which I have to group each sample. The group of each sample is apart of the sample's name. I have the rows each with a semi unique heading. e.g
TCGA.02.0047.GBM.C4, TCGA.02.0055.GBM.C4, TCGA.ZS.A9CG.LIHC.C3, TCGA.ZU.A8S4.CHOL.C1, TCGA.ZX.AA5X.CESC.C2.
I need to target the C bit in the heading and group the values in that heading so that each sample will be in either, C1, C2, C3 or C4.
How would I go about doing this?
For example, you have dataset like this:
import pandas as pd
df = pd.DataFrame({"Column_A": ["TCGA.02.0047.GBM.C4", "TCGA.02.0055.GBM.C4", "TCGA.ZS.A9CG.LIHC.C3", "TCGA.ZU.A8S4.CHOL.C1", "TCGA.ZX.AA5X.CESC.C2"]})
Column_A
0 TCGA.02.0047.GBM.C4
1 TCGA.02.0055.GBM.C4
2 TCGA.ZS.A9CG.LIHC.C3
3 TCGA.ZU.A8S4.CHOL.C1
4 TCGA.ZX.AA5X.CESC.C2
You can add new column with group:
df["Group"] = df["Column_A"].str[-2:]
Column_A Group
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2
if you have column names
You can extract the part after the last period and use it as grouper:
df.groupby(df.columns.str.extract('([^.]+)$', expand=False), axis=1)
Then perform the desired aggregation.
if you have one column
df['new'] = df['col'].str.extract('([^.]+)$')
Output:
col new
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2

Pandas Melt Trouble in Python

I have a csv file data frame that looks like the following:
My goal is to melt (transform) the dataframe into a refined dataframe that looks like the following:
This is my code up to now:
import glob, pandas as pd
file = r"C:\Users\jrivera\OneDrive - Accelerate Resources\Documents\Python\maverickAvgTCProductionInput.csv"
dfTotal = pd.DataFrame()
for prd in glob.glob(file):
df = pd.read_csv(prd)
dfTotal = pd.concat([dfTotal, df])
dfTotal.shape
dfHDprd = pd.read_csv(r"C:\Users\jrivera\OneDrive - Accelerate Resources\Documents\Python\maverickAvgTCProductionInput.csv")
id_vars, dct = ["TCA","MONTH",],{}
for x in ["OIL", "GAS"]:
dct["value_vars_%s" % x] = ["NORM_%s"%x]
dfNew = pd.melt(frame = dfHDprd, id_vars = ["TCA", "MONTHS"], value_vars = ["NORM_OIL_1KFT", "NORM_GAS_1KFT"], var_name= "OIL", var_value = "GAS")
I'm not really sure what your goal is, from the link it just seems like you want to limit the months between 0-3 and remove some columns. I would suggest explicitly explaining what you need.
pd.melt is used to convert a wide dataframe into a long dataframe by 'melting' columns, with the variable names (NORM_OIL_1KFT, NORM_GAS_1KFT) going into the rows instead of as column headers. I don't think this is what you are looking for.
If you simply want to retain only the columns in your desired dataframe:
new_df = dfHDprd[['TCA','MONTH','NORM_OIL_1KFT','NORM_GAS_1KFT']]
new_df.columns = ['TCA','MONTH','OIL','GAS']
Melting the dataframe (which probably is not what you are looking to do), you would need to re-define your expression like this to understand the purpose of melting:
dfNew = pd.melt(frame = dfHDprd, id_vars = ["TCA", "MONTHS"], value_vars = ["NORM_OIL_1KFT", "NORM_GAS_1KFT"], var_name= "FUEL_TYPE", var_value = "QUANTITY")
Where var_name is the column header distinguishing between the variables which get melted into the dataframe, and var_value is the column header which is the label for the values.
Trivial example (as I can't copy any of your data):
df = pd.DataFrame({'id':['a','b','c'], 'C1':[1,2,3],'C2':[4,5,6],'C3':[5,6,7]})
>>>
id C1 C2 C3
0 a 1 4 5
1 b 2 5 6
2 c 3 6 7
pd.melt(frame=df, id_vars=['id'], value_vars=['C1','C2','C3'], value_name='value', var_name='variable')
>>>
id variable value
0 a C1 1
1 b C1 2
2 c C1 3
3 a C2 4
4 b C2 5
5 c C2 6
6 a C3 5
7 b C3 6
8 c C3 7

Pandas: conditionally concatenate original columns with a string

INPUT>df1
ColumnA ColumnB
A1 NaN
A1A2 NaN
A3 NaN
What I tried to do is to change column B's value conditionally,
based on iteration of checking ColumnA, adding remarks to column B.
The previous value of column B shall be kept after new string is added.
In sample dataframe, what I want to do would be
If ColumnA contains A1. If so, add string "A1" to Column B (without cleaning all previous value.)
If ColumnA contains A2. If so, add string "A2" to Column B (without cleaning all previous value.)
OUTPUT>df1
ColumnA ColumnB
A1 A1
A1A2 A1_A2
A3 NaN
I have tried the following codes but not working well.
Could anyone give me some advices? Thanks.
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A1'), df1['ColumnB']+"_A1",df1['ColumnB'])
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A2'), df1['ColumnB']+"_A2",df1['ColumnB'])
One way using pandas.Series.str.findall with join:
key = ["A1", "A2"]
df["ColumnB"] = df["ColumnA"].str.findall("|".join(key)).str.join("_")
print(df)
Output:
ColumnA ColumnB
0 A1 A1
1 A1A2 A1_A2
2 A3
You cannot add or append strings to np.nan. That means you would always need to check if any position in your ColumnB is still a np.nan or already a string to properly set its new value. If all you want to do is to work with text you could initialize your ColumnB with empty strings and append selected string pieces from ColumnA as:
import pandas as pd
import numpy as np
I = pd.DataFrame({'ColA': ['A1', 'A1A2', 'A2', 'A3']})
I['ColB'] = ''
I.loc[I.ColA.str.contains('A1'), 'ColB'] += 'A1'
print(I)
I.loc[I.ColA.str.contains('A2'), 'ColB'] += 'A2'
print(I)
The output is:
ColA ColB
0 A1 A1
1 A1A2 A1
2 A2
3 A3
ColA ColB
0 A1 A1
1 A1A2 A1A2
2 A2 A2
3 A3
Note: this is a very verbose version as an example.

How can I concatenate two dataframes with different multi-indexes without merging indexes?

I have two dataframes. Each has a two-level multi-index. The first level is the same in each, but the second level is different. I would like to concatenate the dataframes and end up with a dataframe with a three-level multi-index, where records from the first dataframe would have 'NaN' in the third index level, and records from the second dataframe would have 'NaN' in the second index level. Instead, I get a dataframe with a two-level index, where the values in the second level of each dataframe are put in the same index level, which takes the name of the second level in the first dataframe (see code below).
Is there a nice way to do this? I could make the second level of each index into a column, concatenate, then put them back into the index, but this seems like a roundabout way of doing it to me.
df1 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-2':['a2','b2','c2','d2'], 'values':[1,2,3,4]})
df2 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-3':['a3','b3','c3','d3'], 'values':[5,6,7,8]})
df1.set_index(['index-1','index-2'], inplace=True)
df2.set_index(['index-1','index-3'], inplace=True)
pd.concat([df1, df2])
Thanks!
It'll be easier to reset the index on the two input dataframes, concat them and then set the index again:
pd.concat([df1.reset_index(), df2.reset_index()], sort=False) \
.set_index(['index-1', 'index-2', 'index-3'])
Result:
values
index-1 index-2 index-3
a1 a2 NaN 1
b1 b2 NaN 2
c1 c2 NaN 3
d1 d2 NaN 4
a1 NaN a3 5
b1 NaN b3 6
c1 NaN c3 7
d1 NaN d3 8

Pandas drop duplicates based on 2 columns sometimes reversed

I have a DF that looks something like
c1 c2 c3
1 A B x
2 A C y
3 B A x
4 B D z
5 A B y
As you can see, lines 1 and 3 are repeated if we disregard that c1 and c2 are different columns (or if they become reversed). However, line 5 is not. How can I drop rows based on columns c1 and c2, regardless of where the repeated values are?
Thanks in advance
Ok let us try something new frozenset, will order your column in to sorted tuple , then using the duplicated
df[~df[['c1','c2']].apply(frozenset,axis=1).duplicated()]
Out[666]:
c1 c2 c3
1 A B x
2 A C y
4 B D z
You can select columns by subset, sorting by numpy.sort,create new DataFrame from array and use DataFrame.duplicated with filtering by inverse condition by boolean indexing:
df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1), index=df.index).duplicated()]
print (df)
c1 c2 c3
1 A B x
2 A C y
4 B D z
Or:
df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1)).duplicated().values]
Also it could be done by sorting the row values using sorted():
df[~df[['c1','c2']].apply(lambda row: sorted(row), axis = 1).duplicated()]

Categories