I have a dataframe like this:
C1 C2 C3 C4
A TV /r/tv3 NaN
B Music Pop /r/pop
C /r/foo NaN NaN
I need to iterate through each row and get the value of the first column and then find the value of the column which starts with /r/. So the output should look like this:
A /r/tv3
B /r/pop
C /r/foo
What is the fastest pythonic way to do this?
Using where after startswith
df.where(df.apply(lambda x : x.str.startswith(pat='/r/'),axis=1)).stack().reset_index(level=1,drop=True)
Out[680]:
C1
A /r/tv3
B /r/pop
C /r/foo
dtype: object
Related
I have a dataset here in which I have to group each sample. The group of each sample is apart of the sample's name. I have the rows each with a semi unique heading. e.g
TCGA.02.0047.GBM.C4, TCGA.02.0055.GBM.C4, TCGA.ZS.A9CG.LIHC.C3, TCGA.ZU.A8S4.CHOL.C1, TCGA.ZX.AA5X.CESC.C2.
I need to target the C bit in the heading and group the values in that heading so that each sample will be in either, C1, C2, C3 or C4.
How would I go about doing this?
For example, you have dataset like this:
import pandas as pd
df = pd.DataFrame({"Column_A": ["TCGA.02.0047.GBM.C4", "TCGA.02.0055.GBM.C4", "TCGA.ZS.A9CG.LIHC.C3", "TCGA.ZU.A8S4.CHOL.C1", "TCGA.ZX.AA5X.CESC.C2"]})
Column_A
0 TCGA.02.0047.GBM.C4
1 TCGA.02.0055.GBM.C4
2 TCGA.ZS.A9CG.LIHC.C3
3 TCGA.ZU.A8S4.CHOL.C1
4 TCGA.ZX.AA5X.CESC.C2
You can add new column with group:
df["Group"] = df["Column_A"].str[-2:]
Column_A Group
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2
if you have column names
You can extract the part after the last period and use it as grouper:
df.groupby(df.columns.str.extract('([^.]+)$', expand=False), axis=1)
Then perform the desired aggregation.
if you have one column
df['new'] = df['col'].str.extract('([^.]+)$')
Output:
col new
0 TCGA.02.0047.GBM.C4 C4
1 TCGA.02.0055.GBM.C4 C4
2 TCGA.ZS.A9CG.LIHC.C3 C3
3 TCGA.ZU.A8S4.CHOL.C1 C1
4 TCGA.ZX.AA5X.CESC.C2 C2
INPUT>df1
ColumnA ColumnB
A1 NaN
A1A2 NaN
A3 NaN
What I tried to do is to change column B's value conditionally,
based on iteration of checking ColumnA, adding remarks to column B.
The previous value of column B shall be kept after new string is added.
In sample dataframe, what I want to do would be
If ColumnA contains A1. If so, add string "A1" to Column B (without cleaning all previous value.)
If ColumnA contains A2. If so, add string "A2" to Column B (without cleaning all previous value.)
OUTPUT>df1
ColumnA ColumnB
A1 A1
A1A2 A1_A2
A3 NaN
I have tried the following codes but not working well.
Could anyone give me some advices? Thanks.
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A1'), df1['ColumnB']+"_A1",df1['ColumnB'])
df1['ColumnB'] = np.where(df1['ColumnA'].str.contains('A2'), df1['ColumnB']+"_A2",df1['ColumnB'])
One way using pandas.Series.str.findall with join:
key = ["A1", "A2"]
df["ColumnB"] = df["ColumnA"].str.findall("|".join(key)).str.join("_")
print(df)
Output:
ColumnA ColumnB
0 A1 A1
1 A1A2 A1_A2
2 A3
You cannot add or append strings to np.nan. That means you would always need to check if any position in your ColumnB is still a np.nan or already a string to properly set its new value. If all you want to do is to work with text you could initialize your ColumnB with empty strings and append selected string pieces from ColumnA as:
import pandas as pd
import numpy as np
I = pd.DataFrame({'ColA': ['A1', 'A1A2', 'A2', 'A3']})
I['ColB'] = ''
I.loc[I.ColA.str.contains('A1'), 'ColB'] += 'A1'
print(I)
I.loc[I.ColA.str.contains('A2'), 'ColB'] += 'A2'
print(I)
The output is:
ColA ColB
0 A1 A1
1 A1A2 A1
2 A2
3 A3
ColA ColB
0 A1 A1
1 A1A2 A1A2
2 A2 A2
3 A3
Note: this is a very verbose version as an example.
I have two dataframes. Each has a two-level multi-index. The first level is the same in each, but the second level is different. I would like to concatenate the dataframes and end up with a dataframe with a three-level multi-index, where records from the first dataframe would have 'NaN' in the third index level, and records from the second dataframe would have 'NaN' in the second index level. Instead, I get a dataframe with a two-level index, where the values in the second level of each dataframe are put in the same index level, which takes the name of the second level in the first dataframe (see code below).
Is there a nice way to do this? I could make the second level of each index into a column, concatenate, then put them back into the index, but this seems like a roundabout way of doing it to me.
df1 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-2':['a2','b2','c2','d2'], 'values':[1,2,3,4]})
df2 = pd.DataFrame({'index-1':['a1','b1','c1','d1'], 'index-3':['a3','b3','c3','d3'], 'values':[5,6,7,8]})
df1.set_index(['index-1','index-2'], inplace=True)
df2.set_index(['index-1','index-3'], inplace=True)
pd.concat([df1, df2])
Thanks!
It'll be easier to reset the index on the two input dataframes, concat them and then set the index again:
pd.concat([df1.reset_index(), df2.reset_index()], sort=False) \
.set_index(['index-1', 'index-2', 'index-3'])
Result:
values
index-1 index-2 index-3
a1 a2 NaN 1
b1 b2 NaN 2
c1 c2 NaN 3
d1 d2 NaN 4
a1 NaN a3 5
b1 NaN b3 6
c1 NaN c3 7
d1 NaN d3 8
I have a dataframe as follows:
id|s1|s2|s3|s4|s5
0|a|b|NaN|NaN|NaN
0|NaN|NaN|NaN|c|NaN
0|a1|NaN|NaN|c2|NaN
1|b|c|NaN|NaN|NaN
1|NaN|NaN|a1|NaN|NaN
1|a1|b|NaN|c1|NaN
.
.
.
.
1000(rows)...............
I want this to be restructured like this:
id|s1|s2|s3|s4|s5
0|a|b|NaN|c|NaN
0|a1|b|NaN|c2|NaN
1|b|c|a1|c1|NaN
1|a1|b|a1|c1|NaN
I have tried:
df.unstack(),df.melt() and df.pivot()
None of them gave me the expected result.Basically I want to reduce the NaN as much as possible. Could anyone suggest me a way? I want only one entry per cell not a group of entries in single cell.
I dont want NaN values but I want flows as mentioned in the first output.I want NaN only when there exists no values in any of the rows in same id
Group on id and ffill+bfill each row , then drop_duplicates:
df.groupby('id').apply(lambda x: x.ffill().bfill()).drop_duplicates()
id s1 s2 s3 s4 s5
0 0 a b NaN c NaN
2 0 a1 b NaN c2 NaN
3 1 b c a1 c1 NaN
5 1 a1 b a1 c1 NaN
I have a DF that looks something like
c1 c2 c3
1 A B x
2 A C y
3 B A x
4 B D z
5 A B y
As you can see, lines 1 and 3 are repeated if we disregard that c1 and c2 are different columns (or if they become reversed). However, line 5 is not. How can I drop rows based on columns c1 and c2, regardless of where the repeated values are?
Thanks in advance
Ok let us try something new frozenset, will order your column in to sorted tuple , then using the duplicated
df[~df[['c1','c2']].apply(frozenset,axis=1).duplicated()]
Out[666]:
c1 c2 c3
1 A B x
2 A C y
4 B D z
You can select columns by subset, sorting by numpy.sort,create new DataFrame from array and use DataFrame.duplicated with filtering by inverse condition by boolean indexing:
df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1), index=df.index).duplicated()]
print (df)
c1 c2 c3
1 A B x
2 A C y
4 B D z
Or:
df = df[~pd.DataFrame(np.sort(df[['c1','c2']], axis=1)).duplicated().values]
Also it could be done by sorting the row values using sorted():
df[~df[['c1','c2']].apply(lambda row: sorted(row), axis = 1).duplicated()]