If column value contains symbol, retain only the substring after the symbol

If column value contains symbol, retain only the substring after the symbol - python

If index column value is separated by ;, then retain only the substring after the ;. Else, retain as-is. Would be even better if its in list comprehension.
My code raised ValueError: Length of values (4402) does not match length of index (22501).
# If gene name is separated by ";", then get the substring after the ";"
list = []
for i in meth["Name"]:
if ";" in i:
list.append(i.split(";",1)[1])
else:
continue
meth["Name"] = list
Traceback:
--> 532 "Length of values "
533 f"({len(data)}) "
534 "does not match length of index "
ValueError: Length of values (4402) does not match length of index (22501)
Sample data:
meth.iloc[0:4,0:4].to_dict()
{'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297,
1: 0.786837244239289,
2: 0.5310546143038515,
3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566,
1: 0.386177267500376,
2: 0.5086236274690276,
3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055,
1: 0.54983504208745,
2: 0.5352071929258406,
3: 0.6139719037555759}}

Are you trying to perform this operation on the column names, or to values of a specific column?
Either way, I think this will do the job:
import pandas as pd
# Define the example dataframe
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
# Replacing column names, using '-' as separator:
df.columns = df.columns.astype(str).str.split('-').str[-1]
# Modified dataframe:
df
"""
index 01A 01A 01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
You can apply the same logic to your dataframe index, or specific columns:
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759},
'name': {0: 'A;B;C', 1: 'AAA', 2: 'BBB', 3: 'C-DE'}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 A;B;C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
# Splitting values from column "name":
df['name'] = df['name'].astype(str).str.split(';').str[-1]
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
Note
Please note that if the column values hold multiple repetitions of the same separator (e.g.: "A;B;C"), only the last substring gets returned (for "A;B;C", returns "C").

Related

Formatting a strange DataFrame a the pythonic way

I have a Dataframe with a really strange format, columns are matching/linked two by two.
The first column contains labels and codes associated with the adjacent column values.
Here's what it looks like :
1
2
3
4
Name
letter_1
Name
letter_2
Title
Choose a letter:
Title
Choose another letter
1
a
1
z
2
b
2
y
3
c
4
d
And here's what I need :
Name
Title
Code
Label
letter_1
Choose a letter:
1
a
letter_1
Choose a letter:
2
b
letter_1
Choose a letter:
3
c
letter_1
Choose a letter:
4
d
letter_2
Choose another letter
1
z
letter_2
Choose another letter
2
y
I managed to do it with that code:
# Init an empty DataFrame
df_format = pd.DataFrame()
# Iterate over the columns, step 2
for i in range(0, len(df.columns), 2):
# Get the columns names for current col and col+1, since they're linked together
col_i, col_ii = df.columns[i], df.columns[i]+1
# Concat codes from col and labels from col+1
codes = pd.concat([df[col_i].loc[3:], df[col_ii].loc[3:]], axis=1).dropna()
# Get the "Name", col+1 line 0
name = df[col_ii].loc[0]
# Get the title, col+1 line 1
title = df[col_ii].loc[1]
codes.loc[:, ['Name', 'Title']] = [name, title]
codes.columns = ["Code", "Label", "Name", "Title"]
df_format = pd.concat([df_format, codes])
But the question is: is there a more pythonic way to do this ?
I assume it is with pandas but sometimes it kind of break my brain.
Here is the example to use with pd.DataFrame
[{1: 'Name', 2: 'letter_1', 3: 'Name', 4: 'letter_2'},
{1: 'Title', 2: 'Choose a letter:', 3: 'Title', 4: 'Choose another letter'},
{1: 1, 2: 'a', 3: 1, 4: 'z'},
{1: 2, 2: 'b', 3: 2, 4: 'y'},
{1: 3, 2: 'c', 3: np.nan, 4: np.nan},
{1: 4, 2: 'd', 3: np.nan, 4: np.nan}]
Thanks a lot for your help !
EDIT
This data come from an Excel file with another sheet containing the answers given by some respondents where the columns names are corresponding to letter_1, letter_2, etc.
The sheet I am working on has about 15.000 columns, all ordered the same way, as shown above.
I read it with pd.read_excel(file, header=None, sheetname="sheet1"), that's why I do not have easy to read column names (I dropped column 0)

Select rows that contains values repeated across different levels of another column

I have a dataset with two columns
id to
0 1 0x954b890704693af242613edef1b603825afcd708
1 1 0x954b890704693af242613edef1b603825afcd708
2 1 0x607f4c5bb672230e8672085532f7e901544a7375
3 1 0x9b9647431632af44be02ddd22477ed94d14aacaa
4 2 0x9b9647431632af44be02ddd22477ed94d14aacaa
and I would like to print the value in column 'to' that is present in different levels of the column 'id', in the above example for example the only value to be printed should be 0x9b9647431632af44be02ddd22477ed94d14aacaa
I have done this with a for loop within, i wonder it there is a better way of doing this:
for index, row in df.iterrows():
to=row['to']
id=row['id']
for index, row in df.iterrows():
if row['to']==to and row['id']!=id:
print(to)

You can use df.groupby on column to, apply nunique and keep only the entries > 1. So:
import pandas as pd
d = {'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'to': {0: '0x954b890704693af242613edef1b603825afcd708',
1: '0x954b890704693af242613edef1b603825afcd708',
2: '0x607f4c5bb672230e8672085532f7e901544a7375',
3: '0x9b9647431632af44be02ddd22477ed94d14aacaa',
4: '0x9b9647431632af44be02ddd22477ed94d14aacaa'}}
df = pd.DataFrame(d)
nunique = df.groupby('to')['id'].nunique()
print(nunique)
to
0x607f4c5bb672230e8672085532f7e901544a7375 1
0x954b890704693af242613edef1b603825afcd708 1
0x9b9647431632af44be02ddd22477ed94d14aacaa 2
res = nunique[nunique>1]
print(res.index.tolist())
['0x9b9647431632af44be02ddd22477ed94d14aacaa']

Merging two multiindex dataframes

I have 2 dataframes:
df1 = pd.DataFrame.from_dict({('category', ''): {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
(pd.Timestamp('2021-06-28 00:00:00'),
'metric_1'): {0: 4120.549999999999, 1: 11226.016666666665, 2: 25049.443333333333, 3: 18261.083333333332, 4: 2553.1208333333334, 5: 2843.01, 6: 73203.51333333334},
(pd.Timestamp('2021-06-28 00:00:00'), 'metric_2'): {0: 9907.79,
1: 7614.650000000001,
2: 13775.259999999998,
3: 13158.250000000004,
4: 1457.85,
5: 1089.5600000000002,
6: 38864.9},
(pd.Timestamp('2021-07-05 00:00:00'),
'metric_1'): {0: 5817.319999999998, 1: 10799.45, 2: 23521.51, 3: 22062.350833333334, 4: 1249.5974999999999, 5: 3229.77, 6: 52796.06083333332},
(pd.Timestamp('2021-07-05 00:00:00'), 'metric_2'): {0: 6321.21,
1: 5606.01,
2: 10239.689999999999,
3: 17476.600000000002,
4: 943.7199999999999,
5: 1410.33,
6: 29645.45}}).set_index('category')
df2 = pd.DataFrame.from_dict({'category': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'F',
6: 'G'},
1: {0: 36234.035577957984,
1: 69078.07089184562,
2: 128879.5397517309,
3: 178376.63536908248,
4: 9293.956915067887,
5: 8184.780211399392,
6: 177480.74540313095},
2: {0: 37887.581678419825,
1: 72243.67956241772,
2: 134803.02342121338,
3: 186603.8963173654,
4: 9716.385738295368,
5: 8555.606693927,
6: 185658.87577993725}}).set_index('category')
First I change the column names of df2 to be the same as df
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
df2 = df2.rename(columns=date_mappings)
Then I try to merge it
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack(), left_index=True, right_index=True).sort_index(axis=1))
But I get an error:
ValueError: Cannot merge a Series without a name
What is my mistake?
My goal is to add columns from df2 to df1 in each week like so that df1 would have 3 columns instead of 2.
After using
c = [df2.columns.map(date_mappings.get), df2.columns]
df1.join(df2.set_axis(c, axis=1)).sort_index(axis=1)
I get the values appended to the end of the dataframe rather than to the same columns with the same week naming:
Maybe this could be an issue that df2 holds dates from 2021-06-28 to 2022-06-27 while df1 holds dates from 2020 to today.
Unwanted adding to the end of the df

Idea is create MultiIndex in both DataFrames:
date_mappings = {
1 : '2021-06-28',
2 : '2021-07-05'}
#create MultiIndex in df2 with datetimes in first level
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
#removed unused levels, here category, so possible convert first leve to datetimes
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05 \
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45
metric_3
category
A 37887.581678
B 72243.679562
C 134803.023421
D 186603.896317
E 9716.385738
F 8555.606694
G 185658.875780
If need remove greater datetimes like maximal df1 datetimes use:
#change mapping for test
date_mappings = {
1 : '2021-06-28',
2 : '2022-07-05'}
df2.columns = pd.MultiIndex.from_product([pd.to_datetime(df2.columns.map(date_mappings)),
['metric_3']])
df1.columns = df1.columns.remove_unused_levels()
df1.columns = df1.columns.set_levels(pd.to_datetime(df1.columns.levels[0]), level=0)
df2 = df2.loc[:, df2.columns.get_level_values(0) <= df1.columns.get_level_values(0).max()]
print (df2)
2021-06-28
metric_3
category
A 36234.035578
B 69078.070892
C 128879.539752
D 178376.635369
E 9293.956915
F 8184.780211
G 177480.745403
#join together and sorting MultiIndex
df = df1.join(df2).sort_index(axis=1)
print (df)
2021-06-28 2021-07-05
metric_1 metric_2 metric_3 metric_1 metric_2
category
A 4120.550000 9907.79 36234.035578 5817.320000 6321.21
B 11226.016667 7614.65 69078.070892 10799.450000 5606.01
C 25049.443333 13775.26 128879.539752 23521.510000 10239.69
D 18261.083333 13158.25 178376.635369 22062.350833 17476.60
E 2553.120833 1457.85 9293.956915 1249.597500 943.72
F 2843.010000 1089.56 8184.780211 3229.770000 1410.33
G 73203.513333 38864.90 177480.745403 52796.060833 29645.45

Use pd.DataFrame.reindex + pd.DataFrame.join
reindex has a convenient level parameter that allows you to expand on the index levels not present.
df1.join(df2.reindex(df1.index, level=0))

I am not sure if this is what you want but you might need to_frame:
f = lambda x: pd.to_datetime(x)
df = (df2.merge(df1.unstack().to_frame(), left_index=True, right_index=True).sort_index(level=0))
print(df)

Increasing a value during merge in pandas

I have 2 dataframes
df1
product_id value name
abc 10 a
def 20 b
ggg 10 c
df2
Which I get after using df2.groupby(['prod_id'])['code'].count().reset_index()
prod_id code
abc 10
def 20
ggg 10
ooo 5
hhh 1
I want to merge values from df2 to df1 left on product_id, right on prod_id.
To get:
product_id value name
abc 20 a
def 40 b
ggg 20 c
I tried:
pd.merge(df1, df2.groupby(['prod_id'])['code'].count().reset_index(),
left_on='product_id', right_on='prod_id', how='left')
Which returns df1 with 2 additional columns prod_id and code with the code column holding the amount by which I would like to increase value in df1. Now I can just add those 2 columns but I would like to avoid that.

Here’s one alternative:
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
Complete example:
df1 = pd.DataFrame({'product_id': {0: 'abc', 1: 'def', 2: 'ggg'},
'value': {0: 10, 1: 20, 2: 10},
'name': {0: 'a', 1: 'b', 2: 'c'}})
df2 = pd.DataFrame({'prod_id': {0: 'abc', 1: 'def', 2: 'ggg', 3: 'ooo', 4: 'hhh'},
'code': {0: 10, 1: 20, 2: 10, 3: 5, 4: 1}})
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
OUTPUT:
product_id value name
0 abc 20 a
1 def 40 b
2 ggg 20 c

you could use reindex on df2 with the order of df1 product_id, after the groupby.count (without the reset_index). like
df1['value'] += (
df2.groupby(['prod_id'])
['code'].count()
.reindex(df1['product_id'], fill_value=0)
.to_numpy()
)

Compare two columns in two dataframes with a condition on another column

I have a multilevel dataframe and I want to compare the value in column secret with a condition on column group. If group = A, we allow the value in another dataframe to be empty or na. Otherwise, output INVALID for the mismatching ones.
multilevel dataframe:
name secret group
df1 df2 df1 df2 df1 df2
id
1 Tim Tim random na A A
2 Tom Tom tree A A
3 Alex Alex apple apple B B
4 May May file cheese C C
expected output for secret
id name secret group
1 Tim na A
2 Tom A
3 Alex apple B
4 May INVALID C
so far I have:
result_df['result'] = multilevel_df.groupby(level=0, axis=0).apply(lambda x: secret_check(x))
#take care of the rest by compare column by column
result_df = multilevel_df.groupby(level=0, axis=1).apply(lambda x: validate(x))
def validate(x):
if x[0] == x[1]:
return x[1]
else:
return 'INVALID'
def secret_check(x):
if (x['group'] == 'A' and pd.isnull(['secret']): #this line is off
return x[1]
elif x[0] == x[1]:
return x[1]
else:
return 'INVALID'

Assuming we have the following dataframe:
df = pd.DataFrame({0: {0: 1, 1: 2, 2: 3, 3: 4},
1: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
2: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
3: {0: 'random', 1: 'tree', 2: 'apple', 3: 'file'},
4: {0: 'na', 1: '', 2: 'apple', 3: 'cheese'},
5: {0: 'A', 1: 'A', 2: 'B', 3: 'C'},
6: {0: 'A', 1: 'A', 2: 'B', 3: 'C'}})
df
df.columns = pd.MultiIndex.from_tuples([('id',''), ('name', 'df1'), ('name', 'df2'),
('secret', 'df1'), ('secret', 'df2'), ('group', 'df1'), ('group', 'df2')])
df
In[1]:
id name secret group
df1 df2 df1 df2 df1 df2
0 1 Tim Tim random na A A
1 2 Tom Tom tree A A
2 3 Alex Alex apple apple B B
3 4 May May file cheese C C
You can use np.select() to return results based on conditions.
.droplevel() to get out of a multiindex dataframe
and df.loc[:,~df.columns.duplicated()] to drop duplicate columns. Since we are setting the answer to df1 columns, df2 columns are not needed.
df[('secret', 'df1')] = np.select([(df[('group', 'df2')] != 'A') &
(df[('secret', 'df1')] != df[('secret', 'df2')])], #condition 1
[df[('secret', 'df1')] + ' > ' + df[('secret', 'df2')]], #result 1
df[('secret', 'df2')]) #alterantive if conditions not met
df.columns = df.columns.droplevel(1)
df = df.loc[:,~df.columns.duplicated()]
df
Out[1]:
id name secret group
0 1 Tim na A
1 2 Tom A
2 3 Alex apple B
3 4 May file > cheese C

If I understand you right, you want to mark "secret" in df2 as invalid if the secrets in df1 and df2 differ and the group is not A. There you go:
condition = (df[('secret', 'df1')] != df[('secret', 'df2')]) &\
df[('group', 'df1')] != 'A')
df.loc[condition, ('secret', 'df2')] = 'INVALID'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

If column value contains symbol, retain only the substring after the symbol - python

Related

Formatting a strange DataFrame a the pythonic way

Select rows that contains values repeated across different levels of another column

Merging two multiindex dataframes

Increasing a value during merge in pandas

Compare two columns in two dataframes with a condition on another column

Categories

Resources