Formatting a strange DataFrame a the pythonic way

Formatting a strange DataFrame a the pythonic way - python

I have a Dataframe with a really strange format, columns are matching/linked two by two.
The first column contains labels and codes associated with the adjacent column values.
Here's what it looks like :
1
2
3
4
Name
letter_1
Name
letter_2
Title
Choose a letter:
Title
Choose another letter
1
a
1
z
2
b
2
y
3
c
4
d
And here's what I need :
Name
Title
Code
Label
letter_1
Choose a letter:
1
a
letter_1
Choose a letter:
2
b
letter_1
Choose a letter:
3
c
letter_1
Choose a letter:
4
d
letter_2
Choose another letter
1
z
letter_2
Choose another letter
2
y
I managed to do it with that code:
# Init an empty DataFrame
df_format = pd.DataFrame()
# Iterate over the columns, step 2
for i in range(0, len(df.columns), 2):
# Get the columns names for current col and col+1, since they're linked together
col_i, col_ii = df.columns[i], df.columns[i]+1
# Concat codes from col and labels from col+1
codes = pd.concat([df[col_i].loc[3:], df[col_ii].loc[3:]], axis=1).dropna()
# Get the "Name", col+1 line 0
name = df[col_ii].loc[0]
# Get the title, col+1 line 1
title = df[col_ii].loc[1]
codes.loc[:, ['Name', 'Title']] = [name, title]
codes.columns = ["Code", "Label", "Name", "Title"]
df_format = pd.concat([df_format, codes])
But the question is: is there a more pythonic way to do this ?
I assume it is with pandas but sometimes it kind of break my brain.
Here is the example to use with pd.DataFrame
[{1: 'Name', 2: 'letter_1', 3: 'Name', 4: 'letter_2'},
{1: 'Title', 2: 'Choose a letter:', 3: 'Title', 4: 'Choose another letter'},
{1: 1, 2: 'a', 3: 1, 4: 'z'},
{1: 2, 2: 'b', 3: 2, 4: 'y'},
{1: 3, 2: 'c', 3: np.nan, 4: np.nan},
{1: 4, 2: 'd', 3: np.nan, 4: np.nan}]
Thanks a lot for your help !
EDIT
This data come from an Excel file with another sheet containing the answers given by some respondents where the columns names are corresponding to letter_1, letter_2, etc.
The sheet I am working on has about 15.000 columns, all ordered the same way, as shown above.
I read it with pd.read_excel(file, header=None, sheetname="sheet1"), that's why I do not have easy to read column names (I dropped column 0)

Related

Dropping columns before a matching string in the following column

Is there direct possibility to drop all columns before a matching string in a pandas Dataframe. For eg. if my column 8 contains a string 'Matched' I want to drop columns 0 to 7 ?

Well, you did not give any information where and how to look for 'Matched', but let's say that integer col_num contains the number of the matched column:
col_num = np.where(df == 'Matched')[1][0]
df.drop(columns=df.columns[0:col_num],inplace=True)
will do the drop

Example
data = {'A': {0: 1}, 'B': {0: 2}, 'C': {0: 3}, 'Match1': {0: 4}, 'D': {0: 5}}
df = pd.DataFrame(data)
df
A B C Match1 D
0 1 2 3 4 5
Code
remove in front of first Match + # column : boolean indexing
df.loc[:, df.columns.str.startswith('Match').cumsum() > 0]
result
Match1 D
0 4 5

Select rows that contains values repeated across different levels of another column

I have a dataset with two columns
id to
0 1 0x954b890704693af242613edef1b603825afcd708
1 1 0x954b890704693af242613edef1b603825afcd708
2 1 0x607f4c5bb672230e8672085532f7e901544a7375
3 1 0x9b9647431632af44be02ddd22477ed94d14aacaa
4 2 0x9b9647431632af44be02ddd22477ed94d14aacaa
and I would like to print the value in column 'to' that is present in different levels of the column 'id', in the above example for example the only value to be printed should be 0x9b9647431632af44be02ddd22477ed94d14aacaa
I have done this with a for loop within, i wonder it there is a better way of doing this:
for index, row in df.iterrows():
to=row['to']
id=row['id']
for index, row in df.iterrows():
if row['to']==to and row['id']!=id:
print(to)

You can use df.groupby on column to, apply nunique and keep only the entries > 1. So:
import pandas as pd
d = {'id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'to': {0: '0x954b890704693af242613edef1b603825afcd708',
1: '0x954b890704693af242613edef1b603825afcd708',
2: '0x607f4c5bb672230e8672085532f7e901544a7375',
3: '0x9b9647431632af44be02ddd22477ed94d14aacaa',
4: '0x9b9647431632af44be02ddd22477ed94d14aacaa'}}
df = pd.DataFrame(d)
nunique = df.groupby('to')['id'].nunique()
print(nunique)
to
0x607f4c5bb672230e8672085532f7e901544a7375 1
0x954b890704693af242613edef1b603825afcd708 1
0x9b9647431632af44be02ddd22477ed94d14aacaa 2
res = nunique[nunique>1]
print(res.index.tolist())
['0x9b9647431632af44be02ddd22477ed94d14aacaa']

If column value contains symbol, retain only the substring after the symbol

If index column value is separated by ;, then retain only the substring after the ;. Else, retain as-is. Would be even better if its in list comprehension.
My code raised ValueError: Length of values (4402) does not match length of index (22501).
# If gene name is separated by ";", then get the substring after the ";"
list = []
for i in meth["Name"]:
if ";" in i:
list.append(i.split(";",1)[1])
else:
continue
meth["Name"] = list
Traceback:
--> 532 "Length of values "
533 f"({len(data)}) "
534 "does not match length of index "
ValueError: Length of values (4402) does not match length of index (22501)
Sample data:
meth.iloc[0:4,0:4].to_dict()
{'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297,
1: 0.786837244239289,
2: 0.5310546143038515,
3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566,
1: 0.386177267500376,
2: 0.5086236274690276,
3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055,
1: 0.54983504208745,
2: 0.5352071929258406,
3: 0.6139719037555759}}

Are you trying to perform this operation on the column names, or to values of a specific column?
Either way, I think this will do the job:
import pandas as pd
# Define the example dataframe
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
# Replacing column names, using '-' as separator:
df.columns = df.columns.astype(str).str.split('-').str[-1]
# Modified dataframe:
df
"""
index 01A 01A 01A
0 A1BG 0.278916 0.318497 0.400120
1 A1CF 0.786837 0.386177 0.549835
2 A2BP1 0.531055 0.508624 0.535207
3 A2LD1 0.711916 0.403601 0.613972
"""
You can apply the same logic to your dataframe index, or specific columns:
df = pd.DataFrame(
{
'index': {0: 'A1BG', 1: 'A1CF', 2: 'A2BP1', 3: 'A2LD1'},
'TCGA-2K-A9WE-01A': {0: 0.27891582736223297, 1: 0.786837244239289,
2: 0.5310546143038515, 3: 0.7119161837613309},
'TCGA-2Z-A9J1-01A': {0: 0.318496987871566, 1: 0.386177267500376,
2: 0.5086236274690276, 3: 0.4036012750884792},
'TCGA-2Z-A9J2-01A': {0: 0.400119915667055, 1: 0.54983504208745,
2: 0.5352071929258406, 3: 0.6139719037555759},
'name': {0: 'A;B;C', 1: 'AAA', 2: 'BBB', 3: 'C-DE'}
}
)
# Original dataframe:
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 A;B;C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
# Splitting values from column "name":
df['name'] = df['name'].astype(str).str.split(';').str[-1]
df
"""
index TCGA-2K-A9WE-01A TCGA-2Z-A9J1-01A TCGA-2Z-A9J2-01A name
0 A1BG 0.278916 0.318497 0.400120 C
1 A1CF 0.786837 0.386177 0.549835 AAA
2 A2BP1 0.531055 0.508624 0.535207 BBB
3 A2LD1 0.711916 0.403601 0.613972 C-DE
"""
Note
Please note that if the column values hold multiple repetitions of the same separator (e.g.: "A;B;C"), only the last substring gets returned (for "A;B;C", returns "C").

how to lookup and find a value in a specific column in column precedent and put its value in a new column in Pandas

I want to do something like Vlookup in pandas, I have a two column data frame, need to check if 2nd column values(B) are valid in 1st column(A), if yes related row and 2nd column value to be inserted in a new column named C, below is sample table:
original data frame is:
A B
a -
b a
c a
d b
e d
preferred data frame will be:
A B C
a - N/A
b a -
c a -
d b a
e d b
actually I am beginner in python but in excel this could be easily done by a Vlookup between Column A and B and the result would be reverted in Column C.
below is the code I wrote but it is not complete and does not work:
import pandas as pd
excel_file ='D:\Test\Test.xlsx'
data=pd.read_excel(excel_file, sheet_name= 0)
df=pd.DataFrame(data,columns=['A','B'])
lr = df.index.values.astype(int)[-1]
for j in range(0,2):
for i in range(1,lr):
C = []
row=0
for i in df.iloc[:,1]:
df["C"]=df.iloc[:,0].str.match(i)
if i == "-":
C[row]=C.append(i)
row+=1
elif df.at[i,['Index']]:
idx = next(iter(df[df['Index'] == True].index), 'no match')
df.at[i,"C"]=df.iloc[idx,1]
print(df)

You could fill C using np.where and then map C using a dictionary of A and B
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
'B': {0: '-', 1: 'a', 2: 'a', 3: 'b', 4: 'd'}})
df['C'] = np.where(df['B'].isin(df['A'].values), df['B'], np.nan)
df['C'] = df['C'].map(dict(zip(df.A.values, df.B.values)))

If I remember correctly, VLOOKUP takes a value (say "a") and returns the value in another column (say column "B") whenever this value is matched: VLOOKUP("a", A:A, 2)
If this is what you want, you can do that by creating a new (empty) column, then filling the right rows with the right value, for example:
# Create a new column named C
df["C"] = None
# Fill the cells in C for which the column A matches the condition =="a"
df.loc[df["A"] == "a", "C"] = df.loc[df["A"] == "a", "B"]

use a list comprehension to do the Vlookup
list1=['a','b','c','d','e']
list2=[np.empty,'a','a','b','d']
df=pd.DataFrame({'A':list1,'B':list2})
df['C']=[df.loc[df['A']==x,'B'].values if x in df['A'].values else '' for x in df['B']]

How to Convert rows into columns headers and values of other column as data?

I've a below dataframe and I want to pivot the data to change the Name column values into multiple columns and values from Data column into values for the columns created by the Name column.
As the Data column has all types of date when I pivot the data I dont get the required result can someone please advise what I'm doing wrong?
import numpy as np
dict_d = {'Name': {0: 'Number', 1: 'Purpose', 2: 'Approver', 3: 'internal/external', 4: 'Name', 5: 'N Mnemonic'}, 'Data': {0: '123456', 1: 'BC', 2: np.nan, 3: 'internal', 4: np.nan, 5: 'xyz'}}
df = pd.DataFrame(dict_d)
df
o/p
Name Data
0 Number 123456
1 Purpose BC
2 Approver NaN
3 internal/external internal
4 Name NaN
5 N Mnemonic xyz
I've tried this
df.pivot_table(columns='Name', values='Data', aggfunc=lambda x: ''.join(str(x)))
Name Approver N Mnemonic Name Number Purpose internal/external
Data NameData NameData NameData NameData NameData NameData
But in row 1 I want the data values.

I think you need convert Name to index, select by double [] for one column DataFrame and transpose:
df1 = df.set_index('Name')[['Data']].T

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formatting a strange DataFrame a the pythonic way - python

Related

Dropping columns before a matching string in the following column

Select rows that contains values repeated across different levels of another column

If column value contains symbol, retain only the substring after the symbol

how to lookup and find a value in a specific column in column precedent and put its value in a new column in Pandas

How to Convert rows into columns headers and values of other column as data?

Categories

Resources