Changing row names in dataframe - python

I have a dataframe and one of the columns roughly looks like as shown below. Is there any way to rename rows? Rows should be renamed as psPARP8, psEXOC8, psTMEM128, psCFHR3. Where ps represents pseudogene and and the term in
bracket is the code for that pseudogene. I will highly appreciate if anyone can can make
a python function or any alternative to perform this task.
d = {'gene_final': ["1poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene",
"mitochondrially encoded NADH 4L dehydrogenase (MT-ND4L) pseudogene",
"relaxin family peptide/INSL5 receptor 4 (RXFP4 ) pseudogene",
"nasGBP7and GBP2"
]}
df = pd.DataFrame(data=d)
The desired output should look like this
gene_final
-----------
psPARP8
psEXOC8
psTMEM128
psCFHR3
psMT-ND4L
psRXFP4
nasGBP2

import pandas as pd
from regex import regex
# build dataframe
df = pd.DataFrame({'gene_final': ["poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene"]})
def extract_name(s):
"""Helper function to extract ps name """
s = regex.findall(r"\s\((\S*)\s?\)", s)[0] # find a word between ' (' and ' )'
s = f"ps{s}" # add ps to string
return s
# apply function extract_name() to each row
df['gene_final'] = df['gene_final'].apply(extract_name)
print(df)
> gene_final
> 0 psPARP8
> 1 psEXOC8
> 2 psTMEM128
> 3 psCFHR3
> 4 psMT-ND4L
> 5 psRXFP4

I think you are saying about index names (rows):
This is how you change the row names in DataFrames:
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
print(df)
and you can change the row names after building dataframe also like this:
df_new = df.rename(columns={'A': 'Col_1'}, index={'ONE': 'Row_1'})
print(df_new)
# Col_1 B C
# Row_1 11 12 13
# TWO 21 22 23
# THREE 31 32 33
print(df)
# A B C
# ONE 11 12 13
# TWO 21 22 23
# THREE 31 32 33

Related

Explode a string with random length equally to next empty columns pandas

Let's say I've df like this..
string some_col
0 But were so TESSA tell me a little bit more t ... 10
1 15
2 14
3 Some other text xxxxxxxxxx 20
How can I split string col such that long string exploded into random lengths equally across empty cells. It should look like this after fitting.
string some_col
0 But were so TESSA tell me . 10
1 little bit more t seems like 15
2 you pretty upset 14
Reproducable
import pandas as pd
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
print(df)
I've no idea how to get even started I'm looking for execution steps so that I can implemnt on my own any refrence would be great!
You need to create groups with a non empty row and all consecutive empty rows (the group length gives the number of chunks) then use np.split_array to create n list of words:
import numpy as np
# first row --v group length --v
wrap = lambda x: [' '.join(l) for l in np.array_split(x.iloc[0].split(), len(x))]
df['string2'] = (df.groupby(df['string'].str.len().ne(0).cumsum())['string']
.apply(wrap).explode().to_numpy())
Output:
string some_col string2
0 But were so TESSA tell me a you pretty upset. 10 But were so TESSA
1 15 tell me a
2 14 you pretty upset.
3 Some other text xxxxxxxxxx 20 Some other text xxxxxxxxxx
This works in your case:
import pandas as pd
import numpy as np
from math import ceil
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14],
['Some other long string that you need..', 10], ['', 15]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
df['string'] = np.where(df['string'] == '', None, df['string'])
df.ffill(inplace=True)
df['group_id'] = df.groupby('string').cumcount() + 1
df['max_group_id'] = df.groupby('string',).transform('count')['group_id']
df['string'] = df['string'].str.split(' ')
df['string'] = df.apply(func=lambda r: r['string'][int(ceil(len(r['string'])/r['max_group_id'])*(r['group_id']-1)):
int(ceil(len(r['string'])/r['max_group_id'])*r['group_id'])], axis=1)
df.drop(columns=['group_id', 'max_group_id'], inplace=True)
print(df)
Result:
string some_col
0 [But, were, so, TESSA] 10
1 [tell, me, a, you] 15
2 [pretty, upset.] 14
3 [Some, other, long, string] 10
4 [that, you, need..] 15
You can customize number of rows you want with this code :
import pandas as pd
import random
df = pd.read_csv('text.csv')
string = df.at[0,'string']
# the number of rows you want
num_of_rows = 4
endLineLimits = random.sample(range(1, string.count(' ')), num_of_rows - 1)
count = 1
for i in range(len(string)):
if string[i] == ' ':
if count in endLineLimits:
string = string[:i] + ';' + string[i+1:]
count += 1
newStrings = string.split(';')
for i in range(len(df)):
df.at[i,'string'] = newStrings[i]
print(df)
Example result:
string some_col
0 But were so TESSA tell 10
1 me a little bit more t 15
2 seems like you pretty 14
3 upset 20

How to sum same columns (differentiated by suffix) in pandas?

I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!
Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})
assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want
import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5
Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5
I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample

Element-wise Comparison of Two Pandas Dataframes

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?
This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7
You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )
Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

String increment of characters for a column

I've tried researching but din't get any leads so posting a question,
I have a df and I want the string column values to be incremented based on their ascii values of each character of string by 3
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Name Age
0 Tom 10
1 Nick 15
2 Juli 14
Final answer should be like Name is incremented by 3 ASCII numbers
Name Age
0 Wrp 10
1 Qlfn 15
2 Myol 14
This action has to be carried out on a df with 32,000 row. Please suggest me on how to achieve this result?
Here's one way using python's built-in chr and ord (it seems like you want an increment of 3 not 2):
df['Name'] = [''.join(chr(ord(s)+3) for s in i) for i in df.Name]
print(df)
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14
Try the code below,
data = [['Tom', 10], ['Nick', 15], ['Juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
def fn(inp_str):
return ''.join([chr(ord(i) + 3) for i in inp_str])
df['Name'] = df['Name'].apply(fn)
df
Output is
Name Age
0 Wrp 10
1 Qlfn 15
2 Mxol 14

Curious about groupby function in pandas , how to write groupby for a generic dataset?

i want to write a generic function for groupby , suppose i have dataset which has around 100 columns , for example it has 70 categorical column , 30 numeric attributes , now i would like to write a generic python function which will just take the dataset and will display appropriate groupby results in form on plots or data , any expert advice before i start into this ?
Thanks,
Shivam
You could incorporate the random module from the standard library to get a random sample of all numeric columns.
df = pd.DataFrame({
'a': list('abcde'),
'b': ['1','2','3','4','5'],
'c': range(5),
'd': [i*21 for i in range(5)],
'e': [12,32,45,67,54]})
str_cols = df.select_dtypes(exclude='number').columns.tolist()
num_cols = random.sample(df.select_dtypes('number').columns.tolist(), k=2)
dff = df.loc[:, str_cols+num_cols]
print(dff)
a b d c
0 a 1 0 0
1 b 2 21 1
2 c 3 42 2
3 d 4 63 3
4 e 5 84 4

Categories