I understand the general usage of iloc as follows.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df_ = df.iloc[:, 1:4]
On the other hand, although it is a limited usage, is it possible to set iloc using a string?
Below is pseudo code that does not work properly but is what I would like to do.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df.columns = ["money","job","fruits","animals","height"]
tests = ["1:2","2:3", "1:4"]
for i in tests:
print(df.iloc[:,i])
Is there a better way to split the string into "start_col" and "end_col" using a function?
You an just create a converter function:
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
ranges = ["1:2", "2:3", "1:4"]
def as_int_range(ranges):
return [i for rng in ranges for i in range(*map(int, rng.split(':')))]
df.iloc[as_int_range(ranges),:]
0 1 2 3 4
1 4 5 6 4 5
2 7 8 9 4 5
1 4 5 6 4 5
2 7 8 9 4 5
3 10 11 12 4 5
iloc[ ] is for slicing numeric data. For String slicing, you can use loc[ ] like you have used iloc[ ] for numbers. Here is the official pandas documentation for implementing loc[ ] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
I didn't mention it in my original question.
I wrote a program that supports examples like ["1:3, 4"].
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df.columns = ["a", "b", "c" , "d", "e"]
def args_to_list(string):
strings = string.split(",")
column_list = []
for each_string in strings:
each_string = each_string.strip()
if ":" in each_string:
start_ , end_ = each_string.split(":")
for i in range(int(start_), int(end_)):
column_list.append(i)
else:
column_list.append(int(each_string))
return column_list
tests = ["1:2", "1,2,3,4", "1:2,3", "1,2:3,4"]
for i in tests:
list_ =args_to_list(i)
print(list_)
print(df.iloc[:, list_])
print(list_)
I have a data-frame and one of its columns are a string which separated with dash. I want to get the part before the dash. Could you help me with that?
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 5]
df['b'] = ['C-C02','R-C05','R-C01','C-C06', 'RC-C06']
The desire output is:
You could use str.replace to remove the - and all characters after it:
df['b'] = df['b'].str.replace(r'-.*$', '', regex=True)
Output:
a b
0 1 C
1 2 R
2 3 R
3 4 C
4 5 RC
You want to split each string on the '-' character and keep the part before it:
df['c'] = [s.split('-')[0] for s in df['b']]
I have a dataframe and one of the columns roughly looks like as shown below. Is there any way to rename rows? Rows should be renamed as psPARP8, psEXOC8, psTMEM128, psCFHR3. Where ps represents pseudogene and and the term in
bracket is the code for that pseudogene. I will highly appreciate if anyone can can make
a python function or any alternative to perform this task.
d = {'gene_final': ["1poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene",
"mitochondrially encoded NADH 4L dehydrogenase (MT-ND4L) pseudogene",
"relaxin family peptide/INSL5 receptor 4 (RXFP4 ) pseudogene",
"nasGBP7and GBP2"
]}
df = pd.DataFrame(data=d)
The desired output should look like this
gene_final
-----------
psPARP8
psEXOC8
psTMEM128
psCFHR3
psMT-ND4L
psRXFP4
nasGBP2
import pandas as pd
from regex import regex
# build dataframe
df = pd.DataFrame({'gene_final': ["poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene"]})
def extract_name(s):
"""Helper function to extract ps name """
s = regex.findall(r"\s\((\S*)\s?\)", s)[0] # find a word between ' (' and ' )'
s = f"ps{s}" # add ps to string
return s
# apply function extract_name() to each row
df['gene_final'] = df['gene_final'].apply(extract_name)
print(df)
> gene_final
> 0 psPARP8
> 1 psEXOC8
> 2 psTMEM128
> 3 psCFHR3
> 4 psMT-ND4L
> 5 psRXFP4
I think you are saying about index names (rows):
This is how you change the row names in DataFrames:
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
print(df)
and you can change the row names after building dataframe also like this:
df_new = df.rename(columns={'A': 'Col_1'}, index={'ONE': 'Row_1'})
print(df_new)
# Col_1 B C
# Row_1 11 12 13
# TWO 21 22 23
# THREE 31 32 33
print(df)
# A B C
# ONE 11 12 13
# TWO 21 22 23
# THREE 31 32 33
I have a df with some trigrams (and some more ngrams) and I would like to check if the sentence starts or ends with a list of specific words and remove them from my df. For example:
import pandas as pd
df = pd.DataFrame({'Trigrams+': ['because of tuna', 'to your family', 'pay to you', 'give you in','happy birthday to you'], 'Count': [10,9,8,7,5]})
list_remove = ['of','in','to', 'a']
print(df)
Trigrams+ Count
0 because of tuna 10
1 to your family 9
2 pay to you 8
3 give you in 7
4 happy birthday to you 5
I tried using strip but in the example above the first row would return because of tun
The output should be like this:
list_remove = ['of','in','to', 'a']
Trigrams+ Count
0 because of tuna 10
1 pay to you 8
2 happy birthday to you 5
Can someone help me with that? Thanks in advance!
Try:
list_remove = ["of", "in", "to", "a"]
tmp = df["Trigrams+"].str.split()
df = df[~(tmp.str[0].isin(list_remove) | tmp.str[-1].isin(list_remove))]
print(df)
Prints:
Trigrams+ Count
0 because of tuna 10
2 pay to you 8
4 happy birthday to you 5
You can try something like this:
import numpy as np
def func(x):
y = x.split()[0]
z = x.split()[-1]
if (y in list_remove) or (z in list_remove):
return np.nan
return x
df['Trigrams+'] = df['Trigrams+'].apply(lambda x:func(x))
df = df.dropna().reset_index(drop=True)
I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6
You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6