Explode a string with random length equally to next empty columns pandas - python

Let's say I've df like this..
string some_col
0 But were so TESSA tell me a little bit more t ... 10
1 15
2 14
3 Some other text xxxxxxxxxx 20
How can I split string col such that long string exploded into random lengths equally across empty cells. It should look like this after fitting.
string some_col
0 But were so TESSA tell me . 10
1 little bit more t seems like 15
2 you pretty upset 14
Reproducable
import pandas as pd
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
print(df)
I've no idea how to get even started I'm looking for execution steps so that I can implemnt on my own any refrence would be great!

You need to create groups with a non empty row and all consecutive empty rows (the group length gives the number of chunks) then use np.split_array to create n list of words:
import numpy as np
# first row --v group length --v
wrap = lambda x: [' '.join(l) for l in np.array_split(x.iloc[0].split(), len(x))]
df['string2'] = (df.groupby(df['string'].str.len().ne(0).cumsum())['string']
.apply(wrap).explode().to_numpy())
Output:
string some_col string2
0 But were so TESSA tell me a you pretty upset. 10 But were so TESSA
1 15 tell me a
2 14 you pretty upset.
3 Some other text xxxxxxxxxx 20 Some other text xxxxxxxxxx

This works in your case:
import pandas as pd
import numpy as np
from math import ceil
data = [['But were so TESSA tell me a you pretty upset.', 10], ['', 15], ['', 14],
['Some other long string that you need..', 10], ['', 15]]
df = pd.DataFrame(data, columns=['string', 'some_col'])
df['string'] = np.where(df['string'] == '', None, df['string'])
df.ffill(inplace=True)
df['group_id'] = df.groupby('string').cumcount() + 1
df['max_group_id'] = df.groupby('string',).transform('count')['group_id']
df['string'] = df['string'].str.split(' ')
df['string'] = df.apply(func=lambda r: r['string'][int(ceil(len(r['string'])/r['max_group_id'])*(r['group_id']-1)):
int(ceil(len(r['string'])/r['max_group_id'])*r['group_id'])], axis=1)
df.drop(columns=['group_id', 'max_group_id'], inplace=True)
print(df)
Result:
string some_col
0 [But, were, so, TESSA] 10
1 [tell, me, a, you] 15
2 [pretty, upset.] 14
3 [Some, other, long, string] 10
4 [that, you, need..] 15

You can customize number of rows you want with this code :
import pandas as pd
import random
df = pd.read_csv('text.csv')
string = df.at[0,'string']
# the number of rows you want
num_of_rows = 4
endLineLimits = random.sample(range(1, string.count(' ')), num_of_rows - 1)
count = 1
for i in range(len(string)):
if string[i] == ' ':
if count in endLineLimits:
string = string[:i] + ';' + string[i+1:]
count += 1
newStrings = string.split(';')
for i in range(len(df)):
df.at[i,'string'] = newStrings[i]
print(df)
Example result:
string some_col
0 But were so TESSA tell 10
1 me a little bit more t 15
2 seems like you pretty 14
3 upset 20

Related

How to use a string to set iloc in pandas

I understand the general usage of iloc as follows.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df_ = df.iloc[:, 1:4]
On the other hand, although it is a limited usage, is it possible to set iloc using a string?
Below is pseudo code that does not work properly but is what I would like to do.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df.columns = ["money","job","fruits","animals","height"]
tests = ["1:2","2:3", "1:4"]
for i in tests:
print(df.iloc[:,i])
Is there a better way to split the string into "start_col" and "end_col" using a function?
You an just create a converter function:
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
ranges = ["1:2", "2:3", "1:4"]
def as_int_range(ranges):
return [i for rng in ranges for i in range(*map(int, rng.split(':')))]
df.iloc[as_int_range(ranges),:]
0 1 2 3 4
1 4 5 6 4 5
2 7 8 9 4 5
1 4 5 6 4 5
2 7 8 9 4 5
3 10 11 12 4 5
iloc[ ] is for slicing numeric data. For String slicing, you can use loc[ ] like you have used iloc[ ] for numbers. Here is the official pandas documentation for implementing loc[ ] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
I didn't mention it in my original question.
I wrote a program that supports examples like ["1:3, 4"].
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5],[4,5,6,4,5],[7,8,9,4,5],[10,11,12,4,5]])
df.columns = ["a", "b", "c" , "d", "e"]
def args_to_list(string):
strings = string.split(",")
column_list = []
for each_string in strings:
each_string = each_string.strip()
if ":" in each_string:
start_ , end_ = each_string.split(":")
for i in range(int(start_), int(end_)):
column_list.append(i)
else:
column_list.append(int(each_string))
return column_list
tests = ["1:2", "1,2,3,4", "1:2,3", "1,2:3,4"]
for i in tests:
list_ =args_to_list(i)
print(list_)
print(df.iloc[:, list_])
print(list_)

split the string in dataframe in python

I have a data-frame and one of its columns are a string which separated with dash. I want to get the part before the dash. Could you help me with that?
import pandas as pd
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 5]
df['b'] = ['C-C02','R-C05','R-C01','C-C06', 'RC-C06']
The desire output is:
You could use str.replace to remove the - and all characters after it:
df['b'] = df['b'].str.replace(r'-.*$', '', regex=True)
Output:
a b
0 1 C
1 2 R
2 3 R
3 4 C
4 5 RC
You want to split each string on the '-' character and keep the part before it:
df['c'] = [s.split('-')[0] for s in df['b']]

Changing row names in dataframe

I have a dataframe and one of the columns roughly looks like as shown below. Is there any way to rename rows? Rows should be renamed as psPARP8, psEXOC8, psTMEM128, psCFHR3. Where ps represents pseudogene and and the term in
bracket is the code for that pseudogene. I will highly appreciate if anyone can can make
a python function or any alternative to perform this task.
d = {'gene_final': ["1poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene",
"mitochondrially encoded NADH 4L dehydrogenase (MT-ND4L) pseudogene",
"relaxin family peptide/INSL5 receptor 4 (RXFP4 ) pseudogene",
"nasGBP7and GBP2"
]}
df = pd.DataFrame(data=d)
The desired output should look like this
gene_final
-----------
psPARP8
psEXOC8
psTMEM128
psCFHR3
psMT-ND4L
psRXFP4
nasGBP2
import pandas as pd
from regex import regex
# build dataframe
df = pd.DataFrame({'gene_final': ["poly(ADP-ribose) polymerase family member 8 (PARP8) pseudogene",
"exocyst complex component 8 (EXOC8) pseudogene",
"transmembrane protein 128 (TMEM128) pseudogene",
"complement factor H related 3 (CFHR3) pseudogene"]})
def extract_name(s):
"""Helper function to extract ps name """
s = regex.findall(r"\s\((\S*)\s?\)", s)[0] # find a word between ' (' and ' )'
s = f"ps{s}" # add ps to string
return s
# apply function extract_name() to each row
df['gene_final'] = df['gene_final'].apply(extract_name)
print(df)
> gene_final
> 0 psPARP8
> 1 psEXOC8
> 2 psTMEM128
> 3 psCFHR3
> 4 psMT-ND4L
> 5 psRXFP4
I think you are saying about index names (rows):
This is how you change the row names in DataFrames:
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
print(df)
and you can change the row names after building dataframe also like this:
df_new = df.rename(columns={'A': 'Col_1'}, index={'ONE': 'Row_1'})
print(df_new)
# Col_1 B C
# Row_1 11 12 13
# TWO 21 22 23
# THREE 31 32 33
print(df)
# A B C
# ONE 11 12 13
# TWO 21 22 23
# THREE 31 32 33

Trimming specifc words in a dataframe

I have a df with some trigrams (and some more ngrams) and I would like to check if the sentence starts or ends with a list of specific words and remove them from my df. For example:
import pandas as pd
df = pd.DataFrame({'Trigrams+': ['because of tuna', 'to your family', 'pay to you', 'give you in','happy birthday to you'], 'Count': [10,9,8,7,5]})
list_remove = ['of','in','to', 'a']
print(df)
Trigrams+ Count
0 because of tuna 10
1 to your family 9
2 pay to you 8
3 give you in 7
4 happy birthday to you 5
I tried using strip but in the example above the first row would return because of tun
The output should be like this:
list_remove = ['of','in','to', 'a']
Trigrams+ Count
0 because of tuna 10
1 pay to you 8
2 happy birthday to you 5
Can someone help me with that? Thanks in advance!
Try:
list_remove = ["of", "in", "to", "a"]
tmp = df["Trigrams+"].str.split()
df = df[~(tmp.str[0].isin(list_remove) | tmp.str[-1].isin(list_remove))]
print(df)
Prints:
Trigrams+ Count
0 because of tuna 10
2 pay to you 8
4 happy birthday to you 5
You can try something like this:
import numpy as np
def func(x):
y = x.split()[0]
z = x.split()[-1]
if (y in list_remove) or (z in list_remove):
return np.nan
return x
df['Trigrams+'] = df['Trigrams+'].apply(lambda x:func(x))
df = df.dropna().reset_index(drop=True)

Pandas: how to find and concatenate values

I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6
You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6

Categories