Remove first (if C) and last letter if (F or W)

Remove first (if C) and last letter if (F or W) - python

I want to remove the first letter if it is a C and the last letter if it is a F or a W.
But when I use :
df1['trimmed_seq'] = df1['seq'].str.strip("CFW")
Input:
seq
0 CASSAQGTGDRGYTF
1 CASSLVATGNTGELFF
2 CASSKGTVSGLSG
3 CALKVGADTQYF
4 CASSLWASGRGGTGELFF
5 CASSLLGWEQLDEQFF
6 CASSSGTGVYGYTF
7 CASSPLEWEGVTEAFF
8 CASSFWSSGRGGTDTQYF
9 CASSAGQGASDEQFF
Output:
seq
0 ASSAQGTGDRGYT
1 ASSLVATGNTGEL
2 ASSKGTVSGLSG
3 ALKVGADTQY
4 ASSLWASGRGGTGEL
5 ASSLLGWEQLDEQ
6 ASSSGTGVYGYT
7 ASSPLEWEGVTEA
8 ASSFWSSGRGGTDTQY
9 ASSAGQGASDEQ
The problem I have is that for example for line '1' both F's at the end are removed and in case a sequence would end with CFW all this would be removed.
So my question is: can this be solved somehow using the same str.strip function?

This is not possible using strip because it has no notion of a maximum number of characters to remove. So I would use replace and an regex to remove an optional prefix and an optional suffix:
df['seq'].str.replace(r'^C?(.*?)[FW]?$', r'\1')
It gives as expected:
0 ASSAQGTGDRGYT
1 ASSLVATGNTGELF
2 ASSKGTVSGLSG
3 ALKVGADTQY
4 ASSLWASGRGGTGELF
5 ASSLLGWEQLDEQF
6 ASSSGTGVYGYT
7 ASSPLEWEGVTEAF
8 ASSFWSSGRGGTDTQY
9 ASSAGQGASDEQF
Name: seq, dtype: object

You can use loc operations to filter out the required rows and .str to perform the string formatting
mask = (df.seq.str[0] == 'C')
df.loc[mask, "seq"] = df.loc[mask, "seq"].str[1:]
mask = (df.seq.str[-1] == 'F') | (df.seq.str[-1] == 'W')
df.loc[mask, "seq"] = df.loc[mask, "seq"].str[:-1]

Related

How to find the same characters in a string and display them as a quantity

I am trying to display each of the characters with their quantity
Input Specification
The first line of input contains the number N, which is the number of lines that follow. The next
N lines will contain at least one and at most 80 characters, none of which are spaces.
Output Specification
Output will be N lines. Line i of the output will be the encoding of the line i + 1 of the input.
The encoding of a line will be a sequence of pairs, separated by a space, where each pair is an
integer (representing the number of times the character appears consecutively) followed by a space,
followed by the character.
Sample Input
4
+++===!!!!
777777......TTTTTTTTTTTT
(AABBC)
3.1415555
Output for Sample Input
3 + 3 = 4 !
6 7 6 . 12 T
1 ( 2 A 2 B 1 C 1 )
1 3 1 . 1 1 1 4 1 1 4 5

just use itertools.groupby and format the result: value and length of the group. Join the elements:
import itertools
s = "+++===!!!! 777777......TTTTTTTTTTTT (AABBC) 3.1415555"
result = "".join(["{} {}".format(sum(1 for _ in group),value) for value,group in itertools.groupby(s)])
result:
3 + 3 = 4 ! 1 6 7 6 . 12 T 1 1 ( 2 A 2 B 1 C 1 ) 1 1 3 1 . 1 1 1 4 1 1 4 5
without a key parameter, itertools.groupby just groups identical items into groups, preserving the order. Just count them. Here I chose to not create a list to consume the group (len(list(group))) but just do sum(1 for _ in group)

I'd do something like this:
s = "+++===!!!! 777777......TTTTTTTTTTTT (AABBC) 3.1415555"
d = {char: 0 for char in s}
for char in s:
d[char] += 1
output = "".join([" {} {}".format(value, key) for key, value in d.items()])
# outputs: '3 + 3 = 4 ! 3 6 7 7 . 1 2 T 1 ( 2 A 2 B 1 C 1 ) 1 3 2 1 1 4 4 5'

Since it looks like you aren't looking for total repeating characters, I would suggest reading the string backward and for each character, you want to count how many times it appears as you're iterating through, and once you hit a different character you use the current count for the output. In fact, you could generate your output as you iterate through the string backward.
It might look something like this:
reverse = input[-1:0]
output = ''
count = 0
letter = reverse[0]
for k in range(0, len(reverse)):
if reverse[k] == letter and k != len(reverse) - 1:
count += 1
else:
output = str(count) + ' ' + reverse[k] + ' ' + output
letter = reverse[k]
count = 0

How do I determine how many characters in common two pandas columns have?

I have a dataframe with two columns. I want to know how many characters they have in common. The number of common elements should be a new column. Here's a minimally reproducible example.
What I have:
import pandas as pd
from string import ascii_lowercase
import numpy as np
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
Out[17]:
col_1 col_2
0 ollcgfmy daeubsrx
1 jtvtqoux xbgtrzno
2 irwmoqqa mdblczfa
3 jyebzpyd xwlynkhw
4 ifuqojvs lxotbsju
5 fybsqbku xwbluaek
6 oylztnpf gelonsay
7 zdkibutk ujlcwhfu
8 uhrcjbsk nhxhpoii
9 eocxreqz muvfwusi
What I need (the numbers are random):
Out[19]:
col_1 col_2 common_letters
0 ollcgfmy daeubsrx 1
1 jtvtqoux xbgtrzno 1
2 irwmoqqa mdblczfa 0
3 jyebzpyd xwlynkhw 3
4 ifuqojvs lxotbsju 3
5 fybsqbku xwbluaek 3
6 oylztnpf gelonsay 3
7 zdkibutk ujlcwhfu 3
8 uhrcjbsk nhxhpoii 1
9 eocxreqz muvfwusi 3
EDIT: to anyone reading this trying to get similarity between two strings, don't use this approach. other similarity measures exist, such as levenshtein or jaccard.

Using df.apply and set operations can be one way to solve the problem:
df["common_letters"] = df.apply(
lambda x: len(set(x["col_1"]).intersection(set(x["col_2"]))),
axis=1)
output:
col_1 col_2 common_letters
0 cgeabfem amnwfsde 4
1 vozgpmgs slfwvjnv 2
2 xyvktrfr jtzijmud 1
3 piexmmgh ydaxbmyo 2
4 iydpnwcu hhdxyptd 3

If you like sets you can go for:
df['common_letters'] = (df.col_1.apply(set).apply(len)
+ df.col_2.apply(set).apply(len)
- (df.col_1+df.col_2).apply(set).apply(len))

You can use numpy:
df["noCommonChars"]=np.bitwise_and(df["col_1"].map(set), df["col_2"].map(set)).str.len()
Output:
col_1 col_2 noCommonChars
0 smuxucyw hywtedvz 2
1 bniuqhkh axcuukjg 2
2 ttzehrtl nbmsmwsc 0
3 ndwyjusu dssmdnvb 3
4 zqvsvych wguthcwu 2
5 jlnpjqgn xgedmodm 1
6 ocjbtnpy lywjqkjf 2
7 tolrpshi hslxxmgo 4
8 ehatmryw fhpvluvq 1
9 icciebte joyiwooi 1
Edit
In order to include repeating characters - you can do:
from collections import Counter
df["common_letters_full"]=np.bitwise_and(df["col_1"].map(Counter), df["col_2"].map(Counter))
df["common_letters"]=df["common_letters_full"].map(dict.values).apply(sum)
#alternatively:
df["common_letters"]=df["common_letters_full"].apply(pd.Series).sum(axis=1)

I already see better answers :D , but here goes a kind of good one. I might have been able to use more from pandas:
I took some code from here
import pandas as pd
from string import ascii_lowercase
import numpy as np
def countPairs(s1, s2) :
n1 = len(s1) ;
n2 = len(s2);
# To store the frequencies of characters
# of string s1 and s2
freq1 = [0] * 26;
freq2 = [0] * 26;
# To store the count of valid pairs
count = 0;
# Update the frequencies of
# the characters of string s1
for i in range(n1) :
freq1[ord(s1[i]) - ord('a')] += 1;
# Update the frequencies of
# the characters of string s2
for i in range(n2) :
freq2[ord(s2[i]) - ord('a')] += 1;
# Find the count of valid pairs
for i in range(26) :
count += min(freq1[i], freq2[i]);
return count;
# This code is contributed by Ryuga
df = pd.DataFrame([[''.join(np.random.choice(list(ascii_lowercase),
8)) for i in range(10)] for i in range(2)],
index=['col_1', 'col_2']).T
counts = []
for i in range(0,df.shape[0]):
counts.append(countPairs(df.iloc[i].col_1,df.iloc[i].col_2))
df["counts"] = counts
col_1 col_2 counts
0 ploatffk dwenjpmc 1
1 gjjupyqg smqtlmzc 1
2 cgtxexho hvwhpyfh 1
3 mifsbfhc ufalhlbi 4
4 qnjesfdn lyhrrnkf 2
5 omnumzmf dagttzqo 2
6 gsygkrrb aocfoqxk 1
7 wrgvruuw ydnlzvyf 1
8 ivkdxoft zmgcnrjr 0
9 vvthbzjj mmirlcvx 1

How to parse tables from .txt files using Pandas

I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?

With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None

Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)

Get index of a row of a pandas dataframe as an integer

Assume an easy dataframe, for example
A B
0 1 0.810743
1 2 0.595866
2 3 0.154888
3 4 0.472721
4 5 0.894525
5 6 0.978174
6 7 0.859449
7 8 0.541247
8 9 0.232302
9 10 0.276566
How can I retrieve an index value of a row, given a condition?
For example:
dfb = df[df['A']==5].index.values.astype(int)
returns [4], but what I would like to get is just 4. This is causing me troubles later in the code.
Based on some conditions, I want to have a record of the indexes where that condition is fulfilled, and then select rows between.
I tried
dfb = df[df['A']==5].index.values.astype(int)
dfbb = df[df['A']==8].index.values.astype(int)
df.loc[dfb:dfbb,'B']
for a desired output
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
but I get TypeError: '[4]' is an invalid key

The easier is add [0] - select first value of list with one element:
dfb = df[df['A']==5].index.values.astype(int)[0]
dfbb = df[df['A']==8].index.values.astype(int)[0]
dfb = int(df[df['A']==5].index[0])
dfbb = int(df[df['A']==8].index[0])
But if possible some values not match, error is raised, because first value not exist.
Solution is use next with iter for get default parameetr if values not matched:
dfb = next(iter(df[df['A']==5].index), 'no match')
print (dfb)
4
dfb = next(iter(df[df['A']==50].index), 'no match')
print (dfb)
no match
Then it seems need substract 1:
print (df.loc[dfb:dfbb-1,'B'])
4 0.894525
5 0.978174
6 0.859449
Name: B, dtype: float64
Another solution with boolean indexing or query:
print (df[(df['A'] >= 5) & (df['A'] < 8)])
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
print (df.loc[(df['A'] >= 5) & (df['A'] < 8), 'B'])
4 0.894525
5 0.978174
6 0.859449
Name: B, dtype: float64
print (df.query('A >= 5 and A < 8'))
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449

To answer the original question on how to get the index as an integer for the desired selection, the following will work :
df[df['A']==5].index.item()

Little sum up for searching by row:
This can be useful if you don't know the column values or if columns have non-numeric values
if u want get index number as integer u can also do:
item = df[4:5].index.item()
print(item)
4
it also works in numpy / list:
numpy = df[4:7].index.to_numpy()[0]
lista = df[4:7].index.to_list()[0]
in [x] u pick number in range [4:7], for example if u want 6:
numpy = df[4:7].index.to_numpy()[2]
print(numpy)
6
for DataFrame:
df[4:7]
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
or:
df[(df.index>=4) & (df.index<7)]
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449

The nature of wanting to include the row where A == 5 and all rows upto but not including the row where A == 8 means we will end up using iloc (loc includes both ends of slice).
In order to get the index labels we use idxmax. This will return the first position of the maximum value. I run this on a boolean series where A == 5 (then when A == 8) which returns the index value of when A == 5 first happens (same thing for A == 8).
Then I use searchsorted to find the ordinal position of where the index label (that I found above) occurs. This is what I use in iloc.
i5, i8 = df.index.searchsorted([df.A.eq(5).idxmax(), df.A.eq(8).idxmax()])
df.iloc[i5:i8]
numpy
you can further enhance this by using the underlying numpy objects the analogous numpy functions. I wrapped it up into a handy function.
def find_between(df, col, v1, v2):
vals = df[col].values
mx1, mx2 = (vals == v1).argmax(), (vals == v2).argmax()
idx = df.index.values
i1, i2 = idx.searchsorted([mx1, mx2])
return df.iloc[i1:i2]
find_between(df, 'A', 5, 8)
timing

Or you can add a for loop
for i in dfb:
dfb = i
for j in dfbb:
dgbb = j
This way the element '4' is out of the list

Missing whitespace when printing in a loop

I have this strange problem when following a reference, this code:
for r in range(10):
for c in range(r):
print "",
for c in range(10-r):
print c,
print ""
should print out something like this:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6
0 1 2 3 4 5
0 1 2 3 4
0 1 2 3
0 1 2
0 1
0
but Instead I am getting this:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6
0 1 2 3 4 5
0 1 2 3 4
0 1 2 3
0 1 2
0 1
0
Can anyone explain to me what is causing in indent on right side, it seems so simple but I have no clue what I can do to fix this?

You were printing the leading spaces incorrectly. You were printing empty quotes ("") which is printing only a single space. When you do print c, there is a space printed after c is printed. You should print " " instead to get the correct spacing. This is a very subtle thing to notice.
for r in range(10):
for c in range(r):
print " ", #print it here
for c in range(10-r):
print c,
print ""
Test

If you want to format it just so, it might be better to just let Python do it for you instead of counting explicit and the hidden implicit spaces. See the string formatting docs for what {:^19} means and more.
for i in range(10):
nums = ' '.join(str(x) for x in range(10 - i))
#print '{:^19}'.format(nums) # reproduces your "broken" code
print '{:>19}'.format(nums) # your desired output
Using the print function is a good alternative sometimes, as you can eliminate hidden spaces by setting the keyword argument end to an empty string:
from __future__ import print_function # must be at the top of the file.
# ...
print(x, end='')

You are simply not creating enough indent on the left side (there is no such thing as right side indent while printing).
For every new line you want to increase the indent by two spaces, because you are adding a number+whitespace on the line above. "", automatically adds one whitespace (this is why there is whitespaces between the numbers). Since you need to add two, simply add a whitespace within the quotes, like this: " ",.
The extra whitespace is filling the space of the number in the line above. The comma in "", is only filling the space between the numbers. To clarify: " ", uses the same space as c,, two characters, while "", only uses one character.
Here is your code with the small fix:
for r in range(10):
for c in range(r):
print " ", # added a whitespace here for correct indent
for c in range(10-r):
print c,
print ""

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove first (if C) and last letter if (F or W) - python

You can use loc operations to filter out the required rows and .str to perform the string formatting mask = (df.seq.str[0] == 'C') df.loc[mask, "seq"] = df.loc[mask, "seq"].str[1:] mask = (df.seq.str[-1] == 'F') | (df.seq.str[-1] == 'W') df.loc[mask, "seq"] = df.loc[mask, "seq"].str[:-1]

Related

How to find the same characters in a string and display them as a quantity

How do I determine how many characters in common two pandas columns have?

How to parse tables from .txt files using Pandas

Get index of a row of a pandas dataframe as an integer

Missing whitespace when printing in a loop

Categories

Resources