Is stripping string by '\r\n ' necessary in Python? - python

In Java, it's necessary to strip with \r\n, e.g. split( "\r\n") is not splitting my string in java
But is \r\n necessary in Python? Is the following true?
str.strip() == str.strip('\r\n ')
From the docs:
Return a copy of the string with the leading and trailing characters
removed. The chars argument is a string specifying the set of
characters to be removed. If omitted or None, the chars argument
defaults to removing whitespace. The chars argument is not a prefix or
suffix; rather, all combinations of its values are stripped
From this CPython test, str.strip() seems to be stripping:
\t\n\r\f\v
Anyone can point me to the code in CPython that does the string stripping?

Are you looking for these lines?
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Objects/unicodeobject.c#L12222-L12247
#define LEFTSTRIP 0
#define RIGHTSTRIP 1
#define BOTHSTRIP 2
/* Arrays indexed by above */
static const char *stripfuncnames[] = {"lstrip", "rstrip", "strip"};
#define STRIPNAME(i) (stripfuncnames[i])
/* externally visible for str.strip(unicode) */
PyObject *
_PyUnicode_XStrip(PyObject *self, int striptype, PyObject *sepobj)
{
void *data;
int kind;
Py_ssize_t i, j, len;
BLOOM_MASK sepmask;
Py_ssize_t seplen;
if (PyUnicode_READY(self) == -1 || PyUnicode_READY(sepobj) == -1)
return NULL;
kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
len = PyUnicode_GET_LENGTH(self);
seplen = PyUnicode_GET_LENGTH(sepobj);
sepmask = make_bloom_mask(PyUnicode_KIND(sepobj),
PyUnicode_DATA(sepobj),
seplen);
i = 0;
if (striptype != RIGHTSTRIP) {
while (i < len) {
Py_UCS4 ch = PyUnicode_READ(kind, data, i);
if (!BLOOM(sepmask, ch))
break;
if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
break;
i++;
}
}
j = len;
if (striptype != LEFTSTRIP) {
j--;
while (j >= i) {
Py_UCS4 ch = PyUnicode_READ(kind, data, j);
if (!BLOOM(sepmask, ch))
break;
if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
break;
j--;
}
j++;
}
return PyUnicode_Substring(self, i, j);
}

Essentially:
str.strip() == str.strip(string.whitespace) == str.strip(' \t\n\r\f\v') != str.strip('\r\n')
Unless you are explicitly trying to remove ONLY newline characters, str.strip() and str.strip('\r\n') are different.
>>> '\nfoo\n'.strip()
'foo'
>>> '\nfoo\n'.strip('\r\n')
'foo'
>>> '\r\n\r\n\r\nfoo\r\n\r\n\r\n'.strip()
'foo'
>>> '\r\n\r\n\r\nfoo\r\n\r\n\r\n'.strip('\r\n')
'foo'
>>> '\n\tfoo\t\n'.strip()
'foo'
>>> '\n\tfoo\t\n'.strip('\r\n')
'\tfoo\t'
This all seems fine, but note that if there is whitespace (or any other character) between a newline and the start or end of a string, .strip('\r\n') won't remove the newline.
>>> '\t\nfoo\n\t'.strip()
'foo'
>>> '\t\nfoo\n\t'.strip('\r\n')
'\t\nfoo\n\t'

Related

Count of string2 in string1 not working in C, but works in Python

The problem itself is simple. I have to count the number of occurence of s2 in s1.
And length of s2 is always 2. I tried to implement it with C, but it did not work even though i know the logic is correct. So i tried the same logic in pyhton and it works perfectly. Can someone explain why? Or did i do anything wrong in C. I given both codes below.
C
#include<stdio.h>
#include<string.h>
int main()
{
char s1[100],s2[2];
int count = 0;
gets(s1);
gets(s2);
for(int i=0;i<strlen(s1);i++)
{
if(s1[i] == s2[0] && s1[i+1] == s2[1])
{
count++;
}
}
printf("%d",count);
return 0;
}
Python
s1 = input()
s2 = input()
count = 0
for i in range(0,len(s1)):
if(s1[i] == s2[0] and s1[i+1] == s2[1]):
count = count+1
print(count)
Your python code is actually incorrect, it would raise an IndexError if the last character of s1 matches the first of s2.
You have to stop iterating on the second to last character of s1.
Here is a generic solution working for any length of s2:
s1 = 'abaccabaabaccca'
s2 = 'aba'
count = 0
for i in range(len(s1)-len(s2)+1):
if s2 == s1[i:i+len(s2)]:
count += 1
print(count)
output: 3
First, as others have pointed out, you do not want to use gets(), try using fgets(). Otherwise, your logic is correct but when you read in the input, the new line character will be included in the string.
If you were to input test and es, your strings will contain test\n and es\n (with both respectively containing the null terminating byte \0). Then leads to you searching the string test\n for the substring es\n which it will not find. So you must first remove the new line character from, at least, the substring you want to search for which you can do with strcspn() to give you es.
Once the trailing newline (\n) has been replaced with a null terminating byte. You can search the string for occurances.
#include<stdio.h>
#include<string.h>
int main() {
char s1[100], s2[4];
int count = 0;
fgets(s1, 99, stdin);
fgets(s2, 3, stdin);
s1[strcspn(s1, "\n")] = '\0';
s2[strcspn(s2, "\n")] = '\0';
for(int i=0;i < strlen(s1) - 1;i++) {
if(s1[i] == s2[0] && s1[i+1] == s2[1]) {
count++;
}
}
printf("%d\n",count);
return 0;
}

How to replace string in order

I want to replace some value in order
for example, below is a sample of xpath
/MCCI_IN200100UV01[#ITSVersion='XML_1.0'][#xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']
/PORR_IN049016UV[r]/controlActProcess[#classCode='CACT']
[#moodCode='EVN']/subject[#typeCode='SUBJ'][1]/investigationEvent[#classCode='INVSTG']
[#moodCode='EVN']/outboundRelationship[#typeCode='SPRT'][relatedInvestigation/code[#code='2']
[#codeSystem='2.16.840.1.113883.3.989.2.1.1.22']][r]/relatedInvestigation[#classCode='INVSTG']
[#moodCode='EVN']/subjectOf2[#typeCode='SUBJ']/controlActEvent[#classCode='CACT']
[#moodCode='EVN']/author[#typeCode='AUT']/assignedEntity[#classCode='ASSIGNED']/assignedPerson[#classCode='PSN']
[#determinerCode='INSTANCE']/name/prefix[1]/#nullFlavor",
and, I would like to extract [r] in order and to replace from [0] to [n] depending on the number of elements.
how can I replace [r] ?
const txt = `/MCCI_IN200100UV01[#ITSVersion='XML_1.0'][#xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']
/PORR_IN049016UV[r]/controlActProcess[#classCode='CACT']
[#moodCode='EVN']/subject[#typeCode='SUBJ'][1]/investigationEvent[#classCode='INVSTG']
[#moodCode='EVN']/outboundRelationship[#typeCode='SPRT'][relatedInvestigation/code[#code='2']
[#codeSystem='2.16.840.1.113883.3.989.2.1.1.22']][r]/relatedInvestigation[#classCode='INVSTG']
[#moodCode='EVN']/subjectOf2[#typeCode='SUBJ']/controlActEvent[#classCode='CACT']
[#moodCode='EVN']/author[#typeCode='AUT']/assignedEntity[#classCode='ASSIGNED']/assignedPerson[#classCode='PSN']
[#determinerCode='INSTANCE']/name/prefix[1]/#nullFlavor",`;
const count = (txt.match(/\[r\]/g) || []).length; // count occurrences using RegExp
let replacements; // set replacement values in-order
switch (count) {
case 0:
break
case 1:
replacements = ["a"];
break;
case 2:
replacements = ["___REPLACEMENT_1___", "___REPLACEMENT_2___"];
break;
case 3:
replacements = ["d", "e", "f"];
break;
}
let out = txt; // output variable
for (let i = 0; i < count; i++) {
out = out.replace("[r]", replacements[i], 1); // replace each occurrence one at a time
}
console.log(out);
With str.replace(). For example:
>>> 'test[r]test'.replace('[r]', '[0]')
'test[0]test'
Here's the docs on it.

Carriage return in raw python string

For some experiments with syntax highlighting, I create the following raw string in Python 3.6 (please note that the string itself contains a snippet of C-code, but that's not important right now):
myCodeSample = r"""#include <stdio.h>
int main()
{
char arr[5] = {'h', 'e', 'l', 'l', 'o'};
int i;
for(i = 0; i < 5; i++) {
printf(arr[i]);
}
return 0;
}"""
I have noticed that each line ends in a Unix-style \n newline character. But I actually want - for the sake of my experiment - to have every line ending in the Windows-style \r\n newline character. Is there a way to do this elegantly?
just define your string as you're doing, but apply replace on the literal:
myCodeSample = r"""#include <stdio.h>
int main()
{
char arr[5] = {'h', 'e', 'l', 'l', 'o'};
int i;
for(i = 0; i < 5; i++) {
printf(arr[i]);
}
return 0;
}""".replace("\n","\r\n")

How to Store Column Values in a Variable

I am dealing with tab separated file that contains multiple columns. Each column contain more than ~3000 records.
Column1 Column2 Column3 Column4
1000041 11657 GenNorm albumin
1000043 24249 GenNorm CaBP
1000043 29177 GenNorm calcium-binding protein
1000045 2006 GenNorm tropoelastin
Problem: Using Python, How to read the tab separated file and store each column (with its record) in a single variable. Use "print" to print out a specific column(s)
Preliminary code: I used this code so far to read the tsv file
import csv
Dictionary1 = {}
with open("sample.txt", 'r') as samplefile:
reader = csv.reader(samplefile, delimiter="\t")
I think you're just asking how to "transpose" a CSV file from a sequence of rows to a sequence of columns.
In Python, you can always transpose any iterable of iterables by using the zip function:
with open("sample1.txt") as samplefile:
reader = csv.reader(samplefile, delimiter="\t")
columns = zip(*reader)
Now, if you want to print each column in order:
for column in columns:
print(column)
Here, columns is an iterator of tuples. If you want some other format, like a dict mapping the column names to a list of values, you can transform it easily. For example:
columns = {column[0]: list(column[1:]) for column in columns}
Or, if you want to put them in four separate variables, you can just use normal tuple unpacking:
col1, col2, col3, col4 = columns
But there doesn't seem to be a very good reason to do that.
Not sure the code in python but use this loop. Once you store everything into the dictionary then use this loop then use the function to call the index to print the method you can modify the function to suit what you want the key to be you can pass through a word to search etc
int mainCounter = 0;
int counter1 = 0;
string arrColumn1[3000];
int counter2 = 0;
string arrColumn1[3000];
int counter3 = 0;
string arrColumn1[3000];
int counter4 = 0;
string arrColumn1[3000];
for(int i = 0; i<dictionary.length; ++i){
switch ( mainCounterounter )
{
case 0:
arrColumn1[counter1] = dictionary[i];
++counter1;
++mainCounter;
break;
case 1:
arrColumn2[counter2] = dictionary[i];
++counter2;
++mainCounter;
break;
case 2:
arrColumn3[counter3] = dictionary[i];
++counter3;
++mainCounter;
break;
case 3:
arrColumn4[counter4] = dictionary[i];
++counter4;
mainCounter = 0;
break;
}
}
void printRecordFunction(int colToSearch, string findThis, string arr1[], string arr2[], string arr3[], string arr4[]){
int foundIndex=0;
if(colToSearch == 1){
for(int i = 0; i<arr1.length; ++i){
if(strcmp(arr1[i], findthis)==0){
foundIndex = i;
break;
}
}
}else if(colToSearch == 2){
for(int i = 0; i<arr2.length; ++i){
if(strcmp(arr2[i], findthis)==0){
foundIndex = i;
break;
}
}
}else if(colToSearch == 3){
for(int i = 0; i<arr3.length; ++i){
if(strcmp(arr3[i], findthis)==0){
foundIndex = i;
break;
}
}
}else if(colToSearch == 4){
for(int i = 0; i<arr4.length; ++i){
if(strcmp(arr4[i], findthis)==0){
foundIndex = i;
break;
}
}
}
count<<"Record: " << arr1[i] << " " << arr2[i] << " " << arr3[i] << " " << arr4[i] << endl;
}
Sorry this is all pretty hard code but I hope it gives you some idea and you can adjust it

appending 4 bytes to byte array

I'm trying to translate this code into python from c,
} else if(remaining == 3) {
firstB = (BYTE*)&buf[0];
*firstB ^= 0x12;
firstW = (WORD*)&buf[1];
*firstW ^= 0x1234;
i = 3;
}
for(; i<len;)
{
then = (DWORD*)&buf[i];
*then ^= 0x12345678;
i += 4;
}
What I got:
elif remaining == 3:
new_packet.append(struct.unpack('<B', packet_data[0:1])[0] ^ 0x12)
new_packet.append(struct.unpack('<H', packet_data[1:3])[0] ^ 0x1234)
i = 3
while i < packet_len:
new_packet.append(struct.unpack('<L', packet_data[i:i+4])[0] ^ 0x12345678)
i += 4
return new_packet
problem is I always get ValueError: byte must be in range(0, 256).
So I must be translating this wrong. So what am I missing or is there any way I can make this more efficient? Why is the python code wrong?
update
new_bytes = struct.unpack('<H', packet_data[1:3])
new_packet.append(new_bytes[0] ^ 0x1234)
I'm getting the first few bytes right with above, but nothing ever right with code below:
new_bytes = struct.unpack('<BB', packet_data[1:3])
new_packet.append(new_bytes[0] ^ 0x12)
new_packet.append(new_bytes[1] ^ 0x34)
So my problem still remains inside the while loop, and the question remains how to do this right:
new_bytes = struct.unpack('<L', packet_data[i:i+4])
new_packet.append(new_bytes[0] ^ 0x12345678)
This line
new_packet.append(struct.unpack('<H', packet_data[1:3])[0] ^ 0x1234)
tries to append a two-byte value to the byte array. One fix is to append the two bytes of the word separately:
# Little-endian, so the first byte is low byte of the word.
new_bytes = struct.unpack('BB', packet_data[1:3])
new_packet.append(new_bytes[0] ^ 0x34)
new_packet.append(new_bytes[1] ^ 0x12)
# Similarly for the 4-byte value
new_bytes = struct.unpack('BBBB', packet_data[i:i+4])
new_packet.append(new_bytes[0] ^ 0x78)
new_packet.append(new_bytes[1] ^ 0x56)
new_packet.append(new_bytes[2] ^ 0x34)
new_packet.append(new_bytes[3] ^ 0x12)

Categories