Carriage return in raw python string - python

For some experiments with syntax highlighting, I create the following raw string in Python 3.6 (please note that the string itself contains a snippet of C-code, but that's not important right now):
myCodeSample = r"""#include <stdio.h>
int main()
{
char arr[5] = {'h', 'e', 'l', 'l', 'o'};
int i;
for(i = 0; i < 5; i++) {
printf(arr[i]);
}
return 0;
}"""
I have noticed that each line ends in a Unix-style \n newline character. But I actually want - for the sake of my experiment - to have every line ending in the Windows-style \r\n newline character. Is there a way to do this elegantly?

just define your string as you're doing, but apply replace on the literal:
myCodeSample = r"""#include <stdio.h>
int main()
{
char arr[5] = {'h', 'e', 'l', 'l', 'o'};
int i;
for(i = 0; i < 5; i++) {
printf(arr[i]);
}
return 0;
}""".replace("\n","\r\n")

Related

Count of string2 in string1 not working in C, but works in Python

The problem itself is simple. I have to count the number of occurence of s2 in s1.
And length of s2 is always 2. I tried to implement it with C, but it did not work even though i know the logic is correct. So i tried the same logic in pyhton and it works perfectly. Can someone explain why? Or did i do anything wrong in C. I given both codes below.
C
#include<stdio.h>
#include<string.h>
int main()
{
char s1[100],s2[2];
int count = 0;
gets(s1);
gets(s2);
for(int i=0;i<strlen(s1);i++)
{
if(s1[i] == s2[0] && s1[i+1] == s2[1])
{
count++;
}
}
printf("%d",count);
return 0;
}
Python
s1 = input()
s2 = input()
count = 0
for i in range(0,len(s1)):
if(s1[i] == s2[0] and s1[i+1] == s2[1]):
count = count+1
print(count)
Your python code is actually incorrect, it would raise an IndexError if the last character of s1 matches the first of s2.
You have to stop iterating on the second to last character of s1.
Here is a generic solution working for any length of s2:
s1 = 'abaccabaabaccca'
s2 = 'aba'
count = 0
for i in range(len(s1)-len(s2)+1):
if s2 == s1[i:i+len(s2)]:
count += 1
print(count)
output: 3
First, as others have pointed out, you do not want to use gets(), try using fgets(). Otherwise, your logic is correct but when you read in the input, the new line character will be included in the string.
If you were to input test and es, your strings will contain test\n and es\n (with both respectively containing the null terminating byte \0). Then leads to you searching the string test\n for the substring es\n which it will not find. So you must first remove the new line character from, at least, the substring you want to search for which you can do with strcspn() to give you es.
Once the trailing newline (\n) has been replaced with a null terminating byte. You can search the string for occurances.
#include<stdio.h>
#include<string.h>
int main() {
char s1[100], s2[4];
int count = 0;
fgets(s1, 99, stdin);
fgets(s2, 3, stdin);
s1[strcspn(s1, "\n")] = '\0';
s2[strcspn(s2, "\n")] = '\0';
for(int i=0;i < strlen(s1) - 1;i++) {
if(s1[i] == s2[0] && s1[i+1] == s2[1]) {
count++;
}
}
printf("%d\n",count);
return 0;
}

How to replace string in order

I want to replace some value in order
for example, below is a sample of xpath
/MCCI_IN200100UV01[#ITSVersion='XML_1.0'][#xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']
/PORR_IN049016UV[r]/controlActProcess[#classCode='CACT']
[#moodCode='EVN']/subject[#typeCode='SUBJ'][1]/investigationEvent[#classCode='INVSTG']
[#moodCode='EVN']/outboundRelationship[#typeCode='SPRT'][relatedInvestigation/code[#code='2']
[#codeSystem='2.16.840.1.113883.3.989.2.1.1.22']][r]/relatedInvestigation[#classCode='INVSTG']
[#moodCode='EVN']/subjectOf2[#typeCode='SUBJ']/controlActEvent[#classCode='CACT']
[#moodCode='EVN']/author[#typeCode='AUT']/assignedEntity[#classCode='ASSIGNED']/assignedPerson[#classCode='PSN']
[#determinerCode='INSTANCE']/name/prefix[1]/#nullFlavor",
and, I would like to extract [r] in order and to replace from [0] to [n] depending on the number of elements.
how can I replace [r] ?
const txt = `/MCCI_IN200100UV01[#ITSVersion='XML_1.0'][#xsi:schemaLocation='urn:hl7-org:v3 MCCI_IN200100UV01.xsd']
/PORR_IN049016UV[r]/controlActProcess[#classCode='CACT']
[#moodCode='EVN']/subject[#typeCode='SUBJ'][1]/investigationEvent[#classCode='INVSTG']
[#moodCode='EVN']/outboundRelationship[#typeCode='SPRT'][relatedInvestigation/code[#code='2']
[#codeSystem='2.16.840.1.113883.3.989.2.1.1.22']][r]/relatedInvestigation[#classCode='INVSTG']
[#moodCode='EVN']/subjectOf2[#typeCode='SUBJ']/controlActEvent[#classCode='CACT']
[#moodCode='EVN']/author[#typeCode='AUT']/assignedEntity[#classCode='ASSIGNED']/assignedPerson[#classCode='PSN']
[#determinerCode='INSTANCE']/name/prefix[1]/#nullFlavor",`;
const count = (txt.match(/\[r\]/g) || []).length; // count occurrences using RegExp
let replacements; // set replacement values in-order
switch (count) {
case 0:
break
case 1:
replacements = ["a"];
break;
case 2:
replacements = ["___REPLACEMENT_1___", "___REPLACEMENT_2___"];
break;
case 3:
replacements = ["d", "e", "f"];
break;
}
let out = txt; // output variable
for (let i = 0; i < count; i++) {
out = out.replace("[r]", replacements[i], 1); // replace each occurrence one at a time
}
console.log(out);
With str.replace(). For example:
>>> 'test[r]test'.replace('[r]', '[0]')
'test[0]test'
Here's the docs on it.

Is stripping string by '\r\n ' necessary in Python?

In Java, it's necessary to strip with \r\n, e.g. split( "\r\n") is not splitting my string in java
But is \r\n necessary in Python? Is the following true?
str.strip() == str.strip('\r\n ')
From the docs:
Return a copy of the string with the leading and trailing characters
removed. The chars argument is a string specifying the set of
characters to be removed. If omitted or None, the chars argument
defaults to removing whitespace. The chars argument is not a prefix or
suffix; rather, all combinations of its values are stripped
From this CPython test, str.strip() seems to be stripping:
\t\n\r\f\v
Anyone can point me to the code in CPython that does the string stripping?
Are you looking for these lines?
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Objects/unicodeobject.c#L12222-L12247
#define LEFTSTRIP 0
#define RIGHTSTRIP 1
#define BOTHSTRIP 2
/* Arrays indexed by above */
static const char *stripfuncnames[] = {"lstrip", "rstrip", "strip"};
#define STRIPNAME(i) (stripfuncnames[i])
/* externally visible for str.strip(unicode) */
PyObject *
_PyUnicode_XStrip(PyObject *self, int striptype, PyObject *sepobj)
{
void *data;
int kind;
Py_ssize_t i, j, len;
BLOOM_MASK sepmask;
Py_ssize_t seplen;
if (PyUnicode_READY(self) == -1 || PyUnicode_READY(sepobj) == -1)
return NULL;
kind = PyUnicode_KIND(self);
data = PyUnicode_DATA(self);
len = PyUnicode_GET_LENGTH(self);
seplen = PyUnicode_GET_LENGTH(sepobj);
sepmask = make_bloom_mask(PyUnicode_KIND(sepobj),
PyUnicode_DATA(sepobj),
seplen);
i = 0;
if (striptype != RIGHTSTRIP) {
while (i < len) {
Py_UCS4 ch = PyUnicode_READ(kind, data, i);
if (!BLOOM(sepmask, ch))
break;
if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
break;
i++;
}
}
j = len;
if (striptype != LEFTSTRIP) {
j--;
while (j >= i) {
Py_UCS4 ch = PyUnicode_READ(kind, data, j);
if (!BLOOM(sepmask, ch))
break;
if (PyUnicode_FindChar(sepobj, ch, 0, seplen, 1) < 0)
break;
j--;
}
j++;
}
return PyUnicode_Substring(self, i, j);
}
Essentially:
str.strip() == str.strip(string.whitespace) == str.strip(' \t\n\r\f\v') != str.strip('\r\n')
Unless you are explicitly trying to remove ONLY newline characters, str.strip() and str.strip('\r\n') are different.
>>> '\nfoo\n'.strip()
'foo'
>>> '\nfoo\n'.strip('\r\n')
'foo'
>>> '\r\n\r\n\r\nfoo\r\n\r\n\r\n'.strip()
'foo'
>>> '\r\n\r\n\r\nfoo\r\n\r\n\r\n'.strip('\r\n')
'foo'
>>> '\n\tfoo\t\n'.strip()
'foo'
>>> '\n\tfoo\t\n'.strip('\r\n')
'\tfoo\t'
This all seems fine, but note that if there is whitespace (or any other character) between a newline and the start or end of a string, .strip('\r\n') won't remove the newline.
>>> '\t\nfoo\n\t'.strip()
'foo'
>>> '\t\nfoo\n\t'.strip('\r\n')
'\t\nfoo\n\t'

How to Store Column Values in a Variable

I am dealing with tab separated file that contains multiple columns. Each column contain more than ~3000 records.
Column1 Column2 Column3 Column4
1000041 11657 GenNorm albumin
1000043 24249 GenNorm CaBP
1000043 29177 GenNorm calcium-binding protein
1000045 2006 GenNorm tropoelastin
Problem: Using Python, How to read the tab separated file and store each column (with its record) in a single variable. Use "print" to print out a specific column(s)
Preliminary code: I used this code so far to read the tsv file
import csv
Dictionary1 = {}
with open("sample.txt", 'r') as samplefile:
reader = csv.reader(samplefile, delimiter="\t")
I think you're just asking how to "transpose" a CSV file from a sequence of rows to a sequence of columns.
In Python, you can always transpose any iterable of iterables by using the zip function:
with open("sample1.txt") as samplefile:
reader = csv.reader(samplefile, delimiter="\t")
columns = zip(*reader)
Now, if you want to print each column in order:
for column in columns:
print(column)
Here, columns is an iterator of tuples. If you want some other format, like a dict mapping the column names to a list of values, you can transform it easily. For example:
columns = {column[0]: list(column[1:]) for column in columns}
Or, if you want to put them in four separate variables, you can just use normal tuple unpacking:
col1, col2, col3, col4 = columns
But there doesn't seem to be a very good reason to do that.
Not sure the code in python but use this loop. Once you store everything into the dictionary then use this loop then use the function to call the index to print the method you can modify the function to suit what you want the key to be you can pass through a word to search etc
int mainCounter = 0;
int counter1 = 0;
string arrColumn1[3000];
int counter2 = 0;
string arrColumn1[3000];
int counter3 = 0;
string arrColumn1[3000];
int counter4 = 0;
string arrColumn1[3000];
for(int i = 0; i<dictionary.length; ++i){
switch ( mainCounterounter )
{
case 0:
arrColumn1[counter1] = dictionary[i];
++counter1;
++mainCounter;
break;
case 1:
arrColumn2[counter2] = dictionary[i];
++counter2;
++mainCounter;
break;
case 2:
arrColumn3[counter3] = dictionary[i];
++counter3;
++mainCounter;
break;
case 3:
arrColumn4[counter4] = dictionary[i];
++counter4;
mainCounter = 0;
break;
}
}
void printRecordFunction(int colToSearch, string findThis, string arr1[], string arr2[], string arr3[], string arr4[]){
int foundIndex=0;
if(colToSearch == 1){
for(int i = 0; i<arr1.length; ++i){
if(strcmp(arr1[i], findthis)==0){
foundIndex = i;
break;
}
}
}else if(colToSearch == 2){
for(int i = 0; i<arr2.length; ++i){
if(strcmp(arr2[i], findthis)==0){
foundIndex = i;
break;
}
}
}else if(colToSearch == 3){
for(int i = 0; i<arr3.length; ++i){
if(strcmp(arr3[i], findthis)==0){
foundIndex = i;
break;
}
}
}else if(colToSearch == 4){
for(int i = 0; i<arr4.length; ++i){
if(strcmp(arr4[i], findthis)==0){
foundIndex = i;
break;
}
}
}
count<<"Record: " << arr1[i] << " " << arr2[i] << " " << arr3[i] << " " << arr4[i] << endl;
}
Sorry this is all pretty hard code but I hope it gives you some idea and you can adjust it

Algorithm to match 2 lists with wildcards

I'm looking for an efficient way to match 2 lists, one wich contains complete information, and one which contains wildcards. I've been able to do this with wildcards of fixed lengths, but am now trying to do it with wildcards of variable lengths.
Thus:
match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )
would return True as long as all the elements are in the same order in both lists.
I'm working with lists of objects, but used strings above for simplicity.
[edited to justify no RE after OP comment on comparing objects]
It appears you are not using strings, but rather comparing objects. I am therefore giving an explicit algorithm — regular expressions provide a good solution tailored for strings, don't get me wrong, but from what you say as a comment to your questions, it seems an explicit, simple algorithm may make things easier for you.
It turns out that this can be solved with a much simpler algorithm than this previous answer:
def matcher (l1, l2):
if (l1 == []):
return (l2 == [] or l2 == ['*'])
if (l2 == [] or l2[0] == '*'):
return matcher(l2, l1)
if (l1[0] == '*'):
return (matcher(l1, l2[1:]) or matcher(l1[1:], l2))
if (l1[0] == l2[0]):
return matcher(l1[1:], l2[1:])
else:
return False
The key idea is that when you encounter a wildcard, you can explore two options :
either advance in the list that contains the wildcard (and consider the wildcard matched whatever there was until now)
or advance in the list that doesn't contain the wildcard (and consider that whatever is at the head of the list has to be matched by the wildcard).
How about the following:
import re
def match(pat, lst):
regex = ''.join(term if term != '*' else '.*' for term in pat) + '$'
s = ''.join(lst)
return re.match(regex, s) is not None
print match( ['A', 'B', '*', 'D'], ['A', 'B', 'C', 'C', 'C', 'D'] )
It uses regular expressions. Wildcards (*) are changed to .* and all other search terms are kept as-is.
One caveat is that if your search terms could contain things that have special meaning in the regex language, those would need to be properly escaped. It's pretty easy to handle this in the match function, I just wasn't sure if this was something you required.
I'd recommend converting ['A', 'B', '*', 'D'] to '^AB.*D$', ['A', 'B', 'C', 'C', 'C', 'D'] to 'ABCCCD', and then using the re module (regular expressions) to do the match.
This will be valid if the elements of your lists are only one character each, and if they're strings.
something like:
import(re)
def myMatch( patternList, stringList ):
# convert pattern to flat string with wildcards
# convert AB*D to valid regex ^AB.*D$
pattern = ''.join(patternList)
regexPattern = '^' + pattern.replace('*','.*') + '$'
# perform matching
against = ''.join(stringList) # convert ['A','B','C','C','D'] to ABCCCD
# return whether there is a match
return (re.match(regexPattern,against) is not None)
If the lists contain numbers, or words, choose a character that you wouldn't expect to be in either, for example #. Then ['Aa','Bs','Ce','Cc','CC','Dd'] can be converted to Aa#Bs#Ce#Cc#CC#Dd, the wildcard pattern ['Aa','Bs','*','Dd'] could be converted to ^Aa#Bs#.*#Dd$, and the match performed.
Practically speaking this just means all the ''.join(...) becomes '#'.join(...) in myMatch.
I agree with the comment regarding this could be done with regular expressions. For example:
import re
lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = ['A', 'B', 'C+', 'D']
print re.match(''.join(pattern), ''.join(lst)) # Will successfully match
Edit: As pointed out by a comment, it might be known in advance just that some character has to be matched, but not which one. In that case, regular expressions are useful still:
import re
lst = ['A', 'B', 'C', 'C', 'C', 'D']
pattern = r'AB(\w)\1*D'
print re.match(pattern, ''.join(lst)).groups()
I agree, regular expressions are usually the way to go with this sort of thing. This algorithm works, but it just looks convoluted to me. It was fun to write though.
def match(listx, listy):
listx, listy = map(iter, (listx, listy))
while 1:
try:
x = next(listx)
except StopIteration:
# This means there are values left in listx that are not in listy.
try:
y = next(listy)
except StopIteration:
# This means there are no more values to be compared in either
# listx or listy; since no exception was raied elsewhere, the
# lists match.
return True
else:
# This means that there are values in listy that are not in
# listx.
return False
else:
try:
y = next(listy)
except StopIteration:
# Similarly, there are values in listy that aren't in listx.
return False
if x == y:
pass
elif x == '*':
try:
# Get the value in listx after '*'.
x = next(listx)
except StopIteration:
# This means that listx terminates with '*'. If there are any
# remaining values of listy, they will, by definition, match.
return True
while 1:
if x == y:
# I didn't shift to the next value in listy because I
# assume that a '*' matches the empty string and well as
# any other.
break
else:
try:
y = next(listy)
except StopIteration:
# This means there is at least one remaining value in
# listx that is not in listy, because listy has no
# more values.
return False
else:
pass
# Same algorithm as above, given there is a '*' in listy.
elif y == '*':
try:
y = next(listy)
except StopIteration:
return True
while 1:
if x == y:
break
else:
try:
x = next(listx)
except StopIteration:
return False
else:
pass
I had this c++ piece of code which seems to be doing what you are trying to do (inputs are strings instead of arrays of characters but you'll have to adapt stuff anyway).
bool Utils::stringMatchWithWildcards (const std::string str, const std::string strWithWildcards)
PRINT("Starting in stringMatchWithWildcards('" << str << "','" << strWithWildcards << "')");
const std::string wildcard="*";
const bool startWithWildcard=(strWithWildcards.find(wildcard)==0);
int pos=strWithWildcards.rfind(wildcard);
const bool endWithWildcard = (pos!=std::string::npos) && (pos+wildcard.size()==strWithWildcards.size());
// Basically, the point is to split the string with wildcards in strings with no wildcard.
// Then search in the first string for the different chunks of the second in the correct order
std::vector<std::string> vectStr;
boost::split(vectStr, strWithWildcards, boost::is_any_of(wildcard));
// I expected all the chunks in vectStr to be non-empty. It doesn't seem the be the case so let's remove them.
vectStr.erase(std::remove_if(vectStr.begin(), vectStr.end(), std::mem_fun_ref(&std::string::empty)), vectStr.end());
// Check if at least one element (to have first and last element)
if (vectStr.empty())
{
const bool matchEmptyCase = (startWithWildcard || endWithWildcard || str.empty());
PRINT("Match " << (matchEmptyCase?"":"un") << "successful (empty case) : '" << str << "' and '" << strWithWildcards << "'");
return matchEmptyCase;
}
// First Element
std::vector<std::string>::const_iterator vectStrIt = vectStr.begin();
std::string aStr=*vectStrIt;
if (!startWithWildcard && str.find(aStr, 0)!=0) {
PRINT("Match unsuccessful (beginning) : '" << str << "' and '" << strWithWildcards << "'");
return false;
}
// "Normal" Elements
bool found(true);
pos=0;
std::vector<std::string>::const_iterator vectStrEnd = vectStr.end();
for ( ; vectStrIt!=vectStrEnd ; vectStrIt++)
{
aStr=*vectStrIt;
PRINT( "Searching '" << aStr << "' in '" << str << "' from " << pos);
pos=str.find(aStr, pos);
if (pos==std::string::npos)
{
PRINT("Match unsuccessful ('" << aStr << "' not found) : '" << str << "' and '" << strWithWildcards << "'");
return false;
} else
{
PRINT( "Found at position " << pos);
pos+=aStr.size();
}
}
// Last Element
const bool matchEnd = (endWithWildcard || str.rfind(aStr)+aStr.size()==str.size());
PRINT("Match " << (matchEnd?"":"un") << "successful (usual case) : '" << str << "' and '" << strWithWildcards);
return matchEnd;
}
/* Tested on these values :
assert( stringMatchWithWildcards("ABC","ABC"));
assert( stringMatchWithWildcards("ABC","*"));
assert( stringMatchWithWildcards("ABC","*****"));
assert( stringMatchWithWildcards("ABC","*BC"));
assert( stringMatchWithWildcards("ABC","AB*"));
assert( stringMatchWithWildcards("ABC","A*C"));
assert( stringMatchWithWildcards("ABC","*C"));
assert( stringMatchWithWildcards("ABC","A*"));
assert(!stringMatchWithWildcards("ABC","BC"));
assert(!stringMatchWithWildcards("ABC","AB"));
assert(!stringMatchWithWildcards("ABC","AB*D"));
assert(!stringMatchWithWildcards("ABC",""));
assert( stringMatchWithWildcards("",""));
assert( stringMatchWithWildcards("","*"));
assert(!stringMatchWithWildcards("","ABC"));
*/
It's not something I'm really proud of but it seems to be working so far. I hope you can find it useful.

Categories