I'm trying to create a random text generator in python. I'm using Markovify to produce the required text, a filter to not let it start generating text unless the first word is capitalized and, to prevent it from ending "mid sentence", want the program to search from the back of the output to the front and remove all text after the last (for instance) period. I want it to ignore all other instances of the selected delimiter(s). I have no idea how many instances of the delimiter will occur in the generated text, nor have anyway to know in advance.
While looking into this I found rsplit(), and tried using that, but ran into a problem.
'''tweet = buff.rsplit('.')[-1] '''
The above is what I tried first, and I thought it was working until I noticed that all of the lines printed with that had only a single sentence in them. Never more than that. The problem seems to be that the text is being dumped into an array of strings, and the [-1] bit is calling just one entry from that array.
'''tweet = buff.rsplit('.') - buff.rsplit('.')[-1] '''
Next I tried the above. The thinking, was that it would remove the last entry in the array, and then I could just print what remained. It... didn't go to plan. I get an "unsupported operand type" error, specifically tied to the attempt to subtract. Not sure what I'm missing at this point.
.rsplit has second optional argument - maxsplit i.e. maximum number of split to do. You could use it following way:
txt = 'some.text.with.dots'
all_but_last = txt.rsplit('.', 1)[0]
print(all_but_last)
Output:
some.text.with
i'm new to python, and i am developing a tool/script for ssh and sftp. i noticed some of the code i'm using creates what i thought was a string array
channel_data = str()
to hold the console output from an ssh session. if i check "type" on channel_data it comes back as class 'str' ,
but yet if i perform for loop to read each item in channel_data , and channel_data contains what appears to be 30 lines from an ssh console
for line in channel_data:
if "my text" in line:
found = True
each iteration of "line" shows a single character, as if the whole ssh console output of 30 lines of text is broken down into single character array. i do have \n within all the text.
for example channel_data would contain "Cisco Nexus Operation System (NX-OS) Software\r\nCopyright (c) 2002-2016\r\n ..... etc. etc.. ", but again would read in my for loop and print out "C" then "i" then "s" etc..
i'm trying to understand do i have a char array here or a string array here that is made up of single string characters and how to convert it into a string list based on \n within Python?
You can iterate a string just like a list in Python. So, yes, as expected, your string type channel_data will in fact give you every character.
Python does not have a char array. You will have a list of strings, even as a single character as each item in the list:
>>> type(['a', 'b'])
<type 'list'>
Also, just for the sake of adding some extra information for your own knowledge when it comes to usage of terminology, there is a difference between array and list in Python: Python List vs. Array - when to use?
So, what you are actually looking to do here is take the channel_data string and make it a list by calling the split method on it.
The split method will, by default, split on white space characters only. Check the documentation. So, you will want to make sure what you want to actually split on and provide that detail to the method.
You can take a look at splitlines to see if that works for you.
As specified in the documentation for splitlines:
Line breaks are not included in the resulting list unless keepends is
given and true.
Your result will then be a list of strings as you expect. So, as an example you can do:
your_new_list_of_str = channel_data.split('\n')
or
your_new_list_of_str = channel_data.splitlines()
string_list = channel_data.splitlines()
See docs at https://docs.python.org/3.6/library/stdtypes.html#str.splitlines
I have a python script which parses an xml file and then gives me the required information. My output looks like this, and is 100% correct:
output = ['77:275,77:424,77:425,77:426,77:427,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:431,77:432,77:433,77:435,77:467,77:470,77:471,77:484,77:485,77:475,77:476,77:437,77:438,77:439,77:440,77:442,77:443,77:444,77:445,77:446,77:447,77:449,77:450,77:451,77:454,77:455,77:456,77:305,77:309,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:317,77:321,77:346,77:349,77:350,77:351,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:362,77:367,77:369,77:374,77:370,77:372,77:373,77:387,77:388,77:389,77:392,77:393,77:394,77:328,77:283,77:284,77:285,77:288,77:289,77:290,77:292,']
It is all fine, but I want to remove the duplicate elements in an element, like in the case above. I tried using the OrderedDict package or just simple list(set(output)), but obvoiusly they both didn't work. Does anyone have a tip for me on how to solve this problem.
You have one element in a list. If you expected it to be treated as separate elements, you need to explicitly split it.
You could split the string on the ',' comma character into a list with str.split():
separate_elements = output[0].split(',')
after which you can use set() (unordered) or OrderedDict (maintaining order) and re-join the string if you still need just the one string object:
','.join(set(separate_elements))
You can put that back into a list with just one element, but there is little point if all you ever handle is that one string.
I am using re and would like to search a string between two strings. My problem is the string that I would like to search may end with either newline(\n) or another string. So what I want to do is if it is newline or another string it should give me back the string. The reason why I want to do that is some of my documents are created wrong in a way that it does not have new line, so I have to get the text until newline and then check if it has the corresponding string.
I have tried this:
recipients = re.search('Recipients:(.*)\n', body)
reciBody = re.search('(.*)Notes', recipients.group(1).encode("utf-8"))
Later on I am trying to split this by using:
recipientsList = reciBody.group(1).encode("utf-8").split(',')
The problem is I am getting this error if there is no corresponding string:
recipientsList = reciBody.group(1).encode("utf-8").split(',')
AttributeError: 'NoneType' object has no attribute 'group'
What other ways can I use? Or how can I handle this errror?
I'm assuming nothing needs to be done if the group isn't found. Simplest is to just skip the error.
try:
recipientsList = reciBody.group(1).encode("utf-8").split(',')
except AttributeError:
pass # nothing needs to be done
Instead of pass you may need to set recipientsList to something else
I am evaluating hundreds of thousands of html files. I am looking for particular parts of the files. There can be small variations in the way the files were created
For example, in one file I can have a section heading (after I converted it to upper and split then joined the text to get rid of possibly inconsistent white space:
u'KEY1A\x97RISKFACTORS'
In another file I could have:
'KEY1ARISKFACTORS'
I am trying to create a dictionary of possible responses and I want to compare these two and conclude that they are equal. But every substitution I try to run the first string to remove the '\97 does not seem to work
There are a fair number of variations of keys with various representations of entities so I would really like to create a dictionary more or less automatically so I have something like:
key_dict={'u'KEY1A\x97RISKFACTORS':''KEY1ARISKFACTORS',''KEY1ARISKFACTORS':'KEY1ARISKFACTORS',. . .}
I am assuming that since when I run
S1='A'
S2=u'A'
S1==S2
I get
True
I should be able to compare these once the html entities are handled
What I specifically tried to do is
new_string=u'KEY1A\x97RISKFACTORS'.replace('|','')
I got an error
Sorry, I have been at this since last night. SLott pointed out something and I see I used the wrong label I hope this makes more sense
You are correct that if S1='A' and S2 = u'A', then S1 == S2. Instead of assuming this though, you can do a simple test:
key_dict= {u'A':'Value1',
'A':'Value2'}
print key_dict
print u'A' == 'A'
This outputs:
{u'A': 'Value2'}
True
That resolved, let's look at:
new_string=u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace('|','')
There's a problem here, \x97 is the value you're trying to replace in the target string. However, your search string is '|', which is hex value 0x7C (ascii and unicode) and clearly not the value you need to replace. Even if the target and search string were both ascii or unicode, you'd still not find the '\x97'. Second problem is that you are trying to search for a non-unicode string in a unicode string. The easiest solution, and one that makes the most sense is to simply search for u'\x97':
print u'KEY1A\x97DEMOGRAPHICRESPONSES'
print u'KEY1A\x97DEMOGRAPHICRESPONSES'.replace(u'\x97', u'')
Outputs:
KEY1A\x97DEMOGRAPHICRESPONSES
KEY1ADEMOGRAPHICRESPONSES
Why not the obvious .replace(u'\x97','')? Where does the idea of that '|' come from?
>>> s = u'KEY1A\x97DEMOGRAPHICRESPONSES'
>>> s.replace(u'\x97', '')
u'KEY1ADEMOGRAPHICRESPONSES'