Remove duplicates from a list in Python - python

I have a python script which parses an xml file and then gives me the required information. My output looks like this, and is 100% correct:
output = ['77:275,77:424,77:425,77:426,77:427,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:412,77:413,77:414,77:431,77:432,77:433,77:435,77:467,77:470,77:471,77:484,77:485,77:475,77:476,77:437,77:438,77:439,77:440,77:442,77:443,77:444,77:445,77:446,77:447,77:449,77:450,77:451,77:454,77:455,77:456,77:305,77:309,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:317,77:321,77:346,77:349,77:350,77:351,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:496,77:497,77:500,77:504,77:506,77:507,77:508,77:513,77:515,77:514,77:517,77:518,77:519,77:521,77:522,77:523,77:403,77:406,77:404,77:405,77:403,77:406,77:404,77:405,77:526,77:362,77:367,77:369,77:374,77:370,77:372,77:373,77:387,77:388,77:389,77:392,77:393,77:394,77:328,77:283,77:284,77:285,77:288,77:289,77:290,77:292,']
It is all fine, but I want to remove the duplicate elements in an element, like in the case above. I tried using the OrderedDict package or just simple list(set(output)), but obvoiusly they both didn't work. Does anyone have a tip for me on how to solve this problem.

You have one element in a list. If you expected it to be treated as separate elements, you need to explicitly split it.
You could split the string on the ',' comma character into a list with str.split():
separate_elements = output[0].split(',')
after which you can use set() (unordered) or OrderedDict (maintaining order) and re-join the string if you still need just the one string object:
','.join(set(separate_elements))
You can put that back into a list with just one element, but there is little point if all you ever handle is that one string.

Related

Python 2.7 Regex; Finding varying number of expressions

I am working on a bioinformatics project and am currently trying to split a certain string containing locations on a chromosome.
Example of a few strings, which go by the name "location":
NC_000023.11:g.154532082
NC_000023.11:g.154532058_154532060
NC_000023.11:g.154532046
What I would like returned looks like:
([154532082])
([154532058], [154532060])
([154532046])
I can not think of a regex that normally captures only the first number, and when present, separately captures the second number, without creating a second group, as with:
re.findall(":g.(\d*)_?(\d*)", location)
which gives:
([154532082], [])
([154532058], [154532060])
([154532046], [])
or
re.findall(":g.(\d*)", location), re.findall("\d_(\d*)", location)
which gives:
[(154532082), ()]
[(154532058), (154532060)]
[154532046), ()]
Is there any expression that would solve this? Or should I see and try to remove the empty lists after finding them the way I do?
Here is what you could do:
[re.search("(?<=:g.)(\d*)_?(\d*)", item).group() for item in location.split("\n")]
What I did here was to make a list comprehension to do everything in a single line. Going by parts:
for item in location.split("\n")
This iterates over a list built from the location string, where I split the string in all the line breaks. Now the for loop will iterate over every part of the string between the line breaks. Each of these parts is now called 'item'.
re.search("(?<=:g.)(\d*)_?(\d*)", item).group()
Here I perform a positive lookbehind assertion, which means that the regex will look for ':g.' (the ?<=:g. part), match everything after that, and ditch the ':g.'. As for group(), this is just to print the match from the re.search() method.
Read the python documentation on regex, it helps a lot:
https://docs.python.org/2/library/re.html

How can i use the "list" function and get all items in a single index on python?

So i'm working on this really long program, and i want it to save an input inside of a new list, for that i have tried doing:
thing=list(input("say something")) #hello
print(thing)
#[h,e,l,l,o]
how can i arrange it to get [hello] instead?
Offhand, I'd say the easiest would be to initialize thing with an empty list and then append the user's input to it:
thing = []
thing.append(input("say something: "))
Use:
thing = [input("say something")]
In your version "hello" is treated as an iterable, which all Python strings are. A list then gets created with individual characters as items from that iterable (see docs on list()). If you want a list with the whole string as the only item, you have to do it using the square bracket syntax.

Iterating through python string array gives unexpected output

I was debugging some python code and as any begginer, I'm using print statements. I narrowed down the problem to:
paths = ("../somepath") #is this not how you declare an array/list?
for path in paths:
print path
I was expecting the whole string to be printed out, but only . is. Since I planned on expanding it anyway to cover more paths, it appears that
paths = ("../somepath", "../someotherpath")
fixes the problem and correctly prints out both strings.
I'm assuming the initial version treats the string as an array of characters (or maybe that's just the C++ in me talking) and just prints out characters.?...??
I'd still like to know why this happens.
("../somepath")
is nothing but a string covered in parenthesis. So, it is the same as "../somepath". Since Python's for loop can iterate through any iterable and a string happens to be an iterable, it prints one character at a time.
To create a tuple with one element, use comma at the end
("../somepath",)
If you want to create a list, you need to use square brackets, like this
["../somepath"]
paths = ["../somepath","abc"]
This way you can create list.Now your code will work .
paths = ("../somepath", "../someotherpath") this worked as it formed a tuple.Which again is a type of non mutable list.
Tested it and the output is one character per line
So all is printed one character per character
To get what you want you need
# your code goes here
paths = ['../somepath'] #is this not how you declare an array/list?
for path in paths:
print path

striplines() not working when using finditer python

I am trying to convert a multiline string to a single list which should be possible using splitlines() but for some reason it continues to convert each line into a list instead of processing all the lines at once. I tried to do it out of the for loop but doesnt seem to have any effect. I need the lines as a single list to use it another function. Below is how I get the multiline into a single variable. What am I missing???
multiline_string_final = []
for match_multiline in re.finditer(r'(^(\w+):\sThis particular string\s*|This particular string\s*)\{\s(\w+)\s\{(.*?)\}', string, re.DOTALL):
multi_line_string = match_multiline.group(4)
print multiline_string
This last print statement prints out the strings like this:
blah=0; blah_blah=1; Foo=3;
blah=4; blah_blah=5; Foo=0;
However I need:
['blah=0; blah_blah=1; Foo=3;''blah=4; blah_blah=5; Foo=0;']
I understand it has to be something with the finditer but cant seem to rectify.
Your new problem also has nothing to do with finditer. (Also, your code is still not an MCVE, you still haven't shown us the sample input data, etc., making it harder to help you.)
From this desired output:
['blah=0; blah_blah=1; Foo=3;''blah=4; blah_blah=5; Foo=0;']
I'm pretty sure what you're looking for is to get a list of the matches, instead of printing out each match on its own. That isn't a valid list, because it's missing the comma between the elements,* but I'll assume that's a typo from you making up data instead of building an MCVE and copying and pasting the real output.
Anyway, to get a list, you have to build a list. Printing things to the screen doesn't build anything. So, try this:
multiline_string_final.append(multiline_string)
Then, at the end—not inside the loop, only after the loop has finished—you can print that out:
print multiline_string_final
And it'll look like this:
['blah=0; blah_blah=1; Foo=3;',
'blah=4; blah_blah=5; Foo=0;']
* Actually, it is a valid list, because adjacent strings get concatenated… but it's not the string you wanted, and not a format Python would ever print out for you.
The problem has nothing to do with the finditer, it's that you're doing the wrong thing:
for line in multiline_string:
print multiline_string.splitlines()
If multiline_string really is a multiline string, then for line in multiline_string will iterate over the characters of that string.
Then, within the loop, you completely ignore line anyway, and instead print multiline_string.splitlines()).
So, if multiline_string is this:
abc
def
Then you'll print ['abc\n', 'def\n'] 8 times in a row. That's not what you want (or what you described).
What you want to do is:
split the string into lines
loop over those lines, not over the original un-split string
print each line, not the whole thing
So:
for line in multiline_string.splitlines():
print line

python print string on multiple lines

I have a function that can only accept strings. (it creates the image with the string, but the string has little formatting and no word wrapping, so a long string will just bleed right through the edge of the image and keep going into the abyss, when in reality I would have liked it to create a paragraph, instead of a one line infinity).
I need it print with line breaks. Currently the file is being readin using
inputFiles.readlines()
so that this reads the entire file. Storing file.readLines() creates a list. So this list cannot be passed to my function looking for a string.
I used
inputFileContent = ' \n'.join(inputFiles.readLines())
in an attempt to force hard line breaks into the string between each list item. This does not work (edit: elaboration here) which means that the inputFileContent string does not have line breaks even though I put '\n' between the list elements. From my understanding, the readLines() function puts the individual lines into individual elements of a list.
any suggestions? Thank you
Use inputFiles.read() which creates a string. Does that help?
The 'join' should have worked. Your problem may be that the writing of the string ignores newline characters. You could maybe try '\r\n'.join(...)

Categories