Split identifier string python - python

DB00002
DB00914
DB00222
DB01056
I have a list of database ID and want to trim it down to contain number only e.g. (2,914,222,1056) How can I do this in python? Many thanks!

Just exclude the first two characters and convert the rest to int.
data = ["DB00002", "DB00914", "DB00222", "DB01056"]
print [int(item[2:]) for item in data ]
# [2, 914, 222, 1056]
If you are not sure about the number of characters which are not numbers, you can skip them using generator expression, like this
[int("".join(char for char in item if char.isdigit())) for item in data]

Assuming, you want to remove the frist two chars and convert the rest into integer:
text = "DB00914"
num = int(text[2:])

Related

How to use sep command with sorted

I want to separate sorted numbers with "<", but can't do it.
Here is the code:
numbers = [3, 7, 5]
print(sorted(numbers), sep="<")
The * operator as mentioned by #MisterMiyagi, can be used to unpack the list variables and use the sep.
Code:
print(*sorted(numbers), sep="<")
I dont know if this is the answer you want, but I have made a python code to seperate the sorted numbers with "<" with join after I convert the numbers to strings.
As the items in the iterable must be string types, I first use a list comprehension to create a list containing each interger as a string, and pass this as input to str.join()
# Initial String
test_str = [5,1,2,3,4]
# Sorting number
sortedNum = sorted(test_str)
# Changing numbers into string
string_ints = [str(int) for int in sortedNum]
# Joining the sorted string with "<"
output = '<'.join(string_ints)
print(output)

How to extract a number from an alphanumeric string in Python?

I want to extract numbers contained in an alphanumeric strings. Please help me on this.
Example:
line = ["frame_117", "frame_11","frame_1207"]
Result:
[117, 11, 1207]
You can split with special character '_' like this:
numbers = []
line = ["frame_117", "frame_11","frame_1207"]
for item in line:
number = int(item.split("_",1)[1])
numbers.append(number)
print(numbers)
import re
temp = []
lines = ["frame_117", "frame_11","frame_1207"]
for line in lines:
num = re.search(r'-?\d+', line)
temp.append(int(num.group(0)))
print(temp) # [117, 11, 1207]
Rationale
The first thing I see is that the names inside the list have a pattern. The string frame, an underscore _ and the string number: "frame_number".
Step-By-Step
With that in mind, you can:
Loop through the list. We'll use a list comprehension.
Get each item from the list (the names="frame_number" )
Split them according to a separator (getting a sublist with ["frame", "number"])
And then create a new list with the last items of each sublist
numbers = [x.split("_")[-1] for x in line]
['117', '11', '1207']
Solution
But you need numbers and here you have a list of strings. We make one extra step and use int().
numbers = [int(x.split("_")[-1]) for x in line]
[117, 11, 1207]
This works only because we detected a pattern in the target name.
But what happens if you need to find all numbers in a random string? Or floats? Or Complex numbers? Or negative numbers?
That's a little bit more complex and out of scope of this answer.
See How to extract numbers from a string in Python?

How to count frequencies/occurences of all values within a string

I need a count of all the emails in a list, some of the emails however are consolidated together with a | symbol. These need to be split and the emails need to be counted after splitting to avoid getting an inaccurate or low count of frequencies.
I have a list that is something like this:
test = ['abc#gmail.com', 'xyz#jad.com|abc#gmail.com', 'asd#ajf.com|abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']
I performed a set of operations to split and when I split, the pipe gets replaced by double quotes at that location so I replace the double with single quotes so I have all email ids enclosed in single quotes.
# convert list to a string
test_str = str(test)
# apply string operation to split by separator '|'
test1 = test_str.split('|')
print(test1)
--> OUTPUT of above print statement: ["['abc#gmail.com', 'xyz#jad.com", "abc#gmail.com', 'asd#ajf.com", "abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']"]
test2 = str(test1)
test3 = test2.replace('"','')
print(test3)
--> OUTPUT of above print statement: [['abc#gmail.com', 'xyz#jad.com', 'abc#gmail.com', 'asd#ajf.com', 'abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']]
How can I now obtain a count of all the emails? This is a string essentially and if it's a list, I could use collections.Counter to easily obtain a count.
I'd like to get a list like the one listed below that has the email and the count in descending order of frequency
['abc#gmail.com': 3, 'xyz#jad.com': 2, 'asd#ajf.com': 1, 'asdf#adh.com': 1]
Thanks for the help!
You can use collections.Counter with a generator expression that iterates over the input list of strings and then iterates over the sub-list of emails by splitting the strings. Use the most_common method to ensure a descending order of counts:
from collections import Counter
dict(Counter(e for s in test if s for e in s.split('|')).most_common())
This returns:
{'abc#gmail.com': 3, 'xyz#jad.com': 2, 'asd#ajf.com': 1, 'asdf#adh.com': 1}
What about iterating over the list and calling counter.update on every string? Like this:
test = ['abc#gmail.com', 'xyz#jad.com|abc#gmail.com', 'asd#ajf.com|abc#gmail.com', 'asdf#adh.com', 'xyz#jad.com']
c = Counter()
for email_str in test:
if email_str:
c.update(email_str.split('|'))
res = c.most_common()

Taking a specific character in the string for a list of strings in python

I have a list of 22000 strings like abc.wav . I want to take out a specific character from it in python like a character which is before .wav from all the files. How to do that in python ?
finding the spot of a character could be .split(), but if you want to pull up a specific spot in a string, you could use list[stringNum[letterNum]]. And then list[stringNum].split("a") would get two or more separate strings that are on the other side of the letter "a". Using those strings you could get the spots by measuring the length of the string versus the length of the strings outside of a and compare where those spots were taken. Just a simple algorithm idea ig. You'd have to play around with it.
I am assuming you are trying to reconstruct the same string without the letter before the extension.
resultList = []
for item in list:
newstr = item.split('.')[0]
extstr = item.split('.')[1]
locstr = newstr[:-1] <--- change the selection here depending on the char you want to remove
endstr = locstr + extstr
resultList.append(endstr)
If you are trying to just save a list of the letters you remove only, do the following:
resultList = []
for item in list:
newstr = item.split('.')[0]
endstr = newstr[-1]
resultList.append(endstr)
df= pd.DataFrame({'something':['asb.wav','xyz.wav']})
df.something.str.extract("(\w*)(.wav$)",expand=True)
Gives:
0 1
0 asb .wav
1 xyz .wav

how to split records with non-standard delimiters

in my csv file I have the following records separated by a , between brackets:
(a1,a2,a3),(b1,b2,b3),(c1,c2,c3),(d1,d2,d3)
How do I split the data into a list so that I get something more like this:
a1,a2,a3
b1,b2,b3
c1,c2,c3
d1,d2,d3
Currently my python code looks like this:
dump = open('sample_dump.csv','r').read()
splitdump = dump.split('\n')
print splitdump
You could do something along the lines of:
Remove first and last brackets
Split by ),( character sequence
To split by a custom string, just add it as a parameter to the split method, e.g.:
line.split("),(")
It's a bit hacky, so you'll have to generalize based on any expected variations in your input data format (e.g. will your first/last chars always be brackets?).
Try this, split first by ")," then, join and split again by ( to left tuples without brackets
_line = dump.split("),")
_line = ''.join(_line).split("(")
print _line
>> ['', 'a1,a2,a3,', 'b1,b2,b3,', 'c1,c2,c3,', 'd1,d2,d3']
#drop first empty element
print _line.pop(0)
>> ['a1,a2,a3,', 'b1,b2,b3,', 'c1,c2,c3,', 'd1,d2,d3']
First you need to the steps you need to perform in order to get your result, here's a hacky solution:
remove first and last brackets
use the ),( as the group separator, split
split each group by ,
line = '(a1,a2,a3),(b1,b2,b3),(c1,c2,c3),(d1,d2,d3)'
[group.split(',') for group in line[1:-1].split('),(')]

Categories