Splitting data - specific case

Splitting data - specific case - python

I'm trying to split some data, the data is in this form...
['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '20150407,34.5400,34.8900,34.5100,34.6300,14331200']
The first item in each string in the list is a date, I am trying split the data at a chosen date. But have the whole string... For example if my chosen date was 2015-04-07 the above data would split like this...
['20150406,34.4800,34.8100,34.2300,34.4200,21480500']
['20150407,34.5400,34.8900,34.5100,34.6300,14331200']
This also has to work for lists with lots of strings in the same form as this...

Use next() and enumerate() to find the position of the string with the desired date, then slice:
>>> d = '20150407'
>>> l = [
... '20150406,34.4800,34.8100,34.2300,34.4200,21480500',
... '20160402,34.1,32.8100,33.2300,31.01,22282510',
... '20150407,34.5400,34.8900,34.5100,34.6300,14331200',
... '20120101,2.540,14.8201,32.00,30.1311,12331230'
... ]
>>> index = next(i for i, item in enumerate(l) if item.startswith(d))
>>> l[:index]
['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '20160402,34.1,32.8100,33.2300,31.01,22282510']
>>> l[index:]
['20150407,34.5400,34.8900,34.5100,34.6300,14331200', '20120101,2.540,14.8201,32.00,30.1311,12331230']
Couple notes:
next() would through a StopIteration exception if there will be no match - you should either handle it with try/except or provide a default value, -1 for example:
next((i for i, item in enumerate(l) if item.startswith(d)), -1)
to check if the date matches a desired one, we are simply checking if an item starts with a specific date string. If the desired date comes as a date or datetime, you would need to format it beforehand using strftime():
>>> from datetime import datetime
>>> d = datetime(2015, 4, 7)
>>> d = d.strftime("%Y%m%d")
>>> d
'20150407'

I think you want a groupby, grouping strings that don't start with the date and ones that do so the date delimits the groups:
l = ['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '2015010,34.5400,34.8900,34.5100,34.6300,14331200'
, '20150407,34.5400,34.8900,34.5100,34.6300,14331200']
dte = "2015-04-07"
delim = dte.replace("-","") + ","
from itertools import groupby
print([list(v) for k,v in groupby(l,key=lambda x: not x.startswith(delim))])
[['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '2015010,34.5400,34.8900,34.5100,34.6300,14331200'], ['20150407,34.5400,34.8900,34.5100,34.6300,14331200']]
The groupby will keep splitting the data as many times as there are strings the start with the date.

by extend from alecxe answer:
The code can split original list to couple sublist by input date.
l = [
... '20150406,34.4800,34.8100,34.2300,34.4200,21480500',
... '20160402,34.1,32.8100,33.2300,31.01,22282510',
... '20150407,34.5400,34.8900,34.5100,34.6300,14331200',
... '20120101,2.540,14.8201,32.00,30.1311,12331230',
... '20150407,34.5400,34.8900,34.5100,34.6300,14331200',]
index = [i for i, item in enumerate(l) if item.startswith(d)]
[l[i:j] for i, j in zip([0]+index, index+[None])]
output:
[['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '20160402,34.1,32.8100,33.2300,31.01,22282510'], ['20150407,34.5400,34.8900,34.5100,34.6300,14331200', '20120101,2.540,14.8201,32.00,30.1311,12331230'], ['20150407,34.5400,34.8900,34.5100,34.6300,14331200']]

Related

How to grab dates from a string of conjoined dates

This is the string I am dealing with:'5Nov20217Dec202110Jan2022'
The string could also be:
'5Nov2021 7Dec2021 10Jan2022'
I would like to obtain a list like:
['5Nov2021','7Dec2021','10Jan2022']
I am currently using regex but to no avail:
re.findall('^\d{1,2}[a-zA-Z]{3}\d{4}$','5Nov20217Dec202110Jan2022')
A regex solution is not a must.

Based on the variability of your input, I suggest combining re with string slicing in a while loop:
def extract_dates(d):
while d:
if (k:=re.findall('^\d{1,2}[a-zA-Z]{3}\d{4}', d)):
if not (l:=d[len(k[0]):]) or l[0].isdigit():
yield k[0]
d = l
continue
if (k:=re.findall('^\d{1,2}[a-zA-Z]{3}\d{2}', d)):
yield k[0]
d = d[len(k[0]):]
else:
d = d[1:]
dates = ['5Nov20217Dec202110Jan2022', '5Nov217Dec2110Jan22', '5Nov21 7Dec21 10Jan22']
results = [list(extract_dates(i)) for i in dates]
Output:
[['5Nov2021', '7Dec2021', '10Jan2022'], ['5Nov21', '7Dec21', '10Jan22'], ['5Nov21', '7Dec21', '10Jan22']]

coupling str elements from a list to a tuple list

I have the following list:
lines
['line_North_Mid', 'line_South_Mid',
'line_North_South', 'line_Mid_South',
'line_South_North','line_Mid_North' ]
I would like to couple them in a tuple list as follows, with respect to their names:
tuple_list
[('line_Mid_North', 'line_North_Mid'),
('line_North_South', 'line_South_North'),
('line_Mid_South', 'line_South_Mid')]
I thought maybe I could do a string search in the elements of the lines but it wont be efficient. Is there a better way to order lines elements in a way which would look like tuple_list
Paring Criteria:
If the both elements have the same Area_name: ('North', 'Mid', 'South')
E.g.: 'line_North_Mid' should be coupled with 'line_Mid_North'

Try this:
from itertools import combinations
tuple_list = [i for i in combinations(lines,2) if i[0].split('_')[1] == i[1].split('_')[2] and i[0].split('_')[2] == i[1].split('_')[1]]
or I think this is better:
[i for i in combinations(lines,2) if i[0].split('_')[1:] == i[1].split('_')[1:][::-1]]

An order-agnostic O(n) solution is possible using collections.defaultdict. The idea is to use as our dictionary keys the last 2 components of your strings delimited by '_', appending values from your input list. Then extract values and convert to a list of tuples.
from collections import defaultdict
L = ['line_North_Mid', 'line_South_Mid',
'line_North_South', 'line_Mid_South',
'line_South_North', 'line_Mid_North']
dd = defaultdict(list)
for item in L:
dd[frozenset(item.rsplit('_', maxsplit=2)[1:])].append(item)
res = list(map(tuple, dd.values()))
# [('line_North_Mid', 'line_Mid_North'),
# ('line_South_Mid', 'line_Mid_South'),
# ('line_North_South', 'line_South_North')]

You can use the following list comprehension:
lines = ['line_Mid_North', 'line_North_Mid',
'line_North_South', 'line_South_North',
'line_Mid_South', 'line_South_Mid']
[(j,i) for i in lines for j in lines if j not in i
if set(j.split('_')[1:]) < set(i.split('_'))][::2]
[('line_Mid_North', 'line_North_Mid'),
('line_North_South', 'line_South_North'),
('line_Mid_South', 'line_South_Mid')]

I suggest you have a function that returns the same key for string that are supposed to be together (a grouping-key).
def key(s):
# ignore first part and sort other 2 parts, so they will always be in same order
_, part_1, part_2 = s.split('_')
return tuple(sorted([part_1, part_2]))
The you have to use some grouping method; I used defaultdict for example:
import collections
lines = [
'line_North_Mid', 'line_South_Mid',
'line_North_South', 'line_Mid_South',
'line_South_North','line_Mid_North',
]
dd = collections.defaultdict(list)
for s in lines:
dd[key(s)].append(s) # those with same key get grouped
print(list(tuple(v) for v in dd.values()))
# [
# ('line_North_Mid', 'line_Mid_North'),
# ('line_South_Mid', 'line_Mid_South'),
# ('line_North_South', 'line_South_North'),
# ]

how to separate cam1,2,3,4,5,6 first images from the list

lst = ['Cam218-10-03_16-05-21-54.jpg',
'Cam318-10-03_17-04-21-54.jpg',
'Cam418-10-03_16-04-21-54.jpg',
'Cam218-10-02_16-05-21-54.jpg',
'Cam318-10-02_17-04-21-54.jpg',
'Cam418-10-02_16-04-21-54.jpg',
'Cam218-10-02_16-04-08-31.jpg',
'Cam318-10-02_16-04-08-30.jpg',
'Cam418-10-02_16-04-08-30.jpg',
'Cam518-10-02_16-04-08-35.jpg',
'Cam618-10-02_16-04-08-36.jpg',
'Cam118-10-02_16-04-09-33.jpg',
'Cam218-10-02_16-04-09-33.jpg',
'Cam318-10-02_16-04-09-33.jpg',
'Cam418-10-02_16-04-09-33.jpg',
'Cam518-10-02_16-04-09-33.jpg',
'Cam618-10-02_16-04-09-33.jpg',
'Cam118-10-02_16-04-11-53.jpg',
'Cam218-10-02_16-04-11-53.jpg',
'Cam318-10-02_16-04-11-53.jpg',
'Cam418-10-02_16-04-08-30.jpg',
'Cam118-10-02_16-04-08-31.jpg',
'Cam518-10-02_16-04-11-53.jpg',
'Cam118-10-02_16-04-11-53.jpg']
From this list I want the output:
['Cam118-10-02_16-04-08-31.jpg',
'Cam218-10-02_16-04-08-31.jpg',
'Cam318-10-02_16-04-08-30.jpg',
'Cam418-10-02_16-04-08-30.jpg',
'Cam518-10-02_16-04-08-35.jpg',
'Cam618-10-02_16-04-08-36.jpg']
by using Python. Could anybody help me?

With itertools.groupby - O(n*log(n))
>>> from itertools import groupby
>>> [next(g) for _, g in groupby(sorted(lst), key=lambda cam: cam.partition('-')[0])]
['Cam118-10-02_16-04-08-31.jpg',
'Cam218-10-02_16-04-08-31.jpg',
'Cam318-10-02_16-04-08-30.jpg',
'Cam418-10-02_16-04-08-30.jpg',
'Cam518-10-02_16-04-08-35.jpg',
'Cam618-10-02_16-04-08-36.jpg']
With keeping track of duplicates manually (output not sorted, but potentially useful to other readers) - O(n)
>>> seen = set()
>>> result = []
>>>
>>> for cam in lst:
...: model, *_ = cam.partition('-')
...: if model not in seen:
...: result.append(cam)
...: seen.add(model)
...:
>>> result
['Cam218-10-03_16-05-21-54.jpg',
'Cam318-10-03_17-04-21-54.jpg',
'Cam418-10-03_16-04-21-54.jpg',
'Cam518-10-02_16-04-08-35.jpg',
'Cam618-10-02_16-04-08-36.jpg',
'Cam118-10-02_16-04-09-33.jpg']

you can make if condition to check for the occurrence of the photo tag after sorting the list
list.sort()
i = 1
for item in list:
if(item[3]==str(i)):
i=i+1
print(item)
continue
the result is
Cam118-10-02_16-04-08-31.jpg
Cam218-10-02_16-04-08-31.jpg
Cam318-10-02_16-04-08-30.jpg
Cam418-10-02_16-04-08-30.jpg
Cam518-10-02_16-04-08-35.jpg
Cam618-10-02_16-04-08-36.jpg
if you want to get the first occurrence of item with no regards to its order ascendingly, removing list.sort() shall resolve that.

Zip Lists together based on many to one relationship

I have two lists and I would like to find a way to link them together (I'm not sure the exact term for doing this) by zipping them.
In list one I have a series of tif files:
list1=['LT50300281984137PAC00_sr_band1.tif',
,'LT50300281984137PAC00_sr_band2.tif'
'LT50300281984137PAC00_sr_band3.tif','LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif']
in list two I have two files:
list2=[LT50300281984137PAC00_mask.tif,LT50300281994260XXX03_mask.tif]
I want to zip the files in list one which start with LT50300281984137PAC00 to the file in list 2 which starts the same way, and the same for the files which start with LT50300281994260XXX03
The code I have tried is:
ziplist=zip(sorted(list1),sorted(list2)
but this returns:
[('LT50300281984137PAC00_sr_band1', 'LT50300281984137PAC00_mask.tif'), ('LT50300281984137PAC00_sr_band2', 'LT50300281994260XXX03_mask.tif')]
I would like this to be returned:
[('LT50300281984137PAC00_sr_band1',LT50300281984137PAC00_sr_band2,LT50300281984137PAC00_sr_band3, 'LT50300281984137PAC00_mask.tif'), ('LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif','LT50300281994260XXX03_sr_band3.tif','LT50300281994260XXX03_mask.tif')]

You can use itertools.groupby:
from itertools import groupby
list1 = [
'LT50300281984137PAC00_sr_band1.tif',
'LT50300281984137PAC00_sr_band2.tif',
'LT50300281984137PAC00_sr_band3.tif',
'LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif'
]
list2 = [
'LT50300281984137PAC00_mask.tif',
'LT50300281994260XXX03_mask.tif'
]
def extract_key(s):
return s[:s.index('_')]
l = sorted(list1 + list2, key=extract_key)
l = [tuple(items) for s, items in groupby(l, key=extract_key)]
Result:
[('LT50300281984137PAC00_sr_band1.tif', 'LT50300281984137PAC00_sr_band2.tif', 'LT50300281984137PAC00_sr_band3.tif', 'LT50300281984137PAC00_mask.tif'), ('LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif', 'LT50300281994260XXX03_sr_band3.tif', 'LT50300281994260XXX03_mask.tif')]
The idea is to sort the union of the two lists by the first part of each filename (extract_key). Then use groupby to create groups of the same first part.

You can use list comprehensions and builtin function filter
In [24]: [tuple(filter(lambda x: x.startswith(e.split('_')[0]), list1)+[e]) for e in list2]
Out[24]:
[('LT50300281984137PAC00_sr_band1.tif',
'LT50300281984137PAC00_sr_band2.tif',
'LT50300281984137PAC00_sr_band3.tif',
'LT50300281984137PAC00_mask.tif'),
('LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif',
'LT50300281994260XXX03_mask.tif')]

Can also be done using regex.
import re
list1=['LT50300281984137PAC00_sr_band1.tif'
,'LT50300281984137PAC00_sr_band2.tif',
'LT50300281984137PAC00_sr_band3.tif','LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif']
list2=['LT50300281984137PAC00_mask.tif','LT50300281994260XXX03_mask.tif']
match = re.findall(r'(\b\w+(?:PAC00)\w+.\w+\b)'," ".join(list1))
tuple1 = tuple(match+[list2[0]])
match = re.findall(r'(\b\w+(?:0XXX0)\w+.\w+\b)'," ".join(list1))
tuple2 = tuple(match+[list2[1]])
print [tuple1,tuple2]
Output
[('LT50300281984137PAC00_sr_band1.tif', 'LT50300281984137PAC00_sr_band2.tif', 'LT50300281984137PAC00_sr_band3.tif', 'LT50300281984137PAC00_mask.tif'), ('LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif', 'LT50300281994260XXX03_sr_band3.tif', 'LT50300281994260XXX03_mask.tif')]

A dictionary will work better here, you can then later repurpose it for what you need:
results = {}
for f in list2:
common = f.split('_')[0]
results[common] = []
for f in list1:
common = f.split('_')[0]
try:
results[common].append(f)
except KeyError:
print('{} not a valid grouper'.format(common))
# To convert into a list of tuples
as_list = [(k,)+tuple(v) for k,v in results.iteritems()]
print(as_list)

I would use itertools.chain and itertools.groupby , with a lambda expression to take only till the first _ for the grouping. Example -
>>> from itertools import chain,groupby
>>> list1=['LT50300281984137PAC00_sr_band1.tif','LT50300281984137PAC00_sr_band2.tif','LT50300281984137PAC00_sr_band3.tif','LT50300281994260XXX03_sr_band1.tif','LT50300281994260XXX03_sr_band2.tif','LT50300281994260XXX03_sr_band3.tif']
>>> list2=['LT50300281984137PAC00_mask.tif','LT50300281994260XXX03_mask.tif']
>>>
>>> chained_sorted = sorted(chain(list1,list2))
>>> ret = []
>>> for i, x in groupby(chained_sorted,lambda x: x.split('_')[0]):
... ret.append(tuple(x))
...
>>> ret
[('LT50300281984137PAC00_mask.tif', 'LT50300281984137PAC00_sr_band1.tif', 'LT50300281984137PAC00_sr_band2.tif', 'LT50300281984137PAC00_sr_band3.tif'), ('LT50300281994260XXX03_mask.tif', 'LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif', 'LT50300281994260XXX03_sr_band3.tif')]

My first answer on StackOverflow, so please be patient. But I didn't see a need for zip()
mask1, mask2 = list2[0], list2[1]
for b in reversed(list1):
if b[0:20] in mask1:
mask1 = b + " " + mask1
else:
mask2 = b + " " + mask2
ziplist = [tuple(mask1.split()), tuple(mask2.split())]
I think ziplist should now be what you were asking for.

sorting lists of list to get unique ids for last column

I have this data saved in a file:
['5',60680,60854,'gene_id "ENS1"']
['5',59106,89211,'gene_id "ENS1"']
['5',58686,58765,'gene_id "ENS1"']
['5',80835,93381,'gene_id "ENS2"']
['5',55555,92223,'gene_id "ENS2"']
['5',73902,74276,'gene_id "ENS2"']
I need help with python to get an output which ensures that items in the 4th column appear
only when the second column has the minimum value and the third column has a maximum value within a 4th column item. So I want my output to look like this:
['5',58686,89211,'gene_id "ENS1"']
['5',55555,93381,'gene_id "ENS2"']
Each item in the 4th column should only appear once. How can I also get rid of the [] around the data. Thank you.

>>> from itertools import groupby
>>> for i, j in groupby(lst, key=lambda x: x[3]):
t = list(zip(*j))
print(t[0][0], min(t[1]), max(t[2]), t[3][0])
5 58686 89211 gene_id "ENS1"
5 55555 93381 gene_id "ENS2"
It's not clear, what do you mean by getting rid of [], these are just syntax for python lists.

import re
pat = re.compile("\['[^']+',([^,]+),([^,]+),'([^']+)']")
ch = '''
['5',60680,60854,'gene_id "ENS1"']
['5',59106,89211,'gene_id "ENS1"']
['5',58686,58765,'gene_id "ENS1"']
['5',80835,93381,'gene_id "ENS2"']
['5',55555,92223,'gene_id "ENS2"']
['5',73902,74276,'gene_id "ENS2"']'''
li = pat.findall(ch)
print li
deekmin = {}
deekmax = {}
for a,b,c in li[1:]:
if c in deekmin:
if a<deekmin[c]:
deekmin[c] = a
if b>deekmax[c]:
dekkmax[c] = b
else:
deekmin[c] = a
deekmax[c] = b
res = [ (deekmin[c],deekmax[c],c) for c in deekmin ]
print res

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting data - specific case - python

Related

How to grab dates from a string of conjoined dates

coupling str elements from a list to a tuple list

how to separate cam1,2,3,4,5,6 first images from the list

Zip Lists together based on many to one relationship

sorting lists of list to get unique ids for last column

Categories

Resources