How to grab dates from a string of conjoined dates - python

This is the string I am dealing with:'5Nov20217Dec202110Jan2022'
The string could also be:
'5Nov2021 7Dec2021 10Jan2022'
I would like to obtain a list like:
['5Nov2021','7Dec2021','10Jan2022']
I am currently using regex but to no avail:
re.findall('^\d{1,2}[a-zA-Z]{3}\d{4}$','5Nov20217Dec202110Jan2022')
A regex solution is not a must.

Based on the variability of your input, I suggest combining re with string slicing in a while loop:
def extract_dates(d):
while d:
if (k:=re.findall('^\d{1,2}[a-zA-Z]{3}\d{4}', d)):
if not (l:=d[len(k[0]):]) or l[0].isdigit():
yield k[0]
d = l
continue
if (k:=re.findall('^\d{1,2}[a-zA-Z]{3}\d{2}', d)):
yield k[0]
d = d[len(k[0]):]
else:
d = d[1:]
dates = ['5Nov20217Dec202110Jan2022', '5Nov217Dec2110Jan22', '5Nov21 7Dec21 10Jan22']
results = [list(extract_dates(i)) for i in dates]
Output:
[['5Nov2021', '7Dec2021', '10Jan2022'], ['5Nov21', '7Dec21', '10Jan22'], ['5Nov21', '7Dec21', '10Jan22']]

Related

What is an easy way to remove duplicates from only part of the string in Python?

I have a list of strings that goes like this:
1;213;164
2;213;164
3;213;164
4;213;164
5;213;164
6;213;164
7;213;164
8;213;164
9;145;112
10;145;112
11;145;112
12;145;112
13;145;112
14;145;112
15;145;112
16;145;112
17;145;112
1001;1;151
1002;2;81
1003;3;171
1004;4;31
I would like to remove all duplicates where second 2 numbers are the same. So after running it through program I would get something like this:
1;213;164
9;145;112
1001;1;151
1002;2;81
1003;3;171
1004;4;31
But something like
8;213;164
15;145;112
1001;1;151
1002;2;81
1003;3;171
1004;4;31
would also be correct.
Here is a nice and fast trick you can use (assuming l is your list):
list({ s.split(';', 1)[1] : s for s in l }.values())
No need to import anything, and fast as can be.
In general you can define:
def custom_unique(L, keyfunc):
return list({ keyfunc(li): li for li in L }.values())
You can group the items by this key and then use the first item in each group (assuming l is your list).
import itertools
keyfunc = lambda x: x.split(";", 1)[1]
[next(g) for k, g in itertools.groupby(sorted(l, key=keyfunc), keyfunc)]
Here is a code on the few first items, just switch my list with yours:
x = [
'7;213;164',
'8;213;164',
'9;145;112',
'10;145;112',
'11;145;112',
]
new_list = []
for i in x:
check = True
s_part = i[i.find(';'):]
for j in new_list:
if s_part in j:
check = False
if check == True:
new_list.append(i)
print(new_list)
Output:
['7;213;164', '9;145;112']

python edit tuple duplicates in a list

my target is:
while for looping a list I would like to check for duplicates and if there are some i would like to append a number to it see following example
my list output as an example:
[('name','company'), ('someguy','microsoft'), ('anotherguy','microsoft'), ('thirdguy','amazon')]
in a loop i would like to edit those duplicates so instead of the 2nd microsoft i would like to have microsoft1 (if there would be 3 microsoft guys so the third guy would have microsoft2)
with this i can filter the duplicates but i dont know how to edit them directly in the list
list = [('name','company'), ('someguy','microsoft'), ('anotherguy','microsoft'), ('thirdguy','amazon')]
names = []
double = []
for u in list[1:]:
names.append(u[1])
list_size = len(names)
for i in range(list_size):
k = i + 1
for j in range(k, list_size):
if names[i] == names[j] and names[i] not in double:
double.append(names[i])
This is one approach using collections.defaultdict.
Ex:
from collections import defaultdict
lst = [('name','company'), ('someguy','microsoft'), ('anotherguy','microsoft'), ('thirdguy','amazon')]
seen = defaultdict(int)
result = []
for k, v in lst:
if seen[v]:
result.append((k, "{}_{}".format(v, seen[v])))
else:
result.append((k,v))
seen[v] += 1
print(result)
Output:
[('name', 'company'),
('someguy', 'microsoft'),
('anotherguy', 'microsoft_1'),
('thirdguy', 'amazon')]

How to replace text between parentheses in Python?

I have a dictionary containing the following key-value pairs: d={'Alice':'x','Bob':'y','Chloe':'z'}
I want to replace the lower case variables(values) by the constants(keys) in any given string.
For example, if my string is:
A(x)B(y)C(x,z)
how do I replace the characters in order to get a resultant string of :
A(Alice)B(Bob)C(Alice,Chloe)
Should I use regular expressions?
re.sub() solution with replacement function:
import re
d = {'Alice':'x','Bob':'y','Chloe':'z'}
flipped = dict(zip(d.values(), d.keys()))
s = 'A(x)B(y)C(x,z)'
result = re.sub(r'\([^()]+\)', lambda m: '({})'.format(','.join(flipped.get(k,'')
for k in m.group().strip('()').split(','))), s)
print(result)
The output:
A(Alice)B(Bob)C(Alice,Chloe)
Extended version:
import re
def repl(m):
val = m.group().strip('()')
d = {'Alice':'x','Bob':'y','Chloe':'z'}
flipped = dict(zip(d.values(), d.keys()))
if ',' in val:
return '({})'.format(','.join(flipped.get(k,'') for k in val.split(',')))
else:
return '({})'.format(flipped.get(val,''))
s = 'A(x)B(y)C(x,z)'
result = re.sub(r'\([^()]+\)', repl, s)
print(result)
Bonus approach for particular input case A(x)B(y)C(Alice,z):
...
s = 'A(x)B(y)C(Alice,z)'
result = re.sub(r'\([^()]+\)', lambda m: '({})'.format(','.join(flipped.get(k,'') or k
for k in m.group().strip('()').split(','))), s)
print(result)
I assume you want to replace the values in a string with the respective keys of the dictionary. If my assumption is correct you can try this without using regex.
First the swap the keys and values using dictionary comprehension.
my_dict = {'Alice':'x','Bob':'y','Chloe':'z'}
my_dict = { y:x for x,y in my_dict.iteritems()}
Then using list_comprehension, you replace the values
str_ = 'A(x)B(y)C(x,z)'
output = ''.join([i if i not in my_dict.keys() else my_dict[i] for i in str_])
Hope this is what you need ;)
Code
import re
d={'Alice':'x','Bob':'y','Chloe':'z'}
keys = d.keys()
values = d.values()
s = "A(x)B(y)C(x,z)"
for i in range(0, len(d.keys())):
rx = r"" + re.escape(values[i])
s = re.sub(rx, keys[i], s)
print s
Output
A(Alice)B(Bob)C(Alice,Chloe)
Also you could use the replace method in python like this:
d={'x':'Alice','y':'Bob','z':'Chloe'}
str = "A(x)B(y)C(x,z)"
for key in d:
str = str.replace(key,d[key])
print (str)
But yeah you should swipe your dictionary values like Kishore suggested.
This is the way that I would do it:
import re
def sub_args(text, tosub):
ops = '|'.join(tosub.keys())
for argstr, _ in re.findall(r'(\(([%s]+?,?)+\))' % ops, text):
args = argstr[1:-1].split(',')
args = [tosub[a] for a in args]
subbed = '(%s)' % ','.join(map(str, args))
text = re.sub(re.escape(argstr), subbed, text)
return text
text = 'A(x)B(y)C(x,z)'
tosub = {
'x': 'Alice',
'y': 'Bob',
'z': 'Chloe'
}
print(sub_args(text, tosub))
Basically you just use the regex pattern to find all of the argument groups and substitute in the proper values--the nice thing about this approach is that you don't have to worry about subbing where you don't want to (for example, if you had a string like 'Fn(F,n)'). You can also have multi-character keys, like 'F(arg1,arg2)'.

Splitting data - specific case

I'm trying to split some data, the data is in this form...
['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '20150407,34.5400,34.8900,34.5100,34.6300,14331200']
The first item in each string in the list is a date, I am trying split the data at a chosen date. But have the whole string... For example if my chosen date was 2015-04-07 the above data would split like this...
['20150406,34.4800,34.8100,34.2300,34.4200,21480500']
['20150407,34.5400,34.8900,34.5100,34.6300,14331200']
This also has to work for lists with lots of strings in the same form as this...
Use next() and enumerate() to find the position of the string with the desired date, then slice:
>>> d = '20150407'
>>> l = [
... '20150406,34.4800,34.8100,34.2300,34.4200,21480500',
... '20160402,34.1,32.8100,33.2300,31.01,22282510',
... '20150407,34.5400,34.8900,34.5100,34.6300,14331200',
... '20120101,2.540,14.8201,32.00,30.1311,12331230'
... ]
>>> index = next(i for i, item in enumerate(l) if item.startswith(d))
>>> l[:index]
['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '20160402,34.1,32.8100,33.2300,31.01,22282510']
>>> l[index:]
['20150407,34.5400,34.8900,34.5100,34.6300,14331200', '20120101,2.540,14.8201,32.00,30.1311,12331230']
Couple notes:
next() would through a StopIteration exception if there will be no match - you should either handle it with try/except or provide a default value, -1 for example:
next((i for i, item in enumerate(l) if item.startswith(d)), -1)
to check if the date matches a desired one, we are simply checking if an item starts with a specific date string. If the desired date comes as a date or datetime, you would need to format it beforehand using strftime():
>>> from datetime import datetime
>>> d = datetime(2015, 4, 7)
>>> d = d.strftime("%Y%m%d")
>>> d
'20150407'
I think you want a groupby, grouping strings that don't start with the date and ones that do so the date delimits the groups:
l = ['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '2015010,34.5400,34.8900,34.5100,34.6300,14331200'
, '20150407,34.5400,34.8900,34.5100,34.6300,14331200']
dte = "2015-04-07"
delim = dte.replace("-","") + ","
from itertools import groupby
print([list(v) for k,v in groupby(l,key=lambda x: not x.startswith(delim))])
[['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '2015010,34.5400,34.8900,34.5100,34.6300,14331200'], ['20150407,34.5400,34.8900,34.5100,34.6300,14331200']]
The groupby will keep splitting the data as many times as there are strings the start with the date.
by extend from alecxe answer:
The code can split original list to couple sublist by input date.
l = [
... '20150406,34.4800,34.8100,34.2300,34.4200,21480500',
... '20160402,34.1,32.8100,33.2300,31.01,22282510',
... '20150407,34.5400,34.8900,34.5100,34.6300,14331200',
... '20120101,2.540,14.8201,32.00,30.1311,12331230',
... '20150407,34.5400,34.8900,34.5100,34.6300,14331200',]
index = [i for i, item in enumerate(l) if item.startswith(d)]
[l[i:j] for i, j in zip([0]+index, index+[None])]
output:
[['20150406,34.4800,34.8100,34.2300,34.4200,21480500', '20160402,34.1,32.8100,33.2300,31.01,22282510'], ['20150407,34.5400,34.8900,34.5100,34.6300,14331200', '20120101,2.540,14.8201,32.00,30.1311,12331230'], ['20150407,34.5400,34.8900,34.5100,34.6300,14331200']]

Zip Lists together based on many to one relationship

I have two lists and I would like to find a way to link them together (I'm not sure the exact term for doing this) by zipping them.
In list one I have a series of tif files:
list1=['LT50300281984137PAC00_sr_band1.tif',
,'LT50300281984137PAC00_sr_band2.tif'
'LT50300281984137PAC00_sr_band3.tif','LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif']
in list two I have two files:
list2=[LT50300281984137PAC00_mask.tif,LT50300281994260XXX03_mask.tif]
I want to zip the files in list one which start with LT50300281984137PAC00 to the file in list 2 which starts the same way, and the same for the files which start with LT50300281994260XXX03
The code I have tried is:
ziplist=zip(sorted(list1),sorted(list2)
but this returns:
[('LT50300281984137PAC00_sr_band1', 'LT50300281984137PAC00_mask.tif'), ('LT50300281984137PAC00_sr_band2', 'LT50300281994260XXX03_mask.tif')]
I would like this to be returned:
[('LT50300281984137PAC00_sr_band1',LT50300281984137PAC00_sr_band2,LT50300281984137PAC00_sr_band3, 'LT50300281984137PAC00_mask.tif'), ('LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif','LT50300281994260XXX03_sr_band3.tif','LT50300281994260XXX03_mask.tif')]
You can use itertools.groupby:
from itertools import groupby
list1 = [
'LT50300281984137PAC00_sr_band1.tif',
'LT50300281984137PAC00_sr_band2.tif',
'LT50300281984137PAC00_sr_band3.tif',
'LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif'
]
list2 = [
'LT50300281984137PAC00_mask.tif',
'LT50300281994260XXX03_mask.tif'
]
def extract_key(s):
return s[:s.index('_')]
l = sorted(list1 + list2, key=extract_key)
l = [tuple(items) for s, items in groupby(l, key=extract_key)]
Result:
[('LT50300281984137PAC00_sr_band1.tif', 'LT50300281984137PAC00_sr_band2.tif', 'LT50300281984137PAC00_sr_band3.tif', 'LT50300281984137PAC00_mask.tif'), ('LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif', 'LT50300281994260XXX03_sr_band3.tif', 'LT50300281994260XXX03_mask.tif')]
The idea is to sort the union of the two lists by the first part of each filename (extract_key). Then use groupby to create groups of the same first part.
You can use list comprehensions and builtin function filter
In [24]: [tuple(filter(lambda x: x.startswith(e.split('_')[0]), list1)+[e]) for e in list2]
Out[24]:
[('LT50300281984137PAC00_sr_band1.tif',
'LT50300281984137PAC00_sr_band2.tif',
'LT50300281984137PAC00_sr_band3.tif',
'LT50300281984137PAC00_mask.tif'),
('LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif',
'LT50300281994260XXX03_mask.tif')]
Can also be done using regex.
import re
list1=['LT50300281984137PAC00_sr_band1.tif'
,'LT50300281984137PAC00_sr_band2.tif',
'LT50300281984137PAC00_sr_band3.tif','LT50300281994260XXX03_sr_band1.tif',
'LT50300281994260XXX03_sr_band2.tif',
'LT50300281994260XXX03_sr_band3.tif']
list2=['LT50300281984137PAC00_mask.tif','LT50300281994260XXX03_mask.tif']
match = re.findall(r'(\b\w+(?:PAC00)\w+.\w+\b)'," ".join(list1))
tuple1 = tuple(match+[list2[0]])
match = re.findall(r'(\b\w+(?:0XXX0)\w+.\w+\b)'," ".join(list1))
tuple2 = tuple(match+[list2[1]])
print [tuple1,tuple2]
Output
[('LT50300281984137PAC00_sr_band1.tif', 'LT50300281984137PAC00_sr_band2.tif', 'LT50300281984137PAC00_sr_band3.tif', 'LT50300281984137PAC00_mask.tif'), ('LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif', 'LT50300281994260XXX03_sr_band3.tif', 'LT50300281994260XXX03_mask.tif')]
A dictionary will work better here, you can then later repurpose it for what you need:
results = {}
for f in list2:
common = f.split('_')[0]
results[common] = []
for f in list1:
common = f.split('_')[0]
try:
results[common].append(f)
except KeyError:
print('{} not a valid grouper'.format(common))
# To convert into a list of tuples
as_list = [(k,)+tuple(v) for k,v in results.iteritems()]
print(as_list)
I would use itertools.chain and itertools.groupby , with a lambda expression to take only till the first _ for the grouping. Example -
>>> from itertools import chain,groupby
>>> list1=['LT50300281984137PAC00_sr_band1.tif','LT50300281984137PAC00_sr_band2.tif','LT50300281984137PAC00_sr_band3.tif','LT50300281994260XXX03_sr_band1.tif','LT50300281994260XXX03_sr_band2.tif','LT50300281994260XXX03_sr_band3.tif']
>>> list2=['LT50300281984137PAC00_mask.tif','LT50300281994260XXX03_mask.tif']
>>>
>>> chained_sorted = sorted(chain(list1,list2))
>>> ret = []
>>> for i, x in groupby(chained_sorted,lambda x: x.split('_')[0]):
... ret.append(tuple(x))
...
>>> ret
[('LT50300281984137PAC00_mask.tif', 'LT50300281984137PAC00_sr_band1.tif', 'LT50300281984137PAC00_sr_band2.tif', 'LT50300281984137PAC00_sr_band3.tif'), ('LT50300281994260XXX03_mask.tif', 'LT50300281994260XXX03_sr_band1.tif', 'LT50300281994260XXX03_sr_band2.tif', 'LT50300281994260XXX03_sr_band3.tif')]
My first answer on StackOverflow, so please be patient. But I didn't see a need for zip()
mask1, mask2 = list2[0], list2[1]
for b in reversed(list1):
if b[0:20] in mask1:
mask1 = b + " " + mask1
else:
mask2 = b + " " + mask2
ziplist = [tuple(mask1.split()), tuple(mask2.split())]
I think ziplist should now be what you were asking for.

Categories