How to filter with different lengths

How to filter with different lengths - python

So I’m trying to get the data from my server and the mvmt number consists of 3 characters and 4 digits (EX: CVG5694) but its datatype is a string so I have to use the cast query and just get the last 4 digits. This gives me a column of integers that has the last 4 digits. However there are trucks that I need to get but it has a different format as having 1 character (the letter x), and 5 digits (ex: X12051). This is a problem because I'm trying to filter out numbers less than 5000. And because I get the last 4 digits it filters out the ones with 1 character and 5 digits (EX: x12051 becomes 2051 in the movement column and gets filtered. Do you know how I could get trucks with numbers such as X12051 in my filter?
Below is some of my code:
SQL_Query = pd.read_sql_query('''SELECT[MVMT_DT],
[MVMT_NUMBER],
CAST(RIGHT(MVMT_NUMBER,4) as int) as movement,
[MVMT_TYPE],
[OPERATOR],
[EQUIPMENT],
[ORIG],
[DEST],
[MVMT_STATUS],
CASE WHEN [GROSS_WEIGHT_(KG)] < 0 THEN 0 ELSE [GROSS_WEIGHT_(KG)] END AS [GROSS_WEIGHT_(KG)],
CASE WHEN [NET_WEIGHT_(KG)]< 0 THEN 0 ELSE [NET_WEIGHT_(KG)] END AS [NET_WEIGHT_(KG)],
CASE WHEN [NMBR_ULDS] < 0 THEN 0 ELSE [NMBR_ULDS] END AS [NMBR_ULDS],
CASE WHEN [NMBR_POS] < 0 THEN 0 ELSE [NMBR_POS] END AS [NMBR_POS]
FROM PATH
WHERE [F-T-O] = 'T'
AND ORIG IN ('CVG', 'CVG CRN', 'MIA', 'MIA GTW', 'LAX', 'LAX GTW', 'JFK', 'JFK GTW', 'ORD', 'ORD GTW')
AND MVMT_TYPE IN ('O/XL', 'O/XL/AH', 'T/XL', 'T/XL/AH', 'CL/AH', 'O/AH', 'T/AH')
AND [MVMT_NUMBER] NOT LIKE '%AMZ%'
AND [MVMT_NUMBER] NOT LIKE '%A0%'
AND [MVMT_NUMBER] NOT LIKE '%K0%'
AND [MVMT_NUMBER] NOT LIKE '%A1%'
AND [MVMT_NUMBER] NOT LIKE '%K1%'
--AND RIGHT([MVMT_NUMBER], 4) <= 5000
AND MVMT_DT = '2021-12-06' --DATEADD(DAY, -2, GETDATE()) AND DATEADD(DAY, -1, GETDATE())''',conn_)
CXL_Filter = ['O/XL', 'O/XL/AH', 'T/XL', 'T/XL/AH']
Ad_Hoc_Filter = ['CL/AH', 'O/AH', 'T/AH']
CXL_CVG = SQL_Query[SQL_Query.MVMT_TYPE.isin(CXL_Filter) & (SQL_Query['ORIG'] == 'CVG') & (SQL_Query['movement'] >= 5000)]
CXL_CVG_CRN = SQL_Query[SQL_Query.MVMT_TYPE.isin(CXL_Filter) & (SQL_Query['ORIG'] == 'CVG CRN') & (SQL_Query['movement'] >= 5000)]
Ad_Hoc_CVG = SQL_Query[SQL_Query.MVMT_TYPE.isin(Ad_Hoc_Filter) & (SQL_Query['ORIG'] == 'CVG') & (SQL_Query['movement'] >= 5000)]
Ad_Hoc_CVG_CRN = SQL_Query[SQL_Query.MVMT_TYPE.isin(Ad_Hoc_Filter) & (SQL_Query['ORIG'] == 'CVG CRN') & (SQL_Query['movement'] >= 5000)]```

You can get the whole number from MVMT_NUMBER
...
CAST(RIGHT([MVMT_NUMBER], PATINDEX('%[0-9][^0-9]%', REVERSE([MVMT_NUMBER])+' ')) as INT) as movement,
...
Test
select
MVMT_NUMBER
, CAST(RIGHT([MVMT_NUMBER], PATINDEX('%[0-9][^0-9]%', REVERSE([MVMT_NUMBER])+' ')) as INT) as movement
from (VALUES ('ABC5444'),('X12345'),('ABCD'),('1234')) val([MVMT_NUMBER])
MVMT_NUMBER
movement
ABC5444
5444
X12345
12345
ABCD
0
1234
1234

Related

Debugging the solution to a possible Bipartition

I came across this problem
We want to split a group of n people (labeled from 1 to n)
into two groups of any size. Each person may dislike some other people,
and they should not go into the same group.
Given the integer n and the array dislikes where dislikes[i] = [ai, bi]
indicates that the person labeled ai does not like the person labeled bi,
return true if it is possible to split everyone into two groups in this way.
Example 1:
Input: n = 4, dislikes = [[1,2],[1,3],[2,4]]
Output: true
Explanation: group1 [1,4] and group2 [2,3].
Example 2:
Input: n = 3, dislikes = [[1,2],[1,3],[2,3]]
Output: false
Example 3:
Input: n = 5, dislikes = [[1,2],[2,3],[3,4],[4,5],[1,5]]
Output: false
Below is my approach to the solution:
create two lists, group1 and group2 and initialise group1 with 1
generate all the numbers from 2 to n in a variable called num
check if num is enemy with group1 elements, if yes, then check if num is enemy with group2 elements, if yes as well, return False
else put num in its respective group and goto step 2 with the next value
return True
below is the code implementation
class Solution(object):
def possibleBipartition(self, n, dislikes):
"""
:type n: int
:type dislikes: List[List[int]]
:rtype: bool
"""
group1 = [1]
group2 = []
for num in range(2, n+1):
put_to_group_1 = 1
for _n in group1:
if [_n, num] in dislikes or [num, _n] in dislikes:
put_to_group_1 = 0
break
put_to_group_2 = 1
for _n in group2:
if[_n, num] in dislikes or [num, _n] in dislikes:
put_to_group_2 = 0
break
if put_to_group_1 == 0 and put_to_group_2 == 0:
return False
if put_to_group_1 == 1:
group1.append(num)
else:
group2.append(num)
return True
However for the following input I am getting False, but the expected output isTrue.
50
[[21,47],[4,41],[2,41],[36,42],[32,45],[26,28],[32,44],[5,41],[29,44],[10,46],[1,6],[7,42],[46,49],[17,46],[32,35],[11,48],[37,48],[37,43],[8,41],[16,22],[41,43],[11,27],[22,44],[22,28],[18,37],[5,11],[18,46],[22,48],[1,17],[2,32],[21,37],[7,22],[23,41],[30,39],[6,41],[10,22],[36,41],[22,25],[1,12],[2,11],[45,46],[2,22],[1,38],[47,50],[11,15],[2,37],[1,43],[30,45],[4,32],[28,37],[1,21],[23,37],[5,37],[29,40],[6,42],[3,11],[40,42],[26,49],[41,50],[13,41],[20,47],[15,26],[47,49],[5,30],[4,42],[10,30],[6,29],[20,42],[4,37],[28,42],[1,16],[8,32],[16,29],[31,47],[15,47],[1,5],[7,37],[14,47],[30,48],[1,10],[26,43],[15,46],[42,45],[18,42],[25,42],[38,41],[32,39],[6,30],[29,33],[34,37],[26,38],[3,22],[18,47],[42,48],[22,49],[26,34],[22,36],[29,36],[11,25],[41,44],[6,46],[13,22],[11,16],[10,37],[42,43],[12,32],[1,48],[26,40],[22,50],[17,26],[4,22],[11,14],[26,39],[7,11],[23,26],[1,20],[32,33],[30,33],[1,25],[2,30],[2,46],[26,45],[47,48],[5,29],[3,37],[22,34],[20,22],[9,47],[1,4],[36,46],[30,49],[1,9],[3,26],[25,41],[14,29],[1,35],[23,42],[21,32],[24,46],[3,32],[9,42],[33,37],[7,30],[29,45],[27,30],[1,7],[33,42],[17,47],[12,47],[19,41],[3,42],[24,26],[20,29],[11,23],[22,40],[9,37],[31,32],[23,46],[11,38],[27,29],[17,37],[23,30],[14,42],[28,30],[29,31],[1,8],[1,36],[42,50],[21,41],[11,18],[39,41],[32,34],[6,37],[30,38],[21,46],[16,37],[22,24],[17,32],[23,29],[3,30],[8,30],[41,48],[1,39],[8,47],[30,44],[9,46],[22,45],[7,26],[35,42],[1,27],[17,30],[20,46],[18,29],[3,29],[4,30],[3,46]]
Can anyone tell me where I might be going wrong with the implementation?

Consider a scenario:
Let's assume that in the dislikes array, we have [1,6],[2,6] among other elements (so 6 hates 1 and 2).
Person 1 doesn't hate anybody else
After placing everybody into groups, let's say 2 gets placed in group 2.
While placing 6, you can't put it in either group, since it conflicts with 1 in group 1 and 2 in group 2.
6 could have been placed in group 1 if you didn't start with the assumption of placing 1 in group 1 (ideally 1 could have been placed in group 2 without conflict).
Long story short, don't start with person 1 in group 1. Take the first element in the dislikes array, put either of them in either group, and then continue with the algorithm.

Find intersection in Dataframe with Start and Enddate in Python

I have a dataframe of elements with a start and end datetime. What is the best option to find intersections of the dates? My naive approach right now consists of two nested loops cross-comparing the elements, which obviously is super slow. What would be a better way to achieve that?
dict = {}
start = "start_time"
end = "end_time"
for index1, rowLoop1 in df[{start, end}].head(500).iterrows():
matches = []
dict[(index1, rowLoop1[start])] = 0
for index2, rowLoop2 in df[{start,end}].head(500).iterrows():
if index1 != index2:
if date_intersection(rowLoop1[start], rowLoop1[end], rowLoop2[start], rowLoop2[end]):
dict[(index1, rowLoop1[start])] += 1
Code for date_intersection:
def date_intersection(t1start, t1end, t2start, t2end):
if (t1start <= t2start <= t2end <= t1end): return True
elif (t1start <= t2start <= t1end):return True
elif (t1start <= t2end <= t1end):return True
elif (t2start <= t1start <= t1end <= t2end):return True
else: return False
Sample data:
id,start_date,end_date
41234132,2021-01-10 10:00:05,2021-01-10 10:30:27
64564512,2021-01-10 10:10:00,2021-01-11 10:28:00
21135765,2021-01-12 12:30:00,2021-01-12 12:38:00
87643252,2021-01-12 12:17:00,2021-01-12 12:42:00
87641234,2021-01-12 12:58:00,2021-01-12 13:17:00

You can do something like merging your dataframe with itself to get the cartesian product and comparing columns.
df = df.merge(df, how='cross', suffixes=('','_2'))
df['date_intersection'] = (((df['start_date'].le(df['start_date_2']) & df['start_date_2'].le(df['end_date'])) | # start 2 within start/end
(df['start_date'].le(df['end_date_2']) & df['end_date_2'].le(df['end_date'])) | # end 2 within start/end
(df['start_date_2'].le(df['start_date']) & df['start_date'].le(df['end_date_2'])) | # start within start 2/end 2
(df['start_date_2'].le(df['end_date']) & df['end_date'].le(df['end_date_2']))) & # end within start 2/end 2
df['id'].ne(df['id_2'])) # id not compared to itself
and then to return the ids and if they have a date intersection...
df.groupby('id')['date_intersection'].any()
id
21135765 True
41234132 True
64564512 True
87641234 False
87643252 True
or if you need the ids that were intersected
df.loc[df['date_intersection'], :].groupby(['id'])['id_2'].agg(list).to_frame('intersected_ids')
intersected_ids
id
21135765 [87643252]
41234132 [64564512]
64564512 [41234132]
87643252 [21135765]

Filter by the number of digits pandas

I have a Dataframe that has only one column with numbers ranging from 1 to 10000000000.
df1 =
165437890
2321434256
324334567
4326457
243567869
234567843
......
7654356785432
7654324567543
I want to have a resulting Dataframe that only has numbers with 9 digits, and that those digits are different from each other, is this possible ? I don't have a clue on how to start
OBS:
1) I need to filter out the number that has repeated digits
for example :
122234543 would go out of my DataFrame since it has the number 2 repeated 3 times and the numbers 4 and 3 repeated 2 times

def is_good(num):
numstr = list(str(num))
if len(numstr) == 9 and len(set(numstr))==9:
return True
return False
df1[df.apply(is_good)]

flt = (df.Numbers >= 100000000) & (df.Numbers < 1000000000)
df = pd.DataFrame(df[flt]['Numbers'].unique())
Where Numbers is the column name with your numbers.
Solution for digits that are different from each other in the number itself:
df.Numbers = df.Numbers.astype('str')
df = df[df.Numbers.str.match(r'^(?!.*(.).*\1)[0-9]{9}$')]
Or another solution based on the Igor's answer:
def has_unique_9digits(n):
s = str(n)
return len(s) == len(set(s)) == 9
df = df[df.Numbers.apply(has_unique_9digits)]

How to get the correct number of distinct combination locks with a margin or error of +-2?

I am trying to solve the usaco problem combination lock where you are given a two lock combinations. The locks have a margin of error of +- 2 so if you had a combination lock of 1-3-5, the combination 3-1-7 would still solve it.
You are also given a dial. For example, the dial starts at 1 and ends at the given number. So if the dial was 50, it would start at 1 and end at 50. Since the beginning of the dial is adjacent to the end of the dial, the combination 49-1-3 would also solve the combination lock of 1-3-5.
In this program, you have to output the number of distinct solutions to the two lock combinations. For the record, the combination 3-2-1 and 1-2-3 are considered distinct, but the combination 2-2-2 and 2-2-2 is not.
I have tried creating two functions, one to check whether three numbers match the constraints of the first combination lock and another to check whether three numbers match the constraints of the second combination lock.
a,b,c = 1,2,3
d,e,f = 5,6,7
dial = 50
def check(i,j,k):
i = (i+dial) % dial
j = (j+dial) % dial
k = (k+dial) % dial
if abs(a-i) <= 2 and abs(b-j) <= 2 and abs(c-k) <= 2:
return True
return False
def check1(i,j,k):
i = (i+dial) % dial
j = (j+dial) % dial
k = (k+dial) % dial
if abs(d-i) <= 2 and abs(e-j) <= 2 and abs(f-k) <= 2:
return True
return False
res = []
count = 0
for i in range(1,dial+1):
for j in range(1,dial+1):
for k in range(1,dial+1):
if check(i,j,k):
count += 1
res.append([i,j,k])
if check1(i,j,k):
count += 1
res.append([i,j,k])
print(sorted(res))
print(count)
The dial is 50 and the first combination is 1-2-3 and the second combination is 5-6-7.
The program should output 249 as the count, but it instead outputs 225. I am not really sure why this is happening. I have added the array for display purposes only. Any help would be greatly appreciated!

You're going to a lot of trouble to solve this by brute force.
First of all, your two check routines have identical functionality: just call the same routine for both combinations, giving the correct combination as a second set of parameters.
The critical logic problem is handling the dial wrap-around: you miss picking up the adjacent numbers. Run 49 through your check against a correct value of 1:
# using a=1, i=49
i = (1+50)%50 # i = 1
...
if abs(1-49) <= 2 ... # abs(1-49) is 48. You need it to show up as 2.
Instead, you can check each end of the dial:
a_diff = abs(i-a)
if a_diff <=2 or a_diff >= (dial-2) ...
Another way is to start by making a list of acceptable values:
a_vals = [(a-oops) % dial] for oops in range(-2, 3)]
... but note that you have to change the 0 value to dial. For instance, for a value of 1, you want a list of [49, 50, 1, 2, 3]
With this done, you can check like this:
if i in a_vals and j in b_vals and k in c_vals:
...
If you want to upgrade to the itertools package, you can simply generate all desired combinations:
combo = set(itertools.product(a_list, b_list_c_list) )
Do that for both given combinations and take the union of the two sets. The length of the union is the desired answer.
I see the follow-up isn't obvious -- at least, it's not appearing in the comments.
You have 5*5*5 solutions for each combination; start with 250 as your total.
Compute the sizes of the overlap sets: the numbers in each triple that can serve for each combination. For your given problem, those are [3],[4],[5]
The product of those set sizes is the quantity of overlap: 1*1*1 in this case.
The overlapping solutions got double-counted, so simply subtract the extra from 250, giving the answer of 249.
For example, given 1-2-3 and 49-6-6, you would get sets
{49, 50, 1}
{4}
{4, 5}
The sizes are 3, 1, 2; the product of those numbers is 6, so your answer is 250-6 = 244
Final note: If you're careful with your modular arithmetic, you can directly compute the set sizes without building the sets, making the program very short.

Here is one approach to a semi-brute-force solution:
import itertools
#The following code assumes 0-based combinations,
#represented as tuples of numbers in the range 0 to dial - 1.
#A simple wrapper function can be used to make the
#code apply to 1-based combos.
#The following function finds all combos which open lock with a given combo:
def combos(combo,tol,dial):
valids = []
for p in itertools.product(range(-tol,1+tol),repeat = 3):
valids.append(tuple((x+i)%dial for x,i in zip(combo,p)))
return valids
#The following finds all combos for a given iterable of target combos:
def all_combos(targets,tol,dial):
return set(combo for target in targets for combo in combos(target,tol,dial))
For example, len(all_combos([(0,1,2),(4,5,6)],2,50)) evaluate to 249.

The correct code for what you are trying to do is the following:
dial = 50
a = 1
b = 2
c = 3
d = 5
e = 6
f = 7
def check(i,j,k):
if (abs(a-i) <= 2 or (dial-abs(a-i)) <= 2) and \
(abs(b-j) <= 2 or (dial-abs(b-j)) <= 2) and \
(abs(c-k) <= 2 or (dial-abs(c-k)) <= 2):
return True
return False
def check1(i,j,k):
if (abs(d-i) <= 2 or (dial-abs(d-i)) <= 2) and \
(abs(e-j) <= 2 or (dial-abs(e-j)) <= 2) and \
(abs(f-k) <= 2 or (dial-abs(f-k)) <= 2):
return True
return False
res = []
count = 0
for i in range(1,dial+1):
for j in range(1,dial+1):
for k in range(1,dial+1):
if check(i,j,k):
count += 1
res.append([i,j,k])
elif check1(i,j,k):
count += 1
res.append([i,j,k])
print(sorted(res))
print(count)
And the result is 249, the total combinations are 2*(5**3) = 250, but we have the duplicates: [3, 4, 5]

select region from sequence with Phobius output in Python

I need to use a certain program, to validate some of my results. I am relatively new in Python. The output is so different for each entry, see a snippit below:
SEQENCE ID TM SP PREDICTION
YOL154W_Q12512_Saccharomyces_cerevisiae 0 Y n8-15c20/21o
YDR481C_P11491_Saccharomyces_cerevisiae 1 0 i34-53o
YAL007C_P39704_Saccharomyces_cerevisiae 1 Y n5-20c25/26o181-207i
YAR028W_P39548_Saccharomyces_cerevisiae 2 0 i51-69o75-97i
YBL040C_P18414_Saccharomyces_cerevisiae 7 0 o6-26i38-56o62-80i101-119o125-143i155-174o186-206i
YBR106W_P38264_Saccharomyces_cerevisiae 1 0 o28-47i
YBR287W_P38355_Saccharomyces_cerevisiae 8 0 o12-32i44-63o69-90i258-275o295-315i327-351o363-385i397-421o
So, I need the last transmembrane region, in this case its always the last numbers between o and i or vise versa. if TM = 0, there is no transmembrane region, so I want the numbers if TM > 0
output I need:
34-53
181-207
75-97
186-206
28-47
397-421
preferably in seperate values, like:
first_number = 34
second_number = 53
Because I will be using a loop the values will be overwritten anyway. To summarize: I need the last region between the o and i or vise versa, with very variable strings (both in length and composition).
Trouble: If I just search (for example with regular expression) for the last region between o and i, I will sometimes pick the wrong region.

If the Phobius output is stored in a file, change 'Phobius_output' to the path, then the following code should give the expected result:
with open('Phobius_output') as file:
for line in file.readlines()[1:]:
if int(line.split()[1]) > 0:
prediction = line.split()[3]
i_idx, o_idx = prediction.rfind('i'), prediction.rfind('o')
last_region = prediction[i_idx + 1:o_idx] if i_idx < o_idx else prediction[o_idx + 1:i_idx]
first_number, second_number = map(int, last_region.split('-'))
print(last_region)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to filter with different lengths - python

Related

Debugging the solution to a possible Bipartition

Find intersection in Dataframe with Start and Enddate in Python

Filter by the number of digits pandas

How to get the correct number of distinct combination locks with a margin or error of +-2?

select region from sequence with Phobius output in Python

Categories

Resources