Minimum number of rows where the column sum is as close to N, dealing with non-integers - python

I've got a DataFrame that looks something like this in Pandas.
| clip_id | duration |
|---------:|-----------:|
| 0050 | 3.085 |
| 0019 | 3.125 |
| 0001 | 3.265 |
...
| 0010 | 4.47 |
| 0024 | 4.48 |
| 0034 | 4.49 |
| 0004 | 4.515 |
...
| 0008 | 6.795 |
| 0034 | 6.99 |
| 0026 | 6.995 |
...
| 0004 | 9.005 |
| 0024 | 9.185 |
| 0048 | 9.265 |
| 0029 | 10.055 |
| 0001 | 10.255 |
| 0006 | 10.85 |
I've trimmed the table here using ellipses, but the number of rows is usually around between 30 and 100. Also I have sorted the table using the duration column.
My goal is to find the minimum number of clips such that their sum is lazily above some value N. In other words, if N=25, picking up the bottom three rows is not a good enough solution, as the sum there would be 31.16 and there exists a greedier/lazier solution which is closer to 25.
It's been a super long time since I've taken an algorithms / data-structures class but I was sure there was a Heap-related solution to this problem. I also haven't done dynamic programming in Python before but perhaps there's a solution which involves DP. Looking around at other solved questions on StackOverflow, the most voted answers always assume that
(A) you're dealing with integers, OR
(B) you'll be able to find an exact sum
But that won't be the case for what I'm trying to do. Ideally if I can do this while the data is stored in a Pandas DataFrame, then it'll be easy to return the clip_id values for those resultant rows.
Appreciate any and all help I can get on this front!
Edit: So thinking about the problem more, what makes it tough is that there are two competing goals: I want the fewest number of rows, but I also want the sum barely above N if possible. So between the two goals, I would say being as close to N is the more important condition. So for example, if increasing the number of rows by 2 can bring the total closer to >= N, then that would be more preferred.

Is the solution below something that you meant? Do you want to get the rows more than N or you want to exclude them? According to your answer, I can update the code.
def min_num_rows(data, threshold):
for i in range(1, len(data)):
if data[-i:]['duration'].sum() > threshold:
data = data[: -i]
break
else:
continue
return data[-i: ]
df = df.sort_values('duration', ascending = True)
above = min_num_rows(df, 25)

Related

How to explore features of strings PYTHON

I have a table with relative abundances of strings collected elsewhere. I also have a list of features that are associated with the strings. What is the best way to explore each string for each feature and sum the relative abundancies.
Example Input Table:
+────────────+────────────+
| String | Abundance |
+────────────+────────────+
| abcdef | 12 |
| cdefgh | 15 |
| fghijk | 36 |
| jklmnoabc | 37 |
+────────────+────────────+
Example String Features:
cdef, abc, jk
Example Output
+──────────+────────────────+
| Feature | Abundance (%) |
+──────────+────────────────+
| cdef | 27 |
| abc | 59 |
| jk | 73 |
+──────────+────────────────+
Any help would be greatly appreciated!
The answer is to go through the list of string feature for each string you have and use the inoperator of Python.
This will check if your feature has an occurence in the string you apply it to.
You then want to accumulate abundance and associate it to your feature.
Did you try to do a loop with a regex? Because, you must go through your two lists (inputs and features). I don't think there is a particular algorithm to highly accelerate the process.
Here is what I'm thinking about
import re
for feature in features:
p = re.compile(f"{feature.string}")
feature.abundance = 0
for input in inputs:
m = p.match(input.string)
if m: # if not None
feature.abundance += input.abundance
With that, you will have all your stuff in your features list.

Is there a way to improve a MERGE query?

I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.
If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.
MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.

Efficient way to write Pandas groupby codes by eliminating repetition

I have a DataFrame as below.
df = pd.DataFrame({
'Country':['A','A','A','A','A','A','B','B','B'],
'City':['C 1','C 1','C 1','B 2','B 2','B 2','C 1','C 1','C 1'],
'Date':['7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020','7/1/2020','7/2/2020','7/3/2020'],
'Value':[46,90,23,84,89,98,31,84,41]
})
I need to create 2 averages
Firstly, both Country and City as the criteria
Secondly, Average for only the Country
In order to achieve this, we can easily write below codes
df.groupby(['Country','City']).agg('mean')
.
+---------+------+-------+
| Country | City | Value |
+---------+------+-------+
| A | B 2 | 90.33 |
| +------+-------+
| | C 1 | 53 |
+---------+------+-------+
| B | C 1 | 52 |
+---------+------+-------+
df.groupby(['Country']).agg('mean')
.
+---------+-------+
| Country | |
+---------+-------+
| A | 71.67 |
+---------+-------+
| B | 52 |
+---------+-------+
The only change in the above 2 codes are the groupby criteria City. apart from that everything is same. so there's a clear repetition/duplication of codes. (specially when it comes to complex scenarios).
Now my question is, Is there any way that, we could write one code to incorporate both the scenarios at once. DRY - Don't Repeat Yourself.
what I've in my mind is something like below.
Choice = 'City' `<<--Here I type either City or None or something based on the requirement. Eg: If None, the Below code will ignore that criteria.`
df.groupby(['Country',Choice]).agg('mean')
Is this possible? or what is the best way to write the above codes efficiently without repetition?
I am not sure what you want to accomplish but.. why not just using a if?
columns=['Country']
if Choice:
columns.append(Choice)
df.groupby(columns).agg('mean')

Calculate lag difference in group

I am trying to solve a problem using just SQL (I am able to do this when combining SQL and Python).
Basically what I want to do is to calculate score changes per candidate, where a score consists of joining a score lookup table and then summing these individual event scores. If a candidate fails, they are required to retake the events. Here is an example output:
| brandi_id | retest | total_score |
|-----------|--------|-------------|
| 1 | true | 128 |
| 1 | false | 234 |
| 2 | true | 200 |
| 2 | false | 230 |
| 3 | false | 265 |
What I want is to first only calculate a score change for those candidates who took a retest, where the score change will just be the difference in total_score for retest is true minus retest = false:
| brandi_id | difference |
|-----------|------------|
| 1 | 106 |
| 2 | 30 |
This is the SQL that I am using (with this I need to use Python)
select e.brandi_id, e.retest, sum(sl.scaled_score) as total_score
from event as e
left join apf_score_lookup as sl
on sl.asmnt_code = e.asmnt_code
and sl.raw_score = e.score
where e.asmnt_code in ('APFPS','APFSU','APF2M')
group by e.brandi_id, e.retest
order by e.brandi_id;
I think the solution involves using LAG and PARTITION but I cannot get it. Thanks!
If someone does the retest only once, then you can use a join:
select tc.*, tr.score, (tc.score - tr.score) as diff
from t tc join
t tr
on tc.brandi_id = tr.brandi_id and
tc.retest = 'true' and tr.retest = 'false';
You don't describe your table layout. If the results are from the query in your question, you can just plug that in as a CTE.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

Categories