Returning Max value grouping by N attributes - python

I am coming from a Java background and learning Python by applying it in my work environment whenever possible. I have a piece of functioning code that I would really like to improve.
Essentially I have a list of namedtuples with 3 numerical values and 1 time value.
complete=[]
uniquecomplete=set()
screenedPartitions = namedtuple('screenedPartitions'['feedID','partition','date', 'screeeningMode'])
I parse a log and after this is populated, I want to create a reduced set that is essentially the most recently dated member where feedID, partition and screeningMode are identical. So far I can only get it out by using a nasty nested loop.
for a in complete:
max = a
for b in complete:
if a.feedID == b.feedID and a.partition == b.partition and\
a.screeeningMode == b.screeeningMode and a.date < b.date:
max = b
uniqueComplete.add(max)
Could anyone give me advice on how to improve this? It would be great to work it out with whats available in the stdlib, as I guess my main task here is to get me thinking about it with the map/filter functionality.
The data looks akin to
FeedID | Partition | Date | ScreeningMode
68 | 5 |10/04/2017 12:40| EPEP
164 | 1 |09/04/2017 19:53| ISCION
164 | 1 |09/04/2017 20:50| ISCION
180 | 1 |10/04/2017 06:11| ISAN
128 | 1 |09/04/2017 21:16| ESAN
So
after the code is run line 2 would be removed as line 3 is a more recent version.
Tl;Dr, what would this SQL be in Python :
SELECT feedID,partition,screeeningMode,max(date)
from Complete
group by 'feedID','partition','screeeningMode'

Try something like this:
import pandas as pd
df = pd.DataFrame(screenedPartitions, columns=screenedPartitions._fields)
df = df.groupby(['feedID','partition','screeeningMode']).max()
It really depends on how your date is represented, but if you provide data I think we can work something out.

Related

How to check index numbers from database using Django?

I've got a system built in Django which receives data. I store the data as follows:
id | sensor | message_id | value
----+--------+------------+-------
1 | A | 1 | xxx
2 | A | 2 | xxx
3 | A | 3 | xxx
4 | B | 1 | xxx
5 | B | 2 | xxx
6 | B | 4 | xxx
7 | B | 7 | xxx
We expect the message_id to increases by one per sensor with every subsequent message. As you can see, the message_ids for sensor B are: 1, 2, 4, 7. This means the messages with numbers 3, 5 and 6 are missing for sensor B. In this case we would need to investigate the missing messages, especially if there are many missing. So I now want a way to know about these missing messages when it happens.
So I want do a check whether a message is missing in the past five minutes. I would expect an output that says something like:
3 messages are missing for sensor B in the last 5 minutes. The following ids are missing: 3, 5, 6
The simplest way I thought of doing this is by querying the message_id for one sensor and then looping over them to check whether any number is skipped. I thought of something like this:
five_minutes_ago = datetime.now() - timedelta(minutes=5)
queryset = MessageData.objects.filter(created__gt=five_minutes_ago).filter(sensor='B').order_by('message_id')
last_message_id = None
for md in queryset:
if last_message_id is None:
last_message_id = md.message_id
else:
if md.message_id != last_message_id + 1:
missing_messages = md.message_id - last_message_id - 1
print(f"{missing_messages} messages missing for sensor {md.sensor}")
But since I've got hundreds of sensors this seems like it's not the best way to do it. It might even be possible to do in the SQL itself, but I'm unaware of a way to do so.
How could I efficiently do this?
You can try something like this, I have added comments above the line for the logic, feel free to comment in case of any query.
five_minutes_ago = datetime.now() - timedelta(minutes=5)
queryset = MessageData.objects.filter(created__gt=five_minutes_ago).filter(sensor='B').order_by('message_id')
# rows that should ideally be there if no message_id was missing, i.e. equal to last message_id
ideal_num_of_rows = queryset.last().message_id
# total number of message_id present
total_num_of_row_present = queryset.count()
# number of missing message_ids
num_of_missing_message_ids = ideal_num_of_rows - total_num_of_rows_present - 1
You can accomplish what you want with a single SQL statement. The following generates for an array for of missing message ids and the number of missing messages for each sensor. This is done in 3 steps:
Get the minimum and maximum message ids.
Generate a dense list of message id needed.
Left join the actual sensor messages with the dense list and select
only those in the dense list not in the actual table. Count the
items selected.
with sensor_range (sensor, min_msg_id, max_msg_id) as -- 1 get necessary message range
( select sensor
, min(message_id)
, max(message_id)
from sensor_messages
group by sensor
-- where message_ts > current_timestamp - interval '5 min)
) --select * from sensor_range;
, sensor_series (sensor, msg_id) as -- 2 generate list of needed messages_id
( select sensor, n
from sensor_range sr
cross join generate_series( sr.min_msg_id
, sr.max_msg_id
, 1
) gs(n)
) --select * from sensor_series;
select ss.sensor
, array_agg(ss.msg_id) missing_message_ids --3 Identify messing message_id and count their number
, array_length(array_agg(ss.msg_id),1) missing_messages_count
from sensor_series ss
left join sensor_messages sm
on ( ss.sensor = sm.sensor
and sm.message_id = ss.msg_id
)
where sm.message_id is null
group by ss.sensor
order by ss.sensor;
See demo here. This could be packaged into an SQL function that returns a table if desired. A good reference.
Your description mentions a time range, but your data does not have a timestamp column. The query has a comment for handling this.

Pandas df loop + merge

Hello guys I need your wisdom,
I'm still new to python and pandas and I'm looking to achieve the following thing.
df = pd.DataFrame({'code': [125, 265, 128,368,4682,12,26,12,36,46,1,2,1,3,6], 'parent': [12,26,12,36,46,1,2,1,3,6,'a','b','a','c','f'], 'name':['unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','unknow','g1','g2','g1','g3','g6']})
ds = pd.DataFrame({'code': [125, 265, 128,368,4682], 'name': ['Eagle','Cat','Koala','Panther','Dophin']})
I would like to add a new column in the ds dataframe with the name of the highest parent.
as an example for the first row :
code | name | category
125 | Eagle | a
"a" is the result of a loop between df.code and df.parent 125 > 12 > 1 > a
Since the last parent is not a number but a letter i think I must use a regex and than .merge from pandas to populate the ds['category'] column. Also maybe use an apply function but it seems a little bit above my current knowledge.
Could anyone help me with this?
Regards,
The following is certainly not the fastest solution but it works if your dataframes are not too big. First create a dictionary from the parent codes of df and then apply this dict recursively until you come to an end.
p = df[['code','parent']].set_index('code').to_dict()['parent']
def get_parent(code):
while par := p.get(code):
code = par
return code
ds['category'] = ds.code.apply(get_parent)
Result:
code name category
0 125 Eagle a
1 265 Cat b
2 128 Koala a
3 368 Panther c
4 4682 Dophin f
PS: get_parent uses an assignment expression (Python >= 3.8), for older versions of Python you could use:
def get_parent(code):
while True:
par = p.get(code)
if par:
code = par
else:
return code

Numerical simulations with multiprocessing much slower than hoped: am I doing anything wrong? Can I speed it up?

I am running set of numerical simulations. I need to run some sensitivity analyses on the results, i.e. calculate and show how much certain outputs change, as certain inputs vary within given ranges. Basically I need to create a table like this, where each row is the result of one model run:
+-------------+-------------+-------------+-------------+
| Input 1 | Input 2 | Output 1 | Output 2 |
+-------------+-------------+-------------+-------------+
| 0.708788979 | 0.614576315 | 0.366315092 | 0.476088865 |
| 0.793662551 | 0.938622754 | 0.898870204 | 0.014915374 |
| 0.366560694 | 0.244354275 | 0.740988568 | 0.197036087 |
+-------------+-------------+-------------+-------------+
Each model run is tricky to parallelise, but it shouldn't be too hard to parallelise by getting each CPU to run a different model with different inputs.
I have put something together with the multiprocessing library, but it is much slower than I would have hoped. Do you have any suggestions on what I am doing wrong / how I can speed it up? I am open to using a library other than multiprocessing.
Does it have to do with load balancing?
I must confess I am new to multiprocessing in Python and am not too clear on the differences among map, apply, and apply_async.
I have made a toy example to show what I mean: I create random samples from a lognormal distribution, and calculate how much the mean of my sample changes as the mean and sigma of the distribution change. This is just a banal example because what matters here is not the model itself, but running multiple models in parallel.
In my example, the times (in seconds) are:
+-----------------+-----------------+---------------------+
| Million records | Time (parallel) | Time (not parallel) |
+-----------------+-----------------+---------------------+
| 5 | 24.4 | 18 |
| 10 | 26.5 | 35.8 |
| 20 | 32.2 | 71 |
+-----------------+-----------------+---------------------+
Only between a sample size of 5 and 10 million does parallelising bring any benefits. Is this to be expected?
P.S. I am aware of the SALib library for sensitivity analyses, but, as far as I can see, it doesn't do what I'm after.
My code:
import numpy as np
import pandas as pd
import time
import multiprocessing
from multiprocessing import Pool
# I store all the possible inputs in a dataframe
tmp = {}
i = 0
for mysigma in np.linspace(0,1,10):
for mymean in np.linspace(0,1,10):
i += 1
tmp[i] = pd.DataFrame({'mean':[mymean],\
'sigma':[mysigma]})
par_inputs = pd.concat( [tmp[x] for x in tmp], axis=0, ignore_index=True)
def not_parallel(df):
for row in df.itertuples(index=True):
myindex = row[0]
mymean = row[1]
mysigma = row[2]
dist = np.random.lognormal(mymean, mysigma, size = n)
empmean = dist.mean()
df.loc[myindex,'empirical mean'] = empmean
df.to_csv('results not parallel.csv')
# splits the dataframe and sets up the parallelisation
def parallelize_dataframe(df, func):
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
conc_df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
conc_df.to_csv('results parallelized.csv')
return conc_df
# the actual function being parallelised
def parallel_sensitivities(data):
for row in data.itertuples(index=True):
myindex = row[0]
mymean = row[1]
mysigma = row[2]
dist = np.random.lognormal(mymean, mysigma, size = n)
empmean = dist.mean()
print(empmean)
data.loc[myindex,'empirical mean'] = empmean
return data
num_cores = multiprocessing.cpu_count()
num_partitions = num_cores
n = int(5e6)
if __name__ == '__main__':
start = time.time()
not_parallel(par_inputs)
time_np = time.time() - start
start = time.time()
parallelize_dataframe(par_inputs, parallel_sensitivities)
time_p = time.time() - start
The time differences are for starting the multiple processes up. To start each process it takes some amount of seconds. Actual processing time you are doing much better than non-parallel but part of multiprocessing speed increase is accepting the time it takes to start each process.
In this case, your example functions are relatively fast by amount of seconds so you don't see the time gain immediately on a small amount of records. For more intensive operations on each record you would see much more significant time gains by parallelizing.
Keep in mind that parallelization is both costly, and time-consuming due to the overhead of the subprocesses that is needed by your operating system. Compared to running two or more tasks in a linear way, doing this in parallel you may save between 25 and 30 percent of time per subprocess, depending on your use-case. For example, two tasks that consume 5 seconds each need 10 seconds in total if executed in series, and may need about 8 seconds on average on a multi-core machine when parallelized. 3 of those 8 seconds may be lost to overhead, limiting your speed improvements.
From this article.
Edited:
When using a Pool(), you have a few options to assign tasks to the pool.
multiprocessing.apply_asynch() docs is used to assign a single task and in order to avoid blocking while waiting for that task completion.
multiprocessing.map_async docs will chunk an iterable by chunk_size and add each chunk to the pool to be completed.
In your case, it will depend on the real scenario you are using, but they aren't exchangeable based on time, rather based on what function you need to run. I'm not going to say for sure which one you need since you used a fake example. I'm guessing you could use apply_asynch if you need each function to run and the function is self-contained. If the function can parallel run over an iterable, you would want to map_asynch.

How to make Slices from a Dataframe where Column Equals a Value

I have two sets of csv data. One contains two columns (time and a boolean flag) and another data set which contains some info I have some graphing functions Id like to visually display. The data is sampled at different frequencies so the number of rows may not match for the datasets. How do I plot individual graphs for a range of data where the boolean is true?
Here is what the contact data looks like:
INDEX | TIME | CONTACT
0 | 240:18:59:31.750 | 0
1 | 240:18:59:32.000 | 0
2 | 240:18:59:32.250 | 0
........
1421 | 240:19:05:27.000 | 1
1422 | 240:19:05:27.250 | 1
The other (Vehicle) data isnt really important but contains values like Weight, Speed (MPH), Pedal Position etc.
I have many seperate large excel files and because the shapes do not match I am unsure how to slice the data using the time flags so I made a function below to create the ranges but I am thinking this can be done in an easier manner.
Here is the working code (with output below). In short, is there an easier way to do this?
def determineContactSlices(data):
contactStart = None
contactEnd = None
slices = pd.DataFrame([])
for index, row in data.iterrows():
if row['CONTACT'] == 1:
# begin slice
if contactStart is None:
contactStart = index
continue
else:
# still valid, move onto next
continue
elif row['CONTACT'] == 0:
if contactStart is not None:
contactEnd = index - 1
# create slice and add the df to list
slice = data[contactStart:contactEnd]
print(slice)
slices = slices.append(slice)
# then reset everything
slice = None
contactStart = None
contactEnd = None
continue
else:
# move onto next row
continue
return slices
Output: ([15542 rows x 2 columns])
Index Time CONTACT
1421 240:19:05:27.000 1
1422 240:19:05:27.250 1
1423 240:19:05:27.500 1
1424 240:19:05:27.750 1
1425 240:19:05:28.000 1
1426 240:19:05:28.250 1
... ...
56815 240:22:56:15.500 1
56816 240:22:56:15.750 1
56817 240:22:56:16.000 1
56818 240:22:56:16.250 1
56819 240:22:56:16.500 1
With this output I intend to loop through each time slice and display the Vehicle Data in subplots.
Any help or guidance would be much appreciated (:
UPDATE:
I believe I can just do filteredData = vehicleData[contactData['CONTACT'] == 1] but then I am faced with how to go about graphing individually when there is a disconnect. For example if there are 7 connections at various times and lengths, I woud like to have 7 individual plots to graph.
I think what you are trying to do is relatively simple, although I am not sure if I understand the output that you want or what you want to do with it after you have it. For example:
contact_df = data[data['CONTACT'] == 1]
non_contact_df = data[data['CONTACT'] == 0]
If this isn't helpful, please provide some additional details as to what the output should look like and what you plan to do with it after it is created.
Old question but why not:
sliceStart_index = df[ df["date"]=="2012-12-28" ].index.tolist()[0]
sliceEnd_index = df[ df["date"]=="2013-01-10" ].index.tolist()[0]
this_is_your_slice = df.iloc[sliceStart_index : sliceEnd_index]
first two lines actually get you a list of indexes where the condition is met, I just chose the first ones for example.

How to generate table using Python

I am quite struggling with as I tried many libraries to print table but no success - so I thought to post here and ask.
My data is in a text file (resource.txt) which looks like this (the exact same way it prints)
pipelined 8 8 0 17 0 0
nonpipelined 2 2 0 10 0 0
I want my data print in the following manner
Design name LUT Lut as m Lut as I FF DSP BRAM
-------------------------------------------------------------------
pipelined 8 8 0 17 0 0
Non piplined 2 2 0 10 0 0
Some time data may be more line column remain same but rows may increase.
(i have python 2.7 version)
I am using this part in my python code all code working but am couldn't able print data which i extracted to text file in tabular form. As I can't use panda library as it won't support for python 2.7, but I can use tabulate and all library. Can anyone please help me?
I tried using tabulate and all but I keep getting errors.
I tried at end simple method to print but its not working (same code works if I put at top of code but at the end of code this won't work). Does anyone have any idea?
q11=open( "resource.txt","r")
for line in q11:
print(line)
Here's a self contained function that makes a left-justified, technical paper styled table.
def makeTable(headerRow,columnizedData,columnSpacing=2):
"""Creates a technical paper style, left justified table
Author: Christopher Collett
Date: 6/1/2019"""
from numpy import array,max,vectorize
cols = array(columnizedData,dtype=str)
colSizes = [max(vectorize(len)(col)) for col in cols]
header = ''
rows = ['' for i in cols[0]]
for i in range(0,len(headerRow)):
if len(headerRow[i]) > colSizes[i]: colSizes[i]=len(headerRow[i])
headerRow[i]+=' '*(colSizes[i]-len(headerRow[i]))
header+=headerRow[i]
if not i == len(headerRow)-1: header+=' '*columnSpacing
for j in range(0,len(cols[i])):
if len(cols[i][j]) < colSizes[i]:
cols[i][j]+=' '*(colSizes[i]-len(cols[i][j])+columnSpacing)
rows[j]+=cols[i][j]
if not i == len(headerRow)-1: rows[j]+=' '*columnSpacing
line = '-'*len(header)
print(line)
print(header)
print(line)
for row in rows: print(row)
print(line)
And here's an example using this function.
>>> header = ['Name','Age']
>>> names = ['George','Alberta','Frank']
>>> ages = [8,9,11]
>>> makeTable(header,[names,ages])
------------
Name Age
------------
George 8
Alberta 9
Frank 11
------------
Since the number of columns remains the same, you could just print out the first line with ample spaces as required. Ex-
print("Design name",' ',"LUT",' ',"Lut as m",' ',"and continue
like that")
Then read the csv file. datafile will be
datafile = open('resource.csv','r')
reader = csv.reader(datafile)
for col in reader:
print(col[0],' ',col[1],' ',col[2],' ',"and
continue depending on the number of columns")
This is not he optimized solution but since it looks like you are new, therefore this will help you understand better. Or else you can use row_format print options in python 2.7.
Here is code to print table in nice table, you trasfer all your data to sets then you can data or else you can trasfer data in text file line to one set and print it
from beautifultable import BeautifulTable
h0=["jkgjkg"]
h1=[2,3]
h2=[2,3]
h3=[2,3]
h4=[2,3]
h5=[2,3]
h0.append("FPGA resources")
table = BeautifulTable()
table.column_headers = h0
table.append_row(h1)
table.append_row(h2)
table.append_row(h3)
table.append_row(h4)
table.append_row(h5)
print(table)
Out Put:
+--------+----------------+
| jkgjkg | FPGA resources |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+
| 2 | 3 |
+--------+----------------+

Categories