Pandas remove rows based on timestamps - python

Thanks in advance for checking out my question and help me out!
Basically what I am trying to do is removing MAC addresses which haven't been detected in the past hour or even longer than that.
I collect probe requests from my wifi network with timestamps for each MAC captured. The data is processed using Pandas. The dataframe has only 2 columns: 'MAC' and 'TIME' (in strftime format). Below is a screenshot of my dataframe.
As you can see I will only consider rows which have same MAC address are 'duplicates'. My problem is I can't find out each duplicated MAC addresses' time gap between the last one and the one before last.
MAC csv
MAC csv2
What I have tried so far:
I tried to use groupby and tail(2) to group data by MAC and take the last 2 entries however when there are several duplicated MACs in the dataframe this won't work because this method seems to only work for the last two entries.
Here is the code I tried:
def CheckListCleaner(inputDF) -> pd.DataFrame:
sleep(60 - time() % 60)
cond1 = inputDF.groupby("MAC").count() > 1
cond2 = inputDF.groupby("MAC").tail(2).diff() > 3600
combined_cond = cond1.mul(cond2)
combined_cond["M1"] = combined_cond.index
combined_cond.rename({"T": "val"}, axis=1, inplace=True)
out = inputDF.merge(combined_cond, left_on="MAC", right_on="M1")
listToDel = out[~out["val"]]
return listToDel
I am open to any new ideas. I am also wondering whether there are some easier ways or libs I can use to make this work without using a lot groupby and conditions.
P.S. In case you wonder how did I captured these MAC addresses. I am only interested in type 2 transmitter MACs. Below is the code I used to collect MACS.
def PacketHandler(pkt):
if pkt.haslayer(Dot11):
if pkt.type == 0:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))
if pkt.type == 1 and pkt.addr2 != None:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))
if pkt.type == 2 and pkt.addr2 != None:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))
if pkt.type == 3 and pkt.addr2 != None:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))

Related

Python/ R code is taking too long to extract pairwise information from dataset. How to optimize?

Code was initially in R, but as R does not handle large dataset well, I converted the code to python and ported it to Google Colab. Even on Google Colab it took very long, and I never actually saw it finish runing even after 8 hours. I also added more breaking statements to avoid unnecessary runs.
The dataset has around unique 50000 time stamps, unique 40000 ids. It is in the format of ['time','id','x-coordinate','y-coordinate], very clear cut passenger trajectory dataset.
What the code is trying to do is extract out all the pairs of IDs which are 2 meters/less apart from each other at the same time frame.
Please let me know if there are ways to optimize this.
Here's a short overview of the data. [my_data.head(10)][1]
i=0
y = pd.DataFrame(columns=['source', 'dest']) #empty contact network df
infectedGrp = [824, 11648, 23468]
while (i < my_data.shape[0]):
row1=my_data.iloc[i]
id1=row1[1]
time1=row1[0]
x1=row1[2]
y1=row1[3]
infected1=my_data.iloc[i,4]
infectious1=my_data.iloc[i,5]
#print(row1)
#print(time1)
for j in range(i+1,my_data.shape[0]):
row2=my_data.iloc[j]
id2=row2[1]
time2=row2[0]
x2=row2[2]
y2=row2[3]
infected2=my_data.iloc[j,4]
infectious2=my_data.iloc[j,5]
print(time2)
if(time2!=time1):
i=i+1
print("diff time...breaking")
break
if(x2>x1+2) or (x1>x2+2):
i=i+1
print("x more than 2...breaking")
break
if(y2>y1+2) or (y1>y2+2):
i=i+1
print("y more than 2...breaking")
break
probability = 0
distance = round(math.sqrt(pow((x1-x2),2)+pow((y1-y2),2)),2)
print(distance)
print(infected1)
print(infected2)
if (distance<=R):
if infectious1 and not infected2 : #if one person is infectious and the other is not infected
probability = (1-beta)*(1/R)*(math.sqrt(R**2-distance**2))
print(probability)
print("here")
infected2=decision(probability)
numid2= int(id2) # update all entries for id2
if (infected2):
my_data.loc[my_data['id'] == numid2, 'infected'] = True
#my_data.iloc[j,7]=probability
elif infectious2 and not infected1:
infected1=decision(probability)
numid1= int(id1) # update all entries for id1
if (infected1):
my_data.loc[my_data['id'] == numid1, 'infected'] = True
#my_data.iloc[i,7]=probability
inf1 = 'F'
inf2 = 'F'
if (infected1):
inf1 = 'T'
if (infected2):
inf2 = 'T'
print('prob '+str(probability)+' at time '+str(time1))
new_row = {'source': id1.astype(str)+' '+inf1, 'dest': id2.astype(str)+' '+inf2}
y = y.append(new_row, ignore_index=True)
i=i+1
[1]: https://i.stack.imgur.com/YVdmB.png
Its hard to tell now for sure, but I think good guess is this line is your biggest "sin":
y = y.append(new_row, ignore_index=True)
You should not append rows to dataframe in a loop.
You should aggregate them in python list and then create DataFrame using all of them after the loop.
y = []
while (i < my_data.shape[0])
(...)
y.append(new_row)
y = pd.DataFrame(y)
I also suggest to use line profiler to analyse which parts of the code are the bottlenecks
You are using a nested loop to find time values that are equivalent. You can get a huge improvement by doing a group_by operation instead and then iterating through the groups.

filling in columns with info from other file based on condition

So there are 2 csv files im working with:
file 1:
City KWR1 KWR2 KWR3
Killeen
Killeen
Houston
Whatever
file2:
location link reviews
Killeen www.example.com 300
Killeen www.differentexample.com 200
Killeen www.example3.com 100
Killeen www.extraexample.com 20
Here's what im trying to make this code do:
look at the 'City' in file one, take the top 3 links in file 2 (you can go ahead and assume the cities wont get mixed up) and then put these top 3 into the KWR1 KWR2 KWR3 columns for all the same 'City' values.
so it gets the top 3 and then just copies them to the right of all the Same 'City' values.
even asking this question correctly is difficult for me, hope i've provided enough information.
i know how to read the file in with pandas and all that, just cant code this exact situation in...
It is a little unusual requirement but I think you need to three steps:
1. Keep only the first three values you actually need.
df = df.sort_values(by='reviews',ascending=False).groupby('location').head(3).reset_index()
Hopefully this keeps only the first three from every city.
Then you somehow need to label your data, there might be better ways to do this but here is one way:- You assign a new column with numbers and create a user defined function
import numpy as np
df['nums'] = np.arange(len(df))
Now you have a column full of numbers (kind of like line numbers)
You create your function then that will label your data...
def my_func(index):
if index % 3 ==0 :
x = 'KWR' + str(1)
elif index % 3 == 1:
x = 'KWR' + str(2)
elif index % 3 == 2:
x = 'KWR' + str(3)
return x
You can then create the labels you need:
df['labels'] = df.nums.apply(my_func)
Then you can do:
my_df = pd.pivot_table(df, values='reviews', index=['location'], columns='labels', aggfunc='max').reset_index()
Which literally pulls out the labels (pivots) and puts the values in to the right places.

Libre office calc and excel showing different value

I am trying to do some date parsing in python and while parsing I came to this weird error that said
time data 'nan' does not match format '%d/%m/%y'
As i checked my .csv file in libreoffice calc everything looked fine. No nan values what so ever. However when I checked it in excel(excel mobile version. Since I don't want to pay) I saw different value. Value that was shown as follows in different editor
Libre office calc - 11/09/93
excel - ########.
Here is a screenshot below:
How could I change it in LibreOffice or python so that it won't be treated as nan values but the real values like they should be.
I don't have much knowledge in excel and Libreoffice calc so any explanation to solve this simple issue would be welcome.
Here is the python code
import pandas as pd
from datetime import datetime as dt
loc = "C:/Data/"
season1993_94 = pd.read_csv(loc + '1993-94.csv')
def parse_date_type1(date):
if date == '':
return None
return dt.strptime(date, '%d/%m/%y').date()
def parse_date_type2(date):
if date == '':
return None
return dt.strptime(date, '%d/%m/%Y').date()
season1993_94.Date = season1993_94.Date.astype(str).apply(parse_date_type1)
Error:
<ipython-input-13-46ff7e1afe94> in <module>()
----> 1 season1993_94.Date = season1993_94.Date.astype(str).apply(parse_date_type1)
ValueError: time data 'nan' does not match format '%d/%m/%y'
PS: If the question seems inappropriate as per the context given, please feel free to edit it.
To see what is going on, use a text editor such as Notepad++. Viewing with Excel or Calc may not show the problem; at least, the problem cannot be seen from the images in the question.
The error occurs with a CSV file consisting of the following three lines.
Date,Place
28/08/93,Southampton
,Newcastle
Here is the solution, adapted from How to convert string to datetime with nulls - python, pandas?
season1993_94['Date'] = pd.to_datetime(season1993_94['Date'], errors='coerce')
The result:
>>> season1993_94
Date Place
0 1993-08-28 Southampton
1 NaT Newcastle

Compare each pair of dates in two columns in python efficiently

I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)

Detection of variable length pattern in pandas dataframe column

The last 2 columns of a timeseries indexed dataframe identify the start ('A' or 'AA' or 'AAA'), end ('F' or 'FF' or 'FFF') and duration (number of rows between start and end) of a physical process, they look like this:
and the A-F sequences or the n sequences between them are of variable length.
How can I identify these patterns and for each of them calculate averages of other columns for the corresponding rows?
What I, very badly, tried to do is the following:
import pandas as pd
import xlrd
##### EXCEL LOAD
filepath= 'H:\\CCGT GE startup.xlsx'
df = pd.read_excel(filepath,sheet_name='Sheet1',header=0,skiprows=0,parse_cols='A:CO',index_col=0)
df = df.sort_index() # set increasing time index, source data is time decreasing
gas=[]
for i,row in df.iterrows():
if df['FLAG STARTUP TG1'] is not 'n':
while 'F' not in df['FLAG STARTUP TG1']:
gas.append(df['PORTATA GREZZA TG1 - m3/h'])
gas.append(i)
But the script gets stuck on the first if (doesn't match the 'n' condition and keeps appending the same row,i pair). Additionally, my method is also wrong in excluding the last 'F' row that still pertains to the same process and should be considered as part of it!
p.s. the first 1000 rows df is here http://www.filedropper.com/ccgtgestartup1000
p.p.s. Besides not working, my method is also wrong in excluding the last 'F' row that still pertains to the same process and should be considered as part of it!
p.p.p.s. The 2 columns refer to 2 different processes/machines and are unrelated (almost, more on this later), I want to do the same analysis on both (they will refer to different columns' averages). The first "A" string marks the beginning of the process and gets repeated until the last timestamp that gets marked with an 'F' string. in the original file the timestamps are descending and that's why i used the sort_index() method. The string length depends on other columns values but the obvious FLAG columns correlation is only in the 3 character strings 'AAA'&'FFF' because this should occur only if the the 2 processes start in +-1 timestamp from each other.
This is how I managed to get the desired results (N.B. I later decided that only the single character 'A'-->'F' sequences are of interest)
import pandas as pd
import numpy as np
##### EXCEL LOAD
filepath= 'H:\\CCGT GE startup.xlsx'
df = pd.read_excel(filepath,sheet_name='Sheet1',header=0,skiprows=0,parse_cols='A:CO',index_col=0)
df = df.sort_index() # set increasing time index, source data is time decreasing
tg1 = pd.DataFrame(index=df.index.copy(),columns=['counter','flag','gas','p','raw_p','tv_p','lhv','fs'])
k = 0
for i,row in df.iterrows():
if 'A' == str(row['FLAG STARTUP TG1']):
tg1.ix[i,'flag']=row['FLAG STARTUP TG1']
tg1.ix[i,'gas']=row['Portata gas naturale']
tg1.ix[i,'counter']=k
tg1.ix[i,'fs']=row['1FIRED START COUNT - N°']
tg1.ix[i,'p']=row['POTENZA ATTIVA MONTANTE 1 SU 400 KV - MW']
tg1.ix[i,'raw_p']=row['POTENZA ATTIVA MONTANTE 1 SU 15 KV - MW']
tg1.ix[i,'tv_p']=row['POTENZA ATTIVA MONTANTE TV - MW']
tg1.ix[i,'lhv']=row['LHV - MJ/Sm3']
elif 'F' == str(row['FLAG STARTUP TG1']):
tg1.ix[i,'flag']=row['FLAG STARTUP TG1']
tg1.ix[i,'gas']=row['Portata gas naturale']
tg1.ix[i,'counter']=k
tg1.ix[i,'fs']=row['1FIRED START COUNT - N°']
tg1.ix[i,'p']=row['POTENZA ATTIVA MONTANTE 1 SU 400 KV - MW']
tg1.ix[i,'raw_p']=row['POTENZA ATTIVA MONTANTE 1 SU 15 KV - MW']
tg1.ix[i,'tv_p']=row['POTENZA ATTIVA MONTANTE TV - MW']
tg1.ix[i,'lhv']=row['LHV - MJ/Sm3']
k+=1
tg1 = tg1.dropna(axis=0)
tg1 = tg1[tg1['gas'] != 0] #data where gas flow measurement is missing is dropped
tg1 = tg1.convert_objects(convert_numeric=True)
#timestamp count for each startup for duration calculation
counts = pd.DataFrame(tg1['counter'].value_counts(),columns=['duration'])
counts['start']=counts.index
counts = counts.set_index(np.arange(len(tg1['counter'].value_counts())))
tg1 = tg1.merge(counts,how='inner',left_on='counter',right_on='start')
# filter out non pertinent startups (too long or too short)
tg1 = tg1[tg1['duration'].isin([6,7])]
#calculate thermal input per start (process)
table = tg1.groupby(['counter']).mean()
table['t_in']=table.apply((lambda row: row['gas']*row['duration']*0.25*row['lhv']/3600),axis=1)
Any improvements and suggestions to do the calculations in the iteration and avoid all the "prep- work" after it are welcome.

Categories