Slice DataFrame at specific points and plot each slice - python

I am new to programming and Pythone could you help me?
I have a data frame which look like this.
d = {'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
df = pd.DataFrame(data=d)
I want to slice the data whenever value == 100 and then plot all slices in a figer.
So my questions are how to slice or cut the data as described? and what's the best structure to save slices in order to plot?.
Note 1: value column has no frequency that I can use and it varies from 0 to 100 where time is arbitrary.
Note 2: I already tried this solution but I get the same table
decreased_value = df[df['value'] <= 100][['time', 'value']].reset_index(drop=True)
How can I slice one column in a dataframe to several series based on a condition
Thanks in advance!

EDIT:
Here's a simpler way of handling my first answer (thanks to #aneroid for the suggestion).
Get the indices where value==100 and add +1 so that these land at the bottom of each slice:
indices = df.index[df['value'] == 100] + 1
Then use numpy.split (thanks to this answer for that method) to make a list of dataframes:
df_list = np.split(df, indices)
Then do your plotting for each slice in a for loop:
for df in df_list:
--- plot based on df here ---
VERBOSE / FROM SCRATCH METHOD:
You can get the indices for where value==100 like this:
indices = df.index[df.value==100]
Then add the smallest and largest indices in order to not leave out the beginning and end of the df:
indices = indices.insert(0,0).to_list()
indices.append(df.index[-1]+1)
Then cycle through a while loop to cut up the dataframe and put each slice into a list of dataframes:
i = 0
df_list = []
while i+1 < len(indices):
df_list.append(df.iloc[indices[i]:indices[i+1]])
i += 1

I already solved the problem using for loop, which can be used to slice and plot at the same time without using np.split function, as well as maintain the data structure.
Thanks to the previous answer by #k_n_c, it helps me improve it.
slices = df.index[df['score'] == 100]
slices = slices + 1
slices = np.insert(slices, 0,0, axis=0)
slices = np.append(slices,df.index[-1]+1)
prev_ind = 0
for ind in slices:
temp = df.iloc[prev_ind:ind,:]
plt.plot(temp.time, temp.score)
prev_ind = ind
plt.show()

Related

how to select rows from a dataframe between two lists of index?

I'm new to python, and I'm working on a very raw dataset.
lets say I have two index list x and y,
x is like 1,10, 20, 25, 37
y is like 6,15, 24, 29, 39
I know how to select rows based on 1 index list, such as:
df.iloc[x]
but I was wondering if there is any way I could select rows between two index list?
something like:
df.iloc[x:y]
so, from above, I can get all the rows between 1:6, 10:15
Thank you so much
P.S after I get all the rows between two list, there will be many dataframes I guess. So how could I loop that?
so far I can think of,
for x, y in zip(list_1, list_2):
pd.merge(df.iloc[x:y])
but it's wrong:(
You could write it like df.iloc[x[i]:y[j]], where i and j can be used as indexes.
Yes you can do that , code will be like df.iloc[x[0]:y[0]].
Index for lists x and y can be variables if you have some logic to run.
TRY:
x = [1, 10, 20, 25, 37]
y = [6, 15, 24, 29, 39]
for i, j in zip(x, y):
# select row from i to j and then select all columns.
print(df.iloc[i:j, :])
If instead of printing you wanna concat all df's use:
result = pd.concat([df.iloc[i:j, :] for i, j in zip(x, y)])
NOTE: You can use iloc like shown below, and you can also play with step size if you want:
dataframe.iloc[start_index:end_index,start_col:end_col]
Taking advantage of boolean indexing, the following can be done:
import numpy as np
import pandas as pd
df = pd.DataFrame() # your dataframe here
x = [1, 10, 20, 25, 37]
y = [6, 15, 24, 29, 39]
indices = np.zeros(len(df.index))
for i, j in zip(x, y):
indices[i:j] = 1
print(df.loc[indices])

Quickly remove outliers from list in Python?

I have a many long lists of time and temperature values, which has the following structure:
list1 = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]
Some of the time/temperature pairs are incorrect spikes in the data. For example, in time 8, it spiked to 92 degrees. I would like to get rid of these sudden jumps or dips in the temperature values.
To do this, I wrote the following code (I removed the stuff that isn't necessary and only copied the part that removes the spikes/outliers):
outlierpercent = 3
for i in values:
temperature = i[1]
index = values.index(i)
if index > 0:
prevtemp = values[index-1][1]
pctdiff = (temperature/prevtemp - 1) * 100
if abs(pctdiff) > outlierpercent:
outliers.append(i)
While this works (where I can set the minimum percentage difference required for it to be considered a spike as outlierpercent), it takes a super long time (5-10 minutes per list). My lists are extremely long (around 5 million data points each), and I have hundreds of lists.
I was wondering if there was a much quicker way of doing this? My main concern here is time. There are other similar questions like this, however, they don't seem to be quite efficient for super long list of this structure, so I'm not sure how to do it! Thanks!
outlierpercent = 3
for index in range(1, len(values)):
temperature = values[index][1]
prevtemp = values[index-1][1]
pctdiff = (temperature/prevtemp - 1) * 100
if abs(pctdiff) > outlierpercent:
outliers.append(index)
This should do a lot better with time
Update:
The issue of only first outlier being removed is because after we remove an outlier, in the next iteration, we are comparing the temp from the removed outlier (prevtemp = values[index-1][1]).
I believe you can avoid that by handling the previous temp better. Something like this:
outlierpercent = 3
prevtemp = values[0][1]
for index in range(1, len(values)):
temperature = values[index][1]
pctdiff = (temperature/prevtemp - 1) * 100
# outlier - add to list and don't update prev temp
if abs(pctdiff) > outlierpercent:
outliers.append(index)
# valid temp, update prev temp
else:
prevtemp = values[index-1][1]
Using Numpy to speed computations
With
values = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]
Numpy Code
# Convert list to Numpy array
a = np.array(values)
# Calculate absolute percent difference of temperature
b = np.diff(a[:, 1])*100/a[:-1, 1]
# List of outliers
outlier_indices = np.where(np.abs(b) > outlierpercent)
if outlier_indices:
print(a[outlier_indices[0]+1]) # add one since b is is one short due to
# computing difference
# Output: List of outliers same as original code
[[ 8 92]
[ 9 73]]
This should make two lists, valid and outliers.
I tried to keep math operations to a minimum for speed.
Pardon any typos, this was keyboard composed, untested.
lolim=None
outliers=[]
outlierpercent=3.0
lower_mult=(100.0-outlierpercent)/100.0
upper_mult=(100.0+outlierpercent)/100.0
for index,temp in values
if lolim is None:
valids=[[index,temp]] # start the valid list
lolim,hilim=[lower_mult,upper_mult]*temp # create initial range
else:
if lolim <= temp <= hilim:
valids.append([index,temp]) # new valid entry
lolim,hilim=[lower_mult,upper_mult]*temp # update range
else:
outliers.append([index,temp]) # save outliers, keep old range

Finding Dates in one array based on the ranges from another array and closest value

I have two Nested NumPy arrays (dateValArr & searchDates). dateValArr contains all dates for May 2011 (1st - 31st) and an associated value each date. searchDates contains 2 dates and an associated value as well (2 dates correspond to a date range).
Using date ranges specified in searchDates Array, I want to find dates in dateValArr array. Next for those selected dates in dateValArr, I want to find the closest value to the specified value of searchDates.
I have come up with is code but for the first part it it only works if only one value is specified.
#setup arrays ---------------------------------------------------------------------------
# Generate dates
st_date = '2011-05-01'
ed_date = '2011-05-31'
dates = pd.date_range(st_date,ed_date).to_numpy(dtype = object)
# Generate Values
val_arr = np.random.uniform(1,12,31)
dateValLs = []
for i,j in zip(dates,val_arr):
dateValLs.append((i,j))
dateValArr = np.asarray(dateValLs)
print(dateValArr)
#out:
[[Timestamp('2011-05-01 00:00:00', freq='D') 7.667399233149668]
[Timestamp('2011-05-02 00:00:00', freq='D') 5.906099813052642]
[Timestamp('2011-05-03 00:00:00', freq='D') 3.254485533826182]
...]
#Generate search dates
searchDates = np.array([(datetime(2011,5,11),datetime(2011,5,20),9),(datetime(2011,5,25),datetime(2011,5,29),2)])
print(searchDates)
#out:
[[datetime.datetime(2011, 5, 11, 0, 0) datetime.datetime(2011, 5, 20, 0, 0) 9]
[datetime.datetime(2011, 5, 25, 0, 0) datetime.datetime(2011, 5, 29, 0, 0) 2]]
#end setup ------------------------------------------------------------------------------
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0][0], dateValArr[:,0] < search_dates[0][1]))
print(x)
out: (array([11, 12, 13, 14, 15, 16, 17, 18], dtype=int64),)
However, the code works only if I select the first element searchDates (searchDates[0][0]). It will not run for all values in searcDates. What I mean if I replace by the following code.
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0], dateValArr[:,0] < search_dates[0]))
Then, I will get the following error: operands could not be broadcast together with shapes (31,) (3,)
To find the closest value I hoping to somehow combine the following line of the code,
n = (np.abs(dateValArr[:,1]-searchDates[:,2])).argmin()
Any ideas on how to solve it.Thanks in advance
Only thing came into my mind is a for loop.
Here is the link for my work
result = np.array([])
for search_term in searchDates:
mask = (dateValArr[:,0] > search_term[0]) & (dateValArr[:,0] < search_term[1])
date_search_result = dateValArr[mask, :]
d = np.abs(date_search_result[:,1] - searchDates[0,2])
result = np.hstack([result, date_search_result[d.argmin()]])
print(result)
I kinda figured out it as well,
date_value = []
for i in search_dates:
dateidx_arr = np.where(np.logical_and(dateValArr[:,0] >= i[0],dateValArr[:,0] <= i[1] )) #Get index of specified date ranges
date_arr = dateValArr[dateidx_arr] #Based on the index get the dates and values
value_arr = (np.abs(date_arr[:,1]-i[2])).argmin() #for those dates calculate the closest value index
date_value.append(date_arr[value_arr]) #Use the index to get the closest date and value

Python: Efficient looping in dataframe to find duplicates for multiple columns

I am using python and I want to go through a dataset and highlight the most used locations.
This is my dataset (but with 300,000+ records):
Longitude Latitude
14.28586 48.3069
14.28577 48.30687
14.28555 48.30678
14.28541 48.30673
First I add a density column:
df['Density'] = 0
And this is the code that I am using to increase the density value for each record:
for index in range(0,len(df)):
for index2 in range(index + 1, len(df)):
if df['Longitude'].loc[index] == df['Longitude'].loc[index2] and df['Latitude'].loc[index] == df['Latitude'].loc[index2]:
df['Density'].loc[index] += 1
df['Density'].loc[index2] += 1
print("match")
print(str(index) + "/" + str(len(df)))
The code above is simply iterating through the dataframe, comparing the first record against all the other records in the dataset (inner loop) and when a match is found both of their density values are incremented.
I want to find the Longitudes and Latitudes that match and increase their density value.
The code is obviously very slow and I am sure that Python will have a cool technique for doing something like this, any ideas?
You can use duplicated, groupby, transform & sum to achieve this:
Lets create a sample dataset that actually has duplicates
df = pd.DataFrame({'lat': [0, 0, 0, 1, 1, 2, 2, 2],
'lon': [1, 1, 2, 1, 0, 2, 2, 2]})
First flag the duplicate rows based on lat & lon, & apply the transform to create a new column
df['is_dup'] = df[['lat', 'lon']].duplicated()
df['dups'] = df.groupby(['lat','lon']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] = df[['lat', 'lon']].duplicated()
df['dups'] = df.groupby(['lat','lon']).is_dup.transform(np.sum)

Numpy array conditional matching

I need to match two very large Numpy arrays (one is 20000 rows, another about 100000 rows) and I am trying to build a script to do it efficiently. Simple looping over the arrays is incredibly slow, can someone suggest a better way? Here is what I am trying to do: array datesSecondDict and array pwfs2Dates contain datetime values, I need to take each datetime value from array pwfs2Dates (smaller array) and see if there is a datetime value like that (plus minus 5 minutes) in array datesSecondDict (there might be more than 1). If there is one (or more) I populate a new array (of the same size as array pwfs2Dates) with the value (one of the values) from array valsSecondDict (which is just the array with the corresponding numerical values to datesSecondDict). Here is a solution by #unutbu and #joaquin that worked for me (thanks guys!):
import time
import datetime as dt
import numpy as np
def combineArs(dict1, dict2):
"""Combine data from 2 dictionaries into a list.
dict1 contains primary data (e.g. seeing parameter).
The function compares each timestamp in dict1 to dict2
to see if there is a matching timestamp record(s)
in dict2 (plus/minus 5 minutes).
==If yes: a list called data gets appended with the
corresponding parameter value from dict2.
(Note that if there are more than 1 record matching,
the first occuring value gets appended to the list).
==If no: a list called data gets appended with 0."""
# Specify the keys to use
pwfs2Key = 'pwfs2:dc:seeing'
dimmKey = 'ws:seeFwhm'
# Create an iterator for primary dict
datesPrimDictIter = iter(dict1[pwfs2Key]['datetimes'])
# Take the first timestamp value in primary dict
nextDatePrimDict = next(datesPrimDictIter)
# Split the second dictionary into lists
datesSecondDict = dict2[dimmKey]['datetime']
valsSecondDict = dict2[dimmKey]['values']
# Define time window
fiveMins = dt.timedelta(minutes = 5)
data = []
#st = time.time()
for i, nextDateSecondDict in enumerate(datesSecondDict):
try:
while nextDatePrimDict < nextDateSecondDict - fiveMins:
# If there is no match: append zero and move on
data.append(0)
nextDatePrimDict = next(datesPrimDictIter)
while nextDatePrimDict < nextDateSecondDict + fiveMins:
# If there is a match: append the value of second dict
data.append(valsSecondDict[i])
nextDatePrimDict = next(datesPrimDictIter)
except StopIteration:
break
data = np.array(data)
#st = time.time() - st
return data
Thanks,
Aina.
Are the array dates sorted ?
If yes, you can speed up your comparisons by breaking from the inner
loop comparison once its dates are bigger than the date given by the
outer loop. In this way you will made a one-pass comparison instead of
looping dimVals items len(pwfs2Vals) times
If no, maybe you should transform the current pwfs2Dates array to, for example,
an array of pairs [(date, array_index),...] and then you can sort by
date all your arrays to make the one-pass comparison indicated above and at the
same time to be able to get the original indexes needed to set data[i]
for example if the arrays were already sorted (I use lists here, not sure you need arrays for that):
(Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):
pdates = iter(enumerate(pwfs2Dates))
i, datei = pdates.next()
for datej, valuej in zip(dimmDates, dimvals):
while datei < datej - fiveMinutes:
i, datei = pdates.next()
while datei < datej + fiveMinutes:
data[i] = valuej
i, datei = pdates.next()
Otherwise, if they were not ordered and you created the sorted, indexed lists like this:
pwfs2Dates = sorted([(date, idx) for idx, date in enumerate(pwfs2Dates)])
dimmDates = sorted([(date, idx) for idx, date in enumerate(dimmDates)])
the code would be:
(Edited: now using and iterator not to loop pwfs2Dates from the beginning on each step):
pdates = iter(pwfs2Dates)
datei, i = pdates.next()
for datej, j in dimmDates:
while datei < datej - fiveMinutes:
datei, i = pdates.next()
while datei < datej + fiveMinutes:
data[i] = dimVals[j]
datei, i = pdates.next()
great!
..
Note that dimVals:
dimVals = np.array(dict1[dimmKey]['values'])
is not used in your code and can be eliminated.
Note that your code gets greatly simplified by looping through the
array itself instead of using xrange
Edit: The answer from unutbu address some weak parts in the code above.
I indicate them here for completness:
Use of next: next(iterator) is prefered to iterator.next().
iterator.next() is an exception to a conventional naming rule that
has been fixed in py3k renaming this method as
iterator.__next__().
Check for the end of the iterator with a try/except. After all the
items in the iterator are finished the next call to next()
produces an StopIteration Exception. Use try/except to kindly
break out of the loop when that happens. For the specific case of the
OP question this is not an issue, because the two arrrays are the same
size so the for loop finishes at the same time than the iterator. So no
exception is risen. However, there could be cases were dict1 and dict2
are not the same size. And in this case there is the posibility of an
exception being risen.
Question is: what is better, to use try/except or to prepare the arrays
before looping by equalizing them to the shorter one.
Building on joaquin's idea:
import datetime as dt
import itertools
def combineArs(dict1, dict2, delta = dt.timedelta(minutes = 5)):
marks = dict1['datetime']
values = dict1['values']
pdates = iter(dict2['datetime'])
data = []
datei = next(pdates)
for datej, val in itertools.izip(marks, values):
try:
while datei < datej - delta:
data.append(0)
datei = next(pdates)
while datei < datej + delta:
data.append(val)
datei = next(pdates)
except StopIteration:
break
return data
dict1 = { 'ws:seeFwhm':
{'datetime': [dt.datetime(2011, 12, 19, 12, 0, 0),
dt.datetime(2011, 12, 19, 12, 1, 0),
dt.datetime(2011, 12, 19, 12, 20, 0),
dt.datetime(2011, 12, 19, 12, 22, 0),
dt.datetime(2011, 12, 19, 12, 40, 0), ],
'values': [1, 2, 3, 4, 5] } }
dict2 = { 'pwfs2:dc:seeing':
{'datetime': [dt.datetime(2011, 12, 19, 12, 9),
dt.datetime(2011, 12, 19, 12, 19),
dt.datetime(2011, 12, 19, 12, 29),
dt.datetime(2011, 12, 19, 12, 39),
], } }
if __name__ == '__main__':
dimmKey = 'ws:seeFwhm'
pwfs2Key = 'pwfs2:dc:seeing'
print(combineArs(dict1[dimmKey], dict2[pwfs2Key]))
yields
[0, 3, 0, 5]
I think you can do it with one fewer loops:
import datetime
import numpy
# Test data
# Create an array of dates spaced at 1 minute intervals
m = range(1, 21)
n = datetime.datetime.now()
a = numpy.array([n + datetime.timedelta(minutes=i) for i in m])
# A smaller array with three of those dates
m = [5, 10, 15]
b = numpy.array([n + datetime.timedelta(minutes=i) for i in m])
# End of test data
def date_range(date_array, single_date, delta):
plus = single_date + datetime.timedelta(minutes=delta)
minus = single_date - datetime.timedelta(minutes=delta)
return date_array[(date_array < plus) * (date_array > minus)]
dates = []
for i in b:
dates.append(date_range(a, i, 5))
all_matches = numpy.unique(numpy.array(dates).flatten())
There is surely a better way to gather and merge the matches, but you get the idea... You could also use numpy.argwhere((a < plus) * (a > minus)) to return the index instead of the date and use the index to grab the whole row and place it into your new array.

Categories