Sampling a dataframe by selecting rows where the location modulo P = Q - python

Let's say I have a dataframe with N rows. I want to pick the rows where the row location modulo P gives Q. So for concreteness, let's say P = 7 and Q = 5.
Row 0: 0 mod 7 = 0 (not satisfied)
Row 1: 1 mod 7 = 1 (not satisfied)
...
Row 5: 5 mod 7 = 5 (satisfied)
...
Row 12: 12 mod 7 = 5 (satisfied)
So the rows that are selected will be 5, 12, 19, 26 ....
If Q=0, you can use the slicing method df.iloc[::P]. How does one do it for mod P = Q?

df.iloc[Q::P] this indicates start at row Q then step in increments of P.
When the first argument isn't given like .iloc[::P] it is implicitly 0 (and the middle one is implicitly end of data frame), you can just specify it to be something other than 0 if that is what you need.

Using the numpy package:
import numpy as np
#instantiate new col
df["satisfied"] = 0
#fill new col based on modulus condition
df.satisfied = np.where(df.index % P == Q, "(satisfied)", "(not satisfied)")

code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(100).reshape(25,4), columns = ['A','B','C','D'])
p = 7
q = 5
a = []
#piece of code for getting the p%7 value and appending in a list
for i in range(df.shape[0]):
if i%p == q:
a.append(i)
#printing the p%7 values
print(df.iloc[a,:])
Output:
================
A B C D
5 20 21 22 23
12 48 49 50 51
19 76 77 78 79

Related

how to decrease all the elements of a column in a text file from each other?

I have a text file with two columns and 135001 rows. First column is amplitude and second column is related time. I need to go over the first column and understand where the amplitude increase and again decrease and I need to extract the related time. Actually I should make a derivate from first column. When the amplitude increases I should count one and then I should wait until the amplitude reach zero, then again do this process. As I mentioned I need the related time also. This is a very raw code that I am thinking of that and i know it is not true but I do not know how to complete it. For the first step I have problem with decreasing the rows in the first column and I got this error" str could not convert to float".
n=0
with open('39-1+2.txt',"r") as f
for line in f
data=line.split(' ')[0]
time=line.split(' ')[1]
with open ('grad-time.txt', 'w') as s:
for i in range (0, 135001):
if
d= float(data[i+1])-float (data[i])>0
n=n+1
s.write("{}\n".format(d))
wait
float(data [i]= 0.0)
continue
For an example I have this file:
0 11
2 12
3 13
1 14
0 15
1 16
0 17
0 18
The out put should be like:
2 12
1 16
Since you want to use the value of a previous row to make decision on the current row, you can make use of pandas' shift. This will allow you to create a new column that holds the value of the previous row.
Using that logic you just need to check if the previous row is 0, and that the current value is higher than that.
>>> import pandas as pd
>>> df = pd.DataFrame([[0,11],[2,12],[3,13],[1,14],[0,15],[1,16],[0,17],[0,18]])
>>> df['shift'] = df[0].shift(1)
>>> df
0 1 shift
0 0 11 NaN
1 2 12 0.0
2 3 13 2.0
3 1 14 3.0
4 0 15 1.0
5 1 16 0.0
6 0 17 1.0
7 0 18 0.0
>>> df[(df['shift']==0) & (df[0] > df['shift'])].drop(columns=['shift'])
0 1
1 2 12
5 1 16
I haven't tested it, but the following functions should work for your problem:
def prepare_date(filename):
with open(filename) as f:
data = f.readlines()
prepared_data = []
for line in data:
# A line "1 16" becomes [1, 16] in prepared_data
prepared_data.append(
[int(item) for item in line.split()]
)
return prepared_data
def find_increases_in_amplitude(prepared_data):
# get the first data point
last_data_point = prepared_data[0]
increases = []
# loop over data and find increases
for data_point in prepared_data:
# if the last data point had amplitude 0, and the current has amplitude
# greater than zero: store the data_point in "increases"
if last_data_point[0] == 0 and data_point[0] > 0:
increases.append(data_point)
# update last_data_point
last_data_point = data_point
return increases
If you use the first one to open and prepare your data (make it a list like [ [1, 12], ...] ), then run that list through the second function.

Discard points with X,Y coordinate close to eachother in Dataframe

I have the following dataframe (it is actually several hundred MB long):
X Y Size
0 10 20 5
1 11 21 2
2 9 35 1
3 8 7 7
4 9 19 2
I want discard any X, Y point that has an euclidean distance from any other X, Y point in the dataframe of less than delta=3. In those cases I want to keep only the row with the bigger size.
In this example the intended result would be:
X Y Size
0 10 20 5
2 9 35 1
3 8 7 7
As the question is stated, the behavior of the desired algorithm is not clear about how to deal with the chaining of distances.
If chaining is allowed, one solution is to cluster the dataset using a density-based clustering algorithm such as DBSCAN.
You just need to set the neighboorhood radius epsto delta and the min_sample parameter to 1 to allow isolated points as clusters. Then, you can find in each group which point has the maximum size.
from sklearn.cluster import DBSCAN
X = df[['X', 'Y']]
db = DBSCAN(eps=3, min_samples=1).fit(X)
df['grp'] = db.labels_
df_new = df.loc[df.groupby('grp').idxmax()['Size']]
print(df_new)
>>>
X Y Size grp
0 10 20 5 0
2 9 35 1 1
3 8 7 7 2
You can use below script and also try improving it.
#get all euclidean distances using sklearn;
#it will create an array of euc distances;
#then get index from df whose euclidean distance is less than 3
from sklearn.metrics.pairwise import euclidean_distances
Z = df[['X', 'Y']]
euc = euclidean_distances(Z, Z)
idx = [(i, j) for i in range(len(euc)-1) for j in range(i+1, len(euc)) if euc[i, j] < 3]
# collect all index of df that has euc dist < 3 and get the max value
# then collect all index in df NOT in euc and add the row with max size
# create a new called df_new by combining the rest in df and row with max size
from itertools import chain
df_idx = list(set(chain(*idx)))
df2 = df.iloc[df_idx]
idx_max = df2[df2['Size'] == df2['Size'].max()].index.tolist()
df_new = pd.concat([df.iloc[~df.index.isin(df_idx)], df2.iloc[idx_max]])
df_new
Result:
X Y Size
2 9 35 1
3 8 7 7
0 10 20 5

Generate combinations for a comma separated strings in a pandas row

I have a dataframe like this:
ID, Values
1 10, 11, 12, 13
2 14
3 15, 16, 17, 18
I want to create a new dataframe like this:
ID COl1 Col2
1 10 11
1 11 12
1 12 13
2 14
3 15 16
3 16 17
3 17 18
Please help me in how to do this???
Note: The rows in Values column of input df are str type.
Use list comprehension with flattening and small change - if i > 0: to if i == 2: for correct working with one element values:
from collections import deque
#https://stackoverflow.com/a/36586925
def chunks(iterable, chunk_size=2, overlap=1):
# we'll use a deque to hold the values because it automatically
# discards any extraneous elements if it grows too large
if chunk_size < 1:
raise Exception("chunk size too small")
if overlap >= chunk_size:
raise Exception("overlap too large")
queue = deque(maxlen=chunk_size)
it = iter(iterable)
i = 0
try:
# start by filling the queue with the first group
for i in range(chunk_size):
queue.append(next(it))
while True:
yield tuple(queue)
# after yielding a chunk, get enough elements for the next chunk
for i in range(chunk_size - overlap):
queue.append(next(it))
except StopIteration:
# if the iterator is exhausted, yield any remaining elements
i += overlap
if i == 2:
yield tuple(queue)[-i:]
L = [[x] + list(z) for x, y in zip(df['ID'], df['Values']) for z in (chunks(y.split(', ')))]
df = pd.DataFrame(L, columns=['ID','Col1','Col2']).fillna('')
print (df)
ID Col1 Col2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18
Tried slightly different approach. Created a function which will return numbers in pairs from the initial comma separated string.
def pairup(mystring):
"""Function to return paired up list from string"""
mylist = mystring.split(',')
if len(mylist) == 1: return [mylist]
splitlist = []
for index, item in enumerate(mylist):
try:
splitlist.append([mylist[index], mylist[index+1]])
except:
pass
return splitlist
Now let's create the new data frame.
# https://stackoverflow.com/a/39955283/3679377
new_df = df[['ID']].join(
df.Values.apply(lambda x: pd.Series(pairup(x)))
.stack()
.apply(lambda x: pd.Series(x))
.fillna("")
.reset_index(level=1, drop=True),
how='left').reset_index(drop=True)
new_df.columns = ['ID', 'Col 1', 'Col 2']
Here's the output of print(new_df).
ID Col 1 Col 2
0 1 10 11
1 1 11 12
2 1 12 13
3 2 14
4 3 15 16
5 3 16 17
6 3 17 18

Pandas track consecutive near numbers via compare-cumsum-groupby pattern

I am trying to extend my current pattern to accommodate extra conditions of +- a percentage of the last value rather than strict does it match previous value.
data = np.array([[2,30],[2,900],[2,30],[2,30],[2,30],[2,1560],[2,30],
[2,300],[2,30],[2,450]])
df = pd.DataFrame(data)
df.columns = ['id','interval']
UPDATE 2 (id fix): Updated Data 2 with more data:
data2 = np.array([[2,30],[2,900],[2,30],[2,29],[2,31],[2,30],[2,29],[2,31],[2,1560],[2,30],[2,300],[2,30],[2,450], [3,40],[3,900],[3,40],[3,39],[3,41], [3,40],[3,39],[3,41] ,[3,1560],[3,40],[3,300],[3,40],[3,450]])
df2 = pd.DataFrame(data2)
df2.columns = ['id','interval']
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
results in [30,30,30]
however I really want to catch near number conditions say when a number is +-10% of the previous number.
so looking at df2 I would like to pickup the series [30,29,31]
for i, g in df2.groupby([(df2.interval != <???+- 10% magic ???>).cumsum()]):
if len(g.interval.tolist())>=3:
print(g.interval.tolist())
UPDATE: Here is the end of line processing code where I store the gathered lists into a dictionary with the ID as the key
leak_intervals = {}
final_leak_intervals = {}
serials = []
for i, g in df.groupby([(df.interval != df.interval.shift()).cumsum()]):
if len(g.interval.tolist()) >= 3:
print(g.interval.tolist())
serial = g.id.values[0]
if serial not in serials:
serials.append(serial)
if serial not in leak_intervals:
leak_intervals[serial] = g.interval.tolist()
else:
leak_intervals[serial] = leak_intervals[serial] + (g.interval.tolist())
UPDATE:
In [116]: df2.groupby(df2.interval.pct_change().abs().gt(0.1).cumsum()) \
.filter(lambda x: len(x) >= 3)
Out[116]:
id interval
2 2 30
3 2 29
4 2 31
5 2 30
6 2 29
7 2 31
15 3 40
16 3 39
17 2 41
18 2 40
19 2 39
20 2 41

Binning values into groups with a minimum size using pandas

I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.
You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()

Categories