If T1 is this:
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
and P is this:
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
how do I get the positions of P within T1 ?
In terms of min and max I would like to see this returned
min max
3 6
8 11
If these dataframes were represented as SQL tables I could use this SQL method translated to pandas:
DECLARE #Items INT = (SELECT COUNT(*) FROM #P);
SELECT MIN(t.KeyCol) AS MinKey,
MAX(t.KeyCol) AS MaxKey
FROM dbo.T1 AS t
INNER JOIN #P AS p ON p.Val = t.Val
GROUP BY t.KeyCol - p.KeyCol
HAVING COUNT(*) = #Items;
This SQL solution is from Pesomannen's reply to http://sqlmag.com/t-sql/identifying-subsequence-in-sequence-part-2
well, you can always do a workaround like this:
t1 = ''.join(T1.val)
p = ''.join(P.val)
start, res = 0, []
while True:
try:
res.append(t1.index(p, start))
start = res[-1] + 1
except:
break
to get the starting indices and then figure out the ending indices by mathing and access the dataframe by using iloc. you should use 0-based indexing (not 1-based, like you do in the example)
Granted, this doesn't utilize P, but may serve your purposes.
groups = T1.groupby(T1.val).groups
pd.DataFrame({'min': [min(x) for x in groups.values()],
'max': [max(x) for x in groups.values()]}, index=groups.keys())
yields
max min
E 7 2
B 10 0
D 9 1
A 8 3
[4 rows x 2 columns]
I think I've worked it out by following the same approach as the SQL solution - a type of relational division (ie match up on the values, group by the differences in the key columns and select the group that has the count equal to the size of the subsequence):
import pandas as pd
T1 = pd.DataFrame(data = {'val':['B','D','E','A','D','B','A','E','A','D','B']})
# use the index to create a new column that's going to be the key (zero based)
T1 = T1.reset_index()
# do the same for the subsequence that we want to find within T1
P = pd.DataFrame(data = {'val': ['E','A','D','B']})
P = P.reset_index()
# join on the val column
J = T1.merge(P,on=['val'],how='inner')
# group by difference in key columns calculating the min, max and count of the T1 key
FullResult = J.groupby(J['index_x'] - J['index_y'])['index_x'].agg({min,max,'count'})
# Final result is where the count is the size of the subsequence - in this case 4
FullResult[FullResult['count'] == 4]
Really enjoying using pandas !
Related
For dummy's sake, let's say that I have a database with columns text, ind and sentid. You can think of it as a database with one row per word, with a word's text, its position in the sentence, and the ID of the sentence.
To query for specific occurrences of n words, I join the table with itself n times. That table does not have a unique column except for the default rowid. I oftentimes want to query these columns in such a way that the integers in the n ind columns are sequential without any integer between them. Sometimes the order matters, sometimes the order does not. At the same time, each of the n columns also needs to fulfil some requirement, e.g. n0.text = 'a' AND n1.text = 'b' AND n2.text = 'c'. Put differently, in every sentence (unique sentid), find all occurrences of a b c either ordered or in any order (but sequential).
I have solved the ordered case quite easily. For three columns with names ind0, ind1, ind2 you could simply have a query like the following and scale it accordingly as n columns grows.
WHERE ind1 = ind0 + 1 AND ind2 = ind1 + 1
Perhaps there is a better way (do tell me if that is so) but I have found that this works reliably well in terms of performance (query speed).
The tougher nut to crack is those cases where the integers also have to be sequential but where the order does not matter (e.g., 7 9 8, 3 1 2, 11 10 9). My current approach is "brute forcing" by simply generating all possible permutations of orders (e.g., (ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind0 = ind1 + 1 AND ind2 = ind0 + 1) OR ...)). But as n grows, this becomes a huge list of possibilities and my query speed seems to be really hurting on this. For example, for n=6 (the max requirement) this will generate 720 potential orders separate with OR. Such an approach, that works, is given as a minimal but documented example below for you to try out.
I am looking for a more generic, SQL-y solution that hopefully positively impacts performance when querying for sequential (but not necessarily ordered) columns.
Fiddle with data and current implementation here, reproducible Python code below.
Note that it is possible to get multiple results per sentid, but only if at least one of the word indices differs in the three matched words. E.g. permutations of the results themselves are not needed. E.g., 1-2-0 and 3-2-1 can both be valid results for one sentid, but both 1-2-0 and 2-1-0 can't.
import sqlite3
from itertools import permutations
from pathlib import Path
from random import shuffle
def generate_all_possible_sequences(n: int) -> str:
"""Given an integer, generates all possible permutations of the 'n' given indices with respect to
order in SQLite. What this means is that it will generate all possible permutations, e.g., for '3':
0, 1, 2; 0, 2, 1; 1, 0, 2; 1, 2, 0 etc. and then build corresponding SQLite requirements, e.g.,
0, 1, 2: ind1 = ind0 + 1 AND ind2 = ind1 + 1
0, 2, 1: ind2 = ind0 + 1 AND ind1 = ind2 + 1
...
and all these possibilities are then concatenated with OR to allow every possibility:
((ind1 = ind0 + 1 AND ind2 = ind1 + 1) OR (ind2 = ind0 + 1 AND ind1 = ind2 + 1) OR ...)
"""
idxs = list(range(n))
order_perms = []
for perm in permutations(idxs):
this_perm_orders = []
for i in range(1, len(perm)):
this_perm_orders.append(f"w{perm[i]}.ind = w{perm[i-1]}.ind + 1")
order_perms.append(f"({' AND '.join(this_perm_orders)})")
return f"({' OR '.join(order_perms)})"
def main():
pdb = Path("temp.db")
if pdb.exists():
pdb.unlink()
conn = sqlite3.connect(str(pdb))
db_cur = conn.cursor()
# Create a table of words, where each word has its text, its position in the sentence, and the ID of its sentence
db_cur.execute("CREATE TABLE tbl(text TEXT, ind INTEGER, sentid INTEGER)")
# Create dummy data
vals = []
for sent_id in range(20):
shuffled = ["a", "b", "c", "d", "e", "a", "c"]
shuffle(shuffled)
for word_id, word in enumerate(shuffled):
vals.append((word, word_id, sent_id))
# Wrap the values in single quotes for SQLite
vals = [(f"'{v}'" for v in val) for val in vals]
# Convert values into INSERT commands
cmds = [f"INSERT INTO tbl VALUES ({','.join(val)})" for val in vals]
# Build DB
db_cur.executescript(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;")
print(f"BEGIN TRANSACTION;{';'.join(cmds)};COMMIT;\n")
# Query DB for sequential occurences in ind0, ind1, and ind2: the order does not matter
# but they have to be sequential
query = f"""SELECT w0.ind, w1.ind, w2.ind, w0.sentid
FROM tbl AS w0
JOIN tbl AS w1 USING (sentid)
JOIN tbl AS w2 USING (sentid)
WHERE w0.text = 'a'
AND w1.text = 'b'
AND w2.text = 'c'
AND {generate_all_possible_sequences(3)}"""
print(query)
print()
print("a_idx\tb_idx\tc_idx\tsentid")
for res in db_cur.execute(query).fetchall():
print("\t".join(map(str, res)))
db_cur.close()
conn.commit()
conn.close()
pdb.unlink()
if __name__ == '__main__':
main()
This is a solution that uses mostly the rowids to create all the possible permutations:
WITH cte AS (
SELECT t0.rowid rowid0, t1.rowid rowid1, t2.rowid rowid2
FROM tbl AS t0
JOIN tbl AS t1 ON t1.sentid = t0.sentid AND t1.ind = t0.ind + 1
JOIN tbl AS t2 ON t2.sentid = t1.sentid AND t2.ind = t1.ind + 1
)
SELECT t0.ind ind0, t1.ind ind1, t2.ind ind2
FROM cte c
JOIN tbl t0 ON t0.rowid IN (c.rowid0, c.rowid1, c.rowid2)
JOIN tbl t1 ON t1.rowid IN (c.rowid0, c.rowid1, c.rowid2) AND t1.rowid <> t0.rowid
JOIN tbl t2 ON t2.rowid IN (c.rowid0, c.rowid1, c.rowid2) AND t2.rowid NOT IN (t0.rowid, t1.rowid)
See a simplified demo.
The query plan (in the above demo) shows that SQLite uses covering indexes and the rowids to perform the joins.
I don't know how this would scale for more columns as this requirement is a performance killer by design because of the multiple joins and the number of rows that must be retuned for each tuple of n columns that satisfy the conditions (=n!).
I found that the easiest way to implement this is to use the following condition:
AND MAX(w0.ind, w1.ind, w2.ind) - MIN(w0.ind, w1.ind, w2.ind) = 2
where 2 = the number of words that we are looking for -1. That being said, it is hard to say much about the performance since "no, indexes can't be used in expressions like MAX(...) - MIN(...) where functions are used." as #forpas mentions.
I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.
You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.
Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.
The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37
I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx
I would like to count how many times column A has the same value with B and with C. Similarly, I would like to count how many time A2 has the same value with B2 and with C2.
I have this dataframe:
,A,B,C,A2,B2,C2
2018-12-01,7,0,8,17,17,17
2018-12-02,0,0,8,20,18,18
2018-12-03,9,8,8,17,17,18
2018-12-04,8,8,8,17,17,18
2018-12-05,8,8,8,17,17,17
2018-12-06,9,8,8,15,17,17
2018-12-07,8,9,9,17,17,16
2018-12-08,0,0,0,17,17,17
2018-12-09,8,0,0,17,20,18
2018-12-10,8,8,8,17,17,17
2018-12-11,8,8,9,17,17,17
2018-12-12,8,8,8,17,17,17
2018-12-13,8,8,8,17,17,17
2018-12-14,8,8,8,17,17,17
2018-12-15,9,9,9,17,17,17
2018-12-16,12,0,0,17,19,17
2018-12-17,11,9,9,17,17,17
2018-12-18,8,9,9,17,17,17
2018-12-19,8,9,8,17,17,17
2018-12-20,9,8,8,17,17,17
2018-12-21,9,9,9,17,17,17
2018-12-22,10,9,0,17,17,17
2018-12-23,10,11,10,17,17,17
2018-12-24,10,10,8,17,19,17
2018-12-25,7,10,10,17,17,18
2018-12-26,10,0,10,17,19,17
2018-12-27,9,10,8,18,17,17
2018-12-28,9,9,9,17,17,17
2018-12-29,10,10,12,18,17,17
2018-12-30,10,0,10,16,19,17
2018-12-31,11,8,8,19,17,16
I expect the following value:
A with B = 14
A with C = 14
A2 with B2 = 14
A2 with C2 = 14
I have done this:
ia = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['B'][i]:
ia=ia+1
ib = 0
for i in range(0,len(dfr_h_max1)):
if dfr_h_max1['A'][i] == dfr_h_max1['C'][i]:
ib=ib+1
In order to take advantage of pandas, this is one possible solution:
import numpy as np
dfr_h_max1['que'] = np.where((dfr_h_max1['A'] == dfr_h_max1['B']), 1, 0)
After that I could sum all the elements in the new column 'que'.
Another possibility could be related to some sort of boolean variable. Unfortunately, I still do not have enough knowledge about that.
Any other more efficient or elegant solutions?
The primary calculation you need here is, for example, dfr_h_max1['A'] == dfr_h_max1['B'] - as you've done in your edit. That gives you a Series of True/False values based on the equality of each pair of items in the two series. Since True evaluates to 1 and False evaluates to 0, the .sum() is the count of how many True's there were - hence, how many matches.
Put that in a loop and add the required "text" for the output you want:
mains = ('A', 'A2') # the main columns
comps = (['B', 'C'], ['B2', 'C2']) # columns to compare each main with
for main, pair in zip(mains, comps):
for col in pair:
print(f'{main} with {col} = {(dfr_h_max1[main] == dfr_h_max1[col]).sum()}')
# or without f-strings, do:
# print(main, 'with', col, '=', (dfr_h_max1[main] == dfr_h_max1[col]).sum())
Output:
A with B = 14
A with C = 14
A2 with B2 = 21
A2 with C2 = 20
Btw, (df[main] == df[comp]).sum() for Series.sum() can also be written as sum(df[main] == df[comp]) for Python's builtin sum().
In case you have more than two "triplets" of columns (not just A & A2), change the mains and comps to this, so that it works on all triplets:
mains = dfr_h_max1.columns[::3] # main columns (A's), in steps of 3
comps = zip(dfr_h_max1.columns[1::3], # offset by 1 column (B's),
dfr_h_max1.columns[2::3]) # offset by 2 columns (C's),
# in steps of 3
(Or even using the column names / starting letter.)
The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0