Logical indexing in pandas dataframes [duplicate] - python
This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 3 years ago.
I have some data like this:
+-----------+---------+-------+
| Duration | Outcome | Event |
+-----------+---------+-------+
| 421 | 0 | 1 |
| 421 | 0 | 1 |
| 261 | 0 | 1 |
| 24 | 0 | 1 |
| 27 | 0 | 1 |
| 613 | 0 | 1 |
| 2454 | 0 | 1 |
| 227 | 0 | 1 |
| 2560 | 0 | 1 |
| 229 | 0 | 1 |
| 2242 | 0 | 1 |
| 6680 | 0 | 1 |
| 1172 | 0 | 1 |
| 5656 | 0 | 1 |
| 5082 | 0 | 1 |
| 7239 | 0 | 1 |
| 127 | 0 | 1 |
| 128 | 0 | 1 |
| 128 | 0 | 1 |
| 7569 | 1 | 1 |
| 324 | 0 | 2 |
| 6395 | 0 | 2 |
| 6196 | 0 | 2 |
| 31 | 0 | 2 |
| 228 | 0 | 2 |
| 274 | 0 | 2 |
| 270 | 0 | 2 |
| 275 | 0 | 2 |
| 232 | 0 | 2 |
| 7310 | 0 | 2 |
| 7644 | 1 | 2 |
| 6949 | 0 | 3 |
| 6903 | 1 | 3 |
| 6942 | 0 | 4 |
| 7031 | 1 | 4 |
+-----------+---------+-------+
Now, for each Event, with the Outcome 0/1 considered as Fail/Pass, I want to sum the total Duration of Fail/Pass events separately in 2 new columns (or 1, whatever ensures readability).
I'm new to dataframes and I feel significant logical indexing is involved here. What is the best way to approach this problem?
df.groupby(['Event', 'Outcome'])['Duration'].sum()
So you group by both the event then the outcome, look at the duration column then take the sum of each group.
You can also try:
pd.pivot_table(index='Event',
columns='Outcome',
values='Duration',
data=df,
aggfunc='sum')
which gives you a table with two columns:
+---------+-------+------+
| Outcome | 0 | 1 |
+---------+-------+------+
| Event | | |
+---------+-------+------+
| 1 | 35691 | 7569 |
| 2 | 21535 | 7644 |
| 3 | 6949 | 6903 |
| 4 | 6942 | 7031 |
+---------+-------+------+
Related
Is there a method in turning user input into csv format?
This is the example data that would be pasted into an input() prompt and ideally I would like it to be processed and made into a csv file through python: ,,,,,,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,SCA,SCA,Passes,Passes,Passes,Passes,Carries,Carries,Dribbles,Dribbles,-additional Player,#,Nation,Pos,Age,Min,Gls,Ast,PK,PKatt,Sh,SoT,CrdY,CrdR,Touches,Press,Tkl,Int,Blocks,xG,npxG,xA,SCA,GCA,Cmp,Att,Cmp%,Prog,Carries,Prog,Succ,Att,-9999 Gabriel Jesus,9,br BRA,FW,25-124,82,0,0,0,0,1,0,0,0,40,13,1,1,0,0.1,0.1,0.0,4,0,20,27,74.1,2,33,1,4,5,b66315ae Eddie Nketiah,14,eng ENG,FW,23-067,8,0,0,0,0,0,0,0,0,6,2,0,0,0,0.0,0.0,0.1,2,0,4,4,100.0,1,4,1,0,0,a53649b7 Martinelli,11,br BRA,LW,21-048,90,1,0,0,0,2,1,0,0,38,21,0,2,1,0.6,0.6,0.1,1,0,24,28,85.7,1,34,5,3,4,48a5a5d6 Bukayo Saka,7,eng ENG,RW,20-334,90,0,0,0,0,3,0,0,0,52,23,3,0,3,0.2,0.2,0.0,2,1,24,36,66.7,2,37,8,2,2,bc7dc64d Martin Ødegaard,8,no NOR,AM,23-231,89,0,0,0,0,2,0,0,0,50,22,2,1,2,0.1,0.1,0.0,2,0,30,39,76.9,5,28,3,1,2,79300479 Albert Sambi Lokonga,23,be BEL,CM,22-287,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0.0,0.0,0.0,0,0,1,1,100.0,0,1,1,0,0,1b4f1169 Granit Xhaka,34,ch SUI,DM,29-312,90,0,0,0,0,0,0,1,0,60,5,0,2,3,0.0,0.0,0.0,4,0,42,49,85.7,6,32,2,0,0,e61b8aee Thomas Partey,5,gh GHA,DM,29-053,90,0,0,0,0,1,0,0,0,62,25,7,1,2,0.1,0.1,0.0,0,0,40,47,85.1,5,26,4,0,1,529f49ab Oleksandr Zinchenko,35,ua UKR,LB,25-233,82,0,1,0,0,1,1,0,0,64,16,3,3,1,0.0,0.0,0.3,2,1,44,54,81.5,6,36,5,0,0,51cf8561 Kieran Tierney,3,sct SCO,LBWB,25-061,8,0,0,0,0,0,0,0,0,6,1,0,0,0,0.0,0.0,0.0,0,0,2,4,50.0,0,1,0,0,0,fce2302c Gabriel Dos Santos,6,br BRA,CB,24-229,90,0,0,0,0,0,0,0,0,67,5,1,1,2,0.0,0.0,0.0,0,0,52,58,89.7,1,48,3,0,0,67ac5bb8 William Saliba,12,fr FRA,CB,21-134,90,0,0,0,0,0,0,0,0,58,3,1,2,2,0.0,0.0,0.0,0,0,42,46,91.3,1,35,1,0,0,972aeb2a Ben White,4,eng ENG,RB,24-301,90,0,0,0,0,0,0,1,0,61,22,7,4,5,0.0,0.0,0.1,1,0,29,40,72.5,5,25,2,1,1,35e413f1 Aaron Ramsdale,1,eng ENG,GK,24-083,90,0,0,0,0,0,0,0,0,33,0,0,0,0,0.0,0.0,0.0,0,0,24,32,75.0,0,21,0,0,0,466fb2c5 14 Players,,,,,990,1,1,0,0,10,2,2,0,599,158,25,17,21,1.1,1.1,0.5,18,2,378,465,81.3,35,361,36,11,15,-9999 The link to the table is: https://fbref.com/en/matches/e62f6e78/Crystal-Palace-Arsenal-August-5-2022-Premier-League#stats_18bb7c10_summary I have attempted to use pandas dataframe but I am only able to export the first row of headers and nothing else (only the items before player).
Would have been nice for you to include your attempt. Pandas works just fine: import pandas as pd url = 'https://fbref.com/en/matches/e62f6e78/Crystal-Palace-Arsenal-August-5-2022-Premier-League#stats_18bb7c10_summary' df = pd.read_html(url)[10] cols = [f'{each[0]}_{each[1]}' if 'Unnamed' not in each[0] else f'{each[1]}' for each in df.columns] df.columns = cols df.to_csv('output.csv', index=False) Output: print(df.to_markdown()) | | Player | # | Nation | Pos | Age | Min | Gls | Ast | PK | PKatt | Sh | SoT | CrdY | CrdR | Touches | Press | Tkl | Int | Blocks | xG | npxG | xA | SCA | GCA | Cmp | Att | Cmp% | Prog | Carries | Prog.1 | Succ | Att.1 | |---:|:---------------------|----:|:---------|:------|:-------|------:|------:|------:|-----:|--------:|-----:|------:|-------:|-------:|----------:|--------:|------:|------:|---------:|-----:|-------:|-----:|------:|------:|------:|------:|-------:|-------:|----------:|---------:|-------:|--------:| | 0 | Gabriel Jesus | 9 | br BRA | FW | 25-124 | 82 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 40 | 13 | 1 | 1 | 0 | 0.1 | 0.1 | 0 | 4 | 0 | 20 | 27 | 74.1 | 2 | 33 | 1 | 4 | 5 | | 1 | Eddie Nketiah | 14 | eng ENG | FW | 23-067 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 0.1 | 2 | 0 | 4 | 4 | 100 | 1 | 4 | 1 | 0 | 0 | | 2 | Martinelli | 11 | br BRA | LW | 21-048 | 90 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 38 | 21 | 0 | 2 | 1 | 0.6 | 0.6 | 0.1 | 1 | 0 | 24 | 28 | 85.7 | 1 | 34 | 5 | 3 | 4 | | 3 | Bukayo Saka | 7 | eng ENG | RW | 20-334 | 90 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 52 | 23 | 3 | 0 | 3 | 0.2 | 0.2 | 0 | 2 | 1 | 24 | 36 | 66.7 | 2 | 37 | 8 | 2 | 2 | | 4 | Martin Ødegaard | 8 | no NOR | AM | 23-231 | 89 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 50 | 22 | 2 | 1 | 2 | 0.1 | 0.1 | 0 | 2 | 0 | 30 | 39 | 76.9 | 5 | 28 | 3 | 1 | 2 | | 5 | Albert Sambi Lokonga | 23 | be BEL | CM | 22-287 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 100 | 0 | 1 | 1 | 0 | 0 | | 6 | Granit Xhaka | 34 | ch SUI | DM | 29-312 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 60 | 5 | 0 | 2 | 3 | 0 | 0 | 0 | 4 | 0 | 42 | 49 | 85.7 | 6 | 32 | 2 | 0 | 0 | | 7 | Thomas Partey | 5 | gh GHA | DM | 29-053 | 90 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 62 | 25 | 7 | 1 | 2 | 0.1 | 0.1 | 0 | 0 | 0 | 40 | 47 | 85.1 | 5 | 26 | 4 | 0 | 1 | | 8 | Oleksandr Zinchenko | 35 | ua UKR | LB | 25-233 | 82 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 64 | 16 | 3 | 3 | 1 | 0 | 0 | 0.3 | 2 | 1 | 44 | 54 | 81.5 | 6 | 36 | 5 | 0 | 0 | | 9 | Kieran Tierney | 3 | sct SCO | LB,WB | 25-061 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 50 | 0 | 1 | 0 | 0 | 0 | | 10 | Gabriel Dos Santos | 6 | br BRA | CB | 24-229 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 67 | 5 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 52 | 58 | 89.7 | 1 | 48 | 3 | 0 | 0 | | 11 | William Saliba | 12 | fr FRA | CB | 21-134 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | 3 | 1 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 42 | 46 | 91.3 | 1 | 35 | 1 | 0 | 0 | | 12 | Ben White | 4 | eng ENG | RB | 24-301 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 61 | 22 | 7 | 4 | 5 | 0 | 0 | 0.1 | 1 | 0 | 29 | 40 | 72.5 | 5 | 25 | 2 | 1 | 1 | | 13 | Aaron Ramsdale | 1 | eng ENG | GK | 24-083 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 32 | 75 | 0 | 21 | 0 | 0 | 0 | | 14 | 14 Players | nan | nan | nan | nan | 990 | 1 | 1 | 0 | 0 | 10 | 2 | 2 | 0 | 599 | 158 | 25 | 17 | 21 | 1.1 | 1.1 | 0.5 | 18 | 2 | 378 | 465 | 81.3 | 35 | 361 | 36 | 11 | 15 |
could you elaborate more? maybe you could split the raw text by comma and then convert it to a dataframe like: list_of_string = input.split(',') df = pd.DataFrame(list_of_string) df.to_csv('yourfile.csv')
The correct approach is as proposed by chitown88, however if you want to copy paste the data by hand into the terminal and get a csv you can do something like this: import pandas as pd from datetime import datetime while True: print("Enter/Paste your content. Ctrl-D or Ctrl-Z ( windows ) to save it.") contents = [] while True: try: line = input() except EOFError: break contents.append(line) df = pd.DataFrame(contents) df.to_csv(f"df_{int(datetime.now().timestamp())}.csv", index=None) Start the Python script, paste the data into the terminal, press CTRL+D and press enter to export the data you pasted into the terminal into a csv file.
You can use user input controlled while loop to get user input. Finally, you may exit depending on the user’s choice. Look at the code below: user_input = 'Y' while user_input.lower() == 'y': # Run your code here. user_input = input('Do you want to add one more entry: Y or N?')
This is most intuitive and understandable solution I could come up with uses of basic linear algebra to solve the problem which I find pretty neat. I recommend you to find an another way to parse the data. Check out beautifulsoup and requests. import pandas as pd#for dataframe data = ''' ,,,,,,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,SCA,SCA,Passes,Passes,Passes,Passes,Carries,Carries,Dribbles,Dribbles,-additional Player,#,Nation,Pos,Age,Min,Gls,Ast,PK,PKatt,Sh,SoT,CrdY,CrdR,Touches,Press,Tkl,Int,Blocks,xG,npxG,xA,SCA,GCA,Cmp,Att,Cmp%,Prog,Carries,Prog,Succ,Att,-9999 Gabriel Jesus,9,br BRA,FW,25-124,82,0,0,0,0,1,0,0,0,40,13,1,1,0,0.1,0.1,0.0,4,0,20,27,74.1,2,33,1,4,5,b66315ae Eddie Nketiah,14,eng ENG,FW,23-067,8,0,0,0,0,0,0,0,0,6,2,0,0,0,0.0,0.0,0.1,2,0,4,4,100.0,1,4,1,0,0,a53649b7 Martinelli,11,br BRA,LW,21-048,90,1,0,0,0,2,1,0,0,38,21,0,2,1,0.6,0.6,0.1,1,0,24,28,85.7,1,34,5,3,4,48a5a5d6 Bukayo Saka,7,eng ENG,RW,20-334,90,0,0,0,0,3,0,0,0,52,23,3,0,3,0.2,0.2,0.0,2,1,24,36,66.7,2,37,8,2,2,bc7dc64d Martin Ødegaard,8,no NOR,AM,23-231,89,0,0,0,0,2,0,0,0,50,22,2,1,2,0.1,0.1,0.0,2,0,30,39,76.9,5,28,3,1,2,79300479 Albert Sambi Lokonga,23,be BEL,CM,22-287,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0.0,0.0,0.0,0,0,1,1,100.0,0,1,1,0,0,1b4f1169 Granit Xhaka,34,ch SUI,DM,29-312,90,0,0,0,0,0,0,1,0,60,5,0,2,3,0.0,0.0,0.0,4,0,42,49,85.7,6,32,2,0,0,e61b8aee Thomas Partey,5,gh GHA,DM,29-053,90,0,0,0,0,1,0,0,0,62,25,7,1,2,0.1,0.1,0.0,0,0,40,47,85.1,5,26,4,0,1,529f49ab Oleksandr Zinchenko,35,ua UKR,LB,25-233,82,0,1,0,0,1,1,0,0,64,16,3,3,1,0.0,0.0,0.3,2,1,44,54,81.5,6,36,5,0,0,51cf8561 Kieran Tierney,3,sct SCO,LBWB,25-061,8,0,0,0,0,0,0,0,0,6,1,0,0,0,0.0,0.0,0.0,0,0,2,4,50.0,0,1,0,0,0,fce2302c Gabriel Dos Santos,6,br BRA,CB,24-229,90,0,0,0,0,0,0,0,0,67,5,1,1,2,0.0,0.0,0.0,0,0,52,58,89.7,1,48,3,0,0,67ac5bb8 William Saliba,12,fr FRA,CB,21-134,90,0,0,0,0,0,0,0,0,58,3,1,2,2,0.0,0.0,0.0,0,0,42,46,91.3,1,35,1,0,0,972aeb2a Ben White,4,eng ENG,RB,24-301,90,0,0,0,0,0,0,1,0,61,22,7,4,5,0.0,0.0,0.1,1,0,29,40,72.5,5,25,2,1,1,35e413f1 Aaron Ramsdale,1,eng ENG,GK,24-083,90,0,0,0,0,0,0,0,0,33,0,0,0,0,0.0,0.0,0.0,0,0,24,32,75.0,0,21,0,0,0,466fb2c5 14 Players,,,,,990,1,1,0,0,10,2,2,0,599,158,25,17,21,1.1,1.1,0.5,18,2,378,465,81.3,35,361,36,11,15,-9999 ''' #you can just replace data with user input def tryNum(x):#input a value and if its a number then it returns a number, if not it returns itself back try: x = float(x) return x except: return x rows = [i.split(',')[:-1] for i in data.split('\n')[2:-2]]#removing useless lines col_names = [i for i in rows[0]]#fetching all column names cols = [[tryNum(rows[j][i]) for j in range(1,len(rows))] for i in range(len(rows[0]))]#get all column info by transposing the "matrix" if you will full = {}#setting up the dictionary for i,y in zip(col_names,cols):#putting the data in the dict full[i]=y df = pd.DataFrame(data = full)#uploading it all to the df print(df.head())
Python Pivot Table based on multiple criteria
I was asking the question in this link SUMIFS in python jupyter However, I just realized that the solution didn't work because they can switch in and switch out on different dates. So basically they have to switch out first before they can switch in. Here is the dataframe (sorted based on the date): +---------------+--------+---------+-----------+--------+ | Switch In/Out | Client | Quality | Date | Amount | +---------------+--------+---------+-----------+--------+ | Out | 1 | B | 15-Aug-19 | 360 | | In | 1 | A | 16-Aug-19 | 180 | | In | 1 | B | 17-Aug-19 | 180 | | Out | 1 | A | 18-Aug-19 | 140 | | In | 1 | B | 18-Aug-19 | 80 | | In | 1 | A | 19-Aug-19 | 60 | | Out | 2 | B | 14-Aug-19 | 45 | | Out | 2 | C | 15-Aug-20 | 85 | | In | 2 | C | 15-Aug-20 | 130 | | Out | 2 | A | 20-Aug-19 | 100 | | In | 2 | A | 22-Aug-19 | 30 | | In | 2 | B | 23-Aug-19 | 30 | | In | 2 | C | 23-Aug-19 | 40 | +---------------+--------+---------+-----------+--------+ I would then create a new column and divide them into different transactions. +---------------+--------+---------+-----------+--------+------+ | Switch In/Out | Client | Quality | Date | Amount | Rows | +---------------+--------+---------+-----------+--------+------+ | Out | 1 | B | 15-Aug-19 | 360 | 1 | | In | 1 | A | 16-Aug-19 | 180 | 1 | | In | 1 | B | 17-Aug-19 | 180 | 1 | | Out | 1 | A | 18-Aug-19 | 140 | 2 | | In | 1 | B | 18-Aug-19 | 80 | 2 | | In | 1 | A | 19-Aug-19 | 60 | 2 | | Out | 2 | B | 14-Aug-19 | 45 | 3 | | Out | 2 | C | 15-Aug-20 | 85 | 3 | | In | 2 | C | 15-Aug-20 | 130 | 3 | | Out | 2 | A | 20-Aug-19 | 100 | 4 | | In | 2 | A | 22-Aug-19 | 30 | 4 | | In | 2 | B | 23-Aug-19 | 30 | 4 | | In | 2 | C | 23-Aug-19 | 40 | 4 | +---------------+--------+---------+-----------+--------+------+ With this, I can apply the pivot formula and take it from there. However, how do I do this in python? In excel, I can just use multiple SUMIFS and compare in and out. However, this is not possible in python. Thank you!
One simple solution is to iterate and apply a check (function) over each element being the result a new column, so: map. Using df.index.map we get the index for each item to pass as a argument, so we can play with the values, get and compare. In your case your aim is to identify the change to "Out" keeping a counter. import pandas as pd switchInOut = ["Out", "In", "In", "Out", "In", "In", "Out", "Out", "In", "Out", "In", "In", "In"] df = pd.DataFrame(switchInOut, columns=['Switch In/Out']) counter = 1 def changeToOut(i): global counter if df["Switch In/Out"].get(i) == "Out" and df["Switch In/Out"].get(i-1) == "In": counter += 1 return counter rows = df.index.map(changeToOut) df["Rows"] = rows df Result: +----+-----------------+--------+ | | Switch In/Out | Rows | |----+-----------------+--------| | 0 | Out | 1 | | 1 | In | 1 | | 2 | In | 1 | | 3 | Out | 2 | | 4 | In | 2 | | 5 | In | 2 | | 6 | Out | 3 | | 7 | Out | 3 | | 8 | In | 3 | | 9 | Out | 4 | | 10 | In | 4 | | 11 | In | 4 | | 12 | In | 4 | +----+-----------------+--------+
Assigning states of Hidden Markov Models by idealized values intensity values.
I'm running the pomegranate HMM (http://pomegranate.readthedocs.io/en/latest/HiddenMarkovModel.html) on my data, and I load the results into a Pandas DF, and define the idealized intensity as the median of all the points in that state: df["hmm_idealized"] = df.groupby(["hmm_state"],as_index = False)["Raw"].transform("median"). Sample data: +-----+-----------------+-------------+------------+ | | hmm_idealized | hmm_state | hmm_diff | |-----+-----------------+-------------+------------| | 0 | 99862 | 3 | nan | | 1 | 99862 | 3 | 0 | | 2 | 99862 | 3 | 0 | | 3 | 99862 | 3 | 0 | | 4 | 99862 | 3 | 0 | | 5 | 99862 | 3 | 0 | | 6 | 117759 | 4 | 1 | | 7 | 117759 | 4 | 0 | | 8 | 117759 | 4 | 0 | | 9 | 117759 | 4 | 0 | | 10 | 117759 | 4 | 0 | | 11 | 117759 | 4 | 0 | | 12 | 117759 | 4 | 0 | | 13 | 117759 | 4 | 0 | | 14 | 124934 | 2 | -2 | | 15 | 124934 | 2 | 0 | | 16 | 124934 | 2 | 0 | | 17 | 124934 | 2 | 0 | | 18 | 124934 | 2 | 0 | | 19 | 117759 | 4 | 2 | | 20 | 117759 | 4 | 0 | | 21 | 117759 | 4 | 0 | | 22 | 117759 | 4 | 0 | | 23 | 117759 | 4 | 0 | | 24 | 117759 | 4 | 0 | | 25 | 117759 | 4 | 0 | | 26 | 117759 | 4 | 0 | | 27 | 117759 | 4 | 0 | | 28 | 117759 | 4 | 0 | | 29 | 117759 | 4 | 0 | | 30 | 117759 | 4 | 0 | | 31 | 117759 | 4 | 0 | | 32 | 117759 | 4 | 0 | | 33 | 117759 | 4 | 0 | | 34 | 117759 | 4 | 0 | | 35 | 117759 | 4 | 0 | | 36 | 117759 | 4 | 0 | | 37 | 117759 | 4 | 0 | | 38 | 117759 | 4 | 0 | | 39 | 117759 | 4 | 0 | | 40 | 106169 | 1 | -3 | | 41 | 106169 | 1 | 0 | | 42 | 106169 | 1 | 0 | | 43 | 106169 | 1 | 0 | | 44 | 106169 | 1 | 0 | | 45 | 106169 | 1 | 0 | | 46 | 106169 | 1 | 0 | | 47 | 106169 | 1 | 0 | | 48 | 106169 | 1 | 0 | | 49 | 106169 | 1 | 0 | | 50 | 106169 | 1 | 0 | | 51 | 106169 | 1 | 0 | | 52 | 106169 | 1 | 0 | | 53 | 106169 | 1 | 0 | | 54 | 106169 | 1 | 0 | | 55 | 106169 | 1 | 0 | | 56 | 106169 | 1 | 0 | | 57 | 106169 | 1 | 0 | | 58 | 106169 | 1 | 0 | | 59 | 106169 | 1 | 0 | | 60 | 106169 | 1 | 0 | | 61 | 106169 | 1 | 0 | | 62 | 106169 | 1 | 0 | | 63 | 106169 | 1 | 0 | | 64 | 106169 | 1 | 0 | | 65 | 106169 | 1 | 0 | | 66 | 106169 | 1 | 0 | | 67 | 106169 | 1 | 0 | | 68 | 106169 | 1 | 0 | | 69 | 106169 | 1 | 0 | | 70 | 106169 | 1 | 0 | | 71 | 106169 | 1 | 0 | | 72 | 106169 | 1 | 0 | | 73 | 106169 | 1 | 0 | | 74 | 106169 | 1 | 0 | | 75 | 99862 | 3 | 2 | | 76 | 99862 | 3 | 0 | | 77 | 99862 | 3 | 0 | | 78 | 99862 | 3 | 0 | | 79 | 99862 | 3 | 0 | | 80 | 99862 | 3 | 0 | | 81 | 99862 | 3 | 0 | | 82 | 99862 | 3 | 0 | | 83 | 99862 | 3 | 0 | | 84 | 99862 | 3 | 0 | | 85 | 99862 | 3 | 0 | | 86 | 99862 | 3 | 0 | | 87 | 99862 | 3 | 0 | | 88 | 99862 | 3 | 0 | | 89 | 99862 | 3 | 0 | | 90 | 99862 | 3 | 0 | | 91 | 99862 | 3 | 0 | | 92 | 99862 | 3 | 0 | | 93 | 99862 | 3 | 0 | | 94 | 99862 | 3 | 0 | | 95 | 99862 | 3 | 0 | | 96 | 99862 | 3 | 0 | | 97 | 99862 | 3 | 0 | | 98 | 99862 | 3 | 0 | | 99 | 99862 | 3 | 0 | | 100 | 99862 | 3 | 0 | | 101 | 99862 | 3 | 0 | | 102 | 99862 | 3 | 0 | | 103 | 99862 | 3 | 0 | | 104 | 99862 | 3 | 0 | | 105 | 99862 | 3 | 0 | | 106 | 99862 | 3 | 0 | | 107 | 99862 | 3 | 0 | | 108 | 94127 | 0 | -3 | | 109 | 94127 | 0 | 0 | | 110 | 94127 | 0 | 0 | | 111 | 94127 | 0 | 0 | | 112 | 94127 | 0 | 0 | | 113 | 94127 | 0 | 0 | | 114 | 94127 | 0 | 0 | | 115 | 94127 | 0 | 0 | | 116 | 94127 | 0 | 0 | | 117 | 94127 | 0 | 0 | | 118 | 94127 | 0 | 0 | | 119 | 94127 | 0 | 0 | | 120 | 94127 | 0 | 0 | | 121 | 94127 | 0 | 0 | | 122 | 94127 | 0 | 0 | | 123 | 94127 | 0 | 0 | | 124 | 94127 | 0 | 0 | | 125 | 94127 | 0 | 0 | | 126 | 94127 | 0 | 0 | | 127 | 94127 | 0 | 0 | | 128 | 94127 | 0 | 0 | | 129 | 94127 | 0 | 0 | | 130 | 94127 | 0 | 0 | | 131 | 94127 | 0 | 0 | | 132 | 94127 | 0 | 0 | | 133 | 94127 | 0 | 0 | | 134 | 94127 | 0 | 0 | | 135 | 94127 | 0 | 0 | | 136 | 94127 | 0 | 0 | | 137 | 94127 | 0 | 0 | | 138 | 94127 | 0 | 0 | | 139 | 94127 | 0 | 0 | | 140 | 94127 | 0 | 0 | | 141 | 94127 | 0 | 0 | | 142 | 94127 | 0 | 0 | | 143 | 94127 | 0 | 0 | | 144 | 94127 | 0 | 0 | | 145 | 94127 | 0 | 0 | | 146 | 94127 | 0 | 0 | | 147 | 94127 | 0 | 0 | | 148 | 94127 | 0 | 0 | | 149 | 94127 | 0 | 0 | | 150 | 94127 | 0 | 0 | | 151 | 94127 | 0 | 0 | | 152 | 94127 | 0 | 0 | | 153 | 94127 | 0 | 0 | | 154 | 94127 | 0 | 0 | | 155 | 94127 | 0 | 0 | | 156 | 94127 | 0 | 0 | | 157 | 94127 | 0 | 0 | | 158 | 94127 | 0 | 0 | | 159 | 94127 | 0 | 0 | | 160 | 94127 | 0 | 0 | | 161 | 94127 | 0 | 0 | | 162 | 94127 | 0 | 0 | | 163 | 94127 | 0 | 0 | | 164 | 94127 | 0 | 0 | | 165 | 94127 | 0 | 0 | | 166 | 94127 | 0 | 0 | | 167 | 94127 | 0 | 0 | | 168 | 94127 | 0 | 0 | | 169 | 94127 | 0 | 0 | | 170 | 94127 | 0 | 0 | | 171 | 94127 | 0 | 0 | | 172 | 94127 | 0 | 0 | | 173 | 94127 | 0 | 0 | | 174 | 94127 | 0 | 0 | | 175 | 94127 | 0 | 0 | | 176 | 94127 | 0 | 0 | | 177 | 94127 | 0 | 0 | | 178 | 94127 | 0 | 0 | | 179 | 94127 | 0 | 0 | | 180 | 94127 | 0 | 0 | | 181 | 94127 | 0 | 0 | | 182 | 94127 | 0 | 0 | | 183 | 94127 | 0 | 0 | | 184 | 94127 | 0 | 0 | | 185 | 94127 | 0 | 0 | | 186 | 94127 | 0 | 0 | | 187 | 94127 | 0 | 0 | | 188 | 94127 | 0 | 0 | | 189 | 94127 | 0 | 0 | | 190 | 94127 | 0 | 0 | | 191 | 94127 | 0 | 0 | | 192 | 94127 | 0 | 0 | | 193 | 94127 | 0 | 0 | | 194 | 94127 | 0 | 0 | | 195 | 94127 | 0 | 0 | | 196 | 94127 | 0 | 0 | | 197 | 94127 | 0 | 0 | | 198 | 94127 | 0 | 0 | | 199 | 94127 | 0 | 0 | | 200 | 94127 | 0 | 0 | +-----+-----------------+-------------+------------+ When analyzing the results, I want to count the number of increases in the model. I would like to use the df.where(data.hmm_diff > 0).count() function to read off how many times I increment by one state. However, the increase sometimes spans two states (i.e. it skips the middle state), so I need to reassign the HMM-state label by sorting the idealized values, so that the lowest state would be 0, the highest 4, etc. Is there a way to reassign the hmm_state label from arbitrary to dependent on idealized intensity? For example,the hmm_state labeled as "1" lies in between of hmm_state 3 and 4
It looks like you just need to define a sorted HMM state like this: state_orders = {v: i for i, v in enumerate(sorted(df.hmm_idealized.unique()))} df['sorted_state'] = df.hmm_idealized.map(state_orders) Then you can continue as you did in the question, but taking a diff on this column, and counting the jumps on it.
Numpy version of finding the highest and lowest value locations within an interval of another column?
Given the following numpy array. How can I find the highest and lowest value locations of column 0 within the interval on column 1 using numpy? import numpy as np data = np.array([ [1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1], [1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1], [1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1], [1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1], [1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1], [1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1], [1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan], [1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1], [1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1], [1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1], [1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1], [1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1], [1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan], [1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1], [1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1], [1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1], [1873.174,1],[1873.691,np.nan],[1873.685,np.nan] ]) In the third column below you can see where the max and min is for each interval. +-------+----------+-----------+---------+ | index | Value | Intervals | Min/Max | +-------+----------+-----------+---------+ | 0 | 1879.289 | np.nan | | | 1 | 1879.281 | np.nan | | | 2 | 1879.292 | 1 | | | 3 | 1879.295 | 1 | | | 4 | 1879.481 | 1 | | | 5 | 1879.294 | 1 | | | 6 | 1879.268 | 1 | -1 | min | 7 | 1879.293 | 1 | | | 8 | 1879.277 | 1 | | | 9 | 1879.285 | 1 | | | 10 | 1879.464 | 1 | | | 11 | 1879.475 | 1 | | | 12 | 1879.971 | 1 | | | 13 | 1879.779 | 1 | | | 17 | 1879.986 | 1 | | | 18 | 1880.791 | 1 | 1 | max | 19 | 1880.29 | 1 | | | 55 | 1879.253 | np.nan | | | 56 | 1878.268 | np.nan | | | 57 | 1875.73 | 1 | -1 |min | 58 | 1876.792 | 1 | | | 59 | 1875.977 | 1 | | | 60 | 1876.408 | 1 | | | 61 | 1877.159 | 1 | | | 62 | 1877.187 | 1 | | | 63 | 1883.164 | 1 | | | 64 | 1883.171 | 1 | | | 65 | 1883.495 | 1 | | | 66 | 1883.962 | 1 | | | 67 | 1885.158 | 1 | | | 68 | 1885.974 | 1 | 1 | max | 69 | 1886.479 | np.nan | | | 70 | 1885.969 | np.nan | | | 71 | 1884.693 | 1 | | | 72 | 1884.977 | 1 | | | 73 | 1884.967 | 1 | | | 74 | 1884.691 | 1 | -1 | min | 75 | 1886.171 | 1 | 1 | max | 76 | 1886.166 | np.nan | | | 77 | 1884.476 | np.nan | | | 78 | 1884.66 | 1 | 1 | max | 79 | 1882.962 | 1 | | | 80 | 1881.496 | 1 | | | 81 | 1871.163 | 1 | -1 | min | 82 | 1874.985 | 1 | | | 83 | 1874.979 | 1 | | | 84 | 1871.173 | np.nan | | | 85 | 1871.973 | np.nan | | | 86 | 1871.682 | np.nan | | | 87 | 1872.476 | np.nan | | | 88 | 1882.361 | 1 | 1 | max | 89 | 1880.869 | 1 | | | 90 | 1882.165 | 1 | | | 91 | 1881.857 | 1 | | | 92 | 1880.375 | 1 | | | 93 | 1880.66 | 1 | | | 94 | 1880.891 | 1 | | | 95 | 1880.377 | 1 | | | 96 | 1881.663 | 1 | | | 97 | 1881.66 | 1 | | | 98 | 1877.888 | 1 | | | 99 | 1875.69 | 1 | | | 100 | 1875.161 | 1 | -1 | min | 101 | 1876.697 | np.nan | | | 102 | 1876.671 | np.nan | | | 103 | 1879.666 | np.nan | | | 111 | 1877.182 | np.nan | | | 112 | 1878.898 | 1 | | | 113 | 1878.668 | 1 | | | 114 | 1878.871 | 1 | | | 115 | 1878.882 | 1 | | | 116 | 1879.173 | 1 | 1 | max | 117 | 1878.887 | 1 | | | 118 | 1878.68 | 1 | | | 119 | 1878.872 | 1 | | | 120 | 1878.677 | 1 | | | 121 | 1877.877 | 1 | | | 122 | 1877.669 | 1 | | | 123 | 1877.69 | 1 | | | 124 | 1877.684 | 1 | | | 125 | 1877.68 | 1 | | | 126 | 1877.885 | 1 | | | 127 | 1877.863 | 1 | | | 128 | 1877.674 | 1 | | | 129 | 1877.676 | 1 | | | 130 | 1877.687 | 1 | | | 131 | 1878.367 | 1 | | | 132 | 1878.179 | 1 | | | 133 | 1877.696 | 1 | | | 134 | 1877.665 | 1 | -1 | min | 135 | 1877.667 | np.nan | | | 136 | 1878.678 | np.nan | | | 137 | 1878.661 | 1 | 1 | max | 138 | 1878.171 | 1 | | | 139 | 1877.371 | 1 | | | 140 | 1877.359 | 1 | | | 141 | 1878.381 | 1 | | | 142 | 1875.185 | 1 | -1 | min | 143 | 1875.367 | np.nan | | | 144 | 1865.492 | np.nan | | | 145 | 1865.495 | 1 | -1 | min | 146 | 1866.995 | 1 | | | 147 | 1866.672 | 1 | | | 148 | 1867.465 | 1 | | | 149 | 1867.663 | 1 | | | 150 | 1867.186 | 1 | | | 151 | 1867.687 | 1 | | | 152 | 1867.459 | 1 | | | 153 | 1867.168 | 1 | | | 154 | 1869.689 | 1 | | | 155 | 1869.693 | 1 | | | 156 | 1871.676 | 1 | | | 157 | 1873.174 | 1 | 1 | max | 158 | 1873.691 | np.nan | | | 159 | 1873.685 | np.nan | | +-------+----------+-----------+---------+ I must specify upfront that this question has already been answered here with a pandas solution. The solution performs reasonable at about 300 seconds for a table of around 1 million rows. But after some more testing, I see that if the table is over 3 million rows, the execution time increases dramatically to over 2500 seconds and even more. This is obviously too long for such a simple task. How would the same problem be solved with numpy?
Here's one NumPy approach - mask = ~np.isnan(data[:,1]) s0 = np.flatnonzero(mask[1:] > mask[:-1])+1 s1 = np.flatnonzero(mask[1:] < mask[:-1])+1 lens = s1 - s0 tags = np.repeat(np.arange(len(lens)), lens) idx = np.lexsort((data[mask,0], tags)) starts = np.r_[0,lens.cumsum()] offsets = np.r_[s0[0], s0[1:] - s1[:-1]] offsets_cumsum = offsets.cumsum() min_ids = idx[starts[:-1]] + offsets_cumsum max_ids = idx[starts[1:]-1] + offsets_cumsum out = np.full(data.shape[0], np.nan) out[min_ids] = -1 out[max_ids] = 1
So this is a bit of a cheat since it uses scipy: import numpy as np from scipy import ndimage markers = np.isnan(data[:, 1]) groups = np.cumsum(markers) mins, max, min_idx, max_idx = ndimage.measurements.extrema( data[:, 0], labels=groups, index=range(2, groups.max(), 2))
How to add the first values in cascading columns?
Can anyone help me with a better solution than looping? Let's say that we have the following pandas data-frame made out of 4 columns. I am looking for a way to get the same values as in "Result" column through other methods than looping. Here is the logic: If priority1=1 then result=1 If priority1=1 and priority2=1 then result=2 (ignore all other if priority1 !=1) If priority1=1 and priority2=1 and priority3=1 then result=3 (ignore all other if priority1 and priority2 != 1) If priority1=1 and priority2=1 and priority3=1 and priority4=1 then result=4 (ignore all other if priority1 and priority2 and priority3 != 1) The opposite is happening if priority1 is negative. Here is my final result after a very ugly and inefficient looping: +-----+-----------+-----------+-----------+-----------+--------+ | | priority1 | priority2 | priority3 | priority4 | Result | +-----+-----------+-----------+-----------+-----------+--------+ | 0 | | | | | | | 1 | | 1 | -1 | -1 | | | 2 | | | | | | | 3 | | | | 1 | | | 4 | | | | 1 | | | 5 | | | | | | | 6 | | | | -1 | | | 7 | | | | | | | 8 | | | | | | | 9 | 1 | 1 | 1 | 1 | 1 | | 10 | | | | | | | 11 | | | | 1 | | | 12 | | | 1 | | | | 13 | | | | | | | 14 | | | -1 | -1 | | | 15 | | | | | | | 16 | | | | | | | 17 | | | | -1 | | | 18 | | | | | | | 19 | | | | | | | 20 | | 1 | 1 | 1 | 2 | | 21 | | | | | | | 22 | | | -1 | -1 | | | 23 | | | | | | | 24 | | | | | | | 25 | | | | -1 | | | 26 | | | | | | | 27 | | | 1 | 1 | 3 | | 28 | | | | | | | 29 | | | | | | | 30 | | | | -1 | | | 31 | | | | | | | 32 | | | | | | | 33 | | | -1 | -1 | | | 34 | | | | | | | 35 | | | 1 | 1 | 4 | | 36 | | | | | | | 37 | | | | | | | 38 | | | | | | | 39 | | | -1 | -1 | | | 40 | | | | | | | 41 | | | | | | | 42 | | 1 | 1 | 1 | 2 | | 43 | | | | | | | 44 | | | | | | | 45 | | | | -1 | | | 46 | | | | | | | 47 | | | | | | | 48 | | | | | | | 49 | | | | | | | 50 | | -1 | -1 | -1 | | | 51 | | | | | | | 52 | | | | | | | 53 | | 1 | 1 | 1 | 2 | | 54 | | | | | | | 55 | | | | | | | 56 | | | | -1 | | | 57 | | | | | | | 58 | | | | | | | 59 | | | -1 | -1 | | | 60 | | | | | | | 61 | | | | | | | 62 | | | 1 | 1 | 3 | | 63 | | | | | | | 64 | -1 | -1 | -1 | -1 | -1 | | 65 | | | | | | | 66 | | | 1 | 1 | | | 67 | | | | | | | 68 | | | | | | | 69 | | | | | | | 70 | | | | -1 | | | 71 | | | | | | | 72 | | | | | | | 73 | | | 1 | 1 | | | 74 | | | | -1 | | | 75 | | | | | | | 76 | | | 1 | 1 | | | 77 | | | | | | | 78 | | -1 | -1 | -1 | -2 | | 79 | | | 1 | | | | 80 | | | 1 | 1 | | | 81 | | | | | | | 82 | | | -1 | -1 | -3 | | 83 | | | 1 | 1 | | | 84 | | | | | | | 85 | | | | | | | 86 | | | | | | | 87 | | | -1 | -1 | -4 | | 88 | | | | | | | 89 | | -1 | -1 | -1 | -2 | | 90 | | | | | | | 91 | | | | | | | 92 | | | | -1 | | | 93 | | | | | | | 94 | | | | | | | 95 | | | 1 | 1 | | | 96 | | | | | | | 97 | | | | | | | 98 | | | | -1 | | | 99 | | | | 1 | | | 100 | | | | | | | 101 | | | -1 | -1 | -3 | | 102 | | | | | | | 103 | | | | | | | 104 | | 1 | 1 | 1 | | | 105 | | | | | | | 106 | | | | 1 | | | 107 | | | | | | | 108 | | | -1 | -1 | | | 109 | | | | | | | 110 | | | | | | | 111 | | | 1 | 1 | | | 112 | | | | | | | 113 | | | | | | | 114 | | | -1 | -1 | | | 115 | | | | | | | 116 | | | 1 | 1 | | | 117 | | | | | | | 118 | | | | | | | 119 | | -1 | -1 | -1 | -2 | | 120 | | | | | | | 121 | | | | | | | 122 | | | | 1 | | | 123 | | | | | | | 124 | | | | 1 | | | 125 | | | | | | | 126 | | | | | | | 127 | | | 1 | 1 | | | 128 | | | | | | | 129 | | | | | | | 130 | | | -1 | -1 | -3 | | 131 | | | | | | | 132 | | | | | | | 133 | | | | | | | 134 | 1 | 1 | 1 | 1 | 1 | | 135 | | | | -1 | | | 136 | | | | | | | 137 | | -1 | -1 | -1 | | | 138 | | | 1 | | | | 139 | | | | 1 | | | 140 | | 1 | 1 | 1 | 2 | | 141 | | | 1 | 1 | 3 | | 142 | | | | | | | 143 | | | | -1 | | | 144 | | | | | | | 145 | | | | 1 | 4 | +-----+-----------+-----------+-----------+-----------+--------+
setup df = pd.DataFrame([ [ 1, 0, 0, 0], [ 1, 1, 0, 0], [ 1, 1, 1, 0], [ 1, 1, 1, 1], [ 0, 1, 1, 1], [ 0, 0, 1, 1], [ 0, 0, 0, 1], [ 1, 0, 1, 1], # this should end up 1 [ 0, 0, 0, 0], [-1, 0, 0, 0], [-1, -1, 0, 0], [-1, -1, -1, 0], [-1, -1, -1, -1], [ 0, -1, -1, -1], [ 0, 0, -1, -1], [ 0, 0, 0, -1], ], columns=['priority{}'.format(i) for i in range(1, 5)]) solution v = df.values df.assign(Results=(v * v.cumprod(1).astype(np.bool8)).sum(1)) priority1 priority2 priority3 priority4 Results 0 1 0 0 0 1 1 1 1 0 0 2 2 1 1 1 0 3 3 1 1 1 1 4 4 0 1 1 1 0 5 0 0 1 1 0 6 0 0 0 1 0 7 1 0 1 1 1 8 0 0 0 0 0 9 -1 0 0 0 -1 10 -1 -1 0 0 -2 11 -1 -1 -1 0 -3 12 -1 -1 -1 -1 -4 13 0 -1 -1 -1 0 14 0 0 -1 -1 0 15 0 0 0 -1 0 how it works Grab the numpy array with v = df.values non-zeros as True with v.astype(np.bool8) each successive column continuing to be non-zero with v.astype(np.bool8).cumprod(1) multiply by v to filter out which ones to add then sum (v * v.astype(np.bool8).cumprod(1)).sum() naive time test small data big data
Using piRSquared's example frame (hat tip!), I might do something like match = (df.abs() == 1) & (df.eq(df.iloc[:, 0], axis=0)) out = match.cumprod(axis=1).sum(axis=1) * df.iloc[:, 0] which gives me In [107]: df["Result"] = out In [108]: df Out[108]: priority1 priority2 priority3 priority4 Result 0 1 0 0 0 1 1 1 1 0 0 2 2 1 1 1 0 3 3 1 1 1 1 4 4 0 1 1 1 0 5 0 0 1 1 0 6 0 0 0 1 0 7 0 0 0 0 0 8 -1 0 0 0 -1 9 -1 -1 0 0 -2 10 -1 -1 -1 0 -3 11 -1 -1 -1 -1 -4 12 0 -1 -1 -1 0 13 0 0 -1 -1 0 14 0 0 0 -1 0