Python Pandas - Split Excel Spreadsheet By Empty Rows - python

Given the following input file ("ToSplit2.xlsx"):
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section One | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| | | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section Two | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| | | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section Three | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
And the following Python code:
import pandas as pd
import numpy as np
spreadsheetPath = "ToSplit2.xlsx"
xls = pd.ExcelFile(spreadsheetPath)
# Iterate through worksheets in opened Excel file
for sheet in xls.sheet_names:
# Create a Pandas dataframe from the Excel worksheet (with no headers)
excel_data_df = pd.read_excel(
spreadsheetPath, sheet_name=sheet, header=None)
# Return a list of dataframe index values where entire row is blank
indexList = excel_data_df[excel_data_df.isnull().all(1)].index.tolist()
# Prints [11, 23]
print(indexList)
# Initiate a dictionary
dataframeDictionary = {}
# For every index value in the list
for index in indexList:
# Split and add the result to the dictionary of Panda's dataframes
dataframeDictionary = np.array_split(excel_data_df, index)
# For every pandas dataframe in the dataframe dictionary
for dataframe in dataframeDictionary:
# Write the pandas dataframe to Excel with a worksheet name equal to dataframe address 0,0
dataframe.to_excel("output.xlsx",sheet_name=str(dataframe.iloc[0][0]))
I am trying to split the Excel worksheet into multiple spreadsheets based on the blank rows. E.g.:
Section One: (there would also be Section Two and Section Three worksheets)
+-----------------+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Section One | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 1 | 100 | | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 2 | 100 | 200 | | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 3 | 100 | 200 | 300 | | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 4 | 100 | 200 | 300 | 400 | | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 5 | 100 | 200 | 300 | 400 | 500 | | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 6 | 100 | 200 | 300 | 400 | 500 | 600 | | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 7 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 8 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 9 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| Label 10 | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 | 1000 |
+-----------+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
I believe I am really close, but seem to be slipping up on the data frame splitting.

Make changes according to your file name.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Read excel file
df = pd.read_excel('ToSplit2.xlsx', skip_blank_lines=False, header=None)
# Split by blank rows
df_list = np.split(df, df[df.isnull().all(1)].index)
# Create new excel to write the dataframes
writer = pd.ExcelWriter('Excel_one.xlsx', engine='xlsxwriter')
for i in range(1, len(df_list) + 1):
df_list[i - 1] = df_list[i - 1].dropna(how='all')
df_list[i - 1].to_excel(writer, sheet_name='Sheet{}'.format(i), header=None, index=False)
# Save the excel file
writer.save()

Related

Python Pivot Table based on multiple criteria

I was asking the question in this link SUMIFS in python jupyter
However, I just realized that the solution didn't work because they can switch in and switch out on different dates. So basically they have to switch out first before they can switch in.
Here is the dataframe (sorted based on the date):
+---------------+--------+---------+-----------+--------+
| Switch In/Out | Client | Quality | Date | Amount |
+---------------+--------+---------+-----------+--------+
| Out | 1 | B | 15-Aug-19 | 360 |
| In | 1 | A | 16-Aug-19 | 180 |
| In | 1 | B | 17-Aug-19 | 180 |
| Out | 1 | A | 18-Aug-19 | 140 |
| In | 1 | B | 18-Aug-19 | 80 |
| In | 1 | A | 19-Aug-19 | 60 |
| Out | 2 | B | 14-Aug-19 | 45 |
| Out | 2 | C | 15-Aug-20 | 85 |
| In | 2 | C | 15-Aug-20 | 130 |
| Out | 2 | A | 20-Aug-19 | 100 |
| In | 2 | A | 22-Aug-19 | 30 |
| In | 2 | B | 23-Aug-19 | 30 |
| In | 2 | C | 23-Aug-19 | 40 |
+---------------+--------+---------+-----------+--------+
I would then create a new column and divide them into different transactions.
+---------------+--------+---------+-----------+--------+------+
| Switch In/Out | Client | Quality | Date | Amount | Rows |
+---------------+--------+---------+-----------+--------+------+
| Out | 1 | B | 15-Aug-19 | 360 | 1 |
| In | 1 | A | 16-Aug-19 | 180 | 1 |
| In | 1 | B | 17-Aug-19 | 180 | 1 |
| Out | 1 | A | 18-Aug-19 | 140 | 2 |
| In | 1 | B | 18-Aug-19 | 80 | 2 |
| In | 1 | A | 19-Aug-19 | 60 | 2 |
| Out | 2 | B | 14-Aug-19 | 45 | 3 |
| Out | 2 | C | 15-Aug-20 | 85 | 3 |
| In | 2 | C | 15-Aug-20 | 130 | 3 |
| Out | 2 | A | 20-Aug-19 | 100 | 4 |
| In | 2 | A | 22-Aug-19 | 30 | 4 |
| In | 2 | B | 23-Aug-19 | 30 | 4 |
| In | 2 | C | 23-Aug-19 | 40 | 4 |
+---------------+--------+---------+-----------+--------+------+
With this, I can apply the pivot formula and take it from there.
However, how do I do this in python? In excel, I can just use multiple SUMIFS and compare in and out. However, this is not possible in python.
Thank you!
One simple solution is to iterate and apply a check (function) over each element being the result a new column, so: map.
Using df.index.map we get the index for each item to pass as a argument, so we can play with the values, get and compare. In your case your aim is to identify the change to "Out" keeping a counter.
import pandas as pd
switchInOut = ["Out", "In", "In", "Out", "In", "In",
"Out", "Out", "In", "Out", "In", "In", "In"]
df = pd.DataFrame(switchInOut, columns=['Switch In/Out'])
counter = 1
def changeToOut(i):
global counter
if df["Switch In/Out"].get(i) == "Out" and df["Switch In/Out"].get(i-1) == "In":
counter += 1
return counter
rows = df.index.map(changeToOut)
df["Rows"] = rows
df
Result:
+----+-----------------+--------+
| | Switch In/Out | Rows |
|----+-----------------+--------|
| 0 | Out | 1 |
| 1 | In | 1 |
| 2 | In | 1 |
| 3 | Out | 2 |
| 4 | In | 2 |
| 5 | In | 2 |
| 6 | Out | 3 |
| 7 | Out | 3 |
| 8 | In | 3 |
| 9 | Out | 4 |
| 10 | In | 4 |
| 11 | In | 4 |
| 12 | In | 4 |
+----+-----------------+--------+

How can I Group By Month from a Date field

I have a data frame similar to this one
| date | Murders | State |
|-----------|--------- |------- |
| 6/2/2017 | 100 | Ags |
| 5/23/2017 | 200 | Ags |
| 5/20/2017 | 300 | BC |
| 6/22/2017 | 400 | BC |
| 6/21/2017 | 500 | Ags |
I would like to group the above data by month and state to get an output as:
| date | Murders(SUM) | State |
|-----------|--------- |------- |
| January | 100 | Ags |
| February | 200 | Ags |
| March | 300 | Ags |
| .... | .... | Ags |
| January | 400 | BC |
| February | 500 | BC |
.... .... ..
I tried with this:
dg = DF.groupby(pd.Grouper(key='date', freq='1M')).sum() # groupby each 1 month
dg.index = dg.index.strftime('%B')
But these lines are only add the murders by month but without taking in count the State
We can do
df.groupby([pd.to_datetime(df.date).dt.strftime('%B'),df.State]).Murders.sum().reset_index()

Python: Parse multiple tables from webpage and group data in CSV

I'm a total newbie at Python and have what I think is a pretty complex problem. I'd like to parse two tables from a website for about 80 URLs, example of one of the pages: https://www.sports-reference.com/cfb/players/sam-darnold-1.html
I'd need the first table "Passing" and the second table "Rushing and Receiving" from each of the 80 URLs (I know how to get the first and second table). But the problem is I need it for all 80 URLs in one csv.
This is my code so far and how the data looks:
import requests
import pandas as pd
COLUMNS = ['School', 'Conf', 'Class', 'Pos', 'G', 'Cmp', 'Att', 'Pct', 'Yds','Y/A', 'AY/A', 'TD', 'Int', 'Rate']
urls = ['https://www.sports-reference.com/cfb/players/russell-wilson-1.html',
'https://www.sports-reference.com/cfb/players/cam-newton-1.html',
'https://www.sports-reference.com/cfb/players/peyton-manning-1.html']
#scrape elements
dataframes = []
try:
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print(soup)
table = soup.find_all('table')[0] # Find the first "table" tag in the page
rows = table.find_all("tr")
cy_data = []
for row in rows:
cells = row.find_all("td")
cells = cells[0:14]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
except:
pass
data = pd.concat(dataframes)
data.to_csv('testcsv3.csv', sep=',') ```
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
| | | School | Conf | Class | Pos | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate |
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
| 1 | | | | | | | | | | | | | | | |
| 2 | | North Carolina State | ACC | FR | QB | 11 | 150 | 275 | 54.5 | 1955 | 7.1 | 8.2 | 17 | 1 | 133.9 |
| 3 | | North Carolina State | ACC | SO | QB | 12 | 224 | 378 | 59.3 | 3027 | 8 | 8.3 | 31 | 11 | 147.8 |
| 4 | | North Carolina State | ACC | JR | QB | 13 | 308 | 527 | 58.4 | 3563 | 6.8 | 6.6 | 28 | 14 | 127.5 |
| 5 | | Wisconsin | Big Ten | SR | QB | 14 | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 |
| 6 | | Overall | | | | | 907 | 1489 | 60.9 | 11720 | 7.9 | 8.4 | 109 | 30 | 147.2 |
| 7 | | North Carolina State | | | | | 682 | 1180 | 57.8 | 8545 | 7.2 | 7.5 | 76 | 26 | 135.5 |
| 8 | | Wisconsin | | | | | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 |
| 1 | | | | | | | | | | | | | | | |
| 2 | | Florida | SEC | FR | QB | 5 | 5 | 10 | 50 | 40 | 4 | 4 | 0 | 0 | 83.6 |
| 3 | | Florida | SEC | SO | QB | 1 | 1 | 2 | 50 | 14 | 7 | 7 | 0 | 0 | 108.8 |
| 4 | | Auburn | SEC | JR | QB | 14 | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 |
| 5 | | Overall | | | | | 191 | 292 | 65.4 | 2908 | 10 | 10.9 | 30 | 7 | 178.2 |
| 6 | | Florida | | | | | 6 | 12 | 50 | 54 | 4.5 | 4.5 | 0 | 0 | 87.8 |
| 7 | | Auburn | | | | | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 |
+---+--+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+
And this is how I'd like the data to look, note the player name is missing from each grouping which ideally can be added from the sample website/url and I've added the second table which I need help appending:
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
| | | School | Conf | Class | Pos | G | Cmp | Att | Pct | Yds | Y/A | AY/A | TD | Int | Rate | School | Conf | Class | Pos | G | Att | Yds | Avg | TD |
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
| 1 | | | | | | | | | | | | | | | | | | | | | | | | |
| 2 | Russell Wilson | North Carolina State | ACC | FR | QB | 11 | 150 | 275 | 54.5 | 1955 | 7.1 | 8.2 | 17 | 1 | 133.9 | North Carolina State | ACC | FR | QB | 11 | 150 | 467 | 6.7 | 3 |
| 3 | Russell Wilson | North Carolina State | ACC | SO | QB | 12 | 224 | 378 | 59.3 | 3027 | 8 | 8.3 | 31 | 11 | 147.8 | North Carolina State | ACC | SO | QB | 12 | 129 | 300 | 6.8 | 2 |
| 4 | Russell Wilson | North Carolina State | ACC | JR | QB | 13 | 308 | 527 | 58.4 | 3563 | 6.8 | 6.6 | 28 | 14 | 127.5 | North Carolina State | ACC | JR | QB | 13 | 190 | 560 | 7.1 | 5 |
| 5 | Russell Wilson | Wisconsin | Big Ten | SR | QB | 14 | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 | Wisconsin | Big Ten | SR | QB | 14 | 210 | 671 | 7.3 | 7 |
| 6 | Russell Wilson | Overall | | | | | 907 | 1489 | 60.9 | 11720 | 7.9 | 8.4 | 109 | 30 | 147.2 | Overall | | | | | | | | |
| 7 | Russell Wilson | North Carolina State | | | | | 682 | 1180 | 57.8 | 8545 | 7.2 | 7.5 | 76 | 26 | 135.5 | North Carolina State | | | | | | | | |
| 8 | Russell Wilson | Wisconsin | | | | | 225 | 309 | 72.8 | 3175 | 10.3 | 11.8 | 33 | 4 | 191.8 | Wisconsin | | | | | | | | |
| 1 | | | | | | | | | | | | | | | | | | | | | | | | |
| 2 | Cam Newton | Florida | SEC | FR | QB | 5 | 5 | 10 | 50 | 40 | 4 | 4 | 0 | 0 | 83.6 | Florida | SEC | FR | QB | 5 | 210 | 456 | 7.1 | 2 |
| 3 | Cam Newton | Florida | SEC | SO | QB | 1 | 1 | 2 | 50 | 14 | 7 | 7 | 0 | 0 | 108.8 | Florida | SEC | SO | QB | 1 | 212 | 478 | 4.5 | 5 |
| 4 | Cam Newton | Auburn | SEC | JR | QB | 14 | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 | Auburn | SEC | JR | QB | 14 | 219 | 481 | 6.7 | 6 |
| 5 | Cam Newton | Overall | | | | | 191 | 292 | 65.4 | 2908 | 10 | 10.9 | 30 | 7 | 178.2 | Overall | | | | | | | 3.4 | 7 |
| 6 | Cam Newton | Florida | | | | | 6 | 12 | 50 | 54 | 4.5 | 4.5 | 0 | 0 | 87.8 | Florida | | | | | | | | |
| 7 | Cam Newton | Auburn | | | | | 185 | 280 | 66.1 | 2854 | 10.2 | 11.2 | 30 | 7 | 182 | Auburn | | | | | | | | |
+---+----------------+----------------------+---------+-------+-----+----+-----+------+------+-------+------+------+-----+-----+-------+----------------------+---------+-------+-----+----+-----+-----+-----+----+
So basically I'd wanna append the second table (Only the columns mentioned) to the end of the first table and add the player name (read from the URL) to each row
import requests
import pandas as pd
from bs4 import BeautifulSoup
COLUMNS = ['School', 'Conf', 'Class', 'Pos', 'G', 'Cmp', 'Att', 'Pct', 'Yds','Y/A', 'AY/A', 'TD', 'Int', 'Rate']
COLUMNS2 = ['School', 'Conf', 'Class', 'Pos', 'G', 'Att', 'Yds','Avg', 'TD', 'Rec', 'Yds', 'Avg', 'TD', 'Plays', 'Yds', 'Avg', 'TD']
urls = ['https://www.sports-reference.com/cfb/players/russell-wilson-1.html',
'https://www.sports-reference.com/cfb/players/cam-newton-1.html',
'https://www.sports-reference.com/cfb/players/peyton-manning-1.html']
#scrape elements
dataframes = []
dataframes2 = []
for url in urls:
a = url
print(a)
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print(soup)
table = soup.find_all('table')[0] # Find the first "table" tag in the page
rows = table.find_all("tr")
cy_data = []
for row in rows:
cells = row.find_all("td")
cells = cells[0:14]
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
cy_data = pd.DataFrame(cy_data, columns=COLUMNS)
#Create player column in first column and derive the player from the URL
cy_data.insert(0, 'Player', url)
cy_data['Player'] = cy_data['Player'].str.split('/').str[5].str.split('-').str[0].str.title() + ' ' + cy_data['Player'].str.split('/').str[5].str.split('-').str[1].str.title()
dataframes.append(cy_data)
table2 = soup.find_all('table')[1] # Find the second "table" tag in the page
rows2 = table2.find_all("tr")
cy_data2 = []
for row2 in rows2:
cells2 = row2.find_all("td")
cells2 = cells2[0:14]
cy_data2.append([cell.text for cell in cells2]) # For each "td" tag, get the text inside it
cy_data2 = pd.DataFrame(cy_data2, columns=COLUMNS2)
cy_data2.insert(0, 'Player', url)
cy_data2['Player'] = cy_data2['Player'].str.split('/').str[5].str.split('-').str[0].str.title() + ' ' + cy_data2['Player'].str.split('/').str[5].str.split('-').str[1].str.title()
dataframes2.append(cy_data2)
data = pd.concat(dataframes).reset_index()
data2 = pd.concat(dataframes).reset_index()
data3 = data.merge(data2, on=['index', 'Player'], suffixes=('',' '))
#Filter on None rows
data3 = data3.loc[data3['School'].notnull()].drop('index', axis=1)
display(data, data2, data3)

Numpy version of finding the highest and lowest value locations within an interval of another column?

Given the following numpy array. How can I find the highest and lowest value locations of column 0 within the interval on column 1 using numpy?
import numpy as np
data = np.array([
[1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1],
[1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1],
[1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1],
[1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1],
[1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1],
[1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1],
[1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan],
[1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1],
[1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1],
[1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1],
[1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1],
[1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1],
[1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan],
[1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1],
[1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1],
[1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1],
[1873.174,1],[1873.691,np.nan],[1873.685,np.nan]
])
In the third column below you can see where the max and min is for each interval.
+-------+----------+-----------+---------+
| index | Value | Intervals | Min/Max |
+-------+----------+-----------+---------+
| 0 | 1879.289 | np.nan | |
| 1 | 1879.281 | np.nan | |
| 2 | 1879.292 | 1 | |
| 3 | 1879.295 | 1 | |
| 4 | 1879.481 | 1 | |
| 5 | 1879.294 | 1 | |
| 6 | 1879.268 | 1 | -1 | min
| 7 | 1879.293 | 1 | |
| 8 | 1879.277 | 1 | |
| 9 | 1879.285 | 1 | |
| 10 | 1879.464 | 1 | |
| 11 | 1879.475 | 1 | |
| 12 | 1879.971 | 1 | |
| 13 | 1879.779 | 1 | |
| 17 | 1879.986 | 1 | |
| 18 | 1880.791 | 1 | 1 | max
| 19 | 1880.29 | 1 | |
| 55 | 1879.253 | np.nan | |
| 56 | 1878.268 | np.nan | |
| 57 | 1875.73 | 1 | -1 |min
| 58 | 1876.792 | 1 | |
| 59 | 1875.977 | 1 | |
| 60 | 1876.408 | 1 | |
| 61 | 1877.159 | 1 | |
| 62 | 1877.187 | 1 | |
| 63 | 1883.164 | 1 | |
| 64 | 1883.171 | 1 | |
| 65 | 1883.495 | 1 | |
| 66 | 1883.962 | 1 | |
| 67 | 1885.158 | 1 | |
| 68 | 1885.974 | 1 | 1 | max
| 69 | 1886.479 | np.nan | |
| 70 | 1885.969 | np.nan | |
| 71 | 1884.693 | 1 | |
| 72 | 1884.977 | 1 | |
| 73 | 1884.967 | 1 | |
| 74 | 1884.691 | 1 | -1 | min
| 75 | 1886.171 | 1 | 1 | max
| 76 | 1886.166 | np.nan | |
| 77 | 1884.476 | np.nan | |
| 78 | 1884.66 | 1 | 1 | max
| 79 | 1882.962 | 1 | |
| 80 | 1881.496 | 1 | |
| 81 | 1871.163 | 1 | -1 | min
| 82 | 1874.985 | 1 | |
| 83 | 1874.979 | 1 | |
| 84 | 1871.173 | np.nan | |
| 85 | 1871.973 | np.nan | |
| 86 | 1871.682 | np.nan | |
| 87 | 1872.476 | np.nan | |
| 88 | 1882.361 | 1 | 1 | max
| 89 | 1880.869 | 1 | |
| 90 | 1882.165 | 1 | |
| 91 | 1881.857 | 1 | |
| 92 | 1880.375 | 1 | |
| 93 | 1880.66 | 1 | |
| 94 | 1880.891 | 1 | |
| 95 | 1880.377 | 1 | |
| 96 | 1881.663 | 1 | |
| 97 | 1881.66 | 1 | |
| 98 | 1877.888 | 1 | |
| 99 | 1875.69 | 1 | |
| 100 | 1875.161 | 1 | -1 | min
| 101 | 1876.697 | np.nan | |
| 102 | 1876.671 | np.nan | |
| 103 | 1879.666 | np.nan | |
| 111 | 1877.182 | np.nan | |
| 112 | 1878.898 | 1 | |
| 113 | 1878.668 | 1 | |
| 114 | 1878.871 | 1 | |
| 115 | 1878.882 | 1 | |
| 116 | 1879.173 | 1 | 1 | max
| 117 | 1878.887 | 1 | |
| 118 | 1878.68 | 1 | |
| 119 | 1878.872 | 1 | |
| 120 | 1878.677 | 1 | |
| 121 | 1877.877 | 1 | |
| 122 | 1877.669 | 1 | |
| 123 | 1877.69 | 1 | |
| 124 | 1877.684 | 1 | |
| 125 | 1877.68 | 1 | |
| 126 | 1877.885 | 1 | |
| 127 | 1877.863 | 1 | |
| 128 | 1877.674 | 1 | |
| 129 | 1877.676 | 1 | |
| 130 | 1877.687 | 1 | |
| 131 | 1878.367 | 1 | |
| 132 | 1878.179 | 1 | |
| 133 | 1877.696 | 1 | |
| 134 | 1877.665 | 1 | -1 | min
| 135 | 1877.667 | np.nan | |
| 136 | 1878.678 | np.nan | |
| 137 | 1878.661 | 1 | 1 | max
| 138 | 1878.171 | 1 | |
| 139 | 1877.371 | 1 | |
| 140 | 1877.359 | 1 | |
| 141 | 1878.381 | 1 | |
| 142 | 1875.185 | 1 | -1 | min
| 143 | 1875.367 | np.nan | |
| 144 | 1865.492 | np.nan | |
| 145 | 1865.495 | 1 | -1 | min
| 146 | 1866.995 | 1 | |
| 147 | 1866.672 | 1 | |
| 148 | 1867.465 | 1 | |
| 149 | 1867.663 | 1 | |
| 150 | 1867.186 | 1 | |
| 151 | 1867.687 | 1 | |
| 152 | 1867.459 | 1 | |
| 153 | 1867.168 | 1 | |
| 154 | 1869.689 | 1 | |
| 155 | 1869.693 | 1 | |
| 156 | 1871.676 | 1 | |
| 157 | 1873.174 | 1 | 1 | max
| 158 | 1873.691 | np.nan | |
| 159 | 1873.685 | np.nan | |
+-------+----------+-----------+---------+
I must specify upfront that this question has already been answered here with a pandas solution. The solution performs reasonable at about 300 seconds for a table of around 1 million rows. But after some more testing, I see that if the table is over 3 million rows, the execution time increases dramatically to over 2500 seconds and even more. This is obviously too long for such a simple task. How would the same problem be solved with numpy?
Here's one NumPy approach -
mask = ~np.isnan(data[:,1])
s0 = np.flatnonzero(mask[1:] > mask[:-1])+1
s1 = np.flatnonzero(mask[1:] < mask[:-1])+1
lens = s1 - s0
tags = np.repeat(np.arange(len(lens)), lens)
idx = np.lexsort((data[mask,0], tags))
starts = np.r_[0,lens.cumsum()]
offsets = np.r_[s0[0], s0[1:] - s1[:-1]]
offsets_cumsum = offsets.cumsum()
min_ids = idx[starts[:-1]] + offsets_cumsum
max_ids = idx[starts[1:]-1] + offsets_cumsum
out = np.full(data.shape[0], np.nan)
out[min_ids] = -1
out[max_ids] = 1
So this is a bit of a cheat since it uses scipy:
import numpy as np
from scipy import ndimage
markers = np.isnan(data[:, 1])
groups = np.cumsum(markers)
mins, max, min_idx, max_idx = ndimage.measurements.extrema(
data[:, 0], labels=groups, index=range(2, groups.max(), 2))

Interpolate in SQL based on subgroup in django models

I have the following sheetinfo model with the following data:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 0 |
| SAT123A03 | SAT123 | A | 3 | 0 |
| SAT123A04 | SAT123 | A | 4 | 0 |
| SAT123A05 | SAT123 | A | 5 | 500 |
| SAT123B05 | SAT123 | B | 5 | 400 |
| SAT123B04 | SAT123 | B | 4 | 0 |
| SAT123B03 | SAT123 | B | 3 | 0 |
| SAT123B02 | SAT123 | B | 2 | 500 |
| SAT124A01 | SAT124 | A | 1 | 400 |
| SAT124A02 | SAT124 | A | 2 | 0 |
| SAT124A03 | SAT124 | A | 3 | 0 |
| SAT124A04 | SAT124 | A | 4 | 475 |
I would like to interpolate and update the table with the correct T_val.
Formula is:
new_t_val = delta / (cnt -1) * sheet_num + min_tvc_of_subgroup
For instance the top 5:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 425 |
| SAT123A03 | SAT123 | A | 3 | 450 |
| SAT123A04 | SAT123 | A | 4 | 475 |
| SAT123A05 | SAT123 | A | 5 | 500 |
I have a django query that works to update the data, however it is SLOW and stops after a while (due to type errors etc.)
My question is there a way to accomplish this in SQL?
The ability to do this as one database call doesn't exist in stock Django. 3rd party packages exist though: https://github.com/aykut/django-bulk-update
Example of how that package works:
rows = Model.objects.all()
for row in rows:
# Modify rows as appropriate
row.T_val = delta / (cnt -1) * row.sheet_num + min_tvc_of_subgroup
Model.objects.bulk_update(rows)
For datasets up to the 1,000,000 range, this should have reasonable performance. Most of the bottleneck in iterating through and .save()-ing each object is the overhead on a database call. The python part is reasonably fast. The above example has only two database calls so it will be perhaps an order of magnitude or two faster.

Categories