Numpy Separating CSV into columns

Numpy Separating CSV into columns - python

I'm trying to use a CSV imported from bballreference.com. But as you can see, the separated values are all in one row rather than separated by columns. On NumPy Pandas, what would be the easiest way to fix this? I've googled to no avail.
csv on jupyter
I don't know how to post CSV file in a clean way but here it is:
",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"

It seems that the first two rows of your CSV file are headers, but the default behavior of pd.read_csv thinks that only the first row is header.
Also, the beginning and trailing quotes make pd.read_csv think the text in between is a single field/column.
You could try the following:
Remove the beginning and trailing quotes, and
bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')
Following is how you could use Python to remove the beginning and trailing quotes:
# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
# read in_file line by line
# the variable line stores each line as string
for line in in_file:
# line[1:-1] slices the string to omit the first and last character
# append a newline character '\n' to the sliced line
# write the string with newline to out_file
out_file.write(line[1:-1] + '\n')
# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()

Consider reading in csv as a text file to be stripped of the beginning/end quotes per line on a text file read which tell the parser all data between is a singular value. And use built-in StringIO to read text string into dataframe instead of saving to disk for import.
Additionally, skip the first row of repeated Totals and Per Game and even the last row that aggregates since you can do that with pandas.
from io import StringIO
import pandas as pd
with open('BasketballCSVQuotes.csv') as f:
csvdata = f.read().replace('"', '')
df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')
print(df)
Output
Rk Player Age G GS MP FG FGA 3P 3PA ... PTS FG% 3P% FT% MP.1 PTS.1 TRB.1 AST.1 STL.1 BLK.1
0 1.0 Kevin Durant\duranke01 29.0 5 5.0 182 54 107 9 28 ... 139 0.505 0.321 0.815 36.5 27.8 7.4 4.8 1.4 1.2
1 2.0 Klay Thompson\thompkl01 27.0 5 5.0 183 38 99 12 43 ... 99 0.384 0.279 1.000 36.7 19.8 6.4 1.8 0.2 0.4
2 3.0 Stephen Curry\curryst01 29.0 4 3.0 125 32 67 15 34 ... 98 0.478 0.441 1.000 31.2 24.5 5.3 3.5 2.0 0.5
3 4.0 Draymond Green\greendr01 27.0 5 5.0 186 27 55 8 20 ... 74 0.491 0.400 0.800 37.1 14.8 11.8 10.0 2.4 1.6
4 5.0 Andre Iguodala\iguodan01 34.0 5 4.0 140 14 29 4 12 ... 39 0.483 0.333 0.583 27.9 7.8 5.0 3.4 2.0 0.4
5 6.0 Quinn Cook\cookqu01 24.0 4 0.0 58 12 27 0 10 ... 30 0.444 0.000 0.750 14.4 7.5 2.3 1.0 0.3 0.0
6 7.0 Kevon Looney\looneke01 21.0 5 0.0 113 12 17 0 0 ... 28 0.706 NaN 0.500 22.6 5.6 5.8 1.0 0.8 0.2
7 8.0 Shaun Livingston\livinsh01 32.0 5 0.0 79 11 27 0 0 ... 26 0.407 NaN 1.000 15.9 5.2 1.2 2.4 0.0 0.2
8 9.0 David West\westda01 37.0 5 0.0 40 8 14 0 0 ... 16 0.571 NaN NaN 7.9 3.2 1.4 2.6 0.4 0.8
9 10.0 Nick Young\youngni01 32.0 4 2.0 41 3 11 3 10 ... 11 0.273 0.300 0.667 10.2 2.8 1.0 0.3 0.3 0.0
10 11.0 JaVale McGee\mcgeeja01 30.0 3 1.0 19 3 8 0 1 ... 6 0.375 0.000 NaN 6.2 2.0 2.0 0.0 0.0 0.3
11 12.0 Zaza Pachulia\pachuza01 33.0 2 0.0 8 1 2 0 0 ... 4 0.500 NaN 0.500 4.2 2.0 3.0 0.0 1.0 0.0
12 13.0 Jordan Belelljo01 23.0 4 0.0 23 1 4 0 0 ... 3 0.250 NaN 0.500 5.8 0.8 1.5 1.3 0.5 0.5
13 14.0 Damian Jones\jonesda03 22.0 1 0.0 3 0 1 0 0 ... 2 0.000 NaN 1.000 3.2 2.0 0.0 0.0 0.0 0.0
[14 rows x 30 columns]

Related

Having issues trying to make my dataframe numeric

So I have a sqlite local database, I read it into my program as a pandas dataframe using
""" Seperating hitters and pitchers """
pitchers = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE BF_y >= 20 AND BF_x >= 20", northwoods_db)
hitters = pd.read_sql_query("SELECT * FROM ALL_NORTHWOODS_DATA WHERE PA_y >= 25 AND PA_x >= 25", northwoods_db)
But when I do this, some of the numbers are not numeric. Here is a head of one of the dataframes:
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21 -0.3 Hillsdale GMAC NCAA None 5 None ... 4.0 None 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 -- None Duke ACC NCAA None 15 None ... 13 0 1 88 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21 0.1 Wisconsin-Milwaukee Horz NCAA None 8 None ... 1.0 0.0 2.0 21.0 2.25 9.0 0.0 11.3 11.3 1.0
3 357 2017 22 1.0 Nova Southeastern SSC NCAA None 15.0 None ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21 -0.4 Creighton BigE NCAA None 4 None ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
When I try to make the dataframe numeric, I used this line of code:
hitters = hitters.apply(pd.to_numeric, errors='coerce')
pitchers = pitchers.apply(pd.to_numeric, errors='coerce')
But when I did that, the new head of the dataframes is full of NaN's, it seems like it got rid of all of the string values but I want to keep those.
index Year Age_x AgeDif_x Tm_x Lg_x Lev_x Aff_x G_x PA_x ... ER_y BK_y WP_y BF_y WHIP_y H9_y HR9_y BB9_y SO9_y SO/W_y
0 84 2020 21.0 -0.3 NaN NaN NaN NaN 5.0 NaN ... 4.0 NaN 3.0 71.0 1.132 5.6 0.0 4.6 8.7 1.89
1 264 2018 NaN NaN NaN NaN NaN NaN 15.0 NaN ... 13.0 0.0 1.0 88.0 2.111 10.0 0.5 9.0 8.0 0.89
2 298 2019 21.0 0.1 NaN NaN NaN NaN 8.0 NaN ... 1.0 0.0 2.0 21.0 2.250 9.0 0.0 11.3 11.3 1.00
3 357 2017 22.0 1.0 NaN NaN NaN NaN 15.0 NaN ... 20.0 0.0 3.0 206.0 1.489 9.7 0.4 3.7 8.5 2.32
4 418 2021 21.0 -0.4 NaN NaN NaN NaN 4.0 NaN ... 26.0 1.0 6.0 226.0 1.625 8.6 0.9 6.0 7.5 1.25
Is there a better way to makethe number values numeric and keep all my string columns? Maybe there is an sqlite function that can do it better? I am not sure, any help is appriciated.

Maybe you can use combine_first:
hitters_new = hitters.apply(pd.to_numeric, errors='coerce').combine_first(hitters)
pitchers_new = pitchers.apply(pd.to_numeric, errors='coerce').combine_first(pitchers)

You can try using astype or convert_dtypes. They both take an argument which is the columns you want to convert, if you already know which columns are numeric and which ones are strings that can work. Otherwise, take a look at this thread to do this automatically.

exporting to table not aligned

I'm trying to scrape a table from this link:
https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc
when scraping the table, the names and stats categories align but the numbers themselves don't.
import csv
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(
requests.get("https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc", timeout=30).text,
'lxml')
def scrape_data(url):
# the categories of stats (first row)
ct = soup.find_all('tr', class_="Table__TR Table__even")
# player's stats table (the names and numbers)
st = soup.find_all('tr', class_="Table__TR Table__TR--sm Table__even")
header = [th.text.rstrip() for th in ct[1].find_all('th')]
with open('s espn.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(header)
for row in st[1:]:
data = [th.text.rstrip() for th in row.find_all('td')]
writer.writerow(data)
scrape_data(soup)
https://imgur.com/UFHC8wf

That's because those are under 2 separate table tags. A table for the names, then the table for stats. You'll need to merge them.
Use pandas. A lot easier to parse tables (it actually uses beautifulsoup under the hood). pd.read_html() will return a list of dataframes. I just merged them together. Then you can also manipulate the tables any way you need.
import pandas as pd
url = 'https://www.espn.com/nba/stats/player/_/table/offensive/sort/avgPoints/dir/desc'
dfs = pd.read_html(url)
df = dfs[0].join(dfs[1])
df[['Name','Team']] = df['Name'].str.extract('^(.*?)([A-Z]+)$', expand=True)
df.to_csv('s espn.csv', index=False)
Output:
print (df.head(10).to_string())
RK Name Team POS GP MIN PTS FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% REB AST STL BLK TO DD2 TD3 PER
0 1 Trae Young ATL PG 2 36.5 38.5 13.5 23.0 58.7 5.5 10.0 55.0 6.0 8.0 75.0 7.0 9.0 1.5 0.0 5.5 0 0 40.95
1 2 Kyrie Irving BKN PG 3 34.7 37.7 12.0 26.3 45.6 4.7 11.3 41.2 9.0 9.7 93.1 5.7 6.3 1.7 0.7 2.0 0 0 37.84
2 3 Karl-Anthony Towns MIN C 3 33.7 32.0 10.7 20.3 52.5 5.0 9.7 51.7 5.7 9.0 63.0 13.3 5.0 3.0 2.0 2.3 3 0 40.60
3 4 Damian Lillard POR PG 3 36.7 31.7 10.7 20.3 52.5 2.3 8.0 29.2 8.0 9.0 88.9 4.0 6.0 1.7 0.3 2.7 0 0 32.86
4 5 Giannis Antetokounmpo MIL PF 2 32.5 29.5 11.5 19.0 60.5 1.0 5.0 20.0 5.5 10.0 55.0 15.0 10.0 2.0 1.5 5.5 2 1 34.55
5 6 Luka Doncic DAL SF 3 36.3 29.3 10.0 20.0 50.0 3.0 9.7 31.0 6.3 8.0 79.2 10.3 7.3 2.3 0.0 4.3 2 1 29.45
6 7 Pascal Siakam TOR PF 3 34.0 28.7 9.7 21.0 46.0 2.7 5.7 47.1 6.7 7.0 95.2 10.7 3.7 0.3 0.3 4.3 1 0 24.56
7 8 Brandon Ingra mNO SF 3 35.0 27.3 10.7 20.3 52.5 3.3 6.3 52.6 2.7 3.3 80.0 9.3 4.3 0.3 1.7 2.7 1 0 25.72
8 9 Kristaps Porzingis DAL PF 3 31.0 26.3 8.7 18.7 46.4 3.0 7.3 40.9 6.0 8.3 72.0 5.7 3.3 0.3 2.7 2.3 0 0 27.15
9 10 Russell Westbrook HOU PG 2 33.0 26.0 8.0 17.0 47.1 2.0 5.0 40.0 8.0 10.5 76.2 13.0 10.0 1.5 0.5 3.5 2 1 30.96
...
Explained line by line:
dfs = pd.read_html(url)
This will return a list of dataframes. Essentially it parses every <table> tag within the html. When you do this with the given URL, you will see it returns 2 dataframes:
The dataframe in index position 0, has the ranking and player name. If I look at that, I notice the text is the player name followed by the team abbreviation (the last 3 characters in the string).
So, dfs[0]['Team'] = dfs[0]['Name'].str[-3:] is going to create a column called 'Team', where it takes the Name column strings and takes the last 3 characters. dfs[0]['Name'] = dfs[0]['Name'].str[:-3] will store the string from the Name column up until the last 3 charachters. Essentially splitting the Name column into 2 columns.
df = dfs[0].merge(dfs[1], left_index=True, right_index=True)
The last part then takes those 2 dataframes in index position 0 and 1 (stored in the dfs list) and merges them together and merges on the index values.

Merging different length dataframe in Python/pandas

I have 2 dataframe:
df1
aa gg pm
1 3.3 0.5
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5
6 3.3 0.5
7 11.1 3.0
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3
8 3.3 0.3
and df2:
aa gg st
1 3.3 in
2 0.3 in
5 1.3 in
7 11.1 in
8 5.3 in
I would like to merge these two dataframe on col aa and gg to get results like:
aa gg pm st
1 3.3 0.5 in
1 0.0 4.7
1 9.3 0.2
2 0.3 0.6 in
2 14.0 91.0
3 13.0 31.0
4 13.1 64.0
5 1.3 0.5 in
6 3.3 0.5
7 11.1 3.0 in
7 11.3 24.0
8 3.2 0.0
8 5.3 0.3 in
8 3.3 0.3
I want to map the col st details to based on col aa and gg.
please let me know how to do this.

You can multiple float columns by 1000 or 10000 and convert to integers and then use these new columns for join:
df1['gg_int'] = df1['gg'].mul(1000).astype(int)
df2['gg_int'] = df2['gg'].mul(1000).astype(int)
df = df1.merge(df2.drop('gg', axis=1), on=['aa','gg_int'], how='left')
df = df.drop('gg_int', axis=1)
print (df)
aa gg pm st
0 1 3.3 0.5 in
1 1 0.0 4.7 NaN
2 1 9.3 0.2 NaN
3 2 0.3 0.6 in
4 2 14.0 91.0 NaN
5 3 13.0 31.0 NaN
6 4 13.1 64.0 NaN
7 5 1.3 0.5 in
8 6 3.3 0.5 NaN
9 7 11.1 3.0 in
10 7 11.3 24.0 NaN
11 8 3.2 0.0 NaN
12 8 5.3 0.3 in
13 8 3.3 0.3 NaN

How to get indexes of values in a Pandas DataFrame?

I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)

To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.

Calculating the accumulated summation of clustered data in data frame in pandas

Given the following data frame:
index value
1 0.8
2 0.9
3 1.0
4 0.9
5 nan
6 nan
7 nan
8 0.4
9 0.9
10 nan
11 0.8
12 2.0
13 1.4
14 1.9
15 nan
16 nan
17 nan
18 8.4
19 9.9
20 10.0
…
in which the data 'value' is separated into a number of clusters by value NAN. is there any way I can calculate some values such as accumulate summation, or mean of the clustered data, for example, I want calculate the accumulated sum and generate the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 0
11 0.8 0.8
12 2.0 2.8
13 1.4 4.2
14 1.9 6.1
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
…
Any suggestions?
Also as a simple extension of the problem, if two clusters of data are close enough, such as there are only 1 NAN separate them we consider the as one cluster of data, such that we can have the following data frame:
index value cumsum
1 0.8 0.8
2 0.9 1.7
3 1.0 2.7
4 0.9 3.6
5 nan 0
6 nan 0
7 nan 0
8 0.4 0.4
9 0.9 1.3
10 nan 1.3
11 0.8 2.1
12 2.0 4.1
13 1.4 5.5
14 1.9 7.4
15 nan 0
16 nan 0
17 nan 0
18 8.4 8.4
19 9.9 18.3
20 10.0 28.3
Thank you for the help!

You can do the first part using the compare-cumsum-groupby pattern. Your "simple extension" isn't quite so simple, but we can still pull it off, by finding out the parts of value that we want to treat as zero:
n = df["value"].isnull()
clusters = (n != n.shift()).cumsum()
df["cumsum"] = df["value"].groupby(clusters).cumsum().fillna(0)
to_zero = n & (df["value"].groupby(clusters).transform('size') == 1)
tmp_value = df["value"].where(~to_zero, 0)
n2 = tmp_value.isnull()
new_clusters = (n2 != n2.shift()).cumsum()
df["cumsum_skip1"] = tmp_value.groupby(new_clusters).cumsum().fillna(0)
produces
>>> df
index value cumsum cumsum_skip1
0 1 0.8 0.8 0.8
1 2 0.9 1.7 1.7
2 3 1.0 2.7 2.7
3 4 0.9 3.6 3.6
4 5 NaN 0.0 0.0
5 6 NaN 0.0 0.0
6 7 NaN 0.0 0.0
7 8 0.4 0.4 0.4
8 9 0.9 1.3 1.3
9 10 NaN 0.0 1.3
10 11 0.8 0.8 2.1
11 12 2.0 2.8 4.1
12 13 1.4 4.2 5.5
13 14 1.9 6.1 7.4
14 15 NaN 0.0 0.0
15 16 NaN 0.0 0.0
16 17 NaN 0.0 0.0
17 18 8.4 8.4 8.4
18 19 9.9 18.3 18.3
19 20 10.0 28.3 28.3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy Separating CSV into columns - python

Related

Having issues trying to make my dataframe numeric

exporting to table not aligned

Merging different length dataframe in Python/pandas

How to get indexes of values in a Pandas DataFrame?

Calculating the accumulated summation of clustered data in data frame in pandas

Categories

Resources