pandas mean calculation over a column in a csv - python

I have some data in a csv file as show below(only partial data is shown here).
SourceID local_date local_time Vge BSs PC hour Type
7208 8/01/2015 11:00:19 15.4 87 +BC_MSG 11 MAIN
11060 8/01/2015 11:01:56 14.9 67 +AB_MSG 11 MAIN
3737 8/01/2015 11:02:09 15.4 88 +AB_MSG 11 MAIN
9683 8/01/2015 11:07:19 14.9 69 +AB_MSG 11 MAIN
9276 8/01/2015 11:07:52 15.4 88 +AB_MSG 11 MAIN
7754 8/01/2015 11:09:26 14.7 62 +AF_MSG 11 MAIN
11111 8/01/2015 11:10:06 15.2 80 +AF_MSG 11 MAIN
9276 8/01/2015 11:10:52 15.4 88 +AB_MSG 11 MAIN
11111 8/01/2015 11:12:56 15.2 80 +AB_MSG 11 MAIN
6148 8/01/2015 11:15:29 15 70 +AB_MSG 11 MAIN
11111 8/01/2015 11:15:56 15.2 80 +AB_MSG 11 MAIN
9866 8/01/2015 11:16:28 4.102 80 +SUB_MSG 11 SUB
9866 8/01/2015 11:16:38 15.1 78 +AH_MSG 11 MAIN
9866 8/01/2015 11:16:38 4.086 78 +SUB_MSG 11 SUB
20729 8/01/2015 11:23:21 11.6 82 +AB_MSG 11 MAIN
9276 8/01/2015 11:25:52 15.4 88 +AB_MSG 11 MAIN
11111 8/01/2015 11:34:16 15.2 80 +AF_MSG 11 MAIN
20190 8/01/2015 11:36:09 11.2 55 +AF_MSG 11 MAIN
7208 8/01/2015 11:37:09 15.3 85 +AB_MSG 11 MAIN
7208 8/01/2015 11:38:39 15.3 86 +AB_MSG 11 MAIN
7754 8/01/2015 11:39:16 14.7 61 +AB_MSG 11 MAIN
8968 8/01/2015 11:39:39 15.5 91 +AB_MSG 11 MAIN
3737 8/01/2015 11:41:09 15.4 88 +AB_MSG 11 MAIN
9683 8/01/2015 11:41:39 14.9 69 +AF_MSG 11 MAIN
20729 8/01/2015 11:44:36 11.6 81 +AB_MSG 11 MAIN
9704 8/01/2015 11:45:20 14.9 68 +AF_MSG 11 MAIN
11111 8/01/2015 11:46:06 4.111 87 +SUB_MSG 11 PAN
I have the following python program that uses pandas to process this input
import sys
import csv
import operator
import os
from glob import glob
import fileinput
from relativeDates import *
import datetime
import math
import pprint
import numpy as np
import pandas as pd
from io import StringIO
COLLECTION = 'NEW'
BATTERY = r'C:\MyFolder\Analysis\\{}'.format(COLLECTION)
INPUT_FILE = Pandas + r'\in.csv'
OUTPUT_FILE = Pandas + r'\out.csv'
with open(INPUT_FILE) as fin:
df = pd.read_csv(INPUT_FILE,
usecols=["SourceID", "local_date","local_time","Vge",'BSs','PC'],
header=0)
#df.set_index(['SourceID','local_date','local_time','Vge','BSs','PC'],inplace=True)
df.drop_duplicates(inplace=True)
#df.reset_index(inplace=True)
hour_list = []
gb = df['local_time'].groupby(df['local_date'])
for i in list(gb)[0][1]:
hour_list.append(i.split(':')[0])
for j in list(gb)[1][1]:
hour_list.append(str(int(j.split(':')[0])+ 24))
df['hour'] = pd.Series(hour_list,index=df.index)
df.set_index(['SourceID','local_date','local_time','Vge'],inplace=True)
#gb = df['hour'].groupby(df['PC'])
#print(list(gb))
gb = df['PC']
class_list = []
for msg in df['PC']:
if 'SUB' in msg:
class_list.append('SUB')
else:
class_list.append('MAIN')
df['Type'] = pd.Series(class_list,index=df.index)
print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean))
gb = df['Type'].groupby(df['hour'])
#print(list(gb))
#print(list(df.groupby(['hour','Type']).count()))
df.to_csv(OUTPUT_FILE)
I want to get an average of BSs field over time. This is what I am attempting to do in print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean)) above.
However few things needs to be considered.
Vge values can be classified into 2 types based on Type field.
The number of Vge values that we get can vary from hour to hour widely.
The whole data set is for 24 hours.
The Vge values can be recieved from a number of SourceIDs.
The Vge values can vary little bit among SourceID but should somewhat similar during the same time interval (same hour)
In such a situation calculation of simple mean as above print(df.groupby(['hour','Type'])['BSs'].aggregate(np.mean)) won't be sufficient as the number of samples during different time periods (hours) are different.
What function should be used in such a situation?

Related

How to iterate pandas Dataframe month-wise to satisfy demand over time

Suppose I have a dataframe df
pd demand mon1 mon2 mon3
abc1 137 46 37 31
abc2 138 33 37 50
abc3 120 38 47 46
abc4 149 39 30 30
abc5 129 33 42 42
abc6 112 30 45 43
abc7 129 43 33 45
I want to satisfy the demand of each pd month-wise. I am generating some random numbers which indicate satisfied demand. For example, for pd abc1, demand is 137, say I have produced 42 units for mon1, but mon1 demand is 46. Hence revised dataframe would be
pd demand mon2 mon3
abc1 137 - 42= 95 37 + 4 (Unsatisfied demand for previous month) 31
Then it will run for mon2 and so on. In this way, I would like to capture, how much demand would be satisfied for each pd (excess or unsatisfied).
My try:
import pandas as pd
import random
mon = ['mon1', 'mon2', 'mon3']
for i in df['pd'].values.tolist():
t = df.loc[df['pd'] == i, :]
for m in t.columns[2:]:
y = t[m].iloc[0]
n = random.randint(20, 70)
t['demand'] = t['demand'].iloc[0] - n
Not finding the logic exactly.

Replacing different numbers of whitespace with 2 spaces Python

I am opening a binary file to write to a txt file. I need to remove some lines of the code before writing it to the text so I do this with the following code:
with open(rawfile) as f, open(outfile,'w') as f2:
for x in f:
if (':') not in x and ('Station') not in x and('--')not in x and('hPa') not in x:
f2.write(x.strip()+'\n')
This produces several columns of data like:
Header Header Header Header
Data Data Data Data
Data Data Data Data
As you can see, there are varying number of space between each string in each column/row. While keeping the columns and rows intact, all I need to do is replace the varying number of spaces to be 2 spaces evenly between all data strings. I tried making another for statement but could not figure out how to keep it all to one file and have it overwrite the file everytime.
EDIT WITH DATA SNIPPET:
PRES HGHT TEMP DWPT FRPT RELH RELI MIXR DRCT SKNT
1002.0 53 32.4 23.4 23.4 59 59 18.47 345 10
1000.0 65 31.0 19.0 19.0 49 49 14.03 340 10
999.0 74 30.4 19.4 19.4 52 52 14.41 339 10
996.0 101 30.1 19.5 19.5 53 53 14.54 335 10
972.0 318 28.1 20.3 20.3 63 63 15.64 355 10
959.0 437 26.9 20.7 20.7 69 69 16.29 340 11
932.0 692 24.5 21.6 21.6 84 84 17.74 0 5
931.0 1 5
925.0 10 5
888.0 12.88 95 13
Additionally, I cannot have the data become bunched up in those large blank spaces. Data needs to remain under its associated header (this is a text file and not a dataframe)
You could do:
header_len = 0
for x in f:
if (':') not in x and ('Station') not in x and('--')not in x and('hPa') not in x:
line = x.strip()+'\n'
line_parts = line.split(' ')
if header_len == 0:
header_len = len(line_parts[0])
elif len(line_parts[0]) < header_len:
line_parts = [lp + ' ' * (header_len - len(lp)) for lp in line_parts]
while '' in line_parts:
line_parts.remove('')
f2.write(' '.join(line_parts))

Webscraping - Iterate through a table row with active/inactive status

I want to web-scrape the information on:
https://rotogrinders.com/resultsdb/date/2019-01-13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d
There is a main table with a column user. When you click on a user, there is another table beside that shows the information of the team of that user enters in the contest. I want to extract the team of all the users. Therefore, I need to be able to go through all the users by clicking on them and then extracting the information on the second table. Here is my code to extract the team of the first user:
from selenium import webdriver
import csv
from selenium.webdriver.support.ui import Select
from datetime import date, timedelta
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
chromedriver =("C:/Users/Michel/Desktop/python/package/chromedriver_win32/chromedriver.exe")
driver = webdriver.Chrome(chromedriver)
DFSteam = []
driver.get("https://rotogrinders.com/resultsdb/date/2019-01- 13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d")
Team1=driver.find_element_by_css_selector("table.ant-table-fixed")
driver.close
print(Team1.text)
However, I am not able to iterate through the different users. I noticed that when I click on a user the tr class of that row switch for inactive to active in the page source code, but I do not know how to use that. Moreover, I would like to store the team extracted in a data frame. I am not sure if it is better to do it at the same time or afterwards.
The data frame would look like this:
RANK(team) / C / C / W / W / W / D / D /G/ UTIL/ TOTAL($) / Total Points
1 / Mark Scheifel/ Mickael Backlund/ Artemi Panarin / Nick Foligno / Michael Frolik / Mark Giordano / Zach Werenski / CConnor Hellebuyck / Brandon Tanev / 50 000 / 54.60
You have the right idea. It's just a matter of finding the username element to click on then grab the lineup table, reformat to combine into one results dataframe.
The user name text is tagged with <a>. Just need to find the <a> tag that matched the user name.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
url = 'https://rotogrinders.com/resultsdb/date/2019-01-13/sport/4/slate/5c3c66edb1699a43c0d7bba7/contest/5c3c66f2b1699a43c0d7bd0d'
# Open Browser and go to site
driver = webdriver.Chrome("C:/chromedriver_win32/chromedriver.exe")
driver.get(url)
# Waits until tables are loaded and has text. Timeouts after 60 seconds
WebDriverWait(driver, 60).until(ec.presence_of_element_located((By.XPATH, './/tbody//tr//td//span//a[text() != ""]')))
# Get tables to get the user names
tables = pd.read_html(driver.page_source)
users_df = tables[0][['Rank','User']]
users_df['User'] = users_df['User'].str.replace(' Member', '')
# Initialize results dataframe and iterate through users
results = pd.DataFrame()
for i, row in users_df.iterrows():
rank = row['Rank']
user = row['User']
# Find the user name and click on the name
user_link = driver.find_elements(By.XPATH, "//a[text()='%s']" %(user))[0]
user_link.click()
# Get the lineup table after clicking on the user name
tables = pd.read_html(driver.page_source)
lineup = tables[1]
#print (user)
#print (lineup)
# Restructure to put into resutls dataframe
lineup.loc[9, 'Name'] = lineup.iloc[9]['Salary']
lineup.loc[10, 'Name'] = lineup.iloc[9]['Pts']
temp_df = pd.DataFrame(lineup['Name'].values.reshape(-1, 11),
columns=lineup['Pos'].iloc[:9].tolist() + ['Total_$', 'Total_Pts'] )
temp_df.insert(loc=0, column = 'User', value = user)
temp_df.insert(loc=0, column = 'Rank', value = rank)
results = results.append(temp_df)
results = results.reset_index(drop=True)
driver.close()
Output:
print (results)
Rank User ... Total_$ Total_Pts
0 1 Canadaman101 ... $50,000.00 54.6
1 2 MayhemLikeMe27 ... $50,000.00 53.9
2 2 gunslinger58 ... $50,000.00 53.9
3 4 oilkings ... $48,600.00 53.6
4 5 TTB19 ... $50,000.00 53.4
5 6 Adamjloder ... $49,800.00 53.1
6 7 DollarBillW ... $49,900.00 52.6
7 8 Biglarry696 ... $49,900.00 52.4
8 8 tical1994 ... $49,900.00 52.4
9 8 rollem02 ... $49,900.00 52.4
10 8 kchoban ... $50,000.00 52.4
11 8 TBirdSCIL ... $49,900.00 52.4
12 13 manny716 ... $49,900.00 52.1
13 14 JayKooks ... $50,000.00 51.9
14 15 Cambie19 ... $49,900.00 51.4
15 16 mjh6588 ... $50,000.00 51.1
16 16 shanefriesen ... $50,000.00 51.1
17 16 mnfish42 ... $50,000.00 51.1
18 19 Pugsly55 ... $49,900.00 50.9
19 19 volpez7 ... $50,000.00 50.9
20 19 Scherr47 ... $49,900.00 50.9
21 19 Testosterown ... $50,000.00 50.9
22 23 markm22 ... $49,700.00 50.6
23 23 foreveryoung12 ... $49,800.00 50.6
24 23 STP_Picks ... $49,900.00 50.6
25 26 jibbinghippo ... $49,800.00 50.4
26 26 loumister35 ... $49,900.00 50.4
27 26 creels3 ... $50,000.00 50.4
28 26 JayKooks ... $50,000.00 51.9
29 26 mmeiselman731 ... $49,900.00 50.4
30 26 volpez7 ... $50,000.00 50.9
31 26 tommienation1 ... $49,900.00 50.4
32 26 jibbinghippo ... $49,800.00 50.4
33 26 Testosterown ... $50,000.00 50.9
34 35 nut07 ... $50,000.00 49.9
35 35 volpez7 ... $50,000.00 50.9
36 35 durfdurf ... $50,000.00 49.9
37 35 chupacabra21 ... $50,000.00 49.9
38 39 Mbermes01 ... $50,000.00 49.6
39 40 suerte41 ... $50,000.00 49.4
40 40 spliksskins77 ... $50,000.00 49.4
41 42 Andrewskoff ... $49,600.00 49.1
42 42 Alky14 ... $49,800.00 49.1
43 42 bretned ... $50,000.00 49.1
44 42 bretned ... $50,000.00 49.1
45 42 gehrig38 ... $49,700.00 49.1
46 42 d-train_91 ... $49,500.00 49.1
47 42 DiamondDallas ... $50,000.00 49.1
48 49 jdmre ... $50,000.00 48.9
49 49 Devosty ... $50,000.00 48.9
[50 rows x 13 columns]

DataFrame max() not return max

Real beginner question here, but it is so simple, I'm genuinely stumped. Python/DataFrame newbie.
I've loaded a DataFrame from a Google Sheet, however any graphing or attempts at calculations are generating bogus results. Loading code:
# Setup
!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Linear Regression - Brain vs. Body Predictor').worksheet("Raw Data")
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
This seems to work fine and the data looks to be correctly loaded when I print out the DataFrame but running max() returns obviously false results. For example:
print(df[0])
print(df[0].max())
Will output:
0 3.385
1 0.48
2 1.35
3 465
4 36.33
5 27.66
6 14.83
7 1.04
8 4.19
9 0.425
10 0.101
11 0.92
12 1
13 0.005
14 0.06
15 3.5
16 2
17 1.7
18 2547
19 0.023
20 187.1
21 521
22 0.785
23 10
24 3.3
25 0.2
26 1.41
27 529
28 207
29 85
...
32 6654
33 3.5
34 6.8
35 35
36 4.05
37 0.12
38 0.023
39 0.01
40 1.4
41 250
42 2.5
43 55.5
44 100
45 52.16
46 10.55
47 0.55
48 60
49 3.6
50 4.288
51 0.28
52 0.075
53 0.122
54 0.048
55 192
56 3
57 160
58 0.9
59 1.62
60 0.104
61 4.235
Name: 0, Length: 62, dtype: object
Max: 85
Obviously, the maximum value is way out -- it should be 6654, not 85.
What on earth am I doing wrong?
First StackOverflow post, so thanks in advance.
If you check it, you'll see at the end of your print() that dtype=object. Also, you'll notice your pandas Series have "int" values along with "float" values (e.g. you have 6654 and 3.5 in the same Series).
These are good hints you have a series of strings, and the max operator here is comparing based on string comparing. You want, however, to have a series of numbers (specifically floats) and to compare based on number comparing.
Check the following reproducible example:
>>> df = pd.DataFrame({'col': ['0.02', '9', '85']}, dtype=object)
>>> df.col.max()
'9'
You can check that because
>>> '9' > '85'
True
You want these values to be considered floats instead. Use pd.to_numeric
>>> df['col'] = pd.to_numeric(df.col)
>>> df.col.max()
85
For more on str and int comparison, check this question

Debugging a print DataFrame issue in Pandas

How do I debug a problem with printing a Pandas DataFrame ? I call this function and then print the output (which is a Pandas DataFrame).
n=ion_tab(y_ion,cycles,t,pH)
print(n)
The last part of the output looks like this:
58 O2 1.784306e-35 4 86 7.3
60 HCO3- 5.751170e+02 5 86 7.3
61 Ca+2 1.825748e+02 5 86 7.3
62 CO2 3.928413e+01 5 86 7.3
63 CaHCO3+ 3.755015e+01 5 86 7.3
64 CaCO3 4.616840e+00 5 86 7.3
65 SO4-2 1.393365e+00 5 86 7.3
66 CO3-2 8.243118e-01 5 86 7.3
67 CaSO4 7.363058e-01 5 86 7.3
... ... ... ... ...
[65 rows x 5 columns]
But if I do an n.tail() command, I see the missing data that ... seems to suggest.
print n.tail()
Species ppm as ion Cycles Temp F pH
68 OH- 5.516061e-03 5 86 7.3
69 CaOH+ 6.097815e-04 5 86 7.3
70 HSO4- 5.395493e-06 5 86 7.3
71 CaHSO4+ 2.632098e-07 5 86 7.3
73 O2 1.783007e-35 5 86 7.3
[5 rows x 5 columns]
If I count the number of rows showing up on the screen, I get 60. If I add the 5 extra that show up with n.tail(), I get the expected 65 rows. Is there some limit in print that would only allow 60 rows ? What's causing ... at the end of my DataFrame ?
Initially I though there was something in the ion_tab function that was limiting the printing. But one I saw the missing data in the n.tail() statement, I got confused.
Any help in debugging this would be appreciated.
Pandas limits the number of rows printed by default. You can change that with pd.set_option
In [4]: pd.get_option('display.max_rows')
Out[4]: 60
In [5]: pd.set_option('display.max_rows', 100)

Categories