I have this data file and I have to find the 3 largest numbers it contains
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Therefore I have written the following code, but it only searches the first row of numbers instead of the entire list. Can anyone help to find the error?
def three_highest_temps(f):
file = open(f, "r")
largest = 0
second_largest = 0
third_largest = 0
temp = []
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
print(three_highest_temps("data5.txt"))
Your data contains float numbers not integer.
You can use sorted:
>>> data = '''24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
... 16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
... 10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
... 21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
... 19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
... 14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
... 8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
... 11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
... 13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
... 22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
... 17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
... 20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
... '''
>>> sorted(map(float, data.split()), reverse=True)[:3]
[74.0, 73.7, 73.7]
If you want to integer results
>>> temps = sorted(map(float, data.split()), reverse=True)[:3]
>>> map(int, temps)
[74, 73, 73]
You only get the max elements for the first line because you return at the end of the first iteration. You should de-indent the return statement.
Sorting the data and picking the first 3 elements runs in n*log(n).
data = [float(v) for v in line.split() for line in file]
sorted(data, reverse=True)[:3]
It is perfectly fine for 144 elements.
You can also get the answer in linear time using a heapq
import heapq
heapq.nlargest(3, data)
Your return statement is inside the for loop. Once return is reached, the function terminates, so the loop never gets into a second iteration. Move the return outside the loop by reducing indentation.
for line in file:
temps = line.split()
for i in temps:
if i > largest:
largest = i
elif largest > i > second_largest:
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
In addition, your comparisons won't work, because line.split() returns a list of strings, not floats. (As has been pointed out, your data consists of floats, not ints. I'm assuming the task is to find the largest float.) So let's convert the strings using float()
Your code still won't be correct, though, because when you find a new largest value, you completely discard the old one. Instead you should now consider it the second largest known value. Same rule applies for second to third largest.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif largest > i > second_largest:
third_largest = second_largest
second_largest = i
elif second_largest > i > third_largest:
third_largest = i
return largest, second_largest, third_largest
Now there is one last issue:
You overlook cases where i is identical with one of the largest values. In such a case i > largest would be false, but so would largest > i. You could change either of these comparisons to >= to fix this.
Instead, let us simplify the if clauses by considering that the elif conditions are only considered after all previous conditions were already found to be false. When we reach the first elif, we already know that i can not be larger than largest, so it suffices to compare it to second largest. The same goes for the second elif.
for line in file:
temps = line.split()
for temp_string in temps:
i = float(temp_string)
if i > largest:
third_largest = second_largest
second_largest = largest
largest = i
elif i > second_largest:
third_largest = second_largest
second_largest = i
elif i > third_largest:
third_largest = i
return largest, second_largest, third_largest
This way we avoid accidentally filtering out the i == largest and i == second_largest edge cases.
Since you are dealing with a file, as a cast and numpythonic approach you can load the file as an array and then sort the array and get the last 3 item :
import numpy as np
with open('filename') as f:
array = np.genfromtxt(f).ravel()
array.sort()
print array[-3:]
[ 73.7 73.7 74. ]
Related
I have a dataframe that looks like this:
Temp
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
... ...
1981-12-27 15.5
1981-12-28 13.3
1981-12-29 15.6
1981-12-30 15.2
1981-12-31 17.4
365 rows × 1 columns
And I want to transform It so That It looks like:
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
0 20.7 17.0 18.4 19.5 13.3 12.9 12.3 15.3 14.3 14.8
1 17.9 15.0 15.0 17.1 15.2 13.8 13.8 14.3 17.4 13.3
2 18.8 13.5 10.9 17.1 13.1 10.6 15.3 13.5 18.5 15.6
3 14.6 15.2 11.4 12.0 12.7 12.6 15.6 15.0 16.8 14.5
4 15.8 13.0 14.8 11.0 14.6 13.7 16.2 13.6 11.5 14.3
... ... ... ... ... ... ... ... ... ... ...
360 15.5 15.3 13.9 12.2 11.5 14.6 16.2 9.5 13.3 14.0
361 13.3 16.3 11.1 12.0 10.8 14.2 14.2 12.9 11.7 13.6
362 15.6 15.8 16.1 12.6 12.0 13.2 14.3 12.9 10.4 13.5
363 15.2 17.7 20.4 16.0 16.3 11.7 13.3 14.8 14.4 15.7
364 17.4 16.3 18.0 16.4 14.4 17.2 16.7 14.1 12.7 13.0
My attempt:
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values
Question:
The above code is giving me my desired output but Is there is a more efficient way of transforming this?
As I can't post the whole data because there are 3650 rows in the dataframe so you can download the csv file(60.6 kb) for testing from here
Try grabbing the year and dayofyear from the index then pivoting:
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1982-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])
# Get Year and Day of Year
df['year'] = df.index.year
df['day'] = df.index.dayofyear
# Pivot
p = df.pivot(index='day', columns='year', values='Temp')
print(p)
p:
year 1981 1982
day
1 38 85
2 51 70
3 76 61
4 71 47
5 44 76
.. ... ...
361 23 22
362 42 64
363 84 22
364 26 56
365 67 73
Run-Time via Timeit
import timeit
setup = '''
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1983-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])'''
pivot = '''
df['year'] = df.index.year
df['day'] = df.index.dayofyear
p = df.pivot(index='day', columns='year', values='Temp')'''
groupby_for = '''
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values'''
if __name__ == '__main__':
print("Pivot")
print(timeit.timeit(setup=setup, stmt=pivot, number=1000))
print("Groupby For")
print(timeit.timeit(setup=setup, stmt=groupby_for, number=1000))
Pivot
1.598973
Groupby For
2.3967995999999996
*Additional note, the groupby for option will not work for leap years as it will not be able to handle 1984 being 366 days instead of 365. Pivot will work regardless.
I am running into an issue when using BeautifulSoup to scrape data off of www.basketball-reference.com. I've used BeautifulSoup before on Bballreference before so I am a little stumped as to what is happening (granted I am a pretty huge noob so please bear with me).
I am trying to scrape team season stats off of https://www.basketball-reference.com/leagues/NBA_2020.html and am running into troubles from the very start:
from bs4 import BeautifulSoup
import requests
web_response = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html').text
soup = BeautifulSoup(web_response, 'lxml')
table = soup.find('table', id='team-stats-per_game')
print(table)
This shows that the finding of the table in question was unsuccessful even though I can clearly locate that tag when inspecting the web page. Okay... no biggie so far (usually these errors are on my end) so I instead just print out the whole soup:
soup = BeautifulSoup(web_response, 'lxml')
print(soup)
I copy and paste that into https://codebeautify.org/htmlviewer/. To get a better view than from the terminal and I see that it does not look how I would expect it to. Essentially the meta tags are fine but everything else appears to have lost its opening and closing tags, just making my soup into an actual soup...
Again, no biggie (still pretty sure it is something that I am doing), so I go and grab the html from a simple blog site, print it, and paste it into codebeautify and lo and behold it looks normal. Now I have a suspicion that something is occurring on basketball-reference's side that is obscuring my ability to even grab the html.
My question is this; what exactly is going on here? I am assuming it's an 80% chance it is still me but the 20% is not so sure at this point. Can someone point out what I am doing wrong or how to grab the html?
The data is stored within the page, but inside the HTML comment.
To parse it, you can do for example:
import requests
from bs4 import BeautifulSoup, Comment
web_response = requests.get('https://www.basketball-reference.com/leagues/NBA_2020.html').text
soup = BeautifulSoup(web_response, 'lxml')
table = soup.find('table', id='team-stats-per_game')
# find the comment section where the data is stored
for idx, c in enumerate(soup.select_one('div#all_team-stats-per_game').contents):
if isinstance(c, Comment):
break
# load the data from comment:
soup2 = BeautifulSoup(soup.select_one('div#all_team-stats-per_game').contents[idx], 'html.parser')
# print data:
for tr in soup2.select('tr:has(td)'):
tds = tr.select('td')
for td in tds:
print(td.get_text(strip=True), end='\t')
print()
Prints:
Dallas Mavericks 67 241.5 41.6 90.0 .462 15.3 41.5 .369 26.3 48.5 .542 17.9 23.1 .773 10.6 36.4 47.0 24.5 6.3 5.0 12.8 19.0 116.4
Milwaukee Bucks* 65 240.8 43.5 91.2 .477 13.7 38.6 .356 29.8 52.6 .567 17.8 24.0 .742 9.5 42.2 51.7 25.9 7.4 6.0 14.9 19.2 118.6
Houston Rockets 64 241.2 41.1 90.7 .454 15.4 44.3 .348 25.7 46.4 .554 20.5 26.0 .787 10.4 34.6 44.9 21.5 8.5 5.1 14.7 21.6 118.1
Portland Trail Blazers 66 240.8 41.9 90.9 .461 12.6 33.8 .372 29.3 57.1 .513 17.3 21.7 .798 10.1 35.4 45.5 20.2 6.1 6.2 13.0 21.4 113.6
Atlanta Hawks 67 243.0 40.6 90.6 .449 12.0 36.1 .333 28.6 54.5 .525 18.5 23.4 .790 9.9 33.4 43.3 24.0 7.8 5.1 16.2 23.1 111.8
New Orleans Pelicans 64 242.3 42.6 92.2 .462 14.0 37.6 .372 28.6 54.6 .525 16.9 23.2 .729 11.2 35.8 47.0 27.0 7.6 5.1 16.2 21.0 116.2
Los Angeles Clippers 64 241.2 41.6 89.7 .464 12.2 33.2 .366 29.5 56.5 .522 20.8 26.2 .792 11.0 37.0 48.0 23.8 7.1 5.0 14.8 22.0 116.2
Washington Wizards 64 241.2 41.9 91.0 .461 12.3 33.1 .372 29.6 57.9 .511 19.5 24.8 .787 10.1 31.6 41.7 25.3 8.1 4.3 14.1 22.6 115.6
Memphis Grizzlies 65 240.4 42.8 91.0 .470 10.9 31.1 .352 31.8 59.9 .531 16.2 21.3 .761 10.4 36.3 46.7 27.0 8.0 5.6 15.3 20.8 112.6
Phoenix Suns 65 241.2 40.8 87.8 .464 11.2 31.7 .353 29.6 56.1 .527 19.8 24.0 .826 9.8 33.3 43.1 27.2 7.8 4.0 15.1 22.1 112.6
Miami Heat 65 243.5 39.6 84.4 .470 13.4 34.8 .383 26.3 49.6 .530 19.5 25.1 .778 8.5 36.0 44.5 26.0 7.4 4.5 14.9 20.4 112.2
Minnesota Timberwolves 64 243.1 40.4 91.6 .441 13.3 39.7 .336 27.1 52.0 .521 19.1 25.4 .753 10.5 34.3 44.8 23.8 8.7 5.7 15.3 21.4 113.3
Boston Celtics* 64 242.0 41.2 89.6 .459 12.4 34.2 .363 28.8 55.4 .519 18.3 22.8 .801 10.7 35.3 46.0 22.8 8.3 5.6 13.6 21.4 113.0
Toronto Raptors* 64 241.6 40.6 88.5 .458 13.8 37.0 .371 26.8 51.5 .521 18.1 22.6 .800 9.7 35.5 45.2 25.4 8.8 4.9 14.4 21.5 113.0
Los Angeles Lakers* 63 240.8 42.9 88.6 .485 11.2 31.4 .355 31.8 57.1 .556 17.3 23.7 .730 10.6 35.5 46.1 25.9 8.6 6.8 15.1 20.6 114.3
Denver Nuggets 65 242.3 41.8 88.9 .471 10.9 30.4 .358 31.0 58.5 .529 15.9 20.5 .775 10.8 33.5 44.3 26.5 8.1 4.6 13.7 20.0 110.4
San Antonio Spurs 63 242.8 42.0 89.5 .470 10.7 28.7 .371 31.4 60.8 .517 18.4 22.8 .809 8.8 35.6 44.4 24.5 7.2 5.5 12.3 19.2 113.2
Philadelphia 76ers 65 241.2 40.8 87.7 .465 11.4 31.6 .362 29.4 56.1 .523 16.6 22.1 .752 10.4 35.1 45.5 25.9 8.2 5.4 14.2 20.6 109.6
Indiana Pacers 65 241.5 42.2 88.4 .477 10.0 27.5 .363 32.2 60.9 .529 15.1 19.1 .787 8.8 34.0 42.8 25.9 7.2 5.1 13.1 19.6 109.3
Utah Jazz 64 240.4 40.1 84.6 .475 13.2 34.4 .383 27.0 50.2 .537 17.6 22.8 .772 8.8 36.3 45.1 22.2 5.9 4.0 14.9 20.0 111.0
Oklahoma City Thunder 64 241.6 40.3 85.1 .473 10.4 29.3 .355 29.9 55.8 .536 19.8 24.8 .797 8.1 34.6 42.7 21.9 7.6 5.0 13.5 18.8 110.8
Brooklyn Nets 64 243.1 40.0 90.0 .444 12.9 37.9 .340 27.1 52.2 .519 18.0 24.1 .744 10.8 37.6 48.5 24.0 6.5 4.6 15.5 20.7 110.8
Detroit Pistons 66 241.9 39.3 85.7 .459 12.0 32.7 .367 27.3 53.0 .515 16.6 22.4 .743 9.8 32.0 41.7 24.1 7.4 4.5 15.3 19.7 107.2
New York Knicks 66 241.9 40.0 89.3 .447 9.6 28.4 .337 30.4 61.0 .499 16.3 23.5 .694 12.0 34.5 46.5 22.1 7.6 4.7 14.3 22.2 105.8
Sacramento Kings 64 242.3 40.4 87.8 .459 12.6 34.7 .364 27.7 53.2 .522 15.6 20.3 .769 9.6 32.9 42.5 23.4 7.6 4.2 14.4 21.9 109.0
Cleveland Cavaliers 65 241.9 40.3 87.9 .458 11.2 31.8 .351 29.1 56.1 .519 15.1 19.9 .758 10.8 33.4 44.2 23.1 6.9 3.2 16.5 18.3 106.9
Chicago Bulls 65 241.2 39.6 88.6 .447 12.2 35.1 .348 27.4 53.5 .511 15.5 20.5 .755 10.5 31.4 41.9 23.2 10.0 4.1 15.5 21.8 106.8
Orlando Magic 65 240.4 39.2 88.8 .442 10.9 32.0 .341 28.3 56.8 .498 17.0 22.1 .770 10.4 34.2 44.5 24.0 8.4 5.7 12.6 17.6 106.4
Golden State Warriors 65 241.9 38.6 88.2 .438 10.4 31.3 .334 28.2 56.9 .495 18.7 23.2 .803 10.0 32.9 42.8 25.6 8.2 4.6 14.9 20.1 106.3
Charlotte Hornets 65 242.3 37.3 85.9 .434 12.1 34.3 .352 25.2 51.6 .489 16.2 21.6 .748 11.0 31.8 42.8 23.8 6.6 4.1 14.6 18.8 102.9
League Average 65 241.7 40.8 88.8 .460 12.1 33.9 .357 28.7 54.9 .523 17.7 22.9 .771 10.1 34.7 44.9 24.3 7.7 4.9 14.5 20.6 111.4
I want to add three additional columns using pandas and python. I'm not sure how to add additional columns based on rows who have the same GroupID value.
min_avg: Which is the lowest avg value for rows with the same GroupID
max_avg: Which is the highest avg value for rows with the same GroupID
group_avg: Which is the avg value for each rows 'min_avg, max_avg' columns
I'm not entirely sure where to begin with this one.
I have this:
avg groupId
0 25.5 1016
1 26.7 1048
2 25.8 1016
3 53.5 1048
4 29.3 1064
5 27.7 1016
and my goal is this:
avg groupId min_avg max_avg group_average
0 25.5 1016 25.5 27.7 26.6
1 26.7 1048 26.3 53.5 39.9
2 25.8 1016 25.5 27.7 26.6
3 53.5 1048 26.3 53.5 39.9
4 29.3 1064 29.3 29.3 29.3
5 27.7 1016 25.5 27.7 26.6
We can do merge with groupby describe
df=df.merge(df.groupby('groupId').avg.describe()[['mean','min','max']].reset_index(),how='left')
Out[25]:
avg groupId mean min max
0 25.5 1016 26.333333 25.5 27.7
1 26.7 1048 40.100000 26.7 53.5
2 25.8 1016 26.333333 25.5 27.7
3 53.5 1048 40.100000 26.7 53.5
4 29.3 1064 29.300000 29.3 29.3
5 27.7 1016 26.333333 25.5 27.7
The describe method, as given in YOBEN_S's solution, will compute more statistics than is required, including count, std, and dtypes. See here.
We can get around this by using the agg method.
df.merge(df.groupby('groupId')['avg'].agg([min, max, 'mean']), on='groupId')
# output
avg groupId min max mean
0 25.5 1016 25.5 27.7 26.333333
1 26.7 1048 26.7 53.5 40.100000
2 25.8 1016 25.5 27.7 26.333333
3 53.5 1048 26.7 53.5 40.100000
4 29.3 1064 29.3 29.3 29.300000
5 27.7 1016 25.5 27.7 26.333333
Speed Comparison
Approach 1
%%timeit -n 1000
df.merge(df.groupby('groupId').avg.describe()[['mean','min','max']].reset_index(),how='left')
9.6 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Approach 2
%%timeit -n 1000
df.merge(df.groupby('groupId')['avg'].agg([min, max, 'mean']), on='groupId')
3.42 ms ± 74.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Approach 3
Additionally, we can get a slight speedup by converting df.merge to df.join.
2.96 ms ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have a data file with a data in some specific format and has some extra lines to ignore while processing. I need to process the data and calculate a value based on the same.
Sample Data:
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Source of Sample File: http://robjhyndman.com/tsdldata/data/cryer2.dat
Note: Here, rows represent the years and columns represent the months.
I am trying to write a function which returns the average temperature of any month from the given url.
I have tried as below:
def avg_temp_march(f):
march_temps = []
# read each line of the file and store the values
# as floats in a list
for line in f:
line = str(line, 'ascii') # now line is a string
temps = line.split()
# check that it is not empty.
if temps != []:
march_temps.append(float(temps[2]))
# calculate the average and return it
return sum(march_temps) / len(march_temps)
avg_temp_march("data5.txt")
But I am getting the error line = str(line, 'ascii')
TypeError: decoding str is not supported
I think there is no requirement for converting a string to string.
I tried to fix your code with some modifications:
def avg_temp_march(f):
# f is a string read from file
march_temps = []
for line in f.split("\n"):
if line == "": continue
temps = line.split(" ")
temps = [t for t in temps if t != ""]
# check that it is not empty.
month_index = 2
if len(temps) > month_index:
try:
march_temps.append(float(temps[month_index]))
except Exception, e:
print temps
print "Skipping line:", e
# calculate the average and return it
return sum(march_temps) / len(march_temps)
Output:
['Average', 'monthly', 'temperatures', 'in', 'Dubuque,', 'Iowa,']
Skipping line: could not convert string to float: temperatures
['January', '1964', 'through', 'december', '1975,', 'n=144']
Skipping line: could not convert string to float: through
32.475
Based on your original question (before latest edits), I think you can solve your problem in this way.
# from urllib2 import urlopen
from urllib.request import urlopen #python3
def avg_temp_march(url):
f = urlopen(url).read()
data = f.split("\n")[3:] #ingore the first 3 lines
data = [line.split() for line in data if line!=''] #ignore the empty lines
data = [map(float, line) for line in data] #Convert all numbers to float
month_index = 2 # 2 for march
monthly_sum = sum([line[month_index] for line in data])
monthly_avg = monthly_sum/len(data)
return monthly_avg
print avg_temp_march("http://robjhyndman.com/tsdldata/data/cryer2.dat")
Using pandas, the code becomes bit shorter:
import calendar
import pandas a spd
df = pd.read_csv('data5.txt', delim_whitespace=True, skiprows=2,
names=calendar.month_abbr[1:])
Now for March:
>>> df.Mar.mean()
32.475000000000001
and for all months:
>>> df.mean()
Jan 16.608333
Feb 20.650000
Mar 32.475000
Apr 46.525000
May 58.091667
Jun 67.500000
Jul 71.716667
Aug 69.333333
Sep 61.025000
Oct 50.975000
Nov 36.650000
Dec 23.641667
dtype: float64
I have a text file of temperature data that looks like this:
3438012868.0 0.0 21.7 22.6 22.5 22.5 21.2
3438012875.0 0.0 21.6 22.6 22.5 22.5 21.2
3438012881.9 0.0 21.7 22.5 22.5 22.5 21.2
3438012888.9 0.0 21.6 22.6 22.5 22.5 21.2
3438012895.8 0.0 21.6 22.5 22.6 22.5 21.3
3438012902.8 0.0 21.6 22.5 22.5 22.5 21.2
3438012909.7 0.0 21.6 22.5 22.5 22.5 21.2
3438012916.6 0.0 21.6 22.5 22.5 22.5 21.2
3438012923.6 0.0 21.6 22.6 22.5 22.5 21.2
3438012930.5 0.0 21.6 22.5 22.5 22.5 21.2
3438012937.5 0.0 21.7 22.5 22.5 22.5 21.2
3438012944.5 0.0 21.6 22.5 22.5 22.5 21.3
3438012951.4 0.0 21.6 22.5 22.5 22.5 21.2
3438012958.4 0.0 21.6 22.5 22.5 22.5 21.3
3438012965.3 0.0 21.6 22.6 22.5 22.5 21.2
3438012972.3 0.0 21.6 22.5 22.5 22.5 21.3
3438012979.2 0.0 21.6 22.6 22.5 22.5 21.2
3438012986.1 0.0 21.6 22.5 22.5 22.5 21.3
3438012993.1 0.0 21.6 22.5 22.6 22.5 21.2
3438013000.0 0.0 21.6 0.0 22.5 22.5 21.3
3438013006.9 0.0 21.6 22.6 22.5 22.5 21.2
3438013014.4 0.0 21.6 22.5 22.5 22.5 21.3
3438013021.9 0.0 21.6 22.5 22.5 22.5 21.3
3438013029.9 0.0 21.6 22.5 22.5 22.5 21.2
3438013036.9 0.0 21.6 22.6 22.5 22.5 21.2
3438013044.6 0.0 21.6 22.5 22.5 22.5 21.2
but the entire file is much longer, this is the first few lines. The first column is a timestamp and the next 6 columns are temperature recordings. I need to write a loop that will find the average of the 6 measurements but will ignore measurement of 0.0 because this just means the sensor wasn't turned on. Later in the measurements, the first column does have a measurement. Is there a way for me to write an if statement or another way to only find averages of the non-zero numbers in a list? Right now, I have:
time = []
t1 = []
t2 = []
t3 = []
t4 = []
t5 = []
t6 = []
newdate = []
temps = open('file_path','r')
sepfile = temps.read().replace('\n','').split('\r')
temps.close()
for plotpair in sepfile:
data = plotpair.split('\t')
time.append(float(data[0]))
t1.append(float(data[1]))
t2.append(float(data[2]))
t3.append(float(data[3]))
t4.append(float(data[4]))
t5.append(float(data[5]))
t6.append(float(data[6]))
for data_seconds in time:
date = datetime(1904,1,1,5,26,02)
delta = timedelta(seconds=data_seconds)
newdate.append(date+delta)
for datapoint in t2,t3,t4,t5,t6:
temperatures = np.array([t2,t3,t4,t5,t6]).mean(0).tolist()
which only finds the average for the last 5 measurements. I'm hoping to find a better method that will ignore 0.0's and include the first column when it is a non-0.
Prior questions show you have NumPy installed. So using NumPy, you could set the zeros to NaN and then call np.nanmean to take the mean, ignoring NaNs:
import numpy as np
data = np.genfromtxt('data')
data[data == 0] = np.nan
means = np.nanmean(data[:, 1:], axis=1)
yields
array([ 22.1 , 22.08 , 22.08 , 22.08 , 22.1 , 22.06 , 22.06 ,
22.06 , 22.08 , 22.06 , 22.08 , 22.08 , 22.06 , 22.08 ,
22.08 , 22.08 , 22.08 , 22.08 , 22.08 , 21.975, 22.08 ,
22.08 , 22.08 , 22.06 , 22.08 , 22.06 ])
You can make an truncated/trimmed mean using scipy.stats.tmean
Or you can check if float(data[X]) is equal to 0, before appending it to the corresponding list
This will work with python3
import csv
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
outfile = csv.writer(outfile, delimiter='\t')
for time, *temps in csv.reader(infile, delimiter='\t'):
temps = [float(t) for t in temps if t!='0.0']
avg = sum(temps)/len(temps)
outfile.writerow([time, avg])
with open('infile') as f1, with open('outfile','w') as f2:
for x in f1:
nums = [float(i) for i in x.strip().split() if i!='0.0']
avg = sum(nums[1:])/len(nums[1:])
f2.write("{}\t{}".format(nums[0],avg))