sum of row in the same columns in pandas - python

i have a dataframe something like this
d1 d2 d3 d4
780 37.0 21.4 122840.0
784 38.1 21.4 122860.0
846 38.1 21.4 122880.0
843 38.0 21.5 122900.0
820 36.3 22.9 133220.0
819 36.3 22.9 133240.0
819 36.4 22.9 133260.0
820 36.3 22.9 133280.0
822 36.4 22.9 133300.0
how do i get the sum of values between the same column in a new column in a dataframe
for example:
d1 d2 d3 d4 d5
780 37.0 21.4 122840.0 1564
784 38.1 21.4 122860.0 1630
846 38.1 21.4 122880.0 1689
i want a new column with the sum of d1[i] + d1[i+1] .i know .sum() in pandas but i cant do sum between the same column

Your question is not fully clear to me, but I think what you mean to do is:
df['d5'] = df['d1'] + df['d1'].shift(-1)
Now you have to decide what you want to happen for the last element of the series.

Check with rolling
df['d5'] = df['d1'].rolling(2 ,min_periods=1).sum()
df
Out[321]:
d1 d2 d3 d4 d5
0 780 37.0 21.4 122840.0 780.0
1 784 38.1 21.4 122860.0 1564.0
2 846 38.1 21.4 122880.0 1630.0
3 843 38.0 21.5 122900.0 1689.0
4 820 36.3 22.9 133220.0 1663.0
5 819 36.3 22.9 133240.0 1639.0
6 819 36.4 22.9 133260.0 1638.0
7 820 36.3 22.9 133280.0 1639.0
8 822 36.4 22.9 133300.0 1642.0

Related

How to Fit a Breit Wigner/ Lorentzian on data (scipy.optimize) in Python

Here is the data I used for the fit which does not work:
x_vals = [20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 20.9 21. 21.1 21.2 21.3 21.4
21.5 21.6 21.7 21.8 21.9 22. 22.1 22.2 22.3 22.4 22.5 22.6 22.7 22.8
22.9 23. 23.1 23.2 23.3 23.4 23.5 23.6 23.7 23.8 23.9 24. 24.1 24.2
24.3 24.4 24.5 24.6 24.7 24.8 24.9 25. 25.1 25.2 25.3 25.4 25.5 25.6
25.7 25.8 25.9 26. 26.1 26.2 26.3 26.4 26.5 26.6 26.7 26.8 26.9 27.
27.1 27.2 27.3 27.4 27.5 27.6 27.7 27.8 27.9 28. 28.1 28.2 28.3 28.4
28.5 28.6 28.7 28.8 28.9 29. 29.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8
29.9]
y_vals = [1922 1947 1985 2019 2050 1955 2143 2133 2132 2214 2268 2293 2397 2339
2407 2447 2540 2504 2661 2714 2758 2945 3108 3161 3254 3434 3883 3997
4250 4659 4782 5150 5603 5833 6225 6613 6502 6911 6873 6941 6876 6709
6663 6238 5949 5728 5120 4649 4273 3671 3340 2855 2621 2246 1920 1666
1476 1293 1099 1061 982 993 908 905 806 821 744 705 751 701
673 728 662 677 658 615 684 688 679 624 600 622 608 572
626 637 586 567 579 576 572 585 557 536 549 565 509 511
521]
The fit isn't so great, its off by a lot and I am not sure how to fix it. Please let me know if there is a better way to fit this.
def lorentzian(x, a, x0):
return a / ((x-x0)**2 + a**2) / np.pi
# Obtain xdata and ydata
...
# Initial guess of the parameters (you must find them some way!)
#pguess = [2.6, 24]
# Fit the data
normalization_factor = np.trapz(x_vals, y_vals) # area under the curve
popt, pcov = curve_fit(lorentzian, x_vals, y_vals/normalization_factor)
# Results
a, x0 = popt[0], popt[1]
plt.plot(x_vals, lorentzian(x_vals, popt[0], popt[1])*(normalization_factor),
color='crimson', label='Fitted function')
plt.plot(x_vals, y_vals, 'o', label='data')
plt.show()
You have the arguments to np.trapz reversed. It should be
normalization_factor = np.trapz(y_vals, x_vals)

Transforming yearwise data using pandas

I have a dataframe that looks like this:
Temp
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
... ...
1981-12-27 15.5
1981-12-28 13.3
1981-12-29 15.6
1981-12-30 15.2
1981-12-31 17.4
365 rows × 1 columns
And I want to transform It so That It looks like:
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
0 20.7 17.0 18.4 19.5 13.3 12.9 12.3 15.3 14.3 14.8
1 17.9 15.0 15.0 17.1 15.2 13.8 13.8 14.3 17.4 13.3
2 18.8 13.5 10.9 17.1 13.1 10.6 15.3 13.5 18.5 15.6
3 14.6 15.2 11.4 12.0 12.7 12.6 15.6 15.0 16.8 14.5
4 15.8 13.0 14.8 11.0 14.6 13.7 16.2 13.6 11.5 14.3
... ... ... ... ... ... ... ... ... ... ...
360 15.5 15.3 13.9 12.2 11.5 14.6 16.2 9.5 13.3 14.0
361 13.3 16.3 11.1 12.0 10.8 14.2 14.2 12.9 11.7 13.6
362 15.6 15.8 16.1 12.6 12.0 13.2 14.3 12.9 10.4 13.5
363 15.2 17.7 20.4 16.0 16.3 11.7 13.3 14.8 14.4 15.7
364 17.4 16.3 18.0 16.4 14.4 17.2 16.7 14.1 12.7 13.0
My attempt:
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values
Question:
The above code is giving me my desired output but Is there is a more efficient way of transforming this?
As I can't post the whole data because there are 3650 rows in the dataframe so you can download the csv file(60.6 kb) for testing from here
Try grabbing the year and dayofyear from the index then pivoting:
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1982-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])
# Get Year and Day of Year
df['year'] = df.index.year
df['day'] = df.index.dayofyear
# Pivot
p = df.pivot(index='day', columns='year', values='Temp')
print(p)
p:
year 1981 1982
day
1 38 85
2 51 70
3 76 61
4 71 47
5 44 76
.. ... ...
361 23 22
362 42 64
363 84 22
364 26 56
365 67 73
Run-Time via Timeit
import timeit
setup = '''
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1983-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])'''
pivot = '''
df['year'] = df.index.year
df['day'] = df.index.dayofyear
p = df.pivot(index='day', columns='year', values='Temp')'''
groupby_for = '''
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values'''
if __name__ == '__main__':
print("Pivot")
print(timeit.timeit(setup=setup, stmt=pivot, number=1000))
print("Groupby For")
print(timeit.timeit(setup=setup, stmt=groupby_for, number=1000))
Pivot
1.598973
Groupby For
2.3967995999999996
*Additional note, the groupby for option will not work for leap years as it will not be able to handle 1984 being 366 days instead of 365. Pivot will work regardless.

Numpy - removing rows from one array using rows of another one

I have a two arrays, which contain large datasets - point clouds.
The first array is build from more than three columns, where first three are XYZ coordinates and the rest of columns contain additional informations. One row is basically one point with given cooridnates and additional parametres - not important at this stage.
The second array contain only three - XYZ - columns.
From the first array I would like to remove all rows (points) which XYZ coorindates overlap within given buffor with any of the row (point) from the second array.
For example here is the first array:
15.0 23.0 35.5 222 211 254
13.0 33.0 34.5 223 232 244
15.0 23.0 35.5 226 211 253
15.4 22.1 32.5 122 231 252
14.1 24.4 36.5 242 212 251
15.0 23.4 55.5 223 211 253
15.0 23.5 45.5 222 211 254
Here is the second one:
15.0 23.1 35.6
13.1 33.1 34.4
15.5 23.1 35.8
15.4 22.1 32.9
14.1 24.8 36.5
15.5 23.4 55.9
15.9 23.5 45.5
And my given buffer is 0.1. As a result I would like to obtain the following array:
15.0 23.0 35.5 226 211 253
15.4 22.1 32.5 122 231 252
14.1 24.4 36.5 242 212 251
15.0 23.4 55.5 223 211 253
15.0 23.5 45.5 222 211 254
What is the best way to implement this task using numpy?
How about that?
def filter(arr1, arr2, threshold):
return arr1[np.linalg.norm(arr1[:,:3] - arr2, axis=1) < threshold]

invalid syntax for a big list of numbers

I found a big table of data online. I would like to use it in python. I was going to make a graph out of two of the columns of data.
I copy and pasted the table trying to make a string out of it but the table is just raw numbers no commas or anything and python isn't happy with that.
Is there any way I can do this in python?
(I added the first couple of commas experimenting)
import math
a=(
1983, 937.700, 645 1580 71.6 65.9 65.9 65.8 65.8
1984 3426.020 645 6742 76.8 67.8 67.4 60.5 61.6
1985 3189.450 645 6347 72.4 71.1 69.1 56.4 59.3
1986 3792.140 645 7488 85.5 85.8 74.2 67.1 61.7
1987 4658.460 640 7654 87.4 85.5 76.8 83.1 66.7
1988 5283.590 640 8372 95.3 95.3 80.4 94.0 71.9
1989 4870.250 640 7722 88.2 89.5 81.8 86.9 74.3
1990 4080.560 640 7748 88.4 72.9 80.6 72.8 74.1
1991 3925.510 640 6317 72.1 69.9 79.3 70.0 73.6
1992 4701.500 640 7431 84.6 84.8 79.9 83.6 74.7
1993 4827.100 685 7731 88.2 92.4 81.2 80.4 75.2
1994 5405.460 635 8634 98.6 98.6 82.7 97.2 77.2
1995 4518.970 635 7229 82.5 82.5 82.7 81.2 77.5
1996 5241.980 635 8289 94.4 94.4 83.6 94.0 78.7
1997 4217.520 635 6901 78.8 78.8 83.2 75.8 78.5
1998 3825.060 635 6258 71.4 71.4 82.5 68.8 77.9
1999 3793.280 635 6132 70.0 69.9 81.7 68.2 77.3
2000 4886.200 635 7879 89.7 89.7 82.2 87.6 77.9
2001 4711.190 635 7766 88.6 88.3 82.5 84.7 78.3
2002 4532.290 635 7366 84.1 83.4 82.5 81.5 78.4
2003 3567.070 635 5833 66.6 65.2 81.7 64.1 77.7
2004 4875.390 635 7905 90.0 89.2 82.0 87.4 78.2
2005 4486.190 635 7329 83.7 83.5 82.1 80.6 78.3
2006 4595.250 635 7541 86.1 86.1 82.3 82.6 78.5
2007 4328.590 635 7126 81.4 77.8 82.1 77.8 78.4
2008 3648.410 635 6207 70.7 65.4 81.4 65.4 77.9
2009 3611.440 635 6039 68.9 64.9 80.8 64.9 77.4
2010 3490.450 635 5641 64.4 62.8 80.2 62.8 76.9
2011 3490.600 635 5861 66.9 62.8 79.5 62.8 76.4
2012 3911.560 )
File "", line 3
1983, 937.70, 645 1580 71.6 65.9 65.9 65.8 65.8
^
SyntaxError: invalid syntax
Create file with name data and csv extension like this data.csv. Paste the original values to files (not the commas you added). Now you can read this file:
import csv
with open('data.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
print(', '.join(row))

average temperature from year and month data in a file python

I have a data file with a data in some specific format and has some extra lines to ignore while processing. I need to process the data and calculate a value based on the same.
Sample Data:
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Source of Sample File: http://robjhyndman.com/tsdldata/data/cryer2.dat
Note: Here, rows represent the years and columns represent the months.
I am trying to write a function which returns the average temperature of any month from the given url.
I have tried as below:
def avg_temp_march(f):
march_temps = []
# read each line of the file and store the values
# as floats in a list
for line in f:
line = str(line, 'ascii') # now line is a string
temps = line.split()
# check that it is not empty.
if temps != []:
march_temps.append(float(temps[2]))
# calculate the average and return it
return sum(march_temps) / len(march_temps)
avg_temp_march("data5.txt")
But I am getting the error line = str(line, 'ascii')
TypeError: decoding str is not supported
I think there is no requirement for converting a string to string.
I tried to fix your code with some modifications:
def avg_temp_march(f):
# f is a string read from file
march_temps = []
for line in f.split("\n"):
if line == "": continue
temps = line.split(" ")
temps = [t for t in temps if t != ""]
# check that it is not empty.
month_index = 2
if len(temps) > month_index:
try:
march_temps.append(float(temps[month_index]))
except Exception, e:
print temps
print "Skipping line:", e
# calculate the average and return it
return sum(march_temps) / len(march_temps)
Output:
['Average', 'monthly', 'temperatures', 'in', 'Dubuque,', 'Iowa,']
Skipping line: could not convert string to float: temperatures
['January', '1964', 'through', 'december', '1975,', 'n=144']
Skipping line: could not convert string to float: through
32.475
Based on your original question (before latest edits), I think you can solve your problem in this way.
# from urllib2 import urlopen
from urllib.request import urlopen #python3
def avg_temp_march(url):
f = urlopen(url).read()
data = f.split("\n")[3:] #ingore the first 3 lines
data = [line.split() for line in data if line!=''] #ignore the empty lines
data = [map(float, line) for line in data] #Convert all numbers to float
month_index = 2 # 2 for march
monthly_sum = sum([line[month_index] for line in data])
monthly_avg = monthly_sum/len(data)
return monthly_avg
print avg_temp_march("http://robjhyndman.com/tsdldata/data/cryer2.dat")
Using pandas, the code becomes bit shorter:
import calendar
import pandas a spd
df = pd.read_csv('data5.txt', delim_whitespace=True, skiprows=2,
names=calendar.month_abbr[1:])
Now for March:
>>> df.Mar.mean()
32.475000000000001
and for all months:
>>> df.mean()
Jan 16.608333
Feb 20.650000
Mar 32.475000
Apr 46.525000
May 58.091667
Jun 67.500000
Jul 71.716667
Aug 69.333333
Sep 61.025000
Oct 50.975000
Nov 36.650000
Dec 23.641667
dtype: float64

Categories