using python how to convert numeric data in ordinal? - python

data :
118.6
109.7
126.7
107.8
113.9
109.7
109.7
98.2
112.3
153.7
157.8
85
126.7
125.1
155.4
138.5
154.6
189.9
120.6
101.6
128.7
138.5
210.8
124.4
189.8
122.5
161.7
188.6
229.1
168.7
233.7
137.5
126.6
244.4
141.9
227.5
183
177.6
244.4
95.1
116.4
75.9
75.3
109.8
117.1
75.9
109.8
71.2
71.3
89.6
93.3
84.7
85
82.9
145.3
107.7
84.2
96.7
89.8
86.2
85
89.6
67.5
64.9
48.1
54.9
56.1
60.6
51
44.6
64.3
57.6
66.2
69
60
70.2
65.4
60.1
49.4
61.4
62.8
78.8
70.3
82.7
68.6
I want to convert this numeric data in to ordinal .
Example :
if data values are comes in 60 to 69.9 then it will show me 1.
if data values are comes in 70 to 79.9 then it will show me 2.
if data values are comes in 80 to 89.9 then it will show me 3.
if data values are comes in 90 to 99.9 then it will show me 4. so on.
I know how to binarize data, using binaryX = binarizer.transform(X)
but i don't know how to convert numeric interval values in single ordinal value.

What about just dividing by 10, and subtracting the offset?
data = ['60', '69.9', '70', '73', '80']
[int((float(a) // 10) - 5) for a in data] # [1, 1, 2, 2, 3]
or, if you are using NumPy
((numpy.array([float(a) for a in data]) // 10) - 5).astype(int) # [1 1 2 2 3]

Related

Numpy - removing rows from one array using rows of another one

I have a two arrays, which contain large datasets - point clouds.
The first array is build from more than three columns, where first three are XYZ coordinates and the rest of columns contain additional informations. One row is basically one point with given cooridnates and additional parametres - not important at this stage.
The second array contain only three - XYZ - columns.
From the first array I would like to remove all rows (points) which XYZ coorindates overlap within given buffor with any of the row (point) from the second array.
For example here is the first array:
15.0 23.0 35.5 222 211 254
13.0 33.0 34.5 223 232 244
15.0 23.0 35.5 226 211 253
15.4 22.1 32.5 122 231 252
14.1 24.4 36.5 242 212 251
15.0 23.4 55.5 223 211 253
15.0 23.5 45.5 222 211 254
Here is the second one:
15.0 23.1 35.6
13.1 33.1 34.4
15.5 23.1 35.8
15.4 22.1 32.9
14.1 24.8 36.5
15.5 23.4 55.9
15.9 23.5 45.5
And my given buffer is 0.1. As a result I would like to obtain the following array:
15.0 23.0 35.5 226 211 253
15.4 22.1 32.5 122 231 252
14.1 24.4 36.5 242 212 251
15.0 23.4 55.5 223 211 253
15.0 23.5 45.5 222 211 254
What is the best way to implement this task using numpy?
How about that?
def filter(arr1, arr2, threshold):
return arr1[np.linalg.norm(arr1[:,:3] - arr2, axis=1) < threshold]

Not able to read txt file without comma separator in pandas python

CODE
import pandas
df = pandas.read_csv('biharpopulation.txt', delim_whitespace=True)
df.columns = ['SlNo','District','Total','Male','Female','Total','Male','Female','SC','ST','SC','ST']
DATA
SlNo District Total Male Female Total Male Female SC ST SC ST
1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.2 38.6 68.7
2 Nalanda 473786 248246 225540 970 524 446 20.2 0.0 29.4 29.8
3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.4 39.1 46.7
4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.6 37.9 44.6
5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.0 41.3 30.0
6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.8 40.5 38.6
7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.1 26.3 49.1
8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
9 Arawal 11479 57677 53802 294 179 115 18.8 0.04
10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.1 22.4 20.5
11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.1 35.7 49.7
Saran
12 Saran 389933 199772 190161 6667 3384 3283 12 0.2 33.6 48.5
13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.5 35.6 44.0
14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.3 32.1 37.8
15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.1 28.9 50.4
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.1 22.1 31.4
19 Sheohar 74391 39405 34986 64 35 29 14.4 0.0 16.9 38.8
20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.1 29.4 29.9
21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.0 24.7 49.5
22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.0 22.2 35.8
23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.1 25.1 22.0
24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.6 42.6 37.3
25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.1 31.4 78.6
26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.0 25.2 45.6
27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.7 26.8 12.9
28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.8 24.5 26.7
The issue is with these 2 lines:
16 E. Champaran 514119 270968 243151 4812 2518 2294 13.0 0.1 20.6 34.3
17 W. Champaran 434714 228057 206657 44912 23135 21777 14.3 1.5 22.3 24.1
If you can somehow remove the space between E. Champaran and W. Champaran then you can do this:
df = pd.read_csv('test.csv', sep=r'\s+', skip_blank_lines=True, skipinitialspace=True)
print(df)
SlNo District Total Male Female Total.1 Male.1 Female.1 SC ST SC.1 ST.1
0 1 Patna 729988 386991 342997 9236 5352 3884 15.5 0.20 38.6 68.7
1 2 Nalanda 473786 248246 225540 970 524 446 20.2 0.00 29.4 29.8
2 3 Bhojpur 343598 181372 162226 8337 4457 3880 15.3 0.40 39.1 46.7
3 4 Buxar 198014 104761 93253 8428 4573 3855 14.1 0.60 37.9 44.6
4 5 Rohtas 444333 233512 210821 25663 13479 12184 18.1 1.00 41.3 30.0
5 6 Kaimur 286291 151031 135260 35662 18639 17023 22.2 2.80 40.5 38.6
6 7 Gaya 1029675 529230 500445 2945 1526 1419 29.6 0.10 26.3 49.1
7 8 Jehanabad 174738 90485 84253 1019 530 489 18.9 0.07 32.6 32.4
8 9 Arawal 11479 57677 53802 294 179 115 18.8 0.04 NaN NaN
9 10 Nawada 435975 223929 212046 2158 1123 1035 24.1 0.10 22.4 20.5
10 11 Aurangabad 472766 244761 228005 1640 865 775 23.5 0.10 35.7 49.7
11 12 Saran 389933 199772 190161 6667 3384 3283 12.0 0.20 33.6 48.5
12 13 Siwan 309013 153558 155455 13822 6856 6966 11.4 0.50 35.6 44.0
13 14 Gopalganj 267250 134796 132454 6157 2984 3173 12.4 0.30 32.1 37.8
14 15 Muzaffarpur 594577 308894 285683 3472 1789 1683 15.9 0.10 28.9 50.4
15 16 E.Champaran 514119 270968 243151 4812 2518 2294 13.0 0.10 20.6 34.3
16 17 W.Champaran 434714 228057 206657 44912 23135 21777 14.3 1.50 22.3 24.1
17 18 Sitamarhi 315646 166607 149039 1786 952 834 11.8 0.10 22.1 31.4
18 19 Sheohar 74391 39405 34986 64 35 29 14.4 0.00 16.9 38.8
19 20 Vaishali 562123 292711 269412 3068 1595 1473 20.7 0.10 29.4 29.9
20 21 Darbhanga 511125 266236 244889 841 467 374 15.5 0.00 24.7 49.5
21 22 Madhubani 481922 248774 233148 1260 647 613 13.5 0.00 22.2 35.8
22 23 Samastipur 628838 325101 303737 3362 2724 638 18.5 0.10 25.1 22.0
23 24 Munger 150947 80031 70916 18060 9297 8763 13.3 1.60 42.6 37.3
24 25 Begusarai 341173 177897 163276 1505 823 682 14.5 0.10 31.4 78.6
25 26 Shekhapura 103732 54327 49405 211 115 96 19.7 0.00 25.2 45.6
26 27 Lakhisarai 126575 65781 60794 5636 2918 2718 15.8 0.70 26.8 12.9
27 28 Jamui 242710 124538 118172 67357 34689 32668 17.4 4.80 24.5 26.7
Your problem is that the CSV is whitespace-delimited, but some of your district names also have whitespace in them. Luckily, none of the district names contain '\t' characters, so we can fix this:
df = pandas.read_csv('biharpopulation.txt', delimiter='\t')

30 Day distance between dates in datetime64[ns] column

I have data of the following form:
6460 2001-07-24 00:00:00 67.5 75.1 75.9 71.0 75.2 81.8
6490 2001-06-24 00:00:00 68.4 74.9 76.1 70.9 75.5 82.7
6520 2001-05-25 00:00:00 69.6 74.7 76.3 70.8 75.5 83.2
6550 2001-04-25 00:00:00 69.2 74.6 76.1 70.6 75.0 83.1
6580 2001-03-26 00:00:00 69.1 74.4 75.9 70.5 74.3 82.8
6610 2001-02-24 00:00:00 69.0 74.0 75.3 69.8 73.8 81.9
6640 2001-01-25 00:00:00 68.9 73.9 74.6 69.7 73.5 80.0
6670 2000-12-26 00:00:00 69.0 73.5 75.0 69.5 72.6 81.8
6700 2000-11-26 00:00:00 69.8 73.2 75.1 69.5 72.0 82.7
6730 2000-10-27 00:00:00 70.3 73.1 75.0 69.4 71.3 82.6
6760 2000-09-27 00:00:00 69.4 73.0 74.8 69.4 71.0 82.3
6790 2000-08-28 00:00:00 69.6 72.8 74.6 69.2 70.7 81.9
6820 2000-07-29 00:00:00 67.8 72.9 74.4 69.1 70.6 81.8
I want all the dates to have a 30 day difference between each other. I know how to add a specific day or month to a datetime object with something like
ndfd = ndf['Date'].astype('datetime64[ns]')
ndfd = ndfd.apply(lambda dt: dt.replace(day=15))
But this does not take into account the difference in days from month to month.
How can I ensure there is a consistent step in days from month to month in my data, given that I am able to change the day as long as it remains on the same month?
You could use date_range:
df['date'] = pd.date_range(start=df['date'][0], periods=len(df), freq='30D')
IIUC you could change your date column like this:
import datetime
a = df.iloc[0,0] # first date, assuming date col is first
df['date'] = [a + datetime.timedelta(days=30 * i) for i in range(len(df))]
I haven't tested this so not sure it work as smooth as I thought it will =).
You can transform your first day into ordinal, add 30*i to it and then transform it back.
first_day=df.iloc[0]['date_column'].toordinal()
df['date']=(first_day+30*i for i in range(len(df))).fromordinal

invalid syntax for a big list of numbers

I found a big table of data online. I would like to use it in python. I was going to make a graph out of two of the columns of data.
I copy and pasted the table trying to make a string out of it but the table is just raw numbers no commas or anything and python isn't happy with that.
Is there any way I can do this in python?
(I added the first couple of commas experimenting)
import math
a=(
1983, 937.700, 645 1580 71.6 65.9 65.9 65.8 65.8
1984 3426.020 645 6742 76.8 67.8 67.4 60.5 61.6
1985 3189.450 645 6347 72.4 71.1 69.1 56.4 59.3
1986 3792.140 645 7488 85.5 85.8 74.2 67.1 61.7
1987 4658.460 640 7654 87.4 85.5 76.8 83.1 66.7
1988 5283.590 640 8372 95.3 95.3 80.4 94.0 71.9
1989 4870.250 640 7722 88.2 89.5 81.8 86.9 74.3
1990 4080.560 640 7748 88.4 72.9 80.6 72.8 74.1
1991 3925.510 640 6317 72.1 69.9 79.3 70.0 73.6
1992 4701.500 640 7431 84.6 84.8 79.9 83.6 74.7
1993 4827.100 685 7731 88.2 92.4 81.2 80.4 75.2
1994 5405.460 635 8634 98.6 98.6 82.7 97.2 77.2
1995 4518.970 635 7229 82.5 82.5 82.7 81.2 77.5
1996 5241.980 635 8289 94.4 94.4 83.6 94.0 78.7
1997 4217.520 635 6901 78.8 78.8 83.2 75.8 78.5
1998 3825.060 635 6258 71.4 71.4 82.5 68.8 77.9
1999 3793.280 635 6132 70.0 69.9 81.7 68.2 77.3
2000 4886.200 635 7879 89.7 89.7 82.2 87.6 77.9
2001 4711.190 635 7766 88.6 88.3 82.5 84.7 78.3
2002 4532.290 635 7366 84.1 83.4 82.5 81.5 78.4
2003 3567.070 635 5833 66.6 65.2 81.7 64.1 77.7
2004 4875.390 635 7905 90.0 89.2 82.0 87.4 78.2
2005 4486.190 635 7329 83.7 83.5 82.1 80.6 78.3
2006 4595.250 635 7541 86.1 86.1 82.3 82.6 78.5
2007 4328.590 635 7126 81.4 77.8 82.1 77.8 78.4
2008 3648.410 635 6207 70.7 65.4 81.4 65.4 77.9
2009 3611.440 635 6039 68.9 64.9 80.8 64.9 77.4
2010 3490.450 635 5641 64.4 62.8 80.2 62.8 76.9
2011 3490.600 635 5861 66.9 62.8 79.5 62.8 76.4
2012 3911.560 )
File "", line 3
1983, 937.70, 645 1580 71.6 65.9 65.9 65.8 65.8
^
SyntaxError: invalid syntax
Create file with name data and csv extension like this data.csv. Paste the original values to files (not the commas you added). Now you can read this file:
import csv
with open('data.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in reader:
print(', '.join(row))

average temperature from year and month data in a file python

I have a data file with a data in some specific format and has some extra lines to ignore while processing. I need to process the data and calculate a value based on the same.
Sample Data:
Average monthly temperatures in Dubuque, Iowa,
January 1964 through december 1975, n=144
24.7 25.7 30.6 47.5 62.9 68.5 73.7 67.9 61.1 48.5 39.6 20.0
16.1 19.1 24.2 45.4 61.3 66.5 72.1 68.4 60.2 50.9 37.4 31.1
10.4 21.6 37.4 44.7 53.2 68.0 73.7 68.2 60.7 50.2 37.2 24.6
21.5 14.7 35.0 48.3 54.0 68.2 69.6 65.7 60.8 49.1 33.2 26.0
19.1 20.6 40.2 50.0 55.3 67.7 70.7 70.3 60.6 50.7 35.8 20.7
14.0 24.1 29.4 46.6 58.6 62.2 72.1 71.7 61.9 47.6 34.2 20.4
8.4 19.0 31.4 48.7 61.6 68.1 72.2 70.6 62.5 52.7 36.7 23.8
11.2 20.0 29.6 47.7 55.8 73.2 68.0 67.1 64.9 57.1 37.6 27.7
13.4 17.2 30.8 43.7 62.3 66.4 70.2 71.6 62.1 46.0 32.7 17.3
22.5 25.7 42.3 45.2 55.5 68.9 72.3 72.3 62.5 55.6 38.0 20.4
17.6 20.5 34.2 49.2 54.8 63.8 74.0 67.1 57.7 50.8 36.8 25.5
20.4 19.6 24.6 41.3 61.8 68.5 72.0 71.1 57.3 52.5 40.6 26.2
Source of Sample File: http://robjhyndman.com/tsdldata/data/cryer2.dat
Note: Here, rows represent the years and columns represent the months.
I am trying to write a function which returns the average temperature of any month from the given url.
I have tried as below:
def avg_temp_march(f):
march_temps = []
# read each line of the file and store the values
# as floats in a list
for line in f:
line = str(line, 'ascii') # now line is a string
temps = line.split()
# check that it is not empty.
if temps != []:
march_temps.append(float(temps[2]))
# calculate the average and return it
return sum(march_temps) / len(march_temps)
avg_temp_march("data5.txt")
But I am getting the error line = str(line, 'ascii')
TypeError: decoding str is not supported
I think there is no requirement for converting a string to string.
I tried to fix your code with some modifications:
def avg_temp_march(f):
# f is a string read from file
march_temps = []
for line in f.split("\n"):
if line == "": continue
temps = line.split(" ")
temps = [t for t in temps if t != ""]
# check that it is not empty.
month_index = 2
if len(temps) > month_index:
try:
march_temps.append(float(temps[month_index]))
except Exception, e:
print temps
print "Skipping line:", e
# calculate the average and return it
return sum(march_temps) / len(march_temps)
Output:
['Average', 'monthly', 'temperatures', 'in', 'Dubuque,', 'Iowa,']
Skipping line: could not convert string to float: temperatures
['January', '1964', 'through', 'december', '1975,', 'n=144']
Skipping line: could not convert string to float: through
32.475
Based on your original question (before latest edits), I think you can solve your problem in this way.
# from urllib2 import urlopen
from urllib.request import urlopen #python3
def avg_temp_march(url):
f = urlopen(url).read()
data = f.split("\n")[3:] #ingore the first 3 lines
data = [line.split() for line in data if line!=''] #ignore the empty lines
data = [map(float, line) for line in data] #Convert all numbers to float
month_index = 2 # 2 for march
monthly_sum = sum([line[month_index] for line in data])
monthly_avg = monthly_sum/len(data)
return monthly_avg
print avg_temp_march("http://robjhyndman.com/tsdldata/data/cryer2.dat")
Using pandas, the code becomes bit shorter:
import calendar
import pandas a spd
df = pd.read_csv('data5.txt', delim_whitespace=True, skiprows=2,
names=calendar.month_abbr[1:])
Now for March:
>>> df.Mar.mean()
32.475000000000001
and for all months:
>>> df.mean()
Jan 16.608333
Feb 20.650000
Mar 32.475000
Apr 46.525000
May 58.091667
Jun 67.500000
Jul 71.716667
Aug 69.333333
Sep 61.025000
Oct 50.975000
Nov 36.650000
Dec 23.641667
dtype: float64

Categories