Error while training a model with keras train_function (empty Logs) - python

I am trying to train a new model based on Bert's pre-training. However, the moment I try to do the training, I get an error without logs that seems to be related to this issue from Keras: https://github.com/keras-team/keras/issues/16202
. The process that the data has followed is as follows:
Xtrain_encoded = np.array(tokenizer.batch_encode_plus(X_train.astype('str'), truncation=True, max_length = 512)['input_ids'])
ytrain_encoded = tf.keras.utils.to_categorical(y_train, num_classes=4, dtype = 'int32')
Xtest_encoded = np.array(tokenizer.batch_encode_plus(X_test.astype('str'), truncation=True, max_length = 512)['input_ids'])
ytest_encoded = tf.keras.utils.to_categorical(y_test, num_classes=4, dtype = 'int32')
tensor_with_from_dimensions = tf.ragged.constant(Xtrain_encoded, ytrain_encoded)
tensor_with_from_Xtrain_encoded = tf.ragged.constant(Xtrain_encoded)
tensor_with_from_Ytrain_encoded = tf.ragged.constant(ytrain_encoded)
tensor_with_from_Xtest_encoded = tf.ragged.constant(Xtest_encoded)
tensor_with_from_Ytest_encoded = tf.ragged.constant(ytest_encoded)
Then, I created the training and testing dataset
BATCH_SIZE = 32*strategy.num_replicas_in_sync
AUTO = tf.data.experimental.AUTOTUNE
train_dataset = (
tf.data.Dataset
.from_tensor_slices((tensor_with_from_Xtrain_encoded, tensor_with_from_Ytrain_encoded))
.repeat()
.shuffle(2048)
.batch(BATCH_SIZE)
.prefetch(AUTO)
)
test_dataset = (
tf.data.Dataset
.from_tensor_slices(tensor_with_from_Xtest_encoded)
.batch(BATCH_SIZE)
)
And here below is where it breaks
n_steps = tensor_with_from_Xtrain_encoded.shape[0] // BATCH_SIZE
train_history = model.fit(
train_dataset,
steps_per_epoch=n_steps,
epochs=10
)
The value of tensor_with_from_Xtrain_encoded and tensor_with_from_Ytrain_encoded respectively is like this but more extensive:
tf.Tensor(
[ 101 11909 26136 23194 1000 1037 2329 10687 3005 2269 2003 15497
3979 1999 7128 2005 1037 3940 1997 2149 8408 20109 2003 1037
2877 8343 1999 1996 3347 25990 2278 2022 4974 2075 1997 2137
4988 2508 17106 1010 2009 2001 3936 2006 5958 1012 19935 2884
1011 16686 2098 19935 2884 3347 2100 2040 3728 1056 28394 3064
1037 6302 1997 2370 3173 2039 1037 16574 2132 2001 2426 2093
28101 2015 4453 2004 4298 2108 1996 16520 6359 2124 2004 2198
1996 3786 2571 1012 3347 2100 1010 2484 1010 2003 1996 2365
1997 2019 6811 1011 2141 16830 2040 2003 15497 3979 2006 7404
5571 5079 2000 1996 9252 2687 20109 1997 7861 22083 14625 1999
7938 1998 11959 1012 2036 2104 4812 2024 1996 2567 1997 1037
2329 3460 2320 5338 2007 15071 2048 2530 2162 11370 2015 1010
1998 1037 2280 6080 2266 2040 4991 2000 7025 1998 6158 2000
7795 1010 3725 2015 10013 3780 2988 1012 1037 6474 2137 4675
3334 29165 2964 8519 2024 3517 2000 4875 2000 1996 2866 2306
2420 2000 2393 6709 17106 2015 6359 1010 3725 2015 3679 5653
2988 1012 2280 19323 2218 2011 18301 2031 2056 2002 2003 2028
1997 2195 24815 5130 2027 9919 1996 11555 2349 2000 2037 2329
24947 1010 2007 2048 1997 2010 13675 10698 2229 3615 2000 2004
2577 1998 25589 1012 3347 2100 1010 2040 2253 2000 7795 2197
2095 2000 2954 1999 2049 6703 2942 2162 1010 2038 1037 3857
1010 3096 4309 1998 9669 2035 2714 2000 2216 1997 2198 1010
2429 2000 1996 10013 1012 2077 3352 1037 24815 2923 1010 2002
2001 2019 22344 10687 2013 2225 2414 2124 2004 1048 9743 4890
1010 3005 2189 2001 2209 2006 4035 2557 1015 1012 3347 2100
2036 2596 1999 2189 6876 6866 2006 7858 2005 2774 4159 26641
1010 3909 2152 1998 24726 1012 2021 2002 2001 7283 7490 3550
2011 8771 1997 2543 23544 5499 14512 2019 6460 2213 16480 14066
2854 1998 2939 2041 1997 2010 2155 2015 27729 2188 1999 1996
10850 2050 10380 2212 1997 2414 2197 2095 1010 3038 2002 2001
2975 2673 2005 1996 8739 1997 16455 1012 3041 2023 3204 1010
2002 2001 2464 1999 1037 6302 6866 2000 10474 4147 21356 5929
1998 1037 2304 21451 20464 12462 2096 3173 1037 16574 2132 2007
2010 2187 2192 1996 2168 2192 2198 2003 2464 2478 2000 4009
1037 5442 2408 17106 2015 3759 1999 2010 7781 2678 1012 2429
2000 1996 10013 1010 3347 2100 2003 2006 2019 2880 2862 1997
2329 24815 5130 2040 2089 2022 2198 1012 2036 2006 1996 2862
2003 10958 9759 2140 7025 1010 2538 1010 1996 3259 2758 1012
1999 2262 1010 7025 2015 2048 3080 3428 2164 2852 1012 21146
9103 2140 7025 1010 1037 6731 7522 2007 2563 2015 2120 2740
2326 2020 5338 2007 15071 2048 2530 2162 11370 2015 2379 1996
9042 3675 2007 4977 1012 2021 1996 3428 2020 2207 2197 2095
2044 4445 2329 4988 2198 2064 15286 102], shape=(512,), dtype=int32)
tf.Tensor([0 0 1 0], shape=(4,), dtype=int32)
The error itself:
ValueError: Unexpected result of `train_function` (Empty logs). Please use `Model.compile(..., run_eagerly=True)`, or `tf.config.run_functions_eagerly(True)` for more information of where went wrong, or file a issue/bug to `tf.keras`.

Related

Melting a dataframe based on a dependency between columns to be melted and value vars [duplicate]

I have a dataset in the following format:
county area pop_2006 pop_2007 pop_2008
01001 275 1037 1052 1102
01003 394 2399 2424 2438
01005 312 1638 1647 1660
And I need it in a format like this:
county year pop area
01001 2006 1037 275
01001 2007 1052 275
01001 2008 1102 275
01003 2006 2399 394
01003 2007 2424 394
...
I've tried every combination of pivot_table, stack, unstack, wide_to_long that I can think of, with no success yet. (clearly I'm mostly illiterate in Python/pandas, so please be gentle...).
You can use melt for reshaping, then split column variable and drop and sort_values. I think you can cast column year to int by astype and last change order of columns by subset:
df1 = (pd.melt(df, id_vars=['county','area'], value_name='pop'))
df1[['tmp','year']] = df1.variable.str.split('_', expand=True)
df1 = df1.drop(['variable', 'tmp'],axis=1).sort_values(['county','year'])
df1['year'] = df1.year.astype(int)
df1 = df1[['county','year','pop','area']]
print (df1)
county year pop area
0 1001 2006 1037 275
3 1001 2007 1052 275
6 1001 2008 1102 275
1 1003 2006 2399 394
4 1003 2007 2424 394
7 1003 2008 2438 394
2 1005 2006 1638 312
5 1005 2007 1647 312
8 1005 2008 1660 312
print (df1.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
Another solution with set_index, stack and reset_index:
df2 = df.set_index(['county','area']).stack().reset_index(name='pop')
df2[['tmp','year']] = df2.level_2.str.split('_', expand=True)
df2 = df2.drop(['level_2', 'tmp'],axis=1)
df2['year'] = df2.year.astype(int)
df2 = df2[['county','year','pop','area']]
print (df2)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
print (df2.dtypes)
county int64
year int32
pop int64
area int64
dtype: object
As the question title suggests, we can use pd.wide_to_long:
res = pd.wide_to_long(df, stubnames="pop", i=["county", "area"], j="year", sep="_")
to get
pop
county area year
1001 275 2006 1037
2007 1052
2008 1102
1003 394 2006 2399
2007 2424
2008 2438
1005 312 2006 1638
2007 1647
2008 1660
To exactly match the output format in the question, a reset_index and reindex (over columns) can be chained:
>>> res.reset_index().reindex(["county", "year", "pop", "area"], axis=1)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312
One option is with
pivot_longer from pyjanitor
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = ['county', 'area'],
names_to = ('.value', 'year'),
names_sep = '_',
sort_by_appearance=True)
)
county area year pop
0 1001 275 2006 1037
1 1001 275 2007 1052
2 1001 275 2008 1102
3 1003 394 2006 2399
4 1003 394 2007 2424
5 1003 394 2008 2438
6 1005 312 2006 1638
7 1005 312 2007 1647
8 1005 312 2008 1660
For this particular reshape, any part of the columns associated with .value remain as column headers, while the rest are transposed into columns. you can also change the dtype of the transposed columns (this can be efficient, especially for large data sizes):
(df
.pivot_longer(
index = ['county', 'area'],
names_to = ('.value', 'year'),
names_sep = '_',
names_transform = {'year':int},
sort_by_appearance=True)
)
county area year pop
0 1001 275 2006 1037
1 1001 275 2007 1052
2 1001 275 2008 1102
3 1003 394 2006 2399
4 1003 394 2007 2424
5 1003 394 2008 2438
6 1005 312 2006 1638
7 1005 312 2007 1647
8 1005 312 2008 1660

How to get the body of the table using Python?

I am self-lerning webscraping and I am trying to get tbody from a table with beautifulSoups.
My attempt:
url ='https://www.agrolok.pl/notowania/notowania-cen-pszenicy.htm'
page = requests.get(url).content
soup = BeautifulSoup(page, 'lxml')
table = soup.findAll('table', class_='hover')
print(table)
Thats what I get:
<table class="hover"></table>
Any hints highly appreciated
'table', class_='hover' that contains table data aka tbody, tr, td and so on are dynamic thats why you are not getting tbody but you can mimic dat selenium with pandas/bs4. I use selenium with pandas.
Script:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.agrolok.pl/notowania/notowania-cen-pszenicy.htm')
driver.maximize_window()
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
df = pd.read_html(str(soup))[0]
d=df.rename(columns=df.iloc[0]).drop(df.index[0])
print(d)
Output:
7/4/2022 1410 1380 343.25 4.7002 1613 1640
1 7/1/2022 1410 1300 334.50 4.7176 1578 1630
2 6/30/2022 1410 1320 350.25 4.6806 1639 1650
3 6/29/2022 1500 1380 358.50 4.6809 1678 1710
4 6/28/2022 1450 1360 356.75 4.7004 1677 1690
5 6/27/2022 1450 1360 350.00 4.6965 1644 1690
6 6/24/2022 1450 1360 357.25 4.7094 1682 1700
7 6/23/2022 1450 1360 359.00 4.7096 1691 1690
8 6/22/2022 1470 1410 370.50 4.6590 1726 1750
9 6/21/2022 1500 1370 372.50 4.6460 1731 1730
10 6/20/2022 1540 1460 388.25 4.6731 1814 1780
11 6/15/2022 1560 1460 392.75 4.6642 1832 1780
12 6/14/2022 1560 1460 392.25 4.6548 1826 1780
13 6/13/2022 1540 1460 394.50 4.6313 1827 1800
14 6/10/2022 1530 1450 391.75 4.6030 1803 1760
15 6/9/2022 1540 1500 386.25 4.5826 1770 1730
16 6/8/2022 1550 1520 381.75 4.5817 1749 1730
17 6/7/2022 1500 1540 385.50 4.5855 1768 1700
18 6/6/2022 1600 1510 397.50 4.5880 1824 1760
19 6/3/2022 1560 1490 378.25 4.5908 1736 1700
20 6/2/2022 1590 1490 382.50 4.5876 1755 1710
21 6/1/2022 1590 1490 380.50 4.5891 1746 1700
22 5/31/2022 1650 1560 392.25 4.5756 1795 1750
23 5/30/2022 1670 1590 406.75 4.5869 1866 1800
24 5/27/2022 1670 1580 414.75 4.6102 1912 1700
25 5/26/2022 1650 1580 409.50 4.6135 1889 1700
26 5/25/2022 1670 1600 404.50 4.5955 1859 1700
27 5/24/2022 1690 1630 410.50 4.6107 1893 1800
28 5/23/2022 1700 1600 426.00 4.6171 1966 1860
29 5/20/2022 1700 1630 420.75 4.6366 1951 1840
30 5/19/2022 1700 1640 422.25 4.6429 1960 1850
31 5/18/2022 1700 1640 430.50 4.6528 2003 1850
32 5/17/2022 1690 1640 438.25 4.6558 2040 1850
33 5/16/2022 1690 1640 438.25 4.6724 2048 1880
34 5/13/2022 1670 1560 416.50 4.6679 1944 1800
35 5/12/2022 1670 1540 414.25 4.6841 1940 1790
36 5/11/2022 1670 1540 403.25 4.6700 1883 1790
37 5/10/2022 1680 1560 396.50 4.6761 1854 1780
38 5/9/2022 1670 1560 394.50 4.7059 1856 1780
39 5/6/2022 1600 1580 406.25 4.6979 1909 1760
40 5/5/2022 1660 1610 401.00 4.6658 1871 1780
41 5/4/2022 1660 1630 390.50 4.6777 1827 1735
42 4/29/2022 1660 1630 400.75 4.6582 1867 1720
43 4/28/2022 1670 1640 416.50 4.6915 1954 1740
44 4/27/2022 1670 1630 418.25 4.7076 1969 1720
45 4/26/2022 1660 1640 415.25 4.6429 1928 1685
46 4/25/2022 1665 1630 408.25 4.6405 1894 1670
47 4/22/2022 1665 1650 407.00 4.6361 1887 1690
48 4/21/2022 1660 1650 405.75 4.6523 1888 1690
49 4/20/2022 1660 1660 398.50 4.6295 1845 1700
50 4/19/2022 1680 1660 399.50 4.6361 1852 1740
51 4/15/2022 1690 1660 401.00 4.6378 1860 1770
52 4/14/2022 1690 1660 401.00 4.6447 1863 1770
53 4/13/2022 1680 1630 403.00 4.6460 1872 1780
54 4/12/2022 1650 1620 399.25 4.6626 1862 1700
55 4/11/2022 1630 1590 379.50 4.6451 1763 1670
56 4/8/2022 1650 1610 372.75 4.6405 1730 1660
57 4/7/2022 1650 1610 363.75 4.6478 1691 1670
58 4/6/2022 1650 1600 364.00 4.6539 1694 1670
59 4/5/2022 1650 1620 364.50 4.6317 1688 1680
60 4/4/2022 1640 1610 363.75 4.6373 1687 1680
soup = BeautifulSoup(HTML)
# the first argument to find tells it what tag to search for
# the second you can pass a dict of attr->value pairs to filter
# results that match the first tag
table = soup.find( "table", {"title":"TheTitle"} )
rows=list()
for row in table.findAll("tr"):
rows.append(row)
# now rows contains each tr in the table (as a BeautifulSoup object)
# and you can search them to pull out the times
for i in table:
tbody = i.find_all('tbody')

On hacker rank the testcase giving runtime error but in local machine giving correct output

This code is giving runtime errror for this test case on hacker rank but when I run this code in my local machine with same test case its give me correct output.
n = int(input())
li = [int(x) for x in input().split()]
se = set()
se1 = set()
for x in range(n):
for y in range(x, n):
if tuple(li[x:y+1]) not in se1:
se.add(len(li[x:y+1]) * min(li[x:y+1]))
se1.add(tuple(li[x:y+1]))
print(max(se))
Test Case
1000
8505 7916 8460 4200 7096 2044 5971 3746 7934 1043 9114 7722 5284 4324 1603 8290 1460 5855 1989 7322 7962 2004 2885 6643 543 240 6303 999 4967 6429 6653 9824 697 1466 4025 4146 9862 6348 4244 7797 3743 3358 5519 9028 4034 3474 3670 5494 5681 2011 9169 3643 4015 8406 6638 910 8646 2941 8261 3613 5722 4915 3437 6420 6381 3814 6918 2595 162 1162 392 3906 872 2263 9286 4906 2089 9308 6752 7770 1320 5921 1414 1687 679 8052 2598 9325 7346 7211 9290 3068 2126 9080 5840 4859 2894 9110 7455 9409 272 4199 3315 7496 2815 8953 2402 4904 8261 5507 2675 5933 7780 441 7621 8460 4845 6571 4137 2191 3782 3428 1612 2261 8860 7452 7120 1754 6563 927 7515 3187 1479 7182 684 4294 6135 9438 9198 749 1297 8225 6682 9078 8666 4303 3890 3512 874 8027 5703 1009 7807 3667 3270 6667 1120 6742 4774 4035 4022 2289 7222 5501 9472 4258 6147 1959 49 5345 2708 1346 3571 9391 6776 8589 46 666 2101 7273 5046 4157 8282 9205 7824 1552 5873 5296 8294 6999 9331 2316 5640 2906 4169 5112 3516 316 3424 9917 2014 6132 7616 1937 1875 4392 526 1922 1411 2628 9195 2809 3137 3829 2014 7313 5381 7887 2610 27 1238 8293 8696 6879 7551 2865 1991 1068 3182 1767 7337 1548 4252 4953 3485 6127 9346 4011 4401 7109 2991 9948 9918 6128 3777 1932 9794 5510 6172 2404 5538 7410 7049 4234 4289 4601 3451 2633 2021 2985 4400 9358 4533 8652 4312 8018 1132 10 2030 1885 7119 5021 1834 3389 7502 1963 5321 3648 7474 1493 6052 9364 5256 9453 9950 9545 4054 3401 2178 2427 2739 2931 8138 7272 7935 8802 5291 9067 8812 3673 953 2283 5046 9139 5672 8900 1102 7345 2548 4928 5191 4952 4292 447 4406 594 6344 4812 348 8523 7240 3087 7806 1730 359 5741 532 2002 1161 5696 5675 2114 7979 722 7605 3 5974 8707 7348 8523 9988 8891 3475 632 9338 7881 1227 5683 9046 1575 558 2638 4662 8364 4368 1373 457 4900 3376 1618 596 5403 3732 4927 2477 1337 4930 8452 6397 8630 6975 6385 7522 6802 7017 3212 1036 4596 5247 82 6171 5805 2720 7185 521 7088 4911 979 8340 4639 2597 5288 6394 2682 215 8872 371 1497 3676 6768 6479 651 9505 353 3805 2875 3566 4841 7471 5165 4923 9995 971 3995 3532 7844 7435 4795 5175 2127 9434 7773 7415 5829 6807 3982 1053 7178 5479 4729 299 8311 1732 6156 8664 5537 9031 8582 6731 2855 3748 8006 9202 1071 2002 2734 8915 5789 7530 4091 7917 3316 8216 5332 5497 5023 9315 6550 8553 1146 1279 5204 5809 3011 1361 4474 4901 6744 3056 1632 9599 6804 9638 8801 7875 1640 7888 3143 3782 1770 7234 1699 5086 5450 7031 6936 6825 2698 3486 5378 197 4766 6935 6006 4129 8296 480 5382 5040 9889 7014 992 3045 6653 6145 7273 4645 385 416 8427 2155 4002 6478 3594 9452 3510 530 2629 2560 4016 4359 9109 5134 1294 5116 5616 9590 1948 998 983 8189 4365 8327 1235 7370 824 4860 8367 1210 5276 3147 9717 9278 9625 3311 5082 9487 193 4063 2048 562 8422 7509 5696 9717 2625 7664 5659 926 5015 2994 5467 9380 7673 6702 6750 8498 1562 5117 6060 3190 8264 5777 8820 4242 5441 3902 3729 5634 7965 2129 6196 2740 9639 8245 2457 8616 2261 4468 9542 7276 7463 1362 3008 5136 8064 9758 9986 5979 1228 6046 5521 5844 1824 4342 86 3617 4596 168 9251 2562 2297 1800 5302 1936 45 4111 6905 2306 8579 2799 5935 2394 4161 8943 3883 8578 5054 3869 909 2634 6268 6430 8478 4444 7124 4917 8061 1721 1437 3664 635 3734 5464 2289 2023 5509 6400 8928 4168 1331 1727 103 78 5889 9046 3961 4467 452 4182 5376 3086 6802 8158 1565 1246 5283 2834 5659 7004 4271 9324 7639 4357 4788 6280 2732 298 2680 1660 818 363 9740 921 441 5629 6319 754 6448 6772 1289 8176 9858 8091 6334 7775 5690 7969 6961 1349 1325 7584 673 8964 8294 1814 1596 1026 8464 4276 2687 9282 992 8779 6555 7785 760 9226 4892 7208 5998 6181 1736 2209 624 4422 6336 6314 2392 9650 4016 3717 7234 1041 9034 5528 2855 630 6555 1319 1259 5594 6953 2251 4373 3508 6388 5133 2735 1280 8693 5085 7461 429 3646 4438 1203 6335 752 3595 5985 1120 7313 3219 2162 2699 5100 5017 3329 8007 2689 940 9953 9642 3191 678 9503 5932 2163 8590 7212 856 27 1026 7637 3674 5464 8840 6361 2568 8788 2346 3689 2453 1917 2203 5152 7017 3572 4833 1376 6261 5774 1329 2256 8965 2007 1759 1249 4170 6701 8462 1378 6728 9488 9015 6754 1304 4208 3115 3872 2996 5461 3913 5449 7379 6116 6953 748 9689 1786 2125 2302 7560 3454 4558 2878 1814 2669 4127 5984 5722 2589 3715 2451 8429 9082 5557 6085 3290 8673 6310 2638 486 223 4439 4217 2692 1392 1318 8733 3179 3443 1035 739 3249 1946 3617 5063 967 4097 7400 6690 3038 1115 9141 1468 197 1050 3905 9840 9723 215 8830 6562 6791 3270 779 9483 4662 8449 4568 4193 1892 5603 1285 1494 3901 1254 2909 4869 5351 309 1559 8390 7776 7052 6210 4326 8102 115 518 4178 6683 9348 740 3474 2618 7871 9309 3633 6321 3877 7826 4565 5832 5463 6059 9734 6718 8969 4603 8421 5630 6162 6811 3407 9566 9373 4085 7668 5841 4603 1846 2524 3951 8938 2350 6570 6810 8011 6555 9483 1888 733 4048 4072 6197 108 3806 9267 5429 8409 7688 1059 923 852 818 489 225 4903 4510 2418 9506 2708 4942 3458 1647 3644 6380 4809 1655 9287 4292 9895 20 8340 3968 2569 4800 7774 1836 6581 2536 5877 7641 9811 6729 4811 6653
I optimized your algorithm, removed usage of sets and improved minimum computation of sub-ranges by computing rolling minimum. My algorithm's running time is O(N^2).
Your algorithm is of O(N^3) complexity because of two loops (each of O(N)) and then inside each loop adding to set and computing minimum and slicing all take O(N) time (because adding a tuple of O(N) length to set needs also O(N) hashing operations of all its elements). So probably your algorithm will be even faster if you don't use sets, because adding tuple to set takes same time as computing minimum again. Also you don't need to store results in se set, you can just compute maximal value on the way.
Try it online!
n, l = int(input()), list(map(int, input().split()))
maxv = None
for i in range(n):
minv, maxv = l[i], l[i] if maxv is None else maxv
for j in range(i, n):
minv = min(minv, l[j])
maxv = max(maxv, (j - i + 1) * minv)
print(maxv)
Input:
1000
8505 7916 8460 4200 7096 2044 5971 3746 7934 1043 9114 7722 5284 4324 1603 8290 1460 5855 1989 7322 7962 2004 2885 6643 543 240 6303 999 4967 6429 6653 9824 697 1466 4025 4146 9862 6348 4244 7797 3743 3358 5519 9028 4034 3474 3670 5494 5681 2011 9169 3643 4015 8406 6638 910 8646 2941 8261 3613 5722 4915 3437 6420 6381 3814 6918 2595 162 1162 392 3906 872 2263 9286 4906 2089 9308 6752 7770 1320 5921 1414 1687 679 8052 2598 9325 7346 7211 9290 3068 2126 9080 5840 4859 2894 9110 7455 9409 272 4199 3315 7496 2815 8953 2402 4904 8261 5507 2675 5933 7780 441 7621 8460 4845 6571 4137 2191 3782 3428 1612 2261 8860 7452 7120 1754 6563 927 7515 3187 1479 7182 684 4294 6135 9438 9198 749 1297 8225 6682 9078 8666 4303 3890 3512 874 8027 5703 1009 7807 3667 3270 6667 1120 6742 4774 4035 4022 2289 7222 5501 9472 4258 6147 1959 49 5345 2708 1346 3571 9391 6776 8589 46 666 2101 7273 5046 4157 8282 9205 7824 1552 5873 5296 8294 6999 9331 2316 5640 2906 4169 5112 3516 316 3424 9917 2014 6132 7616 1937 1875 4392 526 1922 1411 2628 9195 2809 3137 3829 2014 7313 5381 7887 2610 27 1238 8293 8696 6879 7551 2865 1991 1068 3182 1767 7337 1548 4252 4953 3485 6127 9346 4011 4401 7109 2991 9948 9918 6128 3777 1932 9794 5510 6172 2404 5538 7410 7049 4234 4289 4601 3451 2633 2021 2985 4400 9358 4533 8652 4312 8018 1132 10 2030 1885 7119 5021 1834 3389 7502 1963 5321 3648 7474 1493 6052 9364 5256 9453 9950 9545 4054 3401 2178 2427 2739 2931 8138 7272 7935 8802 5291 9067 8812 3673 953 2283 5046 9139 5672 8900 1102 7345 2548 4928 5191 4952 4292 447 4406 594 6344 4812 348 8523 7240 3087 7806 1730 359 5741 532 2002 1161 5696 5675 2114 7979 722 7605 3 5974 8707 7348 8523 9988 8891 3475 632 9338 7881 1227 5683 9046 1575 558 2638 4662 8364 4368 1373 457 4900 3376 1618 596 5403 3732 4927 2477 1337 4930 8452 6397 8630 6975 6385 7522 6802 7017 3212 1036 4596 5247 82 6171 5805 2720 7185 521 7088 4911 979 8340 4639 2597 5288 6394 2682 215 8872 371 1497 3676 6768 6479 651 9505 353 3805 2875 3566 4841 7471 5165 4923 9995 971 3995 3532 7844 7435 4795 5175 2127 9434 7773 7415 5829 6807 3982 1053 7178 5479 4729 299 8311 1732 6156 8664 5537 9031 8582 6731 2855 3748 8006 9202 1071 2002 2734 8915 5789 7530 4091 7917 3316 8216 5332 5497 5023 9315 6550 8553 1146 1279 5204 5809 3011 1361 4474 4901 6744 3056 1632 9599 6804 9638 8801 7875 1640 7888 3143 3782 1770 7234 1699 5086 5450 7031 6936 6825 2698 3486 5378 197 4766 6935 6006 4129 8296 480 5382 5040 9889 7014 992 3045 6653 6145 7273 4645 385 416 8427 2155 4002 6478 3594 9452 3510 530 2629 2560 4016 4359 9109 5134 1294 5116 5616 9590 1948 998 983 8189 4365 8327 1235 7370 824 4860 8367 1210 5276 3147 9717 9278 9625 3311 5082 9487 193 4063 2048 562 8422 7509 5696 9717 2625 7664 5659 926 5015 2994 5467 9380 7673 6702 6750 8498 1562 5117 6060 3190 8264 5777 8820 4242 5441 3902 3729 5634 7965 2129 6196 2740 9639 8245 2457 8616 2261 4468 9542 7276 7463 1362 3008 5136 8064 9758 9986 5979 1228 6046 5521 5844 1824 4342 86 3617 4596 168 9251 2562 2297 1800 5302 1936 45 4111 6905 2306 8579 2799 5935 2394 4161 8943 3883 8578 5054 3869 909 2634 6268 6430 8478 4444 7124 4917 8061 1721 1437 3664 635 3734 5464 2289 2023 5509 6400 8928 4168 1331 1727 103 78 5889 9046 3961 4467 452 4182 5376 3086 6802 8158 1565 1246 5283 2834 5659 7004 4271 9324 7639 4357 4788 6280 2732 298 2680 1660 818 363 9740 921 441 5629 6319 754 6448 6772 1289 8176 9858 8091 6334 7775 5690 7969 6961 1349 1325 7584 673 8964 8294 1814 1596 1026 8464 4276 2687 9282 992 8779 6555 7785 760 9226 4892 7208 5998 6181 1736 2209 624 4422 6336 6314 2392 9650 4016 3717 7234 1041 9034 5528 2855 630 6555 1319 1259 5594 6953 2251 4373 3508 6388 5133 2735 1280 8693 5085 7461 429 3646 4438 1203 6335 752 3595 5985 1120 7313 3219 2162 2699 5100 5017 3329 8007 2689 940 9953 9642 3191 678 9503 5932 2163 8590 7212 856 27 1026 7637 3674 5464 8840 6361 2568 8788 2346 3689 2453 1917 2203 5152 7017 3572 4833 1376 6261 5774 1329 2256 8965 2007 1759 1249 4170 6701 8462 1378 6728 9488 9015 6754 1304 4208 3115 3872 2996 5461 3913 5449 7379 6116 6953 748 9689 1786 2125 2302 7560 3454 4558 2878 1814 2669 4127 5984 5722 2589 3715 2451 8429 9082 5557 6085 3290 8673 6310 2638 486 223 4439 4217 2692 1392 1318 8733 3179 3443 1035 739 3249 1946 3617 5063 967 4097 7400 6690 3038 1115 9141 1468 197 1050 3905 9840 9723 215 8830 6562 6791 3270 779 9483 4662 8449 4568 4193 1892 5603 1285 1494 3901 1254 2909 4869 5351 309 1559 8390 7776 7052 6210 4326 8102 115 518 4178 6683 9348 740 3474 2618 7871 9309 3633 6321 3877 7826 4565 5832 5463 6059 9734 6718 8969 4603 8421 5630 6162 6811 3407 9566 9373 4085 7668 5841 4603 1846 2524 3951 8938 2350 6570 6810 8011 6555 9483 1888 733 4048 4072 6197 108 3806 9267 5429 8409 7688 1059 923 852 818 489 225 4903 4510 2418 9506 2708 4942 3458 1647 3644 6380 4809 1655 9287 4292 9895 20 8340 3968 2569 4800 7774 1836 6581 2536 5877 7641 9811 6729 4811 6653
Output:
85175

Python Pandas Dataframe assignment

I am following a Lynda tutorial where they use the following code:
import pandas as pd
import seaborn
flights = seaborn.load_dataset('flights')
flights_indexed = flights.set_index(['year','month'])
flights_unstacked = flights_indexed.unstack()
flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
and it works perfectly. However, in my case it seems that the code is not compiling, for the last line I keep getting an error.
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category
I know in the video they are using Python 2, however I have Python 3 since I am learning for work (which uses Python 3). Most of the differences I have been able to figure out, however I cannot figure out how to create this new column called 'total' with the sums of the passengers.
The root cause of this error message is the categorical nature of the month column:
In [42]: flights.dtypes
Out[42]:
year int64
month category
passengers int64
dtype: object
In [43]: flights.month.cat.categories
Out[43]: Index(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'], d
type='object')
and you are trying to add a category total - Pandas doesn't like that.
Workaround:
In [45]: flights.month.cat.add_categories('total', inplace=True)
In [46]: x = flights.pivot(index='year', columns='month', values='passengers')
In [47]: x['total'] = x.sum(1)
In [48]: x
Out[48]:
month January February March April May June July August September October November December total
year
1949 112.0 118.0 132.0 129.0 121.0 135.0 148.0 148.0 136.0 119.0 104.0 118.0 1520.0
1950 115.0 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 114.0 140.0 1676.0
1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0 2042.0
1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0 2364.0
1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0 2700.0
1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0 2867.0
1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0 3408.0
1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0 3939.0
1957 315.0 301.0 356.0 348.0 355.0 422.0 465.0 467.0 404.0 347.0 305.0 336.0 4421.0
1958 340.0 318.0 362.0 348.0 363.0 435.0 491.0 505.0 404.0 359.0 310.0 337.0 4572.0
1959 360.0 342.0 406.0 396.0 420.0 472.0 548.0 559.0 463.0 407.0 362.0 405.0 5140.0
1960 417.0 391.0 419.0 461.0 472.0 535.0 622.0 606.0 508.0 461.0 390.0 432.0 5714.0
UPDATE: alternatively if you don't want to touch the original DF you can get rid of categorical columns in the flights_unstacked DF:
In [76]: flights_unstacked.columns = \
...: flights_unstacked.columns \
...: .set_levels(flights_unstacked.columns.get_level_values(1).categories,
...: level=1)
...:
In [77]: flights_unstacked['passengers','total'] = flights_unstacked.sum(axis=1)
In [78]: flights_unstacked
Out[78]:
passengers
month January February March April May June July August September October November December total
year
1949 112 118 132 129 121 135 148 148 136 119 104 118 1520
1950 115 126 141 135 125 149 170 170 158 133 114 140 1676
1951 145 150 178 163 172 178 199 199 184 162 146 166 2042
1952 171 180 193 181 183 218 230 242 209 191 172 194 2364
1953 196 196 236 235 229 243 264 272 237 211 180 201 2700
1954 204 188 235 227 234 264 302 293 259 229 203 229 2867
1955 242 233 267 269 270 315 364 347 312 274 237 278 3408
1956 284 277 317 313 318 374 413 405 355 306 271 306 3939
1957 315 301 356 348 355 422 465 467 404 347 305 336 4421
1958 340 318 362 348 363 435 491 505 404 359 310 337 4572
1959 360 342 406 396 420 472 548 559 463 407 362 405 5140
1960 417 391 419 461 472 535 622 606 508 461 390 432 5714

Formatted input in Python

I have a file that has the following:
A B C D
1 2 3 4 5
2 2 4
3 1 3 4
Note that 4 on line 2 is followed immediately by the new line.
I want to make a dictionary that looks like this
['A']['1'] = 2, d['B']['1'] = 3, ..., d['D']['1'] = 5, d['B']['2'] = 2, etc
The blanks should not appear in the dictionary.
What's the best way to do this in python?
The data will all be single digits right? So it lines up with the column headers? In that case, you can do this:
it = iter(datafile)
cols = list(next(it)[2::2])
d = {}
for row in it:
for col, val in zip(cols, row[2::2]):
if val != ' ':
d.setdefault(col, {})[row[0]] = int(val)
Based on the author's data and code that was recently added, the above code clearly isn't enough. If the format of the document will always be 31 pairs of data for 12 months in groups of 6, we could handle it in many ways. This is what I wrote. It's not the most elegant, probably not as efficient as it can be, but get's the job done. This is one of the reasons why you index by row first, then column.
def process(data):
import re
hre = re.compile(r' +([A-Z]+)'*6)
sre = re.compile(r' +([a-z]+) ([a-z]+)'*6)
dre = re.compile(r'(\d{1,2}) ' + r'(.{4}) (.{4}) {,4}'*6)
it = iter(data)
headers = None
result = {}
for line in it:
if not line: continue
if not headers:
# find the first header
hmatch = hre.match(line)
if hmatch:
subs = iter(sre.match(next(it)).groups())
headers = [h + next(subs)
for h in hmatch.groups()
for _ in range(2)]
count = 0
else:
# fill in the data
dmatch = dre.match(line)
row = dmatch.group(1)
for col, d in zip(headers, dmatch.groups()[1:]):
if d.strip():
result.setdefault(col, {})[row] = int(d)
count += 1
if count == 31:
headers = None
return result
data = """
TIMES OF SUNRISE AND SUNSET (for ideal horizon & meteorological conditions)
For the year 2012
Make corrections for daylight saving time where necessary.
------------------------------------------------------------------------------
JAN FEB MAR APR MAY JUN
rise set rise set rise set rise set rise set rise set
1 0513 1925 0541 1918 0606 1851 0628 1812 0648 1738 0708 1720
2 0514 1925 0541 1918 0606 1850 0628 1811 0649 1737 0709 1719
3 0515 1925 0542 1917 0607 1849 0629 1810 0649 1736 0709 1719
4 0515 1926 0543 1916 0608 1847 0630 1808 0650 1736 0710 1719
5 0516 1926 0544 1915 0609 1846 0630 1807 0651 1735 0710 1719
6 0517 1926 0545 1915 0609 1845 0631 1806 0651 1734 0711 1719
7 0518 1926 0546 1914 0610 1844 0632 1805 0652 1733 0711 1719
8 0519 1926 0547 1913 0611 1843 0632 1803 0653 1732 0712 1719
9 0519 1926 0548 1912 0612 1841 0633 1802 0653 1731 0712 1718
10 0520 1926 0549 1911 0612 1840 0634 1801 0654 1731 0712 1718
11 0521 1926 0550 1911 0613 1839 0634 1800 0655 1730 0713 1718
12 0522 1926 0551 1910 0614 1838 0635 1759 0655 1729 0713 1718
13 0523 1926 0551 1909 0615 1836 0636 1757 0656 1729 0714 1719
14 0524 1926 0552 1908 0615 1835 0636 1756 0657 1728 0714 1719
15 0525 1925 0553 1907 0616 1834 0637 1755 0657 1727 0714 1719
16 0526 1925 0554 1906 0617 1832 0638 1754 0658 1727 0715 1719
17 0527 1925 0555 1905 0617 1831 0638 1753 0659 1726 0715 1719
18 0527 1925 0556 1904 0618 1830 0639 1752 0659 1725 0715 1719
19 0528 1924 0557 1903 0619 1829 0640 1751 0700 1725 0716 1719
20 0529 1924 0558 1902 0619 1827 0640 1749 0701 1724 0716 1719
21 0530 1924 0558 1901 0620 1826 0641 1748 0701 1724 0716 1720
22 0531 1923 0559 1900 0621 1825 0642 1747 0702 1723 0716 1720
23 0532 1923 0600 1859 0621 1824 0642 1746 0703 1723 0716 1720
24 0533 1923 0601 1858 0622 1822 0643 1745 0703 1722 0717 1720
25 0534 1922 0602 1857 0623 1821 0644 1744 0704 1722 0717 1721
26 0535 1922 0602 1855 0624 1820 0644 1743 0705 1722 0717 1721
27 0536 1921 0603 1854 0624 1818 0645 1742 0705 1721 0717 1721
28 0537 1921 0604 1853 0625 1817 0646 1741 0706 1721 0717 1722
29 0538 1920 0605 1852 0626 1816 0646 1740 0706 1720 0717 1722
30 0539 1920 0626 1815 0647 1739 0707 1720 0717 1722
31 0540 1919 0627 1813 0707 1720
JUL AUG SEP OCT NOV DEC
rise set rise set rise set rise set rise set rise set
1 0717 1723 0705 1740 0632 1759 0553 1818 0518 1841 0503 1907
2 0717 1723 0704 1741 0631 1800 0552 1819 0517 1842 0503 1908
3 0717 1724 0703 1741 0630 1801 0551 1819 0517 1843 0503 1909
4 0717 1724 0702 1742 0629 1801 0550 1820 0516 1843 0503 1910
5 0717 1724 0701 1743 0627 1802 0548 1821 0515 1844 0503 1911
6 0717 1725 0700 1743 0626 1802 0547 1821 0514 1845 0503 1911
7 0716 1725 0700 1744 0625 1803 0546 1822 0513 1846 0503 1912
8 0716 1726 0659 1745 0624 1804 0545 1823 0513 1847 0503 1913
9 0716 1726 0658 1745 0622 1804 0543 1823 0512 1848 0503 1914
10 0716 1727 0657 1746 0621 1805 0542 1824 0511 1849 0503 1914
11 0716 1727 0656 1746 0620 1805 0541 1825 0511 1850 0503 1915
12 0715 1728 0655 1747 0618 1806 0540 1825 0510 1850 0504 1916
13 0715 1729 0654 1748 0617 1807 0538 1826 0509 1851 0504 1916
14 0715 1729 0653 1748 0616 1807 0537 1827 0509 1852 0504 1917
15 0714 1730 0652 1749 0614 1808 0536 1827 0508 1853 0505 1918
16 0714 1730 0651 1750 0613 1809 0535 1828 0508 1854 0505 1918
17 0713 1731 0650 1750 0612 1809 0534 1829 0507 1855 0505 1919
18 0713 1731 0649 1751 0610 1810 0533 1830 0507 1856 0506 1920
19 0713 1732 0648 1751 0609 1810 0531 1830 0506 1857 0506 1920
20 0712 1733 0647 1752 0608 1811 0530 1831 0506 1858 0507 1921
21 0712 1733 0645 1753 0607 1812 0529 1832 0505 1859 0507 1921
22 0711 1734 0644 1753 0605 1812 0528 1833 0505 1859 0508 1922
23 0711 1734 0643 1754 0604 1813 0527 1834 0505 1900 0508 1922
24 0710 1735 0642 1755 0603 1813 0526 1834 0504 1901 0509 1923
25 0709 1736 0641 1755 0601 1814 0525 1835 0504 1902 0509 1923
26 0709 1736 0640 1756 0600 1815 0524 1836 0504 1903 0510 1923
27 0708 1737 0638 1756 0559 1815 0523 1837 0503 1904 0510 1924
28 0707 1738 0637 1757 0557 1816 0522 1838 0503 1905 0511 1924
29 0707 1738 0636 1758 0556 1817 0521 1838 0503 1906 0512 1924
30 0706 1739 0635 1758 0555 1817 0520 1839 0503 1906 0512 1925
31 0705 1739 0634 1759 0519 1840 0513 1925
""".split('\n')
>>> d = process(data)
>>> d['DECrise']['8']
503
>>> d
{'AUGset': {'24': 1755, '25': 1755, '26': 1756, '27': 1756, '20': 1752...
For fun and interest, I came up with a totally different answer;
import datetime
import math
import ephem # PyEphem module
class SunTimes(object):
"""Helper class for finding sun rise/set times
#param date: observation date, one of
string, "yyyy[/mm[/dd[ hh[:mm[:ss]]]]]"
(Unspecified pieces are assumed to be 0)
datetime.date
#param lat: latitude, one of
string, "d[:mm[:ss]]" angle measured in degrees, minutes, seconds
(Unspecified pieces are assumed to be 0)
epoch.Angle
numeric angle in degrees
#param lon: longitude, same types as lat
#fromCity: string, city name
If specified, overrides lat and lon
If city is not recognized, raises KeyError
"""
def __init__(self, *args, **kwargs):
super(SunTimes,self).__init__()
self.sun = ephem.Sun()
self.date = ephem.Date(0)
self._date = 0
self.viewer = ephem.Observer()
self._lat = ''
self._lon = ''
self._city = None
self.dirty = True # lazy updates
self._clean(*args, **kwargs)
def _clean(self, date=None, lat=None, lon=None, fromCity=None):
if date is not None and date != self._date:
self.date = ephem.Date(date)
self._date = date
self.dirty = True
if lat is not None and lat != self._lat:
self.viewer.lat = self.getAngle(lat)
self._lat = lat
self.viewer.name = None
self.city = None
self.dirty = True
if lon is not None and lon != self._lon:
self.viewer.long = self.getAngle(lon)
self._lon = lon
self.viewer.name = None
self.city = None
self.dirty = True
if fromCity is not None and fromCity != self._city:
self.viewer = ephem.city(fromCity)
self._city = fromCity
self._lat = self.viewer.lat
self._lon = self.viewer.long
self.dirty = True
if self.dirty:
self.viewer.date = self.date
self.sun.compute(self.viewer)
self.dirty = False
def getAngle(self, value):
if isinstance(value, ephem.Angle):
return value
elif isinstance(value, str):
return ephem.degrees(value)
else:
return ephem.degrees(math.radians(value))
def sunrise(self, *args, **kwargs):
self._clean(*args, **kwargs)
return self.sun.rise_time.datetime()
def sunset(self, *args, **kwargs):
self._clean(*args, **kwargs)
return self.sun.set_time.datetime()
the tables given match very nicely for local times in Perth, Australia.
sun = SunTimes(lat='-31.9273', lon='115.87925') # Perth
print sun.sunrise(date='2012/1/1')
>>> 2012-01-01 05:15:42.835679
print st.sunset()
>>> 2012-01-01 19:24:23.083130
The times are not exactly identical; a comparison follows:
Read each line into a string, parse it into a list with some empty elements like
(2,,2,4,)
and then convert that list into your dictionary entries. Before parsing you might want to read about the methods in the string module.
Looks like a homework problem regarding sparse matrices.
This is a possible solution, assuming text is the content of the input file:
lines = text.rstrip().split('\n')
lines = [line + ' ' * (max(map(len, lines)) - len(line)) for line in lines]
# pads lines with spaces so that all of them have the same length
rows = tuple(line.replace(' ', ' ').split(' ') for line in lines)
columns = tuple(zip(*rows)) # transpose rows matrix
table = dict()
for i, column in enumerate(columns):
if i > 0: # skip first column
table[rows[0][i]] = dict()
for j, cell in enumerate(column):
if j > 0 and cell.isdigit(): # skip header and blanks
table[rows[0][i]][columns[0][j]] = int(cell)
print table # prints the resulting dict
This assumes that all data items are separated by a single whitespace and that 'blank' items consist of a single whitespace.
from collections import defaultdict
import re
testData = """
A B C D
1 2 3 4 5
2 2 4
3 1 3 4
"""
def strToArray(s):
item = re.compile(r'(\s|\S+)(?:\s|$)')
return [item.findall(ln) for ln in s.split('\n') if len(ln)]
def arrayToDict(array):
res = defaultdict(dict)
xIds = array.pop(0)[1:]
for row in array:
yId = row.pop(0)
for xId,item in zip(xIds,row):
if item.strip():
res[xId][yId] = int(item)
return res
def main():
data = arrayToDict(strToArray(testData))
print data
if __name__=="__main__":
main()
which results in
{'A': {'1': 2, '3': 1}, 'C': {'1': 4, '3': 3, '2': 4}, 'B': {'1': 3, '2': 2}, 'D': {'1': 5, '3': 4}}
Here's a solution using your later data:
data = """
TIMES OF SUNRISE AND SUNSET (for ideal horizon & meteorological conditions)
For the year 2012
Make corrections for daylight saving time where necessary.
------------------------------------------------------------------------------
JAN FEB MAR APR MAY JUN
rise set rise set rise set rise set rise set rise set
1 0513 1925 0541 1918 0606 1851 0628 1812 0648 1738 0708 1720
2 0514 1925 0541 1918 0606 1850 0628 1811 0649 1737 0709 1719
3 0515 1925 0542 1917 0607 1849 0629 1810 0649 1736 0709 1719
4 0515 1926 0543 1916 0608 1847 0630 1808 0650 1736 0710 1719
5 0516 1926 0544 1915 0609 1846 0630 1807 0651 1735 0710 1719
6 0517 1926 0545 1915 0609 1845 0631 1806 0651 1734 0711 1719
7 0518 1926 0546 1914 0610 1844 0632 1805 0652 1733 0711 1719
8 0519 1926 0547 1913 0611 1843 0632 1803 0653 1732 0712 1719
9 0519 1926 0548 1912 0612 1841 0633 1802 0653 1731 0712 1718
10 0520 1926 0549 1911 0612 1840 0634 1801 0654 1731 0712 1718
11 0521 1926 0550 1911 0613 1839 0634 1800 0655 1730 0713 1718
12 0522 1926 0551 1910 0614 1838 0635 1759 0655 1729 0713 1718
13 0523 1926 0551 1909 0615 1836 0636 1757 0656 1729 0714 1719
14 0524 1926 0552 1908 0615 1835 0636 1756 0657 1728 0714 1719
15 0525 1925 0553 1907 0616 1834 0637 1755 0657 1727 0714 1719
16 0526 1925 0554 1906 0617 1832 0638 1754 0658 1727 0715 1719
17 0527 1925 0555 1905 0617 1831 0638 1753 0659 1726 0715 1719
18 0527 1925 0556 1904 0618 1830 0639 1752 0659 1725 0715 1719
19 0528 1924 0557 1903 0619 1829 0640 1751 0700 1725 0716 1719
20 0529 1924 0558 1902 0619 1827 0640 1749 0701 1724 0716 1719
21 0530 1924 0558 1901 0620 1826 0641 1748 0701 1724 0716 1720
22 0531 1923 0559 1900 0621 1825 0642 1747 0702 1723 0716 1720
23 0532 1923 0600 1859 0621 1824 0642 1746 0703 1723 0716 1720
24 0533 1923 0601 1858 0622 1822 0643 1745 0703 1722 0717 1720
25 0534 1922 0602 1857 0623 1821 0644 1744 0704 1722 0717 1721
26 0535 1922 0602 1855 0624 1820 0644 1743 0705 1722 0717 1721
27 0536 1921 0603 1854 0624 1818 0645 1742 0705 1721 0717 1721
28 0537 1921 0604 1853 0625 1817 0646 1741 0706 1721 0717 1722
29 0538 1920 0605 1852 0626 1816 0646 1740 0706 1720 0717 1722
30 0539 1920 0626 1815 0647 1739 0707 1720 0717 1722
31 0540 1919 0627 1813 0707 1720
JUL AUG SEP OCT NOV DEC
rise set rise set rise set rise set rise set rise set
1 0717 1723 0705 1740 0632 1759 0553 1818 0518 1841 0503 1907
2 0717 1723 0704 1741 0631 1800 0552 1819 0517 1842 0503 1908
3 0717 1724 0703 1741 0630 1801 0551 1819 0517 1843 0503 1909
4 0717 1724 0702 1742 0629 1801 0550 1820 0516 1843 0503 1910
5 0717 1724 0701 1743 0627 1802 0548 1821 0515 1844 0503 1911
6 0717 1725 0700 1743 0626 1802 0547 1821 0514 1845 0503 1911
7 0716 1725 0700 1744 0625 1803 0546 1822 0513 1846 0503 1912
8 0716 1726 0659 1745 0624 1804 0545 1823 0513 1847 0503 1913
9 0716 1726 0658 1745 0622 1804 0543 1823 0512 1848 0503 1914
10 0716 1727 0657 1746 0621 1805 0542 1824 0511 1849 0503 1914
11 0716 1727 0656 1746 0620 1805 0541 1825 0511 1850 0503 1915
12 0715 1728 0655 1747 0618 1806 0540 1825 0510 1850 0504 1916
13 0715 1729 0654 1748 0617 1807 0538 1826 0509 1851 0504 1916
14 0715 1729 0653 1748 0616 1807 0537 1827 0509 1852 0504 1917
15 0714 1730 0652 1749 0614 1808 0536 1827 0508 1853 0505 1918
16 0714 1730 0651 1750 0613 1809 0535 1828 0508 1854 0505 1918
17 0713 1731 0650 1750 0612 1809 0534 1829 0507 1855 0505 1919
18 0713 1731 0649 1751 0610 1810 0533 1830 0507 1856 0506 1920
19 0713 1732 0648 1751 0609 1810 0531 1830 0506 1857 0506 1920
20 0712 1733 0647 1752 0608 1811 0530 1831 0506 1858 0507 1921
21 0712 1733 0645 1753 0607 1812 0529 1832 0505 1859 0507 1921
22 0711 1734 0644 1753 0605 1812 0528 1833 0505 1859 0508 1922
23 0711 1734 0643 1754 0604 1813 0527 1834 0505 1900 0508 1922
24 0710 1735 0642 1755 0603 1813 0526 1834 0504 1901 0509 1923
25 0709 1736 0641 1755 0601 1814 0525 1835 0504 1902 0509 1923
26 0709 1736 0640 1756 0600 1815 0524 1836 0504 1903 0510 1923
27 0708 1737 0638 1756 0559 1815 0523 1837 0503 1904 0510 1924
28 0707 1738 0637 1757 0557 1816 0522 1838 0503 1905 0511 1924
29 0707 1738 0636 1758 0556 1817 0521 1838 0503 1906 0512 1924
30 0706 1739 0635 1758 0555 1817 0520 1839 0503 1906 0512 1925
31 0705 1739 0634 1759 0519 1840 0513 1925
"""
import re
import itertools
parsed = re.findall(r'''(?xm) # verbose, multiline
^ # start of line
(\d+) # the date
\s{2} # 2 spaces
(?:(\d+)\s(\d+)\s{4}|\s{13}) # rise/set time or 13 spaces
(?:(\d+)\s(\d+)\s{4}|\s{13}) # rise/set time or 13 spaces
(?:(\d+)\s(\d+)\s{4}|\s{13}) # rise/set time or 13 spaces
(?:(\d+)\s(\d+)\s{4}|\s{13}) # rise/set time or 13 spaces
(?:(\d+)\s(\d+)\s{4}|\s{13}) # rise/set time or 13 spaces
(?:(\d+)\s(\d+)|\s{9})? # rise/set time or 9 spaces (optional)
$ # end of line
''',data)
# transpose, throw out date line and create an iterator
# that will walk the original table column by column.
parsed = zip(*parsed)[1:]
data_gen = itertools.chain(*parsed)
sun = {}
# Date changes fastest, followed by 6 month step, then rise/set, then first 6 months.
for m in range(1,7):
for t in ['rise','set']:
for s in [0,6]:
for d in range(1,32):
data = next(data_gen)
# handle blanks
if data:
sun[m+s,d,t] = data
if __name__ == '__main__':
print sun[11,24,'rise']
I propose to use the csv module.
The presence of blanks produces some hard problems. So I created a file containing
A B C D
1 2 3 4 5
2 8 2 4 10
3 1 88 3 4
and I write this code that roughly processes this content as you wish:
f = open('gogo.txt','rb')
print f.read()
f.seek(0,0)
import csv
dodo = csv.reader(f, delimiter = ' ')
headers = dodo.next()[-4:]
print 'headers==',headers
print
d = {}
for k in headers:
d[k] = {}
print d
print
for row in dodo:
print row[0],row[1:]
z = zip(headers,row[1:])
print "z==",z
for x,y in zip(headers,row[-4:]):
print x,y
d[x][row[0]] = y
print d
print '-----------------------------------'
print d
Result
A B C D
1 2 3 4 5
2 8 2 4 10
3 1 88 3 4
headers== ['A', 'B', 'C', 'D']
{'A': {}, 'C': {}, 'B': {}, 'D': {}}
1 ['2', '3', '4', '5']
z== [('A', '2'), ('B', '3'), ('C', '4'), ('D', '5')]
A 2
B 3
C 4
D 5
{'A': {'1': '2'}, 'C': {'1': '4'}, 'B': {'1': '3'}, 'D': {'1': '5'}}
-----------------------------------
2 ['8', '2', '4', '10']
z== [('A', '8'), ('B', '2'), ('C', '4'), ('D', '10')]
A 8
B 2
C 4
D 10
{'A': {'1': '2', '2': '8'}, 'C': {'1': '4', '2': '4'}, 'B': {'1': '3', '2': '2'}, 'D': {'1': '5', '2': '10'}}
-----------------------------------
3 ['1', '88', '3', '4']
z== [('A', '1'), ('B', '88'), ('C', '3'), ('D', '4')]
A 1
B 88
C 3
D 4
{'A': {'1': '2', '3': '1', '2': '8'}, 'C': {'1': '4', '3': '3', '2': '4'}, 'B': {'1': '3', '3': '88', '2': '2'}, 'D': {'1': '5', '3': '4', '2': '10'}}
-----------------------------------
{'A': {'1': '2', '3': '1', '2': '8'}, 'C': {'1': '4', '3': '3', '2': '4'}, 'B': {'1': '3', '3': '88', '2': '2'}, 'D': {'1': '5', '3': '4', '2': '10'}}
Since it has been said that it is a homework, it will be a good thing that you'll have to search how to improve this code to make it able to process lines containing blanks.
sloppy code, but I believe this does what you asked for
jcomeau#intrepid:/tmp$ cat test.dat test.py; ./test.py
A B C D
1 2 3 4 5
2 2 4
3 1 3 4
#!/usr/bin/python
import re
input = open('test.dat')
data = input.readlines()
input.close()
pattern = '(\S+|\s?)\s'
parsed = [map(str.strip, re.compile(pattern).findall(line)) for line in data]
columns = parsed.pop(0)[1:]
rows = [r.pop(0) for r in parsed]
d = {}
for c in columns:
if not d.has_key(c):
d[c] = {}
for r in rows:
try:
d[c][r] = int(parsed[rows.index(r)][columns.index(c)])
except:
pass
print d, d['A']['1'], d['B']['1'], d['D']['1'], d['B']['2']
{'A': {'1': 2, '3': 1}, 'C': {'1': 4, '3': 3, '2': 4}, 'B': {'1': 3, '2': 2}, 'D': {'1': 5, '3': 4}} 2 3 5 2
Hugh Bothwell and jcomeau_ictx, interesting ideas but not general enough. Doesn't work with the real data, although I'm sure you can find a way to use regexp to make it work.
scoffey, thanks. I've used your idea of padding the lines to the same length.
eyquem, you must be dreaming. I never said anything about homework.
Below is my code with the real data now.
l = """
TIMES OF SUNRISE AND SUNSET (for ideal horizon & meteorological conditions)
For the year 2012
Make corrections for daylight saving time where necessary.
------------------------------------------------------------------------------
JAN FEB MAR APR MAY JUN
rise set rise set rise set rise set rise set rise set
1 0513 1925 0541 1918 0606 1851 0628 1812 0648 1738 0708 1720
2 0514 1925 0541 1918 0606 1850 0628 1811 0649 1737 0709 1719
3 0515 1925 0542 1917 0607 1849 0629 1810 0649 1736 0709 1719
4 0515 1926 0543 1916 0608 1847 0630 1808 0650 1736 0710 1719
5 0516 1926 0544 1915 0609 1846 0630 1807 0651 1735 0710 1719
6 0517 1926 0545 1915 0609 1845 0631 1806 0651 1734 0711 1719
7 0518 1926 0546 1914 0610 1844 0632 1805 0652 1733 0711 1719
8 0519 1926 0547 1913 0611 1843 0632 1803 0653 1732 0712 1719
9 0519 1926 0548 1912 0612 1841 0633 1802 0653 1731 0712 1718
10 0520 1926 0549 1911 0612 1840 0634 1801 0654 1731 0712 1718
11 0521 1926 0550 1911 0613 1839 0634 1800 0655 1730 0713 1718
12 0522 1926 0551 1910 0614 1838 0635 1759 0655 1729 0713 1718
13 0523 1926 0551 1909 0615 1836 0636 1757 0656 1729 0714 1719
14 0524 1926 0552 1908 0615 1835 0636 1756 0657 1728 0714 1719
15 0525 1925 0553 1907 0616 1834 0637 1755 0657 1727 0714 1719
16 0526 1925 0554 1906 0617 1832 0638 1754 0658 1727 0715 1719
17 0527 1925 0555 1905 0617 1831 0638 1753 0659 1726 0715 1719
18 0527 1925 0556 1904 0618 1830 0639 1752 0659 1725 0715 1719
19 0528 1924 0557 1903 0619 1829 0640 1751 0700 1725 0716 1719
20 0529 1924 0558 1902 0619 1827 0640 1749 0701 1724 0716 1719
21 0530 1924 0558 1901 0620 1826 0641 1748 0701 1724 0716 1720
22 0531 1923 0559 1900 0621 1825 0642 1747 0702 1723 0716 1720
23 0532 1923 0600 1859 0621 1824 0642 1746 0703 1723 0716 1720
24 0533 1923 0601 1858 0622 1822 0643 1745 0703 1722 0717 1720
25 0534 1922 0602 1857 0623 1821 0644 1744 0704 1722 0717 1721
26 0535 1922 0602 1855 0624 1820 0644 1743 0705 1722 0717 1721
27 0536 1921 0603 1854 0624 1818 0645 1742 0705 1721 0717 1721
28 0537 1921 0604 1853 0625 1817 0646 1741 0706 1721 0717 1722
29 0538 1920 0605 1852 0626 1816 0646 1740 0706 1720 0717 1722
30 0539 1920 0626 1815 0647 1739 0707 1720 0717 1722
31 0540 1919 0627 1813 0707 1720
JUL AUG SEP OCT NOV DEC
rise set rise set rise set rise set rise set rise set
1 0717 1723 0705 1740 0632 1759 0553 1818 0518 1841 0503 1907
2 0717 1723 0704 1741 0631 1800 0552 1819 0517 1842 0503 1908
3 0717 1724 0703 1741 0630 1801 0551 1819 0517 1843 0503 1909
4 0717 1724 0702 1742 0629 1801 0550 1820 0516 1843 0503 1910
5 0717 1724 0701 1743 0627 1802 0548 1821 0515 1844 0503 1911
6 0717 1725 0700 1743 0626 1802 0547 1821 0514 1845 0503 1911
7 0716 1725 0700 1744 0625 1803 0546 1822 0513 1846 0503 1912
8 0716 1726 0659 1745 0624 1804 0545 1823 0513 1847 0503 1913
9 0716 1726 0658 1745 0622 1804 0543 1823 0512 1848 0503 1914
10 0716 1727 0657 1746 0621 1805 0542 1824 0511 1849 0503 1914
11 0716 1727 0656 1746 0620 1805 0541 1825 0511 1850 0503 1915
12 0715 1728 0655 1747 0618 1806 0540 1825 0510 1850 0504 1916
13 0715 1729 0654 1748 0617 1807 0538 1826 0509 1851 0504 1916
14 0715 1729 0653 1748 0616 1807 0537 1827 0509 1852 0504 1917
15 0714 1730 0652 1749 0614 1808 0536 1827 0508 1853 0505 1918
16 0714 1730 0651 1750 0613 1809 0535 1828 0508 1854 0505 1918
17 0713 1731 0650 1750 0612 1809 0534 1829 0507 1855 0505 1919
18 0713 1731 0649 1751 0610 1810 0533 1830 0507 1856 0506 1920
19 0713 1732 0648 1751 0609 1810 0531 1830 0506 1857 0506 1920
20 0712 1733 0647 1752 0608 1811 0530 1831 0506 1858 0507 1921
21 0712 1733 0645 1753 0607 1812 0529 1832 0505 1859 0507 1921
22 0711 1734 0644 1753 0605 1812 0528 1833 0505 1859 0508 1922
23 0711 1734 0643 1754 0604 1813 0527 1834 0505 1900 0508 1922
24 0710 1735 0642 1755 0603 1813 0526 1834 0504 1901 0509 1923
25 0709 1736 0641 1755 0601 1814 0525 1835 0504 1902 0509 1923
26 0709 1736 0640 1756 0600 1815 0524 1836 0504 1903 0510 1923
27 0708 1737 0638 1756 0559 1815 0523 1837 0503 1904 0510 1924
28 0707 1738 0637 1757 0557 1816 0522 1838 0503 1905 0511 1924
29 0707 1738 0636 1758 0556 1817 0521 1838 0503 1906 0512 1924
30 0706 1739 0635 1758 0555 1817 0520 1839 0503 1906 0512 1925
31 0705 1739 0634 1759 0519 1840 0513 1925
"""
l = l.split('\n')
l = filter(None, l)
f = map(lambda _: str.ljust(_[4:],78), l[7:38])
s = map(lambda _: str.ljust(_[4:],78), l[40:71])
l = map(lambda _: ''.join(_),zip(f,s))
a=[]
r = [13*i for i in xrange(13)]
for line in l:
d = [(line[r[i]:r[i+1]]) for i in xrange(12)]
a.append(d)
import numpy
a = numpy.transpose(a).tolist()
sun = {}
for m in xrange(12):
a[m] = filter(lambda _: not _.isspace(), a[m])
for d in xrange(len(a[m])):
date = "%4d-%02d-%d" % (2012, m+1, d+1)
sun[date] = {}
sun[date]['rise'], sun[date]['set'] = a[m][d].split()
print sun
Andrey,
To increase the automation of the beginning of the treatment, I would do like that:
l = filter(None,l.splitlines())
starts = [ i+1 for i,line in enumerate(l) if 'rise' in line and 'set' in line]
print 'starts==',starts
f = []
for line in l[starts[0]:]:
if any(c not in ' 0123456789' for c in line):
break
else:
f.append( line.partition(' ')[2] )
s = []
for line in l[starts[1]:]:
if any(c not in ' 0123456789' for c in line):
break
else:
s.append( line.partition(' ')[2] )
a= [ (x+' '+y).split(' ') for x,y in zip(f,s) ]
And improving the algorithm:
l = filter(None,l.splitlines())
starts = [ i+1 for i,line in enumerate(l) if 'rise' in line and 'set' in line]
print 'starts==',starts
nb, a = 0, []
while starts[1]+nb<len(l):
line0 = l[starts[0]+nb].partition(' ')[2]
line1 = l[starts[1]+nb].partition(' ')[2]
if any(c not in ' 0123456789' for c in line0) or any(c not in ' 0123456789' for c in line1):
break
else:
a.append((line0+' '+line1).split(' '))
nb += 1
I can't go farther for I haven't numpy and I don't know the final dat you want to obtain;
But there's something evident:
to realize this kind of process, it's absolutely necessary to use regexes: you'll have a confort of treatment several magnitude higher.

Categories