I have a little problem parsing a date in python
This is the date I have to parse:
Sun Sep 15, 2013 12:10pm EDT
And that is the code I'm using to parse it:
datetime.strptime( date, "%a %b %d, %Y %I:%M%p %Z")
Everything is fine but the time-zone parsing, which always return a ValueError exception. I've also tried pytz but without any success.
So how can i parse this kind date using python?
Using dateutil:
import dateutil.parser
import pytz
tz_str = '''-12 Y
-11 X NUT SST
-10 W CKT HAST HST TAHT TKT
-9 V AKST GAMT GIT HADT HNY
-8 U AKDT CIST HAY HNP PST PT
-7 T HAP HNR MST PDT
-6 S CST EAST GALT HAR HNC MDT
-5 R CDT COT EASST ECT EST ET HAC HNE PET
-4 Q AST BOT CLT COST EDT FKT GYT HAE HNA PYT
-3 P ADT ART BRT CLST FKST GFT HAA PMST PYST SRT UYT WGT
-2 O BRST FNT PMDT UYST WGST
-1 N AZOT CVT EGT
0 Z EGST GMT UTC WET WT
1 A CET DFT WAT WEDT WEST
2 B CAT CEDT CEST EET SAST WAST
3 C EAT EEDT EEST IDT MSK
4 D AMT AZT GET GST KUYT MSD MUT RET SAMT SCT
5 E AMST AQTT AZST HMT MAWT MVT PKT TFT TJT TMT UZT YEKT
6 F ALMT BIOT BTT IOT KGT NOVT OMST YEKST
7 G CXT DAVT HOVT ICT KRAT NOVST OMSST THA WIB
8 H ACT AWST BDT BNT CAST HKT IRKT KRAST MYT PHT SGT ULAT WITA WST
9 I AWDT IRKST JST KST PWT TLT WDT WIT YAKT
10 K AEST ChST PGT VLAT YAKST YAPT
11 L AEDT LHDT MAGT NCT PONT SBT VLAST VUT
12 M ANAST ANAT FJT GILT MAGST MHT NZST PETST PETT TVT WFT
13 FJST NZDT
11.5 NFT
10.5 ACDT LHST
9.5 ACST
6.5 CCT MMT
5.75 NPT
5.5 SLT
4.5 AFT IRDT
3.5 IRST
-2.5 HAT NDT
-3.5 HNT NST NT
-4.5 HLV VET
-9.5 MART MIT'''
tzd = {}
for tz_descr in map(str.split, tz_str.split('\n')):
tz_offset = int(float(tz_descr[0]) * 3600)
for tz_code in tz_descr[1:]:
tzd[tz_code] = tz_offset
date = 'Sun Sep 15, 2013 12:10pm EDT'
dateutil.parser.parse(date, tzinfos=tzd) # => datetime.datetime(2013, 9, 15, 12, 10, tzinfo=tzoffset(u'EDT', -14400))
tzd generation code comes from this answer.
UPDATE
NOTE The list of time zone abbreviations is not accurate as Matt Johnson commented. See his answer.
You can't. Not reliably anyway. Time zone abbreviations are ambiguous and contradictory. There are no standards.
For example "CST" has 5 distinctly different meanings.
(UTC-06:00) Central Standard Time (America)
(UTC-05:00) Cuba Standard Time
(UTC+08:00) China Standard Time
(UTC+09:30) Central Standard Time (Australia)
(UTC+10:30) Central Summer Time (Australia)
See this list for additional examples.
Related
I am working Bicycle dataset. I want to replace text values in 'weather' column with numbers 1 to 4. This field is an object field. I tried all of these following ways but none seems to work.
There is another field called 'season'. If I apply same code on 'season', my code works fine. Please help.
Sample data:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 Clear + Few clouds 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 Clear + Few clouds 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 Light Snow, Light Rain 11.48 12.120 100 27.9993
3 10/13/2011 11:00 Winter NaN 1 Mist + Cloudy 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 Clear + Few clouds 25.42 31.060 43 23.9994
I tried following, none worked on 'weather' but when i use same code on 'season' column it works fine.
test["weather"] = np.where(test["weather"]=="Clear + Few clouds", 1,
(np.where(test["weather"]=="Mist + Cloudy",2,(np.where(test["weather"]=="Light Snow, Light
Rain",3,(np.where(test["weather"]=="Heavy Rain + Thunderstorm",4,0)))))))
PE_weather = [
(train['weather'] == ' Clear + Few clouds '),
(train['weather'] =='Mist + Cloudy') ,
(train['weather'] >= 'Light Snow, Light Rain'),
(train['weather'] >= 'Heavy Rain + Thunderstorm')]
PE_weather_value = ['1', '2', '3','4']
train['Weather'] = np.select(PE_weather, PE_weather_value)
test.loc[test.weather =='Clear + Few clouds', 'weather']='1'
I suggest you make a dictionary to look up the corresponding values and then apply a lookup to the weather column.
weather_lookup = {
'Clear + Few clouds': 1,
'Mist + Cloudy': 2,
'Light Snow, Light Rain': 3,
'Heavy Rain + Thunderstorm': 4
}
def lookup(w):
return weather_lookup.get(w, 0)
test['weather'] = test['weather'].apply(lookup)
Output:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 1 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 1 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 3 11.48 12.120 100 27.9993 NaN
3 10/13/2011 11:00 Winter NaN 1 2 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 1 25.42 31.060 43 23.9994
I worked on calculating churn using the mix of pandas dataframe of git logs and git show command for a particular commit to see where exactly the changes has been done based on loc. However, I could not able to calculate churn based on the days i.e. I mean calculate churn when an engineer rewrites or deletes their own code that is less than 3 weeks old.
This is how I have done for such dataframe for each commit based
git logs dataframe
sha timestamp date author message body age insertion deletion filepath churn merges
1 1 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997
2 21 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997 31.0 0.0 packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js 31.0
3 22 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997 1.0 1.0 packages/react-test-renderer/src/ReactTestRenderer.js 0.0
4 23 cae635054 Sat Jun 26 14:51:23 2021 -0400 2021-06-26 18:51:23+00:00 Andrew Clark `act`: Resolve to return value of scope function (#21759) When migrating some internal tests I found it annoying that I couldn't -24 days +12:21:32.839997 24.0 14.0 packages/react/src/ReactAct.js 10.0
5 25 e2453e200 Fri Jun 25 15:39:46 2021 -0400 2021-06-25 19:39:46+00:00 Andrew Clark act: Add test for bypassing queueMicrotask (#21743) Test for fix added in #21740 -25 days +13:09:55.839997 50.0 0.0 packages/react-reconciler/src/__tests__/ReactIsomorphicAct-test.js 50.0
6 27 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 4.0 5.0 packages/react-devtools-shared/src/__tests__/FastRefreshDevToolsIntegration-test.js -1.0
7 28 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 4.0 4.0 packages/react-devtools-shared/src/__tests__/componentStacks-test.js 0.0
8 29 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 12.0 12.0 packages/react-devtools-shared/src/__tests__/console-test.js 0.0
9 30 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 7.0 6.0 packages/react-devtools-shared/src/__tests__/editing-test.js 1.0
10 31 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 47.0 42.0 packages/react-devtools-shared/src/__tests__/inspectedElement-test.js 5.0
11 32 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 7.0 6.0 packages/react-devtools-shared/src/__tests__/ownersListContext-test.js 1.0
12 33 73ffce1b6 Thu Jun 24 22:42:44 2021 -0400 2021-06-25 02:42:44+00:00 Brian Vaughn DevTools: Update tests to fix warnings/errors (#21748) Some new ones had slipped in (e.g. deprecated ReactDOM.render message from 18) -26 days +20:12:53.839997 22.0 21.0 packages/react-devtools-shared/src/__tests__/profilerContext-test.js 1.0
churn calculation
commits = df["sha"].unique().tolist()
for commit in commits:
contribution, churn = await self.calculate_churn(commit)
async def calculate_churn(self, stream):
PREVIOUS_BASE_DIR = os.path.abspath("")
try:
GIT_DIR = os.path.join(PREVIOUS_BASE_DIR, "app/git/react.git")
os.chdir(GIT_DIR)
except FileNotFoundError as e:
raise ValueError(e)
cmd = f"git show --format= --unified=0 --no-prefix {stream}"
cmds = [f"{cmd}"]
results = get_proc_out(cmds)
[files, contribution, churn] = get_loc(results)
# need to circle back to previous path
os.chdir(PREVIOUS_BASE_DIR)
return contribution, churn
def is_new_file(result, file):
# search for destination file (+++ ) and update file variable
if result.startswith("+++"):
return result[result.rfind(" ") + 1 :]
else:
return file
def is_loc_change(result, loc_changes):
# search for loc changes (## ) and update loc_changes variable
# ## -1,5 +1,4 ##
# ## -l,s +l,s ##
if result.startswith("##"):
# loc_change = result[2+1: ] -> -1,5 +1,4 ##
loc_change = result[result.find(" ") + 1 :]
# loc_change = loc_change[:9] -> -1,5 +1,4
loc_change = loc_change[: loc_change.find(" ##")]
return loc_change
else:
return loc_changes
def get_loc_change(loc_changes):
# removals
# -1,5 +1,4 = -1,5
left = loc_changes[: loc_changes.find(" ")]
left_dec = 0
# 2
if left.find(",") > 0:
# 2
comma = left.find(",")
# 5
left_dec = int(left[comma + 1 :])
# 1
left = int(left[1:comma])
else:
left = int(left[1:])
left_dec = 1
# additions
# +1,4
right = loc_changes[loc_changes.find(" ") + 1 :]
right_dec = 0
if right.find(",") > 0:
comma = right.find(",")
right_dec = int(right[comma + 1 :])
right = int(right[1:comma])
else:
right = int(right[1:])
right_dec = 1
if left == right:
return {left: (right_dec - left_dec)}
else:
return {left: left_dec, right: right_dec}
def get_loc(results):
files = {}
contribution = 0
churn = 0
file = ""
loc_changes = ""
for result in results:
new_file = is_new_file(result, file)
if file != new_file:
file = new_file
if file not in files:
files[file] = {}
else:
new_loc_changes = is_loc_change(
result, loc_changes
) # returns either empmty or -6 +6 or -13, 0 +14, 2 format
if loc_changes != new_loc_changes:
loc_changes = new_loc_changes
locc = get_loc_change(loc_changes) # {2: 0} or {8: 0, 9: 1}
for loc in locc:
# files[file] = {2: 0, 8: 0, 9: 1}
# print("loc", loc, files[file], locc[loc])
if loc in files[file]:
# change of lines triggered
files[file][loc] += locc[loc]
churn += abs(locc[loc])
else:
files[file][loc] = locc[loc]
contribution += abs(locc[loc])
else:
continue
return [files, contribution, churn]
How can I utilize this same code but check churn only if there is changes in code that is only 3 weeks old?
The only practical way to do this is to iterate through the DataDrame, and because that sucks with pandas, it almost always means you have the wrong data structure. If you're not doing numerical analysis, and it looks like you aren't, then just keep a simple list of dicts. Pandas has its shining points, but it's not a universal database.
Here's the rough code you'd need, although I'm glossing over details:
# Go through the df row by row.
lastdate = {}
for index,row in df.iterrows():
if row['filepath'] in lastdate:
if lastdate[row['filepath']] - row['date'] < timedelta(days=21):
print( "Last change to", row['filepath'], "was within three weeks" )
lastdate[row['filepath']] = row['date']
I'm scraping a website that has a table of satellite values (https://planet4589.org/space/gcat/data/cat/satcat.html).
Because every entry is only separated by whitespace, I need a way to split the string of data entries into an array.
However, the .split() function does not suit my needs, because some of the data entries have spaces (e.g. Able 3), I can't just split everything separated by whitespace.
It gets trickier, however. In some cases where no data is available, a dash ("-") is used. If two data entries are separated by only a space, and one of them is a dash, I don't want to include it as one entry.
e.g say we have the two entries "Able 3" and "-", separated only by a single space. In the file, they would appear as "Able 3 -". I want to split this string into the separate data entries, "Able 3" and "-" (as a list, this would be ["Able 3", "-"]).
Another example would be the need to split "data1 -" into ["data1", "-"]
Pretty much, I need to take a string and split it into a list or words separated by whitespace, except when there is a single space between words, and one of them is not a dash.
Also, as you can see the table is massive. I thought about looping through every character, but that would be too slow, and I need to run this thousands of times.
Here is a sample from the beginning of the file:
JCAT Satcat Piece Type Name PLName LDate Parent SDate Primary DDate Status Dest Owner State Manufacturer Bus Motor Mass DryMass TotMass Length Diamete Span Shape ODate Perigee Apogee Inc OpOrbitOQU AltNames
S00001 00001 1957 ALP 1 R2 8K71A M1-10 8K71A M1-10 (M1-1PS) 1957 Oct 4 - 1957 Oct 4 1933 Earth 1957 Dec 1 1000? R - OKB1 SU OKB1 Blok-A - 7790 7790 7800 ? 28.0 2.6 28.0 Cyl 1957 Oct 4 214 938 65.10 LLEO/I -
S00002 00002 1957 ALP 2 P 1-y ISZ PS-1 1957 Oct 4 S00001 1957 Oct 4 1933 Earth 1958 Jan 4? R - OKB1 SU OKB1 PS - 84 84 84 0.6 0.6 2.9 Sphere + Ant 1957 Oct 4 214 938 65.10 LLEO/I -
S00003 00003 1957 BET 1 P A 2-y ISZ PS-2 1957 Nov 3 A00002 1957 Nov 3 0235 Earth 1958 Apr 14 0200? AR - OKB1 SU OKB1 PS - 508 508 8308 ? 2.0 1.0 2.0 Cone 1957 Nov 3 211 1659 65.33 LEO/I -
S00004 00004 1958 ALP P A Explorer 1 Explorer 1 1958 Feb 1 A00004 1958 Feb 1 0355 Earth 1970 Mar 31 1045? AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Feb 1 359 2542 33.18 LEO/I -
S00005 00005 1958 BET 2 P Vanguard I Vanguard Test Satellite 1958 Mar 17 S00016 1958 Mar 17 1224 Earth - O - NRL US NRL NRL 6" - 2 2 2 0.1 0.1 0.1 Sphere 1959 May 23 657 3935 34.25 MEO -
S00006 00006 1958 GAM P A Explorer 3 Explorer 3 1958 Mar 26 A00005 1958 Mar 26 1745 Earth 1958 Jun 28 AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Mar 26 195 2810 33.38 LEO/I -
S00007 00007 1958 DEL 1 R2 8K74A 8K74A 1958 May 15 - 1958 May 15 0705 Earth 1958 Dec 3 R - OKB1 SU OKB1 Blok-A - 7790 7790 7820 ? 28.0 2.6 28.0 Cyl 1958 May 15 214 1860 65.18 LEO/I -
S00008 00008 1958 DEL 2 P 3-y Sovetskiy ISZ D-1 No. 2 1958 May 15 S00007 1958 May 15 0706 Earth 1960 Apr 6 R - OKB1 SU OKB1 Object D - 1327 1327 1327 3.6 1.7 3.6 Cone 1959 May 7 207 1247 65.12 LEO/I -
S00009 00009 1958 EPS P A Explorer 4 Explorer 4 1958 Jul 26 A00009 1958 Jul 26 1507 Earth 1959 Oct 23 AR - ABMA/JPL US JPL Explorer - 12 12 17 0.8 0.1 0.8 Cyl 1959 Apr 24 258 2233 50.40 LEO/I -
S00010 00010 1958 ZET P A SCORE SCORE 1958 Dec 18 A00015 1958 Dec 18 2306 Earth 1959 Jan 21 AR - ARPA/SRDL US SRDL SCORE - 68 68 3718 2.5 ? 1.5 ? 2.5 Cone 1958 Dec 30 159 1187 32.29 LEO/I -
S00011 00011 1959 ALP 1 P Vanguard II Cloud cover satellite 1959 Feb 17 S00012 1959 Feb 17 1605 Earth - O - BSC US NRL NRL 20" - 10 10 10 0.5 0.5 0.5 Sphere 1959 May 15 564 3304 32.88 MEO -
S00012 00012 1959 ALP 2 R3 GRC 33-KS-2800 GRC 33-KS-2800 175-15-21 1959 Feb 17 R02749 1959 Feb 17 1604 Earth - O - BSC US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Apr 28 564 3679 32.88 MEO -
S00013 00013 1959 BET P A Discoverer 1 CORONA Test Vehicle 2 1959 Feb 28 A00017 1959 Feb 28 2156 Earth 1959 Mar 5 AR - ARPA/CIA US LMSD CORONA - 78 ? 78 ? 668 ? 2.0 1.5 2.0 Cone 1959 Feb 28 163? 968? 89.70 LLEO/P -
S00014 00014 1959 GAM P A Discoverer 2 CORONA BIO 1 1959 Apr 13 A00021 1959 Apr 13 2126 Earth 1959 Apr 26 AR - ARPA/CIA US LMSD CORONA - 110 ? 110 ? 788 1.3 1.5 1.3 Frust 1959 Apr 13 239 346 89.90 LLEO/P -
S00015 00015 1959 DEL 1 P Explorer 6 NASA S-2 1959 Aug 7 S00017 1959 Aug 7 1430 Earth 1961 Jul 1 R? - GSFC US TRW Able Probe ARC 420 40 40 42 ? 0.7 0.7 2.2 Sphere + 4 Pan 1959 Sep 8 250 42327 46.95 HEO - Able 3
S00016 00016 1958 BET 1 R3 GRC 33-KS-2800 GRC 33-KS-2800 144-79-22 1958 Mar 17 R02064 1958 Mar 17 1223 Earth - O - NRL US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Sep 30 653 4324 34.28 MEO -
S00017 00017 1959 DEL 2 R3 Altair Altair X-248 1959 Aug 7 A00024 1959 Aug 7 1428 Earth 1961 Jun 30 R? - USAF US ABL Altair - 24 24 24 1.5 0.5 1.5 Cyl 1961 Jan 8 197 40214 47.10 GTO -
S00018 00018 1959 EPS 1 P A Discoverer 5 CORONA C-2 1959 Aug 13 A00028 1959 Aug 13 1906 Earth 1959 Sep 28 AR - ARPA/CIA US LMSD CORONA - 140 140 730 1.3 1.5 1.3 Frust 1959 Aug 14 215 732 80.00 LLEO/I - NRO Mission 9002
A less haphazard approach would be to interpret the headers on the first line as column indicators, and split on those widths.
import sys
import re
def col_widths(s):
# Shamelessly adapted from https://stackoverflow.com/a/33090071/874188
cols = re.findall(r'\S+\s+', s)
return [len(col) for col in cols]
widths = col_widths(next(sys.stdin))
for line in sys.stdin:
line = line.rstrip('\n')
fields = []
for col_max in widths[:-1]:
fields.append(line[0:col_max].strip())
line = line[col_max:]
fields.append(line)
print(fields)
Demo: https://ideone.com/ASANjn
This seems to provide a better interpretation of e,g. the LDate column, where the dates are sometimes padded with more than one space. The penultimate column preserves the final dash as part of the column value; this seems more consistent with the apparent intent of the author of the original table, though perhaps separately split that off from that specific column if that's not to your liking.
If you don't want to read sys.stdin, just wrap this in with open(filename) as handle: and replace sys.stdin with handle everywhere.
One approach is to use pandas.read_fwf(), which reads text files in fixed-width format. The function returns Pandas DataFrames, which are useful for handling large data sets.
As a quick taste, here's what this simple bit of code does:
import pandas as pd
data = pd.read_fwf("data.txt")
print(data.columns) # Prints an index of all columns.
print()
print(data.head(5)) # Prints the top 5 rows.
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate',
# 'Unnamed: 7', 'Parent', 'SDate', 'Unnamed: 10', 'Unnamed: 11',
# 'Primary', 'DDate', 'Unnamed: 14', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'Unnamed: 23', 'DryMass',
# 'Unnamed: 25', 'TotMass', 'Unnamed: 27', 'Length', 'Unnamed: 29',
# 'Diamete', 'Span', 'Unnamed: 32', 'Shape', 'ODate', 'Unnamed: 35',
# 'Perigee', 'Apogee', 'Inc', 'OpOrbitOQU', 'AltNames'],
# dtype='object')
#
# JCAT Satcat Piece Type ... Apogee Inc OpOrbitOQU AltNames
# 0 S00001 1 1957 ALP 1 R2 ... 938 65.10 LLEO/I - NaN
# 1 S00002 2 1957 ALP 2 P ... 938 65.10 LLEO/I - NaN
# 2 S00003 3 1957 BET 1 P A ... 1659 65.33 LEO/I - NaN
# 3 S00004 4 1958 ALP P A ... 2542 33.18 LEO/I - NaN
# 4 S00005 5 1958 BET 2 P ... 3935 34.25 MEO - NaN
You'll note that some of the columns are unnamed. We can solve this by determining the field widths of the file, guiding the read_fwf()'s parsing. We'll achieve this by reading the first line of the file and iterating over it.
field_widths = [] # We'll append column widths into this list.
last_i = 0
new_field = False
for i, x in enumerate(first_line):
if x != ' ' and new_field:
# Register a new field.
new_field = False
field_widths.append(i - last_i) # Get the field width by subtracting
# the index from previous field's index.
last_i = i # Set the new field index.
elif not new_field and x == ' ':
# We've encountered a space.
new_field = True # Set true so that the next
# non-space encountered is
# recognised as a new field
else:
field_widths.append(64) # Append last field. Set to a high number,
# so that all strings are eventually read.
Just a simple for-loop. Nothing fancy.
All that's left is passing the field_widths list through the widths= keyword arg:
data = pd.read_fwf("data.txt", widths=field_widths)
print(data.columns)
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate', 'Parent',
# 'SDate', 'Primary', 'DDate', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'DryMass', 'TotMass', 'Length',
# 'Diamete', 'Span', 'Shape', 'ODate', 'Perigee', 'Apogee', 'Inc',
# 'OpOrbitOQU'],
# dtype='object')
data is a dataframe, but with some work, you can change it to a list of lists or a list of dicts. Or you could also work with the dataframe directly.
So say, you want the first row. Then you could do
datalist = data.values.tolist()
print(datalist[0])
# ['S00001', 1, '1957 ALP 1', 'R2', '8K71A M1-10', '8K71A M1-10 (M1-1PS)', '1957 Oct 4', '-', '1957 Oct 4 1933', 'Earth', '1957 Dec 1 1000?', 'R', '-', 'OKB1', 'SU', 'OKB1', 'Blok-A', '-', '7790', '7790', '7800 ?', '28.0', '2.6', '28.0', 'Cyl', '1957 Oct 4', '214', '938', '65.10', 'LLEO/I -']
I am trying to assign each game in the NFL a value for the week in which they occur.
Example for the 2008 season all the games that occur in the range between the 4th and 10th of September occur in week 1
i = 0
week = 1
start_date = df2008['date'].iloc[0]
end_date = df2008['date'].iloc[-1]
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
for row in df2008['date']:
row = row.date()
if row in week_range[i]:
df2008['week'] = week
else:
week += 1
However, this is updating all of the games to week 1
date week
1601 2008-09-04 1
1602 2008-09-07 1
1603 2008-09-07 1
1604 2008-09-07 1
1605 2008-09-07 1
... ... ...
1863 2009-01-11 1
1864 2009-01-11 1
1865 2009-01-18 1
1866 2009-01-18 1
1867 2009-02-01 1
I have tried using print statements to debug and these are my results. "In Range" are games that occur in week 1 and are returning as expected.
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
df_sample:
display(df2008[['date', 'home', 'away', 'week']])
date home away week
1601 2008-09-04 Giants Redskins 1
1602 2008-09-07 Falcons Lions 1
1603 2008-09-07 Bills Seahawks 1
1604 2008-09-07 Titans Jaguars 1
1605 2008-09-07 Dolphins Jets 1
... ... ... ... ...
1863 2009-01-11 Giants Eagles 1
1864 2009-01-11 Steelers Chargers 1
1865 2009-01-18 Cardinals Eagles 1
1866 2009-01-18 Steelers Ravens 1
1867 2009-02-01 Cardinals Steelers 1
Can anyone point out where I am going wrong?
OP's original question was: "Can anyone point out where I am going wrong?",
so - though as Parfait pointed out using pandas.Series.dt.week is a fine pandas solution - to help him to find answer to it, I followed OP's original code logic, with some fixing:
import pandas as pd
i = 0
week = 1
df2008 = pd.DataFrame({"date": [pd.Timestamp("2008-09-04"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-02-01")],
"home": ["Giants", "Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals"],
"away": ["Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals", "Ravens"]
})
i = 0
week = 1
start_date = df2008['date'].iloc[0]
#end_date = df2008['date'].iloc[-1]
end_date = pd.Timestamp("2009-03-01")
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
df2008['week'] = None
for i in range(len(df2008['date'])):
rd = df2008.loc[i, 'date'].date()
while True:
if week == len(week_range):
break
if rd in week_range[week - 1]:
df2008.loc[i, 'week'] = week
break
else:
week += 1
print(df2008)
Out:
date home away week
0 2008-09-04 Giants Falcon 1
1 2008-09-07 Falcon Bills 1
2 2008-09-07 Bills Titans 1
3 2008-09-07 Titans Dolphins 1
4 2008-09-07 Dolphins Giants 1
5 2009-01-11 Giants Steelers 19
6 2009-01-11 Steelers Cardinals 19
7 2009-01-18 Cardinals Steelers 20
8 2009-01-18 Steelers Cardinals 20
9 2009-02-01 Cardinals Ravens 22
Consider avoiding any looping and use pandas.Series.dt.week on datetime fields which returns week in year. Then, subtract from the first week. However, a wrinkle occurs when considering the new year so must handle conditionally by adding difference of end of year and then weeks of new year. Fortunately, weeks start on Monday (so Thursday - Sunday maintain same week number).
first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values
# FIND LAST SUNDAY OF YEAR (NOT NECESSARILY DEC 31)
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values
new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values
# CONDITIONALLY ASSIGN
df2008['week'] = np.where(df2008['date'] < '2009-01-01',
(df2008['date'].dt.week - first_week) + 1,
((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
)
To demonstrate with random seeded data (including new year dates). Will replace for OP's reproducible sample.
Data
import numpy as np
import pandas as pd
### DATA BUILD
np.random.seed(120619)
df2008 = pd.DataFrame({'group': np.random.choice(['sas', 'stata', 'spss', 'python', 'r', 'julia'], 500),
'int': np.random.randint(1, 10, 500),
'num': np.random.randn(500),
'char': [''.join(np.random.choice(list('ABC123'), 3)) for _ in range(500)],
'bool': np.random.choice([True, False], 500),
'date': np.random.choice(pd.date_range('2008-09-04', '2009-01-06'), 500)
})
Calculation
first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values
new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values
df2008['week'] = np.where(df2008['date'] < '2008-12-28',
(df2008['date'].dt.week - first_week) + 1,
((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
)
df2008 = df2008.sort_values('date').reset_index(drop=True)
print(df2008.head(10))
# group int num char bool date week
# 0 sas 2 0.099927 A2C False 2008-09-04 1
# 1 python 3 0.241393 2CB False 2008-09-04 1
# 2 python 8 0.516716 ABC False 2008-09-04 1
# 3 spss 2 0.974715 3CB False 2008-09-04 1
# 4 stata 9 -1.582096 CAA True 2008-09-04 1
# 5 sas 3 0.070347 1BB False 2008-09-04 1
# 6 r 5 -0.419936 1CA True 2008-09-05 1
# 7 python 6 0.628749 1AB True 2008-09-05 1
# 8 python 3 0.713695 CA1 False 2008-09-05 1
# 9 python 1 -0.686137 3AA False 2008-09-05 1
print(df2008.tail(10))
# group int num char bool date week
# 490 spss 5 -0.548257 3CC True 2009-01-04 17
# 491 julia 8 -0.176858 AA2 False 2009-01-05 18
# 492 julia 5 -1.422237 A1B True 2009-01-05 18
# 493 stata 2 -1.710138 BB2 True 2009-01-05 18
# 494 python 4 -0.285249 1B1 True 2009-01-05 18
# 495 spss 3 0.918428 C23 True 2009-01-06 18
# 496 r 5 -1.347936 1AC False 2009-01-06 18
# 497 stata 3 0.883093 1C3 False 2009-01-06 18
# 498 python 9 0.448237 12A True 2009-01-06 18
# 499 spss 3 1.459097 2A1 False 2009-01-06 18
My current project is scraping weather data from websites for calculation. Part of this calculation involves different logic depending on if the current time is before or after noon.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np
# Arkansas State Plant Board Weather Web data
url1 = "http://170.94.200.136/weather/Inversion.aspx"
response1 = requests.get(url1)
soup1 = BeautifulSoup(response1.content)
table1 = soup1.find("table", id="MainContent_GridView1")
data1 = pd.read_html(str(table1),header=0)[0]
data1.columns = ['Station', 'Low Temp (F)', 'Time of Low', 'Current Temp (F)', 'Current Time', 'Wind Speed (MPH)', 'Wind Dir', 'High Temp (F)', 'Time Of High']
print(url1)
print(data1[0:4])
array1 = np.array(data1[0:4])
This is my code to bring in the data I need. However, I don't know how to compare the current time I request as a Unicode string to see if it is before or after noon. Can anyone help me with this?
Edit: some data from the current request
Station Low Temp (F) Time of Low Current Temp (F) Current Time \
0 Arkansas 69.0 5:19 AM 88.7 2:09 PM
1 Ashley 70.4 4:39 AM 91.2 2:14 PM
2 Bradley 69.4 4:09 AM 90.6 2:14 PM
3 Chicot -40.2 2:14 PM -40.2 2:14 PM
Wind Speed (MPH) Wind Dir High Temp (F) Time Of High
0 4.3 213 88.9 2:04 PM
1 4.1 172 91.2 2:14 PM
2 6.0 203 90.6 2:09 PM
3 2.2 201 -40.1 12:24 AM
Just check if the meridian is PM or AM.
time = "2:09 PM"
meridian = time.split(' ')[-1] # get just the meridian
before_noon = meridian == 'AM'
after_noon = meridian == 'PM'
You can do it like this:
t = pd.to_datetime(data1['Current Time'][0:1][0])
noon = pd.to_datetime("12:00 PM")
if t < noon:
print("yes")
else:
print("no")
>>> no
t
>>> Timestamp('2016-07-11 14:04:00')
noon
>>> Timestamp('2016-07-11 12:00:00')