I'm scraping a website that has a table of satellite values (https://planet4589.org/space/gcat/data/cat/satcat.html).
Because every entry is only separated by whitespace, I need a way to split the string of data entries into an array.
However, the .split() function does not suit my needs, because some of the data entries have spaces (e.g. Able 3), I can't just split everything separated by whitespace.
It gets trickier, however. In some cases where no data is available, a dash ("-") is used. If two data entries are separated by only a space, and one of them is a dash, I don't want to include it as one entry.
e.g say we have the two entries "Able 3" and "-", separated only by a single space. In the file, they would appear as "Able 3 -". I want to split this string into the separate data entries, "Able 3" and "-" (as a list, this would be ["Able 3", "-"]).
Another example would be the need to split "data1 -" into ["data1", "-"]
Pretty much, I need to take a string and split it into a list or words separated by whitespace, except when there is a single space between words, and one of them is not a dash.
Also, as you can see the table is massive. I thought about looping through every character, but that would be too slow, and I need to run this thousands of times.
Here is a sample from the beginning of the file:
JCAT Satcat Piece Type Name PLName LDate Parent SDate Primary DDate Status Dest Owner State Manufacturer Bus Motor Mass DryMass TotMass Length Diamete Span Shape ODate Perigee Apogee Inc OpOrbitOQU AltNames
S00001 00001 1957 ALP 1 R2 8K71A M1-10 8K71A M1-10 (M1-1PS) 1957 Oct 4 - 1957 Oct 4 1933 Earth 1957 Dec 1 1000? R - OKB1 SU OKB1 Blok-A - 7790 7790 7800 ? 28.0 2.6 28.0 Cyl 1957 Oct 4 214 938 65.10 LLEO/I -
S00002 00002 1957 ALP 2 P 1-y ISZ PS-1 1957 Oct 4 S00001 1957 Oct 4 1933 Earth 1958 Jan 4? R - OKB1 SU OKB1 PS - 84 84 84 0.6 0.6 2.9 Sphere + Ant 1957 Oct 4 214 938 65.10 LLEO/I -
S00003 00003 1957 BET 1 P A 2-y ISZ PS-2 1957 Nov 3 A00002 1957 Nov 3 0235 Earth 1958 Apr 14 0200? AR - OKB1 SU OKB1 PS - 508 508 8308 ? 2.0 1.0 2.0 Cone 1957 Nov 3 211 1659 65.33 LEO/I -
S00004 00004 1958 ALP P A Explorer 1 Explorer 1 1958 Feb 1 A00004 1958 Feb 1 0355 Earth 1970 Mar 31 1045? AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Feb 1 359 2542 33.18 LEO/I -
S00005 00005 1958 BET 2 P Vanguard I Vanguard Test Satellite 1958 Mar 17 S00016 1958 Mar 17 1224 Earth - O - NRL US NRL NRL 6" - 2 2 2 0.1 0.1 0.1 Sphere 1959 May 23 657 3935 34.25 MEO -
S00006 00006 1958 GAM P A Explorer 3 Explorer 3 1958 Mar 26 A00005 1958 Mar 26 1745 Earth 1958 Jun 28 AR - ABMA/JPL US JPL Explorer - 8 8 14 0.8 0.1 0.8 Cyl 1958 Mar 26 195 2810 33.38 LEO/I -
S00007 00007 1958 DEL 1 R2 8K74A 8K74A 1958 May 15 - 1958 May 15 0705 Earth 1958 Dec 3 R - OKB1 SU OKB1 Blok-A - 7790 7790 7820 ? 28.0 2.6 28.0 Cyl 1958 May 15 214 1860 65.18 LEO/I -
S00008 00008 1958 DEL 2 P 3-y Sovetskiy ISZ D-1 No. 2 1958 May 15 S00007 1958 May 15 0706 Earth 1960 Apr 6 R - OKB1 SU OKB1 Object D - 1327 1327 1327 3.6 1.7 3.6 Cone 1959 May 7 207 1247 65.12 LEO/I -
S00009 00009 1958 EPS P A Explorer 4 Explorer 4 1958 Jul 26 A00009 1958 Jul 26 1507 Earth 1959 Oct 23 AR - ABMA/JPL US JPL Explorer - 12 12 17 0.8 0.1 0.8 Cyl 1959 Apr 24 258 2233 50.40 LEO/I -
S00010 00010 1958 ZET P A SCORE SCORE 1958 Dec 18 A00015 1958 Dec 18 2306 Earth 1959 Jan 21 AR - ARPA/SRDL US SRDL SCORE - 68 68 3718 2.5 ? 1.5 ? 2.5 Cone 1958 Dec 30 159 1187 32.29 LEO/I -
S00011 00011 1959 ALP 1 P Vanguard II Cloud cover satellite 1959 Feb 17 S00012 1959 Feb 17 1605 Earth - O - BSC US NRL NRL 20" - 10 10 10 0.5 0.5 0.5 Sphere 1959 May 15 564 3304 32.88 MEO -
S00012 00012 1959 ALP 2 R3 GRC 33-KS-2800 GRC 33-KS-2800 175-15-21 1959 Feb 17 R02749 1959 Feb 17 1604 Earth - O - BSC US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Apr 28 564 3679 32.88 MEO -
S00013 00013 1959 BET P A Discoverer 1 CORONA Test Vehicle 2 1959 Feb 28 A00017 1959 Feb 28 2156 Earth 1959 Mar 5 AR - ARPA/CIA US LMSD CORONA - 78 ? 78 ? 668 ? 2.0 1.5 2.0 Cone 1959 Feb 28 163? 968? 89.70 LLEO/P -
S00014 00014 1959 GAM P A Discoverer 2 CORONA BIO 1 1959 Apr 13 A00021 1959 Apr 13 2126 Earth 1959 Apr 26 AR - ARPA/CIA US LMSD CORONA - 110 ? 110 ? 788 1.3 1.5 1.3 Frust 1959 Apr 13 239 346 89.90 LLEO/P -
S00015 00015 1959 DEL 1 P Explorer 6 NASA S-2 1959 Aug 7 S00017 1959 Aug 7 1430 Earth 1961 Jul 1 R? - GSFC US TRW Able Probe ARC 420 40 40 42 ? 0.7 0.7 2.2 Sphere + 4 Pan 1959 Sep 8 250 42327 46.95 HEO - Able 3
S00016 00016 1958 BET 1 R3 GRC 33-KS-2800 GRC 33-KS-2800 144-79-22 1958 Mar 17 R02064 1958 Mar 17 1223 Earth - O - NRL US GCR 33-KS-2800 - 195 22 22 1.5 0.7 1.5 Cyl 1959 Sep 30 653 4324 34.28 MEO -
S00017 00017 1959 DEL 2 R3 Altair Altair X-248 1959 Aug 7 A00024 1959 Aug 7 1428 Earth 1961 Jun 30 R? - USAF US ABL Altair - 24 24 24 1.5 0.5 1.5 Cyl 1961 Jan 8 197 40214 47.10 GTO -
S00018 00018 1959 EPS 1 P A Discoverer 5 CORONA C-2 1959 Aug 13 A00028 1959 Aug 13 1906 Earth 1959 Sep 28 AR - ARPA/CIA US LMSD CORONA - 140 140 730 1.3 1.5 1.3 Frust 1959 Aug 14 215 732 80.00 LLEO/I - NRO Mission 9002
A less haphazard approach would be to interpret the headers on the first line as column indicators, and split on those widths.
import sys
import re
def col_widths(s):
# Shamelessly adapted from https://stackoverflow.com/a/33090071/874188
cols = re.findall(r'\S+\s+', s)
return [len(col) for col in cols]
widths = col_widths(next(sys.stdin))
for line in sys.stdin:
line = line.rstrip('\n')
fields = []
for col_max in widths[:-1]:
fields.append(line[0:col_max].strip())
line = line[col_max:]
fields.append(line)
print(fields)
Demo: https://ideone.com/ASANjn
This seems to provide a better interpretation of e,g. the LDate column, where the dates are sometimes padded with more than one space. The penultimate column preserves the final dash as part of the column value; this seems more consistent with the apparent intent of the author of the original table, though perhaps separately split that off from that specific column if that's not to your liking.
If you don't want to read sys.stdin, just wrap this in with open(filename) as handle: and replace sys.stdin with handle everywhere.
One approach is to use pandas.read_fwf(), which reads text files in fixed-width format. The function returns Pandas DataFrames, which are useful for handling large data sets.
As a quick taste, here's what this simple bit of code does:
import pandas as pd
data = pd.read_fwf("data.txt")
print(data.columns) # Prints an index of all columns.
print()
print(data.head(5)) # Prints the top 5 rows.
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate',
# 'Unnamed: 7', 'Parent', 'SDate', 'Unnamed: 10', 'Unnamed: 11',
# 'Primary', 'DDate', 'Unnamed: 14', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'Unnamed: 23', 'DryMass',
# 'Unnamed: 25', 'TotMass', 'Unnamed: 27', 'Length', 'Unnamed: 29',
# 'Diamete', 'Span', 'Unnamed: 32', 'Shape', 'ODate', 'Unnamed: 35',
# 'Perigee', 'Apogee', 'Inc', 'OpOrbitOQU', 'AltNames'],
# dtype='object')
#
# JCAT Satcat Piece Type ... Apogee Inc OpOrbitOQU AltNames
# 0 S00001 1 1957 ALP 1 R2 ... 938 65.10 LLEO/I - NaN
# 1 S00002 2 1957 ALP 2 P ... 938 65.10 LLEO/I - NaN
# 2 S00003 3 1957 BET 1 P A ... 1659 65.33 LEO/I - NaN
# 3 S00004 4 1958 ALP P A ... 2542 33.18 LEO/I - NaN
# 4 S00005 5 1958 BET 2 P ... 3935 34.25 MEO - NaN
You'll note that some of the columns are unnamed. We can solve this by determining the field widths of the file, guiding the read_fwf()'s parsing. We'll achieve this by reading the first line of the file and iterating over it.
field_widths = [] # We'll append column widths into this list.
last_i = 0
new_field = False
for i, x in enumerate(first_line):
if x != ' ' and new_field:
# Register a new field.
new_field = False
field_widths.append(i - last_i) # Get the field width by subtracting
# the index from previous field's index.
last_i = i # Set the new field index.
elif not new_field and x == ' ':
# We've encountered a space.
new_field = True # Set true so that the next
# non-space encountered is
# recognised as a new field
else:
field_widths.append(64) # Append last field. Set to a high number,
# so that all strings are eventually read.
Just a simple for-loop. Nothing fancy.
All that's left is passing the field_widths list through the widths= keyword arg:
data = pd.read_fwf("data.txt", widths=field_widths)
print(data.columns)
# Index(['JCAT', 'Satcat', 'Piece', 'Type', 'Name', 'PLName', 'LDate', 'Parent',
# 'SDate', 'Primary', 'DDate', 'Status', 'Dest', 'Owner', 'State',
# 'Manufacturer', 'Bus', 'Motor', 'Mass', 'DryMass', 'TotMass', 'Length',
# 'Diamete', 'Span', 'Shape', 'ODate', 'Perigee', 'Apogee', 'Inc',
# 'OpOrbitOQU'],
# dtype='object')
data is a dataframe, but with some work, you can change it to a list of lists or a list of dicts. Or you could also work with the dataframe directly.
So say, you want the first row. Then you could do
datalist = data.values.tolist()
print(datalist[0])
# ['S00001', 1, '1957 ALP 1', 'R2', '8K71A M1-10', '8K71A M1-10 (M1-1PS)', '1957 Oct 4', '-', '1957 Oct 4 1933', 'Earth', '1957 Dec 1 1000?', 'R', '-', 'OKB1', 'SU', 'OKB1', 'Blok-A', '-', '7790', '7790', '7800 ?', '28.0', '2.6', '28.0', 'Cyl', '1957 Oct 4', '214', '938', '65.10', 'LLEO/I -']
I have problem with creating DataFrame from string that look like table. Exactly, I want to create same table as my data.This is my data, and below is my code:
0 2017 IX 2018 X 2018 X 2018 X 2018
0 2017 IX 2018 0 2017 IX 2018
UKUPNO 1.053 1.075 1.093 103,8 101,7 1.633 1.669 1.701 104,2 101,9
A Poljoprivreda, šumarstvo i ribolov 907 888 925 102,0 104,2 1.394 1.356 1.420 101,9 104,7
B Vađenje ruda i kamena 913 919 839 91,9 91,3 1.395 1.406 1.297 93,0 92,2
C Prerađivačka industrija 769 764 775 100,8 101,4 1.176 1.169 1.187 100,9 101,5
D Proizvodnja i snabdijevanje 1.574 1.570 1.647 104,6 104,9 2.459 2.455 2.579 104,9 105,1
električnom energijom, plinom,
parom i klimatizacija
E Snabdijevanje vodom; uklanjanje 956 973 954 99,8 98,0 1.462 1.491 1.462 100,0 98,1
otpadnih voda, upravljanje otpadom
TESTDATA = io.StringIO(''' ''')
df=pd.read_csv(TESTDATA,sep='delimiter',header=None,engine='python')
When I read my code, I get this DataFrame
0 Prosječna neto plaća ...
1 u KM ...
2 Index Index ...
3 0 2017 IX 2018 X 2018 X 2018 ...
4 0 2017 IX 2018 ...
5 UKUPNO ...
6 A Poljoprivreda, šumarstvo i ribolov ...
7 B Vađenje ruda i kamena ...
8 C Prerađivačka industrija ...
9 D Proizvodnja i snabdijevanje ...
10 električnom energijom, plinom,
I'm trying to format pandas dataframe:
> year mileage model manufacturer power fuel_type price
> 0 2011 184000 c-klasa Mercedes-Benz 161 diesel 114340
> 1 2013 102000 v40 Volvo 130 diesel 80511
> 2 2014 191000 scenic Renault 85 diesel 57613
> 3 1996 210000 vectra Opel 85 benzin 6278
> 4 2005 258000 tucson Hyundai 83 diesel 41363
> 5 2007 325000 astra Opel 74 diesel 26590
> 6 2002 200000 megane Renault 79 plin 16988
> 7 2011 191000 touran VW 77 diesel 62783
> 8 2007 210000 118 BMW 105 diesel 44318
> 9 2012 104000 3 Mazda 85 diesel 63522
> 10 2011 68000 c3 Citroen 54 benzin 44318
> 11 1993 200000 ax Citroen 37 diesel 43467
> 12 2011 142000 twingo Renault 55 benzin 28068
> 13 2005 280000 320 BMW 120 diesel 28068
output to fit JSON object requirements.
Here's my code:
for model, car in carsDF.groupby('manufacturer'):
print("{\"",model,":\"[\"",'","'.join(car['model'].unique()),"\"]},")
which yields:
> {" Alfa Romeo
> :"["156","159","146","147","giulietta","gt","33","mito","166","145","brera","sprint","spider","155","ostalo
> "]}, {" Aston Martin :"[" vantage "]},...
Which is ok except for spaces that shows each time I use escape chars "\".
How to create JSON object without them?
Is there any better way to generate JSON object for case like this?
I believe you need create Series by unique values by SeriesGroupBy.unique and then convert to json by Series.to_json:
j = carsDF.groupby('manufacturer')['model'].unique().to_json()
print (j)
{
"BMW": ["118", "320"],
"Citroen": ["c3", "ax"],
"Hyundai": ["tucson"],
"Mazda": ["3"],
"Mercedes-Benz": ["c-klasa"],
"Opel": ["vectra", "astra"],
"Renault": ["scenic", "megane", "twingo"],
"VW": ["touran"],
"Volvo": ["v40"]
}
If want each json separately solution is create dictionaries and convert to jsons:
import json
for model, car in carsDF.groupby('manufacturer'):
print (json.dumps({model: car['model'].unique().tolist()}))
{"BMW": ["118", "320"]}
{"Citroen": ["c3", "ax"]}
{"Hyundai": ["tucson"]}
{"Mazda": ["3"]}
{"Mercedes-Benz": ["c-klasa"]}
{"Opel": ["vectra", "astra"]}
{"Renault": ["scenic", "megane", "twingo"]}
{"VW": ["touran"]}
{"Volvo": ["v40"]}
I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas