I have a data-frame (df) with the following structure:
date a b c d e f g
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.80 223716 790.8724 5.7916
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434
where columns a and g have data i would like to multiple them together using the following:
df["h"] = df["a"]*df["g"]
however as you can see from the timeseries above there is not always data with which to perform the calculation and I am being returned the following error:
KeyError: 'g'
Is there a way to check if the data exists before performing the calculation? I am trying to use :
df["h"] = np.where((df.a == blank)|(df.g == blank),"",df.a*df.g)
I would like to have returned:
date a b c d e f g h
23/02/2009 577.9102
24/02/2009 579.1345
25/02/2009 583.2158
26/02/2009 629.7425
27/02/2009 553.8306
02/03/2009 6.15 5.31 5.8 223716 790.8724 5.7916 1.0618
03/03/2009 6.16 6.2 6.26 818424 770.6165 6.0161 1.0239
04/03/2009 6.6 6.485 6.57 636544 858.5754 1.4331 6.4149 1.0288
05/03/2009 6.1 5.98 6.06 810584 816.5025 1.7475 6.242 0.9772
06/03/2009 5.845 5.95 6.00 331079 796.7618 1.7144 6.0427 0.9672
09/03/2009 5.4 5.2 5.28 504271 744.0833 1.6449 5.4076 0.9985
10/03/2009 5.93 5.59 5.595 906742 814.2862 1.4128 5.8434 1.0148
but am unsure of the syntax for a blank data field. What should that be?
Related
I'm trying to pull information off of this web page (Which is providing an AJAX call to this page).
I'm able to print out the whole page, but the find_all function just returns a blank list. What am I doing wrong?
from bs4 import BeautifulSoup
import requests
url = "http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&order=asc&_=1653673850919"
def pageText():
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
return doc
specialNum = pageText()
print(specialNum)
specialNum = pageText().find_all('literally anything I am trying to pull off of the page')
print(specialNum) #This will always print a blank list
Apologies if this is a stupid question. I'm a bit of a beginner.
EDIT
as mentioned by #furas removing parameter and value callback=jsonp1653673850875 from url server will send pure JSON and you can get HTML directly via r.json()['componentData'].
Simplest approach in my opinion is to unwrap the JSON string and convert it with json.loads() to access the HTML.
From there you can go with beautifulsoup or pandas to scrape the content.
Example beautifulsoup
import json, requests
from bs4 import BeautifulSoup
r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&order=asc&_=1653673850919')
soup = BeautifulSoup(
json.loads(
r.text.split('(',1)[-1].rsplit(')',1)[0]
)['componentData']
)
for row in soup.select('table tr'):
...
Example pandas
import json, requests
import pandas as pd
r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL®ion=usa&culture=en-US&cur=&order=asc&_=1653673850919')
pd.read_html(json.loads(
r.text.split('(',1)[-1].rsplit(')',1)[0]
)['componentData']
)[0].dropna()
Output
Unnamed: 0
2012-09
2013-09
2014-09
2015-09
2016-09
2017-09
2018-09
2019-09
2020-09
2021-09
TTM
Revenue USD Mil
156508
170910
182795
233715
215639
229234
265595
260174
274515
365817
386017
Gross Margin %
43.9
37.6
38.6
40.1
39.1
38.5
38.3
37.8
38.2
41.8
43.3
Operating Income USD Mil
55241
48999
52503
71230
60024
61344
70898
63930
66288
108949
119379
Operating Margin %
35.3
28.7
28.7
30.5
27.8
26.8
26.7
24.6
24.1
29.8
30.9
Net Income USD Mil
41733
37037
39510
53394
45687
48351
59531
55256
57411
94680
101935
Earnings Per Share USD
1.58
1.42
1.61
2.31
2.08
2.3
2.98
2.97
3.28
5.61
6.15
Dividends USD
0.09
0.41
0.45
0.49
0.55
0.6
0.68
0.75
0.8
0.85
0.88
Payout Ratio % *
—
27.4
28.5
22.3
24.8
26.5
23.7
25.1
23.7
16.3
14.3
Shares Mil
26470
26087
24491
23172
22001
21007
20000
18596
17528
16865
16585
Book Value Per Share * USD
4.25
4.9
5.15
5.63
5.93
6.46
6.04
5.43
4.26
3.91
4.16
Operating Cash Flow USD Mil
50856
53666
59713
81266
65824
63598
77434
69391
80674
104038
116426
Cap Spending USD Mil
-9402
-9076
-9813
-11488
-13548
-12795
-13313
-10495
-7309
-11085
-10633
Free Cash Flow USD Mil
41454
44590
49900
69778
52276
50803
64121
58896
73365
92953
105793
Free Cash Flow Per Share * USD
1.58
1.61
1.93
2.96
2.24
2.41
2.88
3.07
4.04
5.57
—
Working Capital USD Mil
19111
29628
5083
8768
27863
27831
14473
57101
38321
9355
—
I have a document.gca file that contains specific information that I need, I'm trying to extract certain information, in a part of text repeats the next sentences:
#Sta/Elev= xx
(here goes pair numbers)
#Mann
This part of text repeats several times. My goal is to catch (the pair numbers) that are in that interval, and repeat this process in my text. How can I extract that? Say I have this:
Sta/Elev= 259
0 2186.31 .3 2186.14 .9 2185.83 1.4 2185.56 2.5 2185.23
3 2185.04 3.6 2184.83 4.7 2184.61 5.6 2184.4 6.4 2184.17
6.9 2183.95 7.5 2183.69 7.6 2183.59 8 2183.35 8.6 2182.92
10.2 2181.47 10.8 2181.03 11.3 2180.63 11.9 2180.27 12.4 2179.97
13 2179.72 13.6 2179.47 14.1 2179.3 14.3 2179.21 14.7 2179.11
15.7 2178.9 17.4 2178.74 17.9 2178.65 20.1 2178.17 20.4 2178.13
20.4 2178.12 21.5 2177.94 22.6 2177.81 22.6 2177.8 22.9 2177.79
24.1 2177.78 24.4 2177.75 24.6 2177.72 24.8 2177.68 25.2 2177.54
Mann= 3 , 0 , 0
0 .2 0 26.9 .2 0 46.1 .2 0
Bank Sta=26.9,46.1
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2176.01,0.3, 56
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
Type RM Length L Ch R = 1 ,2655 ,11.2,11.1,10.5
XS GIS Cut Line=4
858341.2470677761196439.12427935858354.9998313071196457.53292637
858369.2753539641196470.40256485858387.8228168661196497.81690065
Node Last Edited Time=Aug/05/2019 11:42:02
Sta/Elev= 245
0 2191.01 .8 2190.54 2.5 2189.4 5 2187.76 7.2 2186.4
8.2 2185.73 9.5 2184.74 10.1 2184.22 10.3 2184.04 10.8 2183.55
12.8 2180.84 13.1 2180.55 13.3 2180.29 13.9 2179.56 14.2 2179.25
14.5 2179.03 15.8 2178.18 16.4 2177.81 16.7 2177.65 17 2177.54
17.1 2177.51 17.2 2177.48 17.5 2177.43 17.6 2177.4 17.8 2177.39
18.3 2177.37 18.8 2177.37 19.7 2177.44 20 2177.45 20.6 2177.45
20.7 2177.45 20.8 2177.44 21 2177.42 21.3 2177.41 21.4 2177.4
21.7 2177.32 22 2177.26 22.1 2177.21 22.2 2177.13 22.5 2176.94
22.6 2176.79 22.9 2176.54 23.2 2176.19 23.5 2175.88 23.9 2175.68
24.4 2175.55 24.6 2175.54 24.8 2175.53 24.9 2175.53 25.1 2175.54
25.7 2175.63 26 2175.71 26.3 2175.78 26.4 2175.8 26.4 2175.82
#Mann= 3 , 0 , 0
0 .2 0 22.9 .2 0 43 .2 0
Bank Sta=22.9,43
XS Rating Curve= 0 ,0
XS HTab Starting El and Incr=2175.68,0.3, 51
XS HTab Horizontal Distribution= 0 , 0 , 0
Exp/Cntr(USF)=0,0
Exp/Cntr=0.3,0.1
But I want to select the numbers between Sta/Elev and Mann and save as a pair vectors, for each Sta/Elev right now I have this:
import re
with open('a.g01','r') as file:
file_contents = file.read()
#print(file_contents)
try:
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
except AttributeError:
found = '' # apply your error handling
print(found)
found is empty and I want to catch all the numbers in interval '#Sta/Elev and #Mann'
The problem is in your regex, try switching
found = re.search('#Sta/Elev(.+?)#Mann',file_contents).group(1)
to
found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
output:
>>> import re
>>> file_contents = 'Sta/ElevthisisatestMann'
>>> found = re.search('Sta/Elev(.*)Mann',file_contents).group(1)
>>> print(found)
thisisatest
Edit:
For multiline matching try adding the DOTALL parameter:
found = re.search('Sta/Elev=(.*)Mann',file_contents, re.DOTALL).group(1)
It was not clear to me on what is the separating string, since they are different in your examples, but for that you can just change it in the regex expression
I am currently trying to to slice a MuliIndex DataFrame that has three levels by position.
I am using pandas 19.1
Level0 Level1 Level2 Value
03-00368 A Item111 6.9
03-00368 A Item333 19.2
03-00368 B Item111 9.7
03-00368 B Item222 17.4
04-00176 C Item110 17.4
04-00176 C Item111 9.7
04-00246 D Item46 12.5
04-00246 D Item66 5.6
04-00246 D Item99 11.2
04-00247 E Item23 12.5
04-00247 E Item24 5.6
04-00247 E Item111 11.2
04-00247 F Item23 7.9
04-00247 F Item24 9.7
04-00247 F Item111 12.5
04-00247 G Item46 11.2
04-00247 G Item66 9.7
04-00247 G Item999 9.7
04-00247 H Item23 11.2
04-00247 H Item94 7.9
04-00247 H Item111 11.2
04-00247 I Item46 5.6
04-00247 I Item66 12.5
04-00247 I Item888 11.2
04-00353 J Item66 12.5
04-00353 J Item99 12.5
04-00354 K Item43 12.5
04-00354 K Item94 12.5
04-00355 L Item54 50
04-00355 L Item99 50
Currently I can achieve:
df.loc[(slice('03-00368', '04-00361'), slice(None), slice(None)), :]
But in practice I won't know what the labels will be. I just want to select the first ten level 0's so I tried this(and many other things which are similar):
>>> df.iloc[(slice(0, 10), slice(None), slice(None)), :]
TypeError: unorderable types: int() >= NoneType()
The end goal is to limit the final number of rows displayed, without breaking up the Level0 index
>>>df.iloc[(0,1,), :]
Level0 Level1 Level2 Value
03-00368 A Item111 6.9
03-00368 A Item333 19.2
Notice that it only returned the first two rows, I would like the result to be:
Level0 Level1 Level2 Value
03-00368 A Item111 6.9
03-00368 A Item333 19.2
03-00368 B Item111 9.7
03-00368 B Item222 17.4
04-00176 C Item110 17.4
04-00176 C Item111 9.7
There are of hacky way to accomplish this but I'm posting because I want to know what I am doing wrong, or why I can't expect to be able to slice MultiIndexes this way.
method 1
groupby + head
df.groupby(level=0).head(10)
method 2
Unnecessarily verbose
IndexSlice
df.sort_index().loc[pd.IndexSlice[df.index.levels[0][:10], :, :], :]
method 3
loc
df.loc[df.index.levels[0][:10].tolist()]
You could groupby level and take the top two this way
df.groupby(level=0).head(2)
I have a DataFrame like so:
In [10]: df.head()
Out[10]:
sand silt clay rho_b ... n \
5 25 60 5 25 60 5 25 60 5 ... 60
STID ...
ACME 73.0 60.3 52.5 19.7 23.9 25.9 7.2 15.7 21.5 1.27 ... 1.32
ADAX 61.1 51.1 47.6 22.0 25.4 24.6 16.9 23.5 27.8 1.01 ... 1.25
ALTU 23.8 17.8 14.3 40.0 45.2 40.9 36.2 37.0 44.8 1.57 ... 1.18
ALV2 33.3 21.2 19.8 31.4 29.7 29.8 35.3 49.1 50.5 1.66 ... 1.20
ANT2 55.6 57.5 47.7 34.9 31.1 26.8 9.4 11.3 25.5 1.49 ... 1.29
So for every STID (e.g. ACME, ADAX, ALTU), there's some property (e.g. sand, silt, clay) defined at three depths (5, 25, 60).
This structure makes it really easy to do per-depth calculations at each STID, e.g.:
In [12]: (df['sand'] + df['silt']).head()
Out[12]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
How can I neatly incorporate a calculated result back in to the DataFrame? For example, if I wanted to call the result of the above calculation 'notclay':
In [13]: df['notclay'] = df['sand'] + df['silt']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-a30bd9ba99c3> in <module>()
----> 1 df['notclay'] = df['sand'] + df['silt']
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Three columns are expected to be defined for each column in the result, not just the one 'notclay' column.
I do have a solution using strict assignments, but I'm not very satisfied with it:
In [21]: df[[('notclay', 5), ('notclay', 25), ('notclay', 60)]] = df['sand'] + df['silt']
In [22]: df['notclay'].head()
Out[22]:
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
I have many other calculations to do similar to this one, and using a strict assignment every time seems tedious. I'm guessing there's a better/"right" way to do this. I think add a field in pandas dataframe with MultiIndex columns might contain the answer, but I don't very well understand the solutions (or even what a Panel is and if it can help me).
Edit: Something I tried that doesn't work, prepending a category using concat:
In [36]: concat([df['sand'] + df['silt']], axis=1, keys=['notclay']).head()
Out[36]:
notclay
5 25 60
STID
ACME 92.7 84.2 78.4
ADAX 83.1 76.5 72.2
ALTU 63.8 63.0 55.2
ALV2 64.7 50.9 49.6
ANT2 90.5 88.6 74.5
In [37]: df['notclay'] = concat([df['sand'] + df['silt']], axis=1, keys=['notclay'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<snip>
ValueError: Wrong number of items passed 3, placement implies 1
Same ValueError raised as above.
Depending on your taste, this may be a nicer way to do it still using concat:
In [53]: df
Out[53]:
blah foo
1 2 3 1 2 3
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406
In [54]: newdf
Out[54]:
1 2 3
a 0.433377 0.806679 0.976298
b 0.593683 0.217415 0.086565
c 0.716244 0.908777 0.180252
d 0.031942 0.074283 0.745019
e 0.651517 0.393569 0.861616
In [56]: newdf.columns=pd.MultiIndex.from_product([['bar'], newdf.columns])
In [57]: pd.concat([df, newdf], axis=1)
Out[57]:
blah foo bar \
1 2 3 1 2 3 1
a 0.351045 0.044654 0.855627 0.839725 0.675183 0.325324 0.433377
b 0.610374 0.394499 0.924708 0.924303 0.404475 0.885368 0.593683
c 0.116418 0.487866 0.190669 0.283535 0.862869 0.346477 0.716244
d 0.771014 0.204143 0.143449 0.848520 0.887373 0.220083 0.031942
e 0.103268 0.306820 0.277125 0.627272 0.631019 0.386406 0.651517
2 3
a 0.806679 0.976298
b 0.217415 0.086565
c 0.908777 0.180252
d 0.074283 0.745019
e 0.393569 0.861616
In order to store this into the original dataframe, you can simply assign to it in the last line:
In [58]: df = pd.concat([df, newdf], axis=1)
I want to extract the name of comets from my table held in a text file. However some comets are 1-worded, others are 2-worded, and some are 3-worded. My table looks like this:
9P/Tempel 1 1.525 0.514 10.5 5.3 2.969
27P/Crommelin 0.748 0.919 29.0 27.9 1.484
126P/IRAS 1.713 0.697 45.8 13.4 1.963
177P/Barnard 1.107 0.954 31.2 119.6 1.317
P/2008 A3 (SOHO) 0.049 0.984 22.4 5.4 1.948
P/2008 Y11 (SOHO) 0.046 0.985 24.4 5.3 1.949
C/1991 L3 Levy 0.983 0.929 19.2 51.3 1.516
However, I know that the name of the comets is from character 5 till character 37. How can I write a code to tell python that the first column is from character 5 till character 37?
data = """9P/Tempel 1 1.525 0.514 10.5 5.3 2.969
27P/Crommelin 0.748 0.919 29.0 27.9 1.484
126P/IRAS 1.713 0.697 45.8 13.4 1.963
177P/Barnard 1.107 0.954 31.2 119.6 1.317
P/2008 A3 (SOHO) 0.049 0.984 22.4 5.4 1.948
P/2008 Y11 (SOHO) 0.046 0.985 24.4 5.3 1.949
C/1991 L3 Levy 0.983 0.929 19.2 51.3 1.516""".split('\n')
To read the whole file you can use
f = open('data.txt', 'r').readlines()
It seems that you have columns that you can use.
If you're only interested in the first column then :
len("9P/Tempel 1 ")
It gives 33.
So,
Extract the first column :
for line in data:
print line[:33].strip()
Here what's printed :
9P/Tempel 1
27P/Crommelin
126P/IRAS
177P/Barnard
P/2008 A3 (SOHO)
P/2008 Y11 (SOHO)
C/1991 L3 Levy
If what you want is :
Tempel 1
Crommelin
IRAS
...
You have to use a regular expression.
Example :
reg = '.*?/[\d\s]*(.*)'
print re.match(reg, '27P/Crommelin').group(1)
print re.match(reg, 'C/1991 L3 Levy').group(1)
Here's the output :
Crommelin
L3 Levy
You can also take a glance to the read_fwf of the python pandas library.
It allows to parse your file specifying the number of characters per columns.