I would like to know how can I transform the day columns into week columns.
I tryed groupby.sum() but there is no column name pattern, I dont know what to groupby for.
So the result should be column name like 'weekX' - "week1(Sum of 7 first days) - week2 - week3" and so on.
Thanks in advance.
You can try:
idx = pd.RangeIndex(len(df.columns[4:])) // 7
out = df.iloc[:, 4:].groupby(idx, axis=1).sum().rename(columns=lambda x:f'Week{x+1}')
out = pd.concat([df.iloc[:, :4], out], axis=1)
print(out)
# Output
Province/State Country/Region Lat ... Week26 Week27 Week28
0 NaN Afghanistan 3.393.911 ... 247210 252460 219855
1 NaN Albania 411.533 ... 28068 32671 32113
2 NaN Algeria 280.339 ... 157675 187224 183841
3 NaN Andorra 425.063 ... 6147 6283 5552
4 NaN Angola -112.027 ... 4741 6341 6978
.. ... ... ... ... ... ... ...
261 NaN Sao Tome and Principe 1.864 ... 5199 5813 5231
262 NaN Yemen 15.552.727 ... 11089 11717 10363
263 NaN Comoros -116.455 ... 2310 2419 2292
264 NaN Tajikistan 38.861 ... 47822 50032 44579
265 NaN Lesotho -29.61 ... 2259 3011 3922
[266 rows x 32 columns]
You can use the melt method to combine all your date columns into a single 'Date' column:
df = df.melt(id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], var_name='Date', value_name='Value')
From this point it should be straightforward to group by the 'Date' column by week, and then unstack it if you want to have it as multiple columns again.
I'm trying to extract name, brand, prices, stock microdata from pages extracted from sitemap.xml
But I'm blocked with the following step, thank you for helping me as I'm a newbie I can't understand the blocking element
Scrape the sitemap.xml to have list of urls : OK
Extract the metadata : OK
Extract the product schema : OK
Extract the products not OK
Crawl the site and store the products not OK
Scrape the sitemap.xml to have list of urls : OK
import pandas as pd
import requests
import extruct
from w3lib.html import get_base_url
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import advertools as adv
proximus_sitemap = adv.sitemap_to_df('https://www.proximus.be/iportal/sitemap.xml')
proximus_sitemap = proximus_sitemap[proximus_sitemap['loc'].str.contains('boutique')]
proximus_sitemap = proximus_sitemap[proximus_sitemap['loc'].str.contains('/fr/')]
Extract the metadata : OK
def extract_metadata(url):
r = requests.get(url)
base_url = get_base_url(r.text, r.url)
metadata = extruct.extract(r.text,
base_url=base_url,
uniform=True,
syntaxes=['json-ld',
'microdata',
'opengraph'])
return metadata
metadata = extract_metadata('https://www.proximus.be/fr/id_cr_apple-iphone-13-128gb-blue/particuliers/equipement/boutique/apple-iphone-13-128gb-blue.html')
metadata
Extract the product schema : OK
def get_dictionary_by_key_value(dictionary, target_key, target_value):
for key in dictionary:
if len(dictionary[key]) > 0:
for item in dictionary[key]:
if item[target_key] == target_value:
return item
Product = get_dictionary_by_key_value(metadata, "#type", "Product")
Product
Extract the products not OK => errormessage = errorkey 'offers'
def get_products(metadata):
Product = get_dictionary_by_key_value(metadata, "#type", "Product")
if Product:
products = []
for offer in Product['offers']['offers']:
product = {
'product_name': Product.get('name', ''),
'brand': offer.get('description', ''),
'availability': offer.get('availability', ''),
'lowprice': offer.get('lowPrice', ''),
'highprice': offer.get('highPrice', ''),
'price': offer.get('price', ''),
'priceCurrency': offer.get('priceCurrency', ''),
}
products.append(product)
return products
Crawl the site and store the products not OK as blocked during previous step
def scrape_products(proximus_sitemap, url='url'):
df_products = pd.DataFrame(columns = ['product_name', 'brand', 'name', 'availability',
'lowprice', 'highprice','price','priceCurrency'])
for index, row in proximus_sitemap.iterrows():
metadata = extract_metadata(row[url])
products = get_products(metadata)
if products is not None:
for product in products:
df_products = df_products.append(product, ignore_index=True)
return df_products
df_products = scrape_products(proximus_sitemap, url='loc')
df_products.to_csv('patch.csv', index=False)
df_products.head()
You can simply continue by using the advertools SEO crawler. It has a crawl function that also extracts structured data by default (JSON-LD, OpenGraph, and Twitter).
I tried to crawl a sample of ten pages, and this what the output looks like:
adv.crawl(proximus_sitemap['loc'], 'proximums.jl')
proximus_crawl = pd.read_json('proximums.jl', lines=True)
proximus_crawl.filter(regex='jsonld').columns
Index(['jsonld_#context', 'jsonld_#type', 'jsonld_name', 'jsonld_url',
'jsonld_potentialAction.#type', 'jsonld_potentialAction.target',
'jsonld_potentialAction.query-input', 'jsonld_1_#context',
'jsonld_1_#type', 'jsonld_1_name', 'jsonld_1_url', 'jsonld_1_logo',
'jsonld_1_sameAs', 'jsonld_2_#context', 'jsonld_2_#type',
'jsonld_2_itemListElement', 'jsonld_2_name', 'jsonld_2_image',
'jsonld_2_description', 'jsonld_2_sku', 'jsonld_2_review',
'jsonld_2_brand.#type', 'jsonld_2_brand.name',
'jsonld_2_aggregateRating.#type',
'jsonld_2_aggregateRating.ratingValue',
'jsonld_2_aggregateRating.reviewCount', 'jsonld_2_offers.#type',
'jsonld_2_offers.priceCurrency', 'jsonld_2_offers.availability',
'jsonld_2_offers.price', 'jsonld_3_#context', 'jsonld_3_#type',
'jsonld_3_itemListElement', 'jsonld_image', 'jsonld_description',
'jsonld_sku', 'jsonld_review', 'jsonld_brand.#type',
'jsonld_brand.name', 'jsonld_aggregateRating.#type',
'jsonld_aggregateRating.ratingValue',
'jsonld_aggregateRating.reviewCount', 'jsonld_offers.#type',
'jsonld_offers.lowPrice', 'jsonld_offers.highPrice',
'jsonld_offers.priceCurrency', 'jsonld_offers.availability',
'jsonld_offers.price', 'jsonld_offers.offerCount',
'jsonld_1_itemListElement', 'jsonld_2_offers.lowPrice',
'jsonld_2_offers.highPrice', 'jsonld_2_offers.offerCount',
'jsonld_itemListElement'],
dtype='object')
These are some of the columns you might be interested in (containing price, currency, availability, etc.)
jsonld_2_description
jsonld_2_offers.priceCurrency
jsonld_2_offers.availability
jsonld_2_offers.price
jsonld_description
jsonld_offers.lowPrice
jsonld_offers.priceCurrency
jsonld_offers.availability
jsonld_offers.price
jsonld_2_offers.lowPrice
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
1
Numéro 7
EUR
OutOfStock
369.99
nan
nan
nan
nan
nan
nan
2
nan
nan
nan
nan
Apple
81.82
EUR
InStock
487.6
nan
3
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
4
nan
nan
nan
nan
Huawei
nan
EUR
OutOfStock
330.57
nan
5
nan
nan
nan
nan
Apple
81.82
EUR
LimitedAvailability
487.6
nan
6
Apple
EUR
InStock
589.99
nan
nan
nan
nan
nan
99
7
Apple
EUR
LimitedAvailability
589.99
nan
nan
nan
nan
nan
99
8
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
9
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
How can I modify the output from what it is currently, into the arrangement of the output as described at the bottom? I've tried stacking and un-stacking but I can't seem to hit the head on the nail. Help would be highly appreciated.
My code:
portfolio_count = 0
Equity_perportfolio = []
Portfolio_sequence = []
while portfolio_count < 1:
# declaring list
list = Tickers
portfolio_count = portfolio_count + 1
# initializing the value of n (Number of assets in portfolio)
n = 5
# printing n elements from list (add number while printing the potential portfolio)
potential_portfolio = random.sample(list, n)
print("Portfolio number", portfolio_count)
print(potential_portfolio)
#Pull 'relevant data' about the selected stocks. (Yahoo API?) # 1. df with Index Date and Closing
price_data_close = web.get_data_yahoo(potential_portfolio,
start = '2012-01-01',
end = '2021-03-31')['Close']
price_data = web.get_data_yahoo(potential_portfolio,
start = '2012-01-01',
end = '2021-03-31')
print(price_data)
Which gives me the following structure:(IGNORE NaNs)
Attributes Adj Close ... Volume
Symbols D HOLX PSX ... PSX MGM PG
Date ...
2012-01-03 36.209511 17.840000 NaN ... NaN 25873300.0 11565900.0
2012-01-04 35.912926 17.910000 NaN ... NaN 14717400.0 10595400.0
2012-01-05 35.837063 18.360001 NaN ... NaN 12437500.0 10085300.0
2012-01-06 35.471519 18.570000 NaN ... NaN 9079700.0 8421200.0
2012-01-09 35.423241 18.520000 NaN ... NaN 15750100.0 7836100.0
... ... ... ... ... ... ... ...
2021-03-25 75.220001 71.050003 82.440002 ... 2613300.0 9601500.0 7517300.0
2021-03-26 75.779999 73.419998 84.309998 ... 2368900.0 7809100.0 10820100.0
2021-03-29 76.699997 74.199997 82.529999 ... 1880600.0 7809700.0 11176000.0
2021-03-30 75.529999 73.870003 82.309998 ... 1960600.0 5668500.0 8090600.0
2021-03-31 75.959999 74.379997 81.540001 ... 2665200.0 7029900.0 9202600.0
However, I wanted it to output in this format:
Date Symbols Open High Low Close Volume Adjusted
04/12/2020 MMM 172.130005 173.160004 171.539993 172.460007 2663600 171.050461
07/12/2020 MMM 171.720001 172.5 169.179993 170.149994 2526800 168.759323
08/12/2020 MMM 169.740005 172.830002 169.699997 172.460007 1730800 171.050461
08/12/2020 MMM 169.740005 172.830002 169.699997 172.460007 1730800 171.050461
11/12/2020 D 172.300003 174.649994 172.169998 174.020004 1875700 172.597702
11/12/2020 D 172.300003 174.649994 172.169998 174.020004 1875700 172.597702
11/12/2020 D 172.300003 174.649994 172.169998 174.020004 1875700 172.597702
14/12/2020 D 175.669998 176.199997 172.990005 173.080002 3700100 171.66539
14/12/2020 D 175.669998 176.199997 172.990005 173.080002 3700100 171.66539
14/12/2020 PSX 175.669998 176.199997 172.990005 173.080002 3700100 171.66539
14/12/2020 PSX 175.669998 176.199997 172.990005 173.080002 3700100 171.66539
15/12/2020 PSX 174.389999 175.059998 172.550003 174.679993 2270600 173.252304
18/12/2020 PSX 176.759995 177.460007 175.110001 176.419998 4682000 174.978088
18/12/2020 PSX 176.759995 177.460007 175.110001 176.419998 4682000 174.978088
23/12/2020 PG 175.300003 175.809998 173.960007 173.990005 1762600 172.567963
28/12/2020 PG 175.309998 176.399994 174.389999 174.710007 1403000 173.282074
29/12/2020 PG 175.550003 175.639999 173.149994 173.850006 1218900 172.429108
31/12/2020 PG 174.119995 174.869995 173.179993 174.789993 1841300 173.361404
05/01/2021 PG 172.009995 173.25 170.649994 171.580002 2295300 170.177643
07/01/2021 MMM 171.559998 173.460007 166.160004 169.720001 5863400 168.332855
07/01/2021 MMM 171.559998 173.460007 166.160004 169.720001 5863400 168.332855
07/01/2021 MMM 171.559998 173.460007 166.160004 169.720001 5863400 168.332855
08/01/2021 MMM 169.169998 169.539993 164.610001 166.619995 4808100 165.258179
13/01/2021 MMM 167.270004 167.740005 166.050003 166.279999 2098000 164.920959
15/01/2021 MMM 165.630005 166.259995 163.380005 165.550003 3550700 164.19693
19/01/2021 MMM 167.259995 169.550003 166.800003 169.119995 3903200 167.737747
I have the following dataframe:
Attributes Adj Close
Symbols ADANIPORTS.NS ASIANPAINT.NS AXISBANK.NS BAJAJ-AUTO.NS BAJFINANCE.NS BAJAJFINSV.NS BHARTIARTL.NS INFRATEL.NS BPCL.NS BRITANNIA.NS ... TCS.NS TATAMOTORS.NS TATASTEEL.NS TECHM.NS TITAN.NS ULTRACEMCO.NS UPL.NS VEDL.NS WIPRO.NS ZEEL.NS
month day
1 1 279.239893 676.232860 290.424052 2324.556588 974.134152 3710.866499 290.157978 243.696764 146.170036 950.108271 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 240.371331 507.737111 236.844831 2340.821987 718.111446 3042.034076 277.125503 236.177303 122.136606 733.759396 ... -2.714824 2.830603 109.334502 -17.856865 13.293902 18.980020 0.689529 -0.006994 -3.862265 -10.423989
3 241.700116 498.997079 213.632179 2368.956136 746.050460 3292.162304 279.075750 231.213816 114.698633 686.986466 ... 0.075497 -0.629591 -0.241416 -0.260787 1.392858 -1.196444 -0.660421 -0.161608 -0.243293 -1.687734
4 223.532480 439.849441 201.245454 2391.910913 499.554044 2313.025635 287.582485 276.568762 104.650728 603.446742 ... -1.270405 0.178012 0.109399 -0.224380 -0.415277 -5.050810 -0.084462 -0.075032 3.924894 0.959136
5 213.588413 359.632790 187.594303 2442.596619 309.180993 1587.324934 260.401816 305.384079 95.571235 475.708696 ... -0.995601 -1.093621 0.214684 -1.189623 -2.503186 -0.511994 -0.512211 0.693024 -1.025715 -1.516946
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12 27 238.901700 500.376711 227.057510 2413.230611 748.599821 3299.320564 276.806537 242.597250 124.235449 727.263012 ... 2.770155 -4.410527 -0.031403 -5.315438 -1.792164 1.038870 -0.860125 -1.258880 -0.933370 -1.487581
28 236.105050 461.535601 218.893424 2375.671582 542.521903 2613.480190 284.374906 264.309625 117.807956 681.625725 ... -0.614677 -1.045941 0.688749 -0.375988 1.848569 -1.362454 37.301528 4.794349 -21.079648 -2.224608
29 215.606034 372.030459 203.876520 2450.112244 324.772498 1765.010912 257.278008 300.096024 108.679112 543.112336 ... 3.220893 -28.873421 0.197491 0.649738 0.737047 -6.121189 -1.165286 0.197648 0.250269 -0.064486
30 205.715512 432.342895 235.872734 2279.715479 515.535031 2164.257183 237.584375 253.401642 116.322402 634.503822 ... -1.190093 0.111826 -1.100066 -0.274475 -1.107278 -0.638013 -7.148901 -0.594369 -0.622608 0.368726
31 222.971462 490.784491 246.348255 2211.909688 670.891505 2671.694809 260.623987 230.032092 108.617400 719.389436 ... -1.950700 0.994181 -11.328524 -1.575859 -8.297147 1.151578 -0.059656 -0.650074 -0.648105 -0.749307
366 rows × 601 columns
To select the row which is month 1 and day 1 i have used the following code:
df.query('month ==1' and 'day ==1')
But this produced the following dataframe:
Attributes Adj Close
Symbols ADANIPORTS.NS ASIANPAINT.NS AXISBANK.NS BAJAJ-AUTO.NS BAJFINANCE.NS BAJAJFINSV.NS BHARTIARTL.NS INFRATEL.NS BPCL.NS BRITANNIA.NS ... TCS.NS TATAMOTORS.NS TATASTEEL.NS TECHM.NS TITAN.NS ULTRACEMCO.NS UPL.NS VEDL.NS WIPRO.NS ZEEL.NS
month day
1 1 279.239893 676.232860 290.424052 2324.556588 974.134152 3710.866499 290.157978 243.696764 146.170036 950.108271 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1 215.752040 453.336287 213.741552 2373.224390 517.295897 2289.618629 280.212598 253.640594 104.505893 620.435294 ... -2.526060 -1.059128 -2.052233 3.941005 25.233763 -41.377432 1.032536 7.398859 -4.622867 -1.506376
3 1 233.534958 472.889636 204.900776 2318.030298 561.193189 2697.357413 254.006857 250.426263 106.528327 649.475321 ... -2.269081 -1.375370 -1.734496 27.675276 -1.944131 0.401074 -0.852499 -0.119033 -1.723600 -1.930760
4 1 192.280787 467.604906 227.369618 1982.318034 506.188324 1931.920305 252.626459 226.062386 98.663596 637.086713 ... -0.044923 -0.111909 -0.181328 -1.943672 1.983368 -1.677000 -0.531217 0.032385 -0.956535 -2.015332
5 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
6 1 230.836429 509.991614 218.370072 2463.180957 526.564244 2231.603166 289.425584 298.146594 118.566019 754.736115 ... -0.807933 -1.509616 1.792957 10.396550 -1.060003 2.008286 1.029651 6.690478 -3.114476 0.766063
7 1 197.943186 355.930544 242.388461 2168.834937 412.196744 1753.647647 233.189894 241.823186 90.870574 512.000742 ... -1.630295 11.019253 -0.244958 2.188104 -0.505939 -0.564639 -1.747775 -0.394980 -2.736355 -0.140087
8 1 236.361903 491.867703 218.289537 2102.183175 657.764627 2792.688073 264.695685 249.063224 108.213277 662.192035 ... -1.655988 -1.555488 -1.199192 -0.565774 -1.831832 -4.770262 -0.442534 -6.168488 -0.267261 -3.324977
9 1 229.131335 372.101859 225.172708 2322.747894 333.243305 1800.901049 246.923254 287.262203 114.754666 562.854895 ... -2.419973 0.205031 -1.096847 -0.840121 -2.932670 1.719342 6.196965 -2.674245 -6.542936 -2.526353
10 1 208.748352 429.829772 222.081509 2095.421448 553.005620 2204.335371 259.718945 229.177512 102.475334 641.439810 ... 0.752312 -1.371583 -1.367145 -5.607321 3.259092 26.787332 -1.023199 -0.589042 0.507405 2.428903
11 1 248.233805 545.774276 241.743095 2390.945333 803.738236 3088.686081 277.757322 243.703551 131.933623 789.243830 ... -1.882445 -0.660089 -0.476966 -1.097497 -0.525270 -0.857579 -0.702017 0.016806 -0.792296 -0.368364
12 1 200.472858 353.177721 200.870312 2451.274841 295.858735 1556.379498 255.714673 301.000198 103.908244 514.528562 ... -0.789445 -14.382776 0.196276 -0.394203 7.600042 48.345830 -0.276618 -0.411825 2.271997 42.734886
12 rows × 601 columns
it has produced day 1 for each month instead of row which will show month 1 and day 1. What can i do to resolve this issue?
Remove one '' for one string:
df.query('month == 1 and day == 1')
I have an issue on calculating the rolling mean for a column I added in the code. For some reason, it doesnt work on the column I added but works on a column from the original csv.
Original dataframe from the csv as follow:
Open High Low Last Change Volume Open Int
Time
09/20/19 98.50 99.00 98.35 98.95 0.60 3305.0 0.0
09/19/19 100.35 100.75 98.10 98.35 -2.00 17599.0 0.0
09/18/19 100.65 101.90 100.10 100.35 0.00 18258.0 121267.0
09/17/19 103.75 104.00 100.00 100.35 -3.95 34025.0 122453.0
09/16/19 102.30 104.95 101.60 104.30 1.55 21403.0 127447.0
Ticker = pd.read_csv('\\......\Historical data\kcz19 daily.csv',
index_col=0, parse_dates=True)
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1)).fillna('')
Ticker['ret20'] = Ticker['Return'].rolling(window=20, win_type='triang').mean()
print(Ticker.head())
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 -0.00608213
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 0.0201315
09/17/19 103.75 104.00 100.00 ... 122453.0 0 0
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 0.0386073
ret20 column should have the rolling mean of the column Return so it should show some data starting from raw 21 whereas it is only a copy of column Return here.
If I replace with the Last column it will work.
Below is the result using colum Last
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0 NaN
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 NaN
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 NaN
09/17/19 103.75 104.00 100.00 ... 122453.0 0 NaN
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 NaN
09/13/19 103.25 103.60 102.05 ... 128707.0 -0.0149725 NaN
09/12/19 102.80 103.85 101.15 ... 128904.0 0.00823848 NaN
09/11/19 102.00 104.70 101.40 ... 132067.0 -0.00193237 NaN
09/10/19 98.50 102.25 98.00 ... 135349.0 -0.0175614 NaN
09/09/19 97.00 99.25 95.30 ... 137347.0 -0.0335283 NaN
09/06/19 95.35 97.30 95.00 ... 135399.0 -0.0122889 NaN
09/05/19 96.80 97.45 95.05 ... 136142.0 -0.0171477 NaN
09/04/19 95.65 96.95 95.50 ... 134864.0 0.0125002 NaN
09/03/19 96.00 96.60 94.20 ... 134685.0 -0.0109291 NaN
08/30/19 95.40 97.20 95.10 ... 134061.0 0.0135137 NaN
08/29/19 97.05 97.50 94.75 ... 132639.0 -0.0166584 NaN
08/28/19 97.40 98.15 95.95 ... 130573.0 0.0238601 NaN
08/27/19 97.35 98.00 96.40 ... 129921.0 -0.00410889 NaN
08/26/19 95.55 98.50 95.25 ... 129003.0 0.0035962 NaN
08/23/19 96.90 97.40 95.05 ... 130268.0 -0.0149835 98.97775
Appreciate any help
the .fillna('') is creating a string in the first row which then creates errors for the rolling calculation in Ticker['ret20'].
Delete this and the code will run fine:
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1))