Scrape a JS rendered site without Chrome GUI? - python

I am trying to scrape a js rendered site using selenium and BeautifulSoup. The code works fine but I need to run it on a server which don't have any chrome. What shall I change in the code that it works without a GUI?
Below is the current code:
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
import json
from selenium.webdriver.common.keys import Keys
url = 'https://www.bigbasket.com/pc/fruits-vegetables/fresh-vegetables/?nc=nb'
chromepath = "/Users/Nitin/Desktop/Milkbasket/Scraping/chromedriver"
driver = driver = webdriver.Chrome(chromepath)
driver.get(url)
#rest of code for fetching prices

I would recommend you ditch the Selenium approach, and work on getting the information you need using using the built in urllib libraries, or if possible the requests library. The information for all of the products can be obtained from returned JSON data. For example:
import requests
import re
params = {
"type" : "pc",
"slug" : "fresh-vegetables",
"tab_type" : '["all"]',
"sorted_on" : "popularity",
"listtype" : "pc",
}
session = requests.Session()
for page in range(1, 10):
params['page'] = page
req_vegetables = session.get("https://www.bigbasket.com/product/get-products", params=params)
json_vegetables = req_vegetables.json()
print(f'Page {page}')
for product in json_vegetables['tab_info']['product_map']['all']['prods']:
print(f" {product['p_desc']} - {product['sp']} - {product['mrp']}")
This would give you the following output:
Page 1
Onion - 21.00 - 26.25
Potato - 27.00 - 33.75
Tomato - Hybrid - 40.00 - 50.00
Ladies Finger - 10.00 - 12.50
Cauliflower - 35.00 - 43.75
Palak - 30.00 - 37.50
Potato Onion Tomato 1 kg Each - 88.00 - 110.00
Carrot - Local - 59.00 - 73.75
Capsicum - Green - 89.00 - 111.25
Tomato - Local - 47.00 - 58.75
Mushrooms - Button - 49.00 - 61.25
Cucumber - 25.00 - 31.25
Broccoli - 18.40 - 23.00
Bottle Gourd - 17.00 - 21.25
Cabbage - 32.00 - 40.00
Cucumber - English - 23.00 - 28.75
Tomato - Local, Organically Grown - 29.00 - 36.25
Brinjal - Bottle Shape - 72.00 - 90.00
Onion - Organically Grown - 23.00 - 28.75
Methi - 19.00 - 23.75
Page 2
Bitter Gourd / Karela - 59.20 - 74.00
Beetroot - 40.00 - 50.00
Fresho Palak - Without Root 250 Gm + Amul Malai Paneer 200 Gm - 94.20 - 102.00
Capsicum - Red - 299.00 - 373.75
... etc

Related

How to scrape all results from list instead of be limited to only 20?

I'm using beautifulsoup and trying to scrape some cars24.com data. The list, however, only contains 20 cars details. That's weird, since the page contains a lot more car details (I tried saving it). What am I doing wrong and how can I get it to scrape the whole page?
This is my code:
from bs4 import BeautifulSoup as bs
import requests
link = 'https://www.cars24.com/buy-used-car?sort=P&storeCityId=2&pinId=110001'
page=requests.get(link)
soup = bs(page.content,'html.parser')
car_name = soup.find_all('h2',class_='_3FpCg')
cust_name = []
for i in range(0, len(car_name)):
cust_name.append(car_name[i].get_text())
cust_name
Is there a workaround for this? Appreciate the help.
Use the API endpoint.
For example:
import requests
url = "https://api-sell24.cars24.team/buy-used-car?sort=P&serveWarrantyCount=true&gaId=&page=1&storeCityId=2&pinId=110001"
cars = requests.get(url).json()['data']['content']
base = "https://www.cars24.com/buy-used-"
for car in cars:
car_name = "-".join(car['carName'].lower().split())
car_city = "-".join(car['city'].lower().split())
offer = f"{base}{car_name}-{car['year']}-cars-{car_city}-{car['carId']}"
print(f"{car['carName']} - {car['year']} - {car['price']}")
print(offer)
Output:
Maruti Swift Dzire - 2010 - 256299
https://www.cars24.com/buy-used-maruti-swift-dzire-2010-cars-new-delhi-10084891724
Hyundai Grand i10 - 2018 - 526599
https://www.cars24.com/buy-used-hyundai-grand-i10-2018-cars-noida-10572294761
Datsun Redi Go - 2018 - 234499
https://www.cars24.com/buy-used-datsun-redi-go-2018-cars-gurgaon-11073694705
Maruti Swift - 2020 - 566499
https://www.cars24.com/buy-used-maruti-swift-2020-cars-faridabad-11041770770
Hyundai i10 - 2009 - 170699
https://www.cars24.com/buy-used-hyundai-i10-2009-cars-rohtak-1007315463
Maruti Swift - 2020 - 577399
https://www.cars24.com/buy-used-maruti-swift-2020-cars-new-delhi-10065678773
Hyundai Grand i10 - 2018 - 508799
https://www.cars24.com/buy-used-hyundai-grand-i10-2018-cars-ghaziabad-11261195767
Maruti Swift - 2020 - 587599
https://www.cars24.com/buy-used-maruti-swift-2020-cars-new-delhi-10016194709
Maruti Swift - 2020 - 524099
https://www.cars24.com/buy-used-maruti-swift-2020-cars-new-delhi-10010390743
Hyundai AURA - 2021 - 675099
https://www.cars24.com/buy-used-hyundai-aura-2021-cars-faridabad-11095494760
Maruti Swift - 2019 - 541899
https://www.cars24.com/buy-used-maruti-swift-2019-cars-new-delhi-10016570794
Hyundai Grand i10 - 2019 - 490449
https://www.cars24.com/buy-used-hyundai-grand-i10-2019-cars-noida-10532691707
Hyundai Santro Xing - 2013 - 281999
https://www.cars24.com/buy-used-hyundai-santro-xing-2013-cars-gurgaon-10168291760
Hyundai Santro Xing - 2014 - 272099
https://www.cars24.com/buy-used-hyundai-santro-xing-2014-cars-gurgaon-10121974770
Mercedes Benz C Class - 2014 - 1854499
https://www.cars24.com/buy-used-mercedes-benz-c-class-2014-cars-new-delhi-1050064264
KIA CARENS - 2022 - 1608099
https://www.cars24.com/buy-used-kia-carens-2022-cars-gurgaon-10160777793
Tata ALTROZ - 2021 - 711599
https://www.cars24.com/buy-used-tata-altroz-2021-cars-new-delhi-10083196703
Maruti New Wagon-R - 2020 - 508899
https://www.cars24.com/buy-used-maruti-new-wagon-r-2020-cars-new-delhi-10084875775
Hyundai Grand i10 - 2018 - 509099
https://www.cars24.com/buy-used-hyundai-grand-i10-2018-cars-new-delhi-10011277773
Maruti Wagon R 1.0 - 2011 - 282499
https://www.cars24.com/buy-used-maruti-wagon-r-1.0-2011-cars-new-delhi-10080499706
Note: You can paginate the API by incrementing the value of page in the URL.
For example:
import requests
import pandas as pd
base = "https://www.cars24.com/buy-used-"
table = []
with requests.Session() as s:
for page in range(1, 11):
url = f"https://api-sell24.cars24.team/buy-used-car?sort=P&serveWarrantyCount=true&gaId=&page={page}&storeCityId=2&pinId=110001"
cars = s.get(url).json()['data']['content']
print(f"Getting page {page}...")
for car in cars:
car_name = "-".join(car['carName'].lower().split())
car_city = "-".join(car['city'].lower().split())
offer_url = f"{base}{car_name}-{car['year']}-cars-{car_city}-{car['carId']}"
table.append([car['carName'], car['year'], car['price'], offer_url])
df = pd.DataFrame(table, columns=['Car Name', 'Year', 'Price', 'Offer URL'])
df.to_csv('cars.csv', index=False)
Output: a .csv file:

Web scraping with python - table with mutliple tbody elements

I'm trying to scrape the data from the top table on this page ("2021-2022 Regular Season Player Stats") using Python and BeautifulSoup. The page shows stats for 100 NHL players, 1 player per row. The code below works, but the problem is it only pulls the first ten rows into the dataframe. This is because the every ten rows is in a separate <tbody>, so it is only iterating through the rows in the first <tbody>. How can I get it to continue through the rest of the <tbody> elements on the page?
Another question: this table has about 1000 rows total, and only shows up to 100 per page. Is there a way to rewrite the code below to iterate through the entire table at once instead of just the 100 rows that show on the page?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.eliteprospects.com/league/nhl/stats/2021-2022'
source = requests.get(url).text
soup = BeautifulSoup(source,'html.parser')
table = soup.find('table', class_='table table-striped table-sortable player-stats highlight-stats season')
df = pd.DataFrame(columns=['Player', 'Team', 'GamesPlayed', 'Goals', 'Assists', 'TotalPoints', 'PointsPerGame', 'PIM', 'PM'])
for row in table.tbody.find_all('tr'):
columns = row.find_all('td')
Player = columns[1].text.strip()
Team = columns[2].text.strip()
GamesPlayed = columns[3].text.strip()
Goals = columns[4].text.strip()
Assists = columns[5].text.strip()
TotalPoints = columns[6].text.strip()
PointsPerGame = columns[7].text.strip()
PIM = columns[8].text.strip()
PM = columns[9].text.strip()
df = df.append({"Player": Player, "Team": Team, "GamesPlayed": GamesPlayed, "Goals": Goals, "Assists": Assists, "TotalPoints": TotalPoints, "PointsPerGame": PointsPerGame, "PIM": PIM, "PM": PM}, ignore_index=True)
To load all player stats into a dataframe and save it to csv you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
dfs = []
for page in range(1, 11):
url = f"https://www.eliteprospects.com/league/nhl/stats/2021-2022?sort=tp&page={page}"
print(f"Loading {url=}")
soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = (
pd.read_html(str(soup.select_one(".player-stats")))[0]
.dropna(how="all")
.reset_index(drop=True)
)
dfs.append(df)
df_final = pd.concat(dfs).reset_index(drop=True)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
...
1132 973.0 Austin Poganski (RW) Winnipeg Jets 16 0 0 0 0.00 7 -3.0
1133 974.0 Mikhail Maltsev (LW) Colorado Avalanche 18 0 0 0 0.00 2 -5.0
1134 975.0 Mason Geertsen (D/LW) New Jersey Devils 23 0 0 0 0.00 62 -4.0
1135 976.0 Jack McBain (C) Arizona Coyotes - - - - - - NaN
1136 977.0 Jordan Harris (D) Montréal Canadiens - - - - - - NaN
1137 978.0 Nikolai Knyzhov (D) San Jose Sharks - - - - - - NaN
1138 979.0 Marc McLaughlin (C) Boston Bruins - - - - - - NaN
1139 980.0 Carson Meyer (RW) Columbus Blue Jackets - - - - - - NaN
1140 981.0 Leon Gawanke (D) Winnipeg Jets - - - - - - NaN
1141 982.0 Brady Keeper (D) Vancouver Canucks - - - - - - NaN
1142 983.0 Miles Wood (LW) New Jersey Devils - - - - - - NaN
1143 984.0 Samuel Morin (D/LW) Philadelphia Flyers - - - - - - NaN
1144 985.0 Connor Carrick (D) Seattle Kraken - - - - - - NaN
1145 986.0 Micheal Ferland (LW/RW) Vancouver Canucks - - - - - - NaN
1146 987.0 Jake Gardiner (D) Carolina Hurricanes - - - - - - NaN
1147 988.0 Oscar Klefbom (D) Edmonton Oilers - - - - - - NaN
1148 989.0 Shea Weber (D) Montréal Canadiens - - - - - - NaN
1149 990.0 Brandon Sutter (C/RW) Vancouver Canucks - - - - - - NaN
1150 991.0 Brent Seabrook (D) Tampa Bay Lightning - - - - - - NaN
and saves data.csv (screenshot from LibreOffice):

How to get strings between two specific strings terms in Pandas

good evening.
I would like help with how to get information between two strings in pandas (python).
Imagine that I have a database of car prices for each car dealer in which each cell had a text similar to this (note: the car dealer column can be the index of each row):
"1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes -
Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery
(caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) -
new:R$90000 / used:R$75500 hatch car"
Thanks for the help!!
Three step process
split whole string into line items using re.split()
parse out constituent parts of line using pandas extract
finally shape dataframe as wide...
import re
import pandas as pd
import numpy as np
s = "1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes - Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery (caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) - new:R$90000 / used:R$75500 hatch car"
# there as a car after "<n> - ". split into lines
df = pd.DataFrame(re.split("[ ]?[0-9] - ", s)).replace("", np.nan).dropna()
# parse out each of the strings
df = df[0].str.extract("(?P<car>.*) \([0-9]\) - new:R\$(?P<new>[0-9]*) \/ used:R\$(?P<used>[0-9]*).*")
# finally format as wide format...
df = (df.melt().assign(car=lambda dfa: dfa.groupby("variable").cumcount(),
col=lambda dfa: dfa.variable + (dfa.car+1).astype(str))
.drop(columns=["variable","car"])
.set_index("col")
.T
)
car1
car2
car3
car4
new1
new2
new3
new4
used1
used2
used3
used4
value
Ford
Mercedes - Benz
Chery (caoa)
Others
60000
130000
80000
90000
30000
95000
60000
75500
You could use extractall to get a multiIndex dataframe that in summary will contain the dealer, the car number and the values extracted from regex named groups. After extractall, use stack to reshape the dataframe and the inner-most level index, this will allow you to set a new index with the format [(dealer, carN)...] and subsequently groupby that same first index level to keep the capturing order. Append each dealer data into a list and create the dataframe.
import pandas as pd
import re
df = pd.DataFrame(
["1 - Ford (1) - new:R$60000 / used:R$30000 sedan car 2 - Mercedes - Benz (1) - new:R$130000 / used:R$95000 silver sedan car 3 - Chery (caoa) (1) - new:R$80000 / used:R$60000 SUV car 5 - Others (1) - new:R$90000 / used:R$75500 hatch car",
"2 - Toyota (1) - new:R$10543 / used:R$9020 silver sedan car",
"3 - Honda (1) - new:R$123600 / used:R$34400 sedan car 2 - Fiat (1) - new:R$1955 / used:R$877 silver sedan car 3 - Cadillac (1) - new:R$174500 / used:R$12999 SUV car"])
regex = re.compile(
r"\d\s-\s(?P<car>.*?)(?:\s\(\d+\)?\s)-\s"
r"new:R\$(?P<new>[\d\.\,]+)\s/\s"
r"used:R\$(?P<used>[\d\.\,]+).*?car"
)
df_out = df[0].str.extractall(regex).stack()
df_out.index = [df_out.index.get_level_values(0), \
df_out.index.map(lambda x: f'{x[2]+str(x[1]+1)}')]
dealers = []
for n, g in df_out.groupby(level=0):
dealers.append(g.droplevel(0))
df1 = pd.DataFrame(dealers).rename_axis('Dealer')
print(df1)
Output from df1
car1 new1 used1 car2 new2 used2 car3 new3 used3 car4 new4 used4
Dealer
0 Ford 60000 30000 Mercedes - Benz 130000 95000 Chery (caoa) 80000 60000 Others 90000 75500
1 Toyota 10543 9020 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Honda 123600 34400 Fiat 1955 877 Cadillac 174500 12999 NaN NaN NaN

How do I fine-tune regex syntax to parse a tricky dirty text?

I have an OCR'ed .txt file that is a volume with data about book reviews (It's a Book Review Index). I'm attempting to separate authors, titles, and review data. I've been able to cleanly separate authors, but still can't separate titles from review data cleanly. Here is a sample of the .txt file:
MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih
BL - v77 - Ja 1 ’81 - p630 MA, Hsin-Teh - Chinese Women In The Great Leap Forward
Choice - v20 - N ’82 - p396 MA, Huan • The Overall Survey Of The Ocean’s Shores 1433
Choice - v8 - 0 ’71 - pl074 MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681 MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279
JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285 MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39 MA, Laurence J C - Urban Development In Modern China
Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39 MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38 MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757 MA, Nancy Chih - Mrs. Ma’s Japanese Cooking
VQR - v58 - Spring ’82 - p68 MA, Tsu Sheng - Microscale Manipulations In Chemistry
Choice-vl3-N ’76 -pi 164 MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320
Pac A - v56 - Winter ’83 - p796 MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555 MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651 MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55 MA COY, Ramelle • Short-Time Compensation
Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c MA De - The Cowherd And The Weaving Maid
Cur R - v20 - S '81 -p325 c MA De - Crickets
Cur R - v20 - S '81 - p325 c MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325 MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r MAACK, Mary N - Libraries In Senegal ARBA - vl3 - '82 - pi53 CRL - v45 - Mr '84-pl52 JAL - v7 - S '81 - p244 JLH - vl9 - Spring ’84 - p315 LJ - vl07 - My 1 ’82 - p865 LQ - v52 - Ap '82-pl75 MAACK, Reinhard • Kontinentaldrift Und Geologie Des Sudatlantischen Ozeans GJ - vl36 - Mr '70 - pl38 MAAG, Russell C - Observe And Understand The Sun
S&T - v54 - S ’77 - p221 MAAG, Victor - Hiob
Rel St Rev - vlO - Ap '84 - pi 75 MAAILMA Katettu Poyta
And here is a cleaner version, to see better what I'm trying to separate:
MA, Chi-Hua - Huan Chiu Hsin Ying BL - v70 - Jl 1 ’74 - pi 183 c
MA, Ching-Hsien - Pei Niang Niang Ti Ku Shih BL - v77 - Ja 1 ’81 - p630
MA, Hsin-Teh - Chinese Women In The Great Leap Forward Choice - v20 - N ’82 - p396
MA, Huan • The Overall Survey Of The Ocean’s Shores 1433 Choice - v8 - 0 ’71 - pl074
MA, Huan • Ying-Yai Sheng-Lan AHR - v76 - D ’71 - pl578 GJ - vl37 - Je ’71 - p213 JAS - v31 - N ’71 - pl81 TLS - Je 16 ’72 - p681
MA, Laurence J C - Commercial Development And Urban Change In Sung China 960-1279 JAS - v31 - Ag ’72 - p928 Pac A - v45 - Summer ’72 - p285
MA, Laurence J C - The Environment JAS - v42 - N ’82 - pl39
MA, Laurence J C - Urban Development In Modern China Choice - vl9 - Ja ’82 - p696 JAS - v42 - N 82 - pl39
MA, Nancy Chih - Cook Chinese AB - v45 - My 25 ’70 - pl786 PW - vl97 - Mr 23 ’70 - p38
MA, Nancy Chih • Don’t Lick The Chopsticks CSM - v66 - Ja 10 ’74 - pF2 LJ - v99 - Mr 15 ’74 - p757
MA, Nancy Chih - Mrs. Ma’s Japanese Cooking VQR - v58 - Spring ’82 - p68
MA, Tsu Sheng - Microscale Manipulations In Chemistry Choice-vl3-N ’76 -pi 164
MA, Tsu Sheng - Organic Functional Group Analysis By Gas Chromatography Choice - vl3 - F ’77 - pl624 r
MA, Wei-Yi - A Bibliography Of Chinese-Language Materials On The People's Communes ARBA - vl5 - '84 - p320 Pac A - v56 - Winter ’83 - p796
MA, Wook - Seoul Ro Kanun Kil BL - v78 - 0 15 '81 - p294 y
MA, Y W - Traditional Chinese Stories ANQ - vl8 - 0 ’79 - p30 BF - v4 - Ap 40 '79 - p575 Choice -vl5-Ja ’79 -pl528 HR-v32-Spring'79-pl23 JAS - v38 - Ag '79 - p773 Kliatt - vl3 • Winter '79 - p26 WIT - v53 - Summer '79 - p555
MA, Yun • Shih Ching T'ao Hsing BL - v68 - Ap 1 '72 - p651
MA BRICALL, Josep - Politica Economica De La Generalitat 1936-1939. Vol. 1 WP - v25 - O '72 - pl55
MA COY, Ramelle • Short-Time Compensation Choice - v21 - Jl '84 - pl648 Econ Bks - vll - S ’84 - p62 c
MA De - The Cowherd And The Weaving Maid Cur R - v20 - S '81 -p325 c
MA De - Crickets Cur R - v20 - S '81 - p325 c
MA De - School-Master Dongguo Cur R - v20 - S '81 - p325 c
MA De - Thrice Borrowing The Plantain Fan CurR- v20-S ’81 -p325 c
MA De - The Wonderful Gourds Cur R - v20 - S '81 - p325
MAACK, Berthold - Preussen JMH - v55 - Mr '83 - p71 r
And here's my code:
# read in review volume .txt file
import pandas as pd
import numpy as np
import re
file = '/Users/sinykin/Dropbox/US_LIT_PRODUCTION_DATA/REVIEWS_DATA/BOOK_REVIEWS_INDEX_TEXTS/1965_1984_Vol_5_M-P.txt'
with open(file) as f:
content = f.readlines()
content = [x.strip() for x in content]
content = " ".join(content)
# Get all authors
pattern = r"[A-Z\-]{2,}[\,]+\s[A-Za-z\s\,\(\)\.]+\s[\-\*\•\.\■ ]{1}"
authors = re.findall(pattern, content)
# Now replace all found authors with XXX_XXX
if re.search(pattern, content):
r = re.compile(pattern)
content2 = r.sub(r'XXX_XXX', content)
# Now get all the content for each author
content3 = content2.split('XXX_XXX')
bib = content3[1:]
# Now separate reviews from titles
pattern2 = r"\s+(?:([A-Z][a-z][a-z]((?!\s+-|\s+Choice\s*-).)*?\w)(?:\s+-
\s*|\s+(?=Choice\s*-)|\s*$))"
bib2 = "".join(bib)
titles = re.findall(pattern2, bib2)
print (titles[:1000])
It's the regex code in pattern2 that I'm struggling with. For example, it currently gives me this, for what should be titles:
[('The Overall Survey Of The Ocean\xe2\x80\x99s Shores 1433', '3'),
('Ying-Yai Sheng-Lan AHR', 'H'),
('Commercial Development And Urban Change In Sung China 960-1279 JAS', 'A'),
('Pac A', ' '),
('Summer \xe2\x80\x9972', '7'),
('The Environment JAS', 'A'),
('Urban Development In Modern China', 'n'),
('Cook Chinese AB', 'A'),
('Don\xe2\x80\x99t Lick The Chopsticks CSM', 'S'),
('Mrs. Ma\xe2\x80\x99s Japanese Cooking VQR', 'Q'),
('Spring \xe2\x80\x9982', '8'),
('Microscale Manipulations In Chemistry', 'r')
As you can see, I'm getting extra data beyond the end of the titles, esp capital letters that mark abbreviation for reviews.
Can you help me refine my regex to capture just the titles?
You could keep building up on the last positive lookahead, by adding some specific expressions, to screen out the unwanted trailing characters. For example, extending
(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$))
into
(?:\s+-\s*|\s+(?=Choice\s*-)|\s*$|\s+[A-Z]{2,3}|\s+Cur R))
would eliminate the Cur R and code endings (AB, JAS, etc.):
('The Environment', 'n')
('Urban Development In Modern China', 'n')
('Cook Chinese', 's')
('Nancy Chih \xe2\x80\xa2 Don\xe2\x80\x99t Lick The Chopsticks', 'k')
('Mrs. Ma\xe2\x80\x99s Japanese Cooking', 'n')
Question: ... refine my regex to capture just the titles?
Split content2 on these Patterns to get a List of Books:
books = []
for c in re.split('(XXX_XXX|MA, | MA, | MA )', content2):
if c and not c in ['MA, ', ' MA, ', ' MA ', 'XXX_XXX']:
books.append(c.strip())
Loop books to extract the title:
reObj = re.compile(r'(.+?)(( [A-Z\&]{1,4})?| Choice| Cur ?R) ?- ?[’v][l]?(\d{1,2}|O )')
for book in books:
match = reObj.match(book)
if match:
title = match.groups()[0]
print('{}'.format(title))
else:
print('FAIL:{}'.format(book))
Output:
Chi-Hua - Huan Chiu Hsin Ying
Ching-Hsien - Pei Niang Niang Ti Ku Shih
Hsin-Teh - Chinese Women In The Great Leap Forward
The Overall Survey Of The Ocean’s Shores 1433
Ying-Yai Sheng-Lan
Commercial Development And Urban Change In Sung China 960-1279
The Environment
Urban Development In Modern China
Cook Chinese
Don’t Lick The Chopsticks
... (omitted for brevity)
Tested with Python:3.4.2 - re:2.2.1

Summing column values in panda and attaching or merging total to dataframe?

I got this function :
def source_revenue(self):
items = self.data.items()
df = pandas.DataFrame(
{'SOURCE OF BUSINESS': [i[0] for i in items], 'INCOME': [i[1] for i in items]})
pivoting = pd.pivot_table(df, index=['SOURCE OF BUSINESS'], values=['INCOME'])
suming = pivoting.sum(index=(0), columns=(1))
This function yields this :
INCOME 216424.9
dtype: float64
Without summing, it returns the full dataframe like this :
INCOME
SOURCE OF BUSINESS
BYD - Other 500.0
BYD - Retail 1584.0
BYD - Transport 42498.0
BYD Beverage - A La Carte 39401.5
BYD Food - A La Carte 瓦厂食品-零点 68365.0
BYD Food - Catering Banquet 53796.0
BYD Rooms 瓦厂房间 5148.0
GS - Retail 386.0
GS Food - A La Carte 48.0
Orchard Retail 130.0
SCH - Food - A La Carte 96.0
SCH - Retail 375.4
SCH - Transport 888.0
SCH Beverage - A La Carte 119.0
Spa 3052.0
XLM Beverage - A La Carte 38.0
The reason I am doing this is because I was trying to get the total of all the returned rows, sum them and attach a total to the dataframe.
Initially I tried with margins = True (I read around here that it was to sum and attach the total to the dataframe, not true )
So what I want to know if there is a way to return the dataframe, but also sum up the values and attach a total to the end of the dataframe just as margins = True does.
I think you can use rather groupby as pivot_table, because here groupby is faster.
You can use pivot_table, but default aggfunc is np.mean. It is easy forget for it:
pivoting = pd.pivot_table(df,
index=['SOURCE OF BUSINESS'],
values=['INCOME'],
aggfunc=np.mean)
I think you need aggfunc=np.sum:
print df
A B C D
0 zoo one small 1
1 zoo one large 2
2 zoo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
print pd.pivot_table(df, values='D', index=['A'], aggfunc=np.sum)
A
bar 22
foo 6
zoo 5
Name: D, dtype: int64
df1 = df.groupby('A')['D'].sum()
print df1
A
bar 22
foo 6
zoo 5
Name: D, dtype: int64
If you need add Total to Series, use loc and sum:
print df1.sum()
33
df1.loc['Total'] = df1.sum()
print df1
A
bar 22
foo 6
zoo 5
Total 33
Name: D, dtype: int64
Timings:
In [111]: %timeit df.groupby('A')['D'].sum()
1000 loops, best of 3: 581 µs per loop
In [112]: %timeit pd.pivot_table(df, values='D', index=['A'], aggfunc=np.sum)
100 loops, best of 3: 2.28 ms per loop
Adding Total in your df by setting with enlargement:
print df
INCOME
SOURCE OF BUSINESS
BYD - Other 500.0
BYD - Retail 1584.0
BYD - Transport 42498.0
BYD Beverage - A La Carte 39401.5
BYD Food - A La Carte 68365.0
BYD Food - Catering Banquet 53796.0
BYD Rooms 5148.0
GS - Retail 386.0
GS Food - A La Carte 48.0
Orchard Retail 130.0
SCH - Food - A La Carte 96.0
SCH - Retail 375.4
SCH - Transport 888.0
SCH Beverage - A La Carte 119.0
Spa 3052.0
XLM Beverage - A La Carte 38.0
df.loc['Total', 'INCOME'] = df['INCOME'].sum()
print df
INCOME
SOURCE OF BUSINESS
BYD - Other 500.0
BYD - Retail 1584.0
BYD - Transport 42498.0
BYD Beverage - A La Carte 39401.5
BYD Food - A La Carte 68365.0
BYD Food - Catering Banquet 53796.0
BYD Rooms 5148.0
GS - Retail 386.0
GS Food - A La Carte 48.0
Orchard Retail 130.0
SCH - Food - A La Carte 96.0
SCH - Retail 375.4
SCH - Transport 888.0
SCH Beverage - A La Carte 119.0
Spa 3052.0
XLM Beverage - A La Carte 38.0
Total 216424.9
df.ix[len(df)] = ... will add a row to the end of your dataframe. Your data then needs to match the correct number of columns. Also, I wouldn't recommend adding this to your data as any subsequent analysis would be invalid. Probably best to create a new series and then concat if needed for display purposes.
df.ix[len(df)] = ['Total', df.INCOME.sum()]
>>> df
SOURCE OF BUSINESS INCOME
0 BYD - Other 500
1 BYD - Retail 1584
2 BYD - Transport 42498
3 BYD Beverage - A La Carte 39401.5
4 BYD Food - A La Carte _______ 68365
5 BYD Food - Catering Banquet 53796
6 BYD Rooms ____ 5148
7 GS - Retail 386
8 GS Food - A La Carte 48
9 Orchard Retail 130
10 SCH - Food - A La Carte 96
11 SCH - Retail 375.4
12 SCH - Transport 888
13 SCH Beverage - A La Carte 119
14 Spa 3052
15 XLM Beverage - A La Carte 38
16 Total 216425

Categories