Creating a dataframe where one of the arrays has a different length - python

I am learning to scrape data from website through Python. Extracting weather information about San Francisco from this page. I get stuck while combining data into a Pandas Dataframe. Is it possible to create a dataframe where each rows have different length?
I have already tried 2 ways based on answers here, but they are not excatly what I am looking for. Both answers shift the values of temps column to up. Here is the screen what I try to explain..
1st way: https://stackoverflow.com/a/40442094/10179259
2nd way: https://stackoverflow.com/a/19736406/10179259
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
periods=[pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]
short_descs=[sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
temps=[t.get_text() for t in seven_day.select('.tombstone-container .temp')]
descs = [d['alt'] for d in seven_day.select('.tombstone-container img')]
#print(len(periods), len(short_descs), len(temps), len(descs))
weather = pd.DataFrame({
"period": periods, #length is 9
"short_desc": short_descs, #length is 9
"temp": temps, #problem here length is 8
#"desc":descs #length is 9
})
print(weather)
I expect that first row of the temp column to be Nan. Thank you.

You can loop each forecast_items value with iter and next for select first value, if not exist is assigned fo dictionary NaN value:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
out = []
for x in forecast_items:
periods = next(iter([t.get_text() for t in x.select('.period-name')]), np.nan)
short_descs = next(iter([t.get_text() for t in x.select('.short-desc')]), np.nan)
temps = next(iter([t.get_text() for t in x.select('.temp')]), np.nan)
descs = next(iter([d['alt'] for d in x.select('img')]), np.nan)
out.append({'period':periods, 'short_desc':short_descs, 'temp':temps, 'descs':descs})
weather = pd.DataFrame(out)
print (weather)
descs period \
0 NOW until4:00pm Sat
1 Today: Showers, with thunderstorms also possib... Today
2 Tonight: Showers likely and possibly a thunder... Tonight
3 Sunday: A chance of showers before 11am, then ... Sunday
4 Sunday Night: Rain before 11pm, then a chance ... SundayNight
5 Monday: A 40 percent chance of showers. Cloud... Monday
6 Monday Night: A 30 percent chance of showers. ... MondayNight
7 Tuesday: A 50 percent chance of rain. Cloudy,... Tuesday
8 Tuesday Night: Rain. Cloudy, with a low aroun... TuesdayNight
short_desc temp
0 Wind Advisory NaN
1 Showers andBreezy High: 56 °F
2 ShowersLikely Low: 49 °F
3 Heavy Rainand Windy High: 56 °F
4 Heavy Rainand Breezythen ChanceShowers Low: 52 °F
5 ChanceShowers High: 58 °F
6 ChanceShowers Low: 53 °F
7 Chance Rain High: 59 °F
8 Rain Low: 53 °F

Related

Trying to access variables while scraping website; trying to get var in script

Trying to web scrape info from this website: http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed=.
For context, I'm trying to find the Tyre brand (Bridgestone, Michelin), pattern (e.g Turanza T001, Ecopia EP500), Tyre Size (205/55. 16 V (91), 225/50. 16 W (100) XL), Seasonality (if available) (Summer, Winter) and price.
My measurements for the tyre are Width – 205, Aspect Ratio – 55, Rim Size - 16.
I found all the info I need here at the var allTyres section. The problem is, I am struggling with how to extract the "manufacturer" (brand), "description" (description has the pattern and size), "winter" (it would have 0 for no and 1 for yes), "summer" (same as winter) and "price".
Afterwards, I want to export the data in CSV format.
Thanks
To create a pandas dataframe from the allTyres data you can do (from the DataFrame you can select columns you want, save it to CSV etc..):
import re
import json
import requests
import pandas as pd
url = "http://www.dexel.co.uk/shopping/tyre-results?width=205&profile=55&rim=16&speed="
data = json.loads(
re.search(r"allTyres = (.*);", requests.get(url).text).group(1)
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
df = pd.DataFrame(data)
print(df.head())
Prints:
id ManufacturerID width profile rim speed load description part_no pattern manufacturer extra_load run_flat winter summer OEList price tyre_class rolling_resistance wet_grip Graphic noise_db noise_rating info pattern_name recommended rating
0 1881920 647 205 55 16 V 91 205/55VR16 BUDGET VR 2055516VBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
1 3901788 647 205 55 16 H 91 205/55R16 BUDGET 91H 2055516HBUD Economy N N 0 1 53.20 C1 G F BUD 73 3 0 1
2 1881957 647 205 55 16 W 91 205/55ZR16 BUDGET ZR 2055516ZBUD Economy N N 0 1 53.54 C1 G F BUD 73 3 0 1
3 6022423 129 205 55 16 H 91 205/55R16 91H UROYAL RAINSPORT 5 2055516HUN09BGS RainSport 5 Uniroyal N N 0 1 70.46 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4
4 6022424 129 205 55 16 V 91 205/55R16 91V UROYAL RAINSPORT 5 2055516VUN09BGR RainSport 5 Uniroyal N N 0 1 70.81 C1 C A UNIRSP5 71 2 <p>The NEW RainSport 5 combines best-in-class wet performance, enhanced mileage, and superior steering control for maximum driving pleasure.</p>\n<ul>\n <li>Safe driving even in the most challenging wet conditions</li>\n <li>Extended tyre life for a long journey</li>\n <li>Excellent control and steering response for maximum driving pleasure.</li>\n</ul> RainSport 5 0 4

Python Dataframe - Can't replace text with a number

I am working Bicycle dataset. I want to replace text values in 'weather' column with numbers 1 to 4. This field is an object field. I tried all of these following ways but none seems to work.
There is another field called 'season'. If I apply same code on 'season', my code works fine. Please help.
Sample data:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 Clear + Few clouds 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 Clear + Few clouds 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 Light Snow, Light Rain 11.48 12.120 100 27.9993
3 10/13/2011 11:00 Winter NaN 1 Mist + Cloudy 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 Clear + Few clouds 25.42 31.060 43 23.9994
I tried following, none worked on 'weather' but when i use same code on 'season' column it works fine.
test["weather"] = np.where(test["weather"]=="Clear + Few clouds", 1,
(np.where(test["weather"]=="Mist + Cloudy",2,(np.where(test["weather"]=="Light Snow, Light
Rain",3,(np.where(test["weather"]=="Heavy Rain + Thunderstorm",4,0)))))))
PE_weather = [
(train['weather'] == ' Clear + Few clouds '),
(train['weather'] =='Mist + Cloudy') ,
(train['weather'] >= 'Light Snow, Light Rain'),
(train['weather'] >= 'Heavy Rain + Thunderstorm')]
PE_weather_value = ['1', '2', '3','4']
train['Weather'] = np.select(PE_weather, PE_weather_value)
test.loc[test.weather =='Clear + Few clouds', 'weather']='1'
I suggest you make a dictionary to look up the corresponding values and then apply a lookup to the weather column.
weather_lookup = {
'Clear + Few clouds': 1,
'Mist + Cloudy': 2,
'Light Snow, Light Rain': 3,
'Heavy Rain + Thunderstorm': 4
}
def lookup(w):
return weather_lookup.get(w, 0)
test['weather'] = test['weather'].apply(lookup)
Output:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 1 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 1 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 3 11.48 12.120 100 27.9993 NaN
3 10/13/2011 11:00 Winter NaN 1 2 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 1 25.42 31.060 43 23.9994

BeautifulSoup: find all instances when class name repeats

I have the following code:
import requests, pandas as pd
from bs4 import BeautifulSoup
s = requests.session()
url2 = r'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
r = s.get(url2)
soup = BeautifulSoup(r.text, 'html.parser')
z2 = soup.find_all("div", {"class": 'dc_blocks_2c'})
z2 returns a long list. How do I get all the variables and values in a dataframe? i.e. gather the dc_label and dc_value pairs.
when reading tables, it's sometimes easier to just use read_html() method. If it doesn't capture everything you want you can code for the other stuff. Just depends on what you need from the page.
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
list_of_dataframes = pd.read_html(url)
for df in list_of_dataframes:
print(df)
or get df by position in list. for example,
df = list_of_dataframes[2]
All dataframes captured:
0 1
0 Original List Price: $249,890
1 Price Reduced: -$1,000
2 Current List Price: $248,890
3 Last Reduction on: 05/14/2021
0 1
0 Original List Price: $249,890
1 Price Reduced: -$1,000
2 Current List Price: $248,890
3 Last Reduction on: 05/14/2021
Tax Year Cost/sqft Market Value Change Tax Assessment Change.1
0 2020 $114.36 $187,555 -4.88% $187,555 -4.88%
1 2019 $120.22 $197,168 -9.04% $197,168 -9.04%
2 2018 $132.18 $216,768 0.00% $216,768 0.00%
3 2017 $132.18 $216,768 5.74% $216,768 9.48%
4 2016 $125.00 $205,000 2.19% $198,000 6.90%
5 2015 $122.32 $200,612 18.71% $185,219 10.00%
6 2014 $103.05 $169,000 10.40% $168,381 10.00%
7 2013 $93.34 $153,074 0.00% $153,074 0.00%
8 2012 $93.34 $153,074 NaN $153,074 NaN
0 1
0 Market Land Value: $39,852
1 Market Improvement Value: $147,703
2 Total Market Value: $187,555
0 1
0 HOUSTON ISD: 1.1367 %
1 HARRIS COUNTY: 0.4071 %
2 HC FLOOD CONTROL DIST: 0.0279 %
3 PORT OF HOUSTON AUTHORITY: 0.0107 %
4 HC HOSPITAL DIST: 0.1659 %
5 HC DEPARTMENT OF EDUCATION: 0.0050 %
6 HOUSTON COMMUNITY COLLEGE: 0.1003 %
7 HOUSTON CITY OF: 0.5679 %
8 Total Tax Rate: 2.4216 %
0 1
0 Estimated Monthly Principal & Interest (Based on the calculation below) $ 951
1 Estimated Monthly Property Tax (Based on Tax Assessment 2020) $ 378
2 Home Owners Insurance Get a Quote
pd.DataFrame([el.find_all('div', {'dc_label','dc_value'}) for el in z2])
0 1
0 [MLS#:] [30509690 (HAR) ]
1 [Listing Price:] [$ 248,890 ($151.76/sqft.) , [], [$Convert ], ...
2 [Listing Status:] [[\n, [\n, <span class="status_icon_1" style="...
3 [Address:] [6408 Burgoyne Road #157]
4 [Unit No.:] [157]
5 [City:] [[Houston]]
6 [State:] [TX]
7 [Zip Code:] [[77057]]
8 [County:] [[Harris County]]
9 [Subdivision:] [ , [Briarwest T/H Condo (View subdivision pri...

How to select the subset of data from each category using for loop in Python?

I have customer data (in CSV format) as:
index category text
0 spam you win much money
1 spam you win 7000 car
2 not_spam the weather in Chicago is nice
3 neutral we have a party now
4 neutral they are driving to downtown
5 not_spam pizza is an Italian food
As an example, each category contains various count:
customer.category.value_counts():
spam 100
not_spam 20
neutral 45
where:
min(customer.category.value_counts()): 20
I want to write a for loop in python to create a new data file that for all category only contains the same size equal to the smallest category count (in the example here smallest category is not_spam).
My expected output would be:
new_customer.category.value_counts():
spam 20
not_spam 20
neutral 20
It's easier to use groupby:
min_count = df.category.value_counts().min()
df.groupby('category').head(min_count)
That said, if you really want a loop, you can use it as a list comprehension which is faster:
categories = df.category.unique()
min_count = df.category.value_counts().min()
df = pd.concat([df.query('category==#cat')[:min_count] for cat in categories])
My randomly generated dataframe has 38 rows with the following distribution of categories:
spam 17
not_spam 16
neutral 5
Name: category, dtype: int64
I was thinking that the first thing you need to do is to find the smallest category, and once you know that, you could .sample() each category using calculated value as n:
def sample(df: pd.DataFrame, category: pd.Series):
threshold = df[category].value_counts().min()
for cat in df[category].unique():
data = df.loc[df[category].eq(cat)]
yield data.sample(threshold)
data = sample(df, "category")
pd.concat(data, ignore_index=True)
text category
0 v not_spam
1 l not_spam
2 q not_spam
3 j not_spam
4 f not_spam
5 l spam
6 t spam
7 r spam
8 n spam
9 k spam
10 n neutral
11 n neutral
12 d neutral
13 q neutral
14 l neutral
This should work. it keeps concatenating generated top min records from each category
minval = min(df1.category.value_counts())
df2 = pd.concat([df1[df1.category == cat].head(minval) for cat in df1.category.unique() ])
print(df2)

How to use date to Split a dataframe column into multiple columns in python

I have a dataframe data with 2 columns ID and Text. The goal is to split the values in the Text column into multiple columns based on dates. Typically, a date starts a series of a string value that needs to be in a column Except when the date is at the end of the string (in such case, then it's considered part of the string that started with the preceding date).
data:
ID Text
10 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20 7/17/06-advil, qui;
10 7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
40 9/26/06-penicilin, tramadol;
91 5/23/06-penicilin, amoxicilin, tylenol;
84 10/20/06-ibuprofen, tramadol;
17 12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
23 12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
15 Follow up appt. scheduled
69 talk to care giver
32 12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
70 12/1/06?Follow up but no serious allergies
70 12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil
Expected output:
ID Text Text2 Text3
10 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007
20 7/17/06-advil, qui;
10 7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;
40 9/26/06-penicilin, tramadol;
91 5/23/06-penicilin, amoxicilin, tylenol;
84 10/20/06-ibuprofen, tramadol;
17 12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up
23 12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up
15 Follow up appt. scheduled
69 talk to care giver
32 12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months
70 12/1/06?Follow up but no serious allergies
70 12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil
My code so far:
d = []
for i in data.Text:
d = list(datefinder.find_dates(i)) #I can get the dates so far but still want to format the date values as %m/%d/%Y
if len(d) > 1:#Checks for every record that has more than 1 date
for j in range(0,len(d)):
i = " " + " ".join(re.split(r'[^a-z 0-9 / -]',i.lower())) + " " #cleans the text strings of any special characters
#data.Text[j] = d[j]r'[/^(.*?)]'d[j+1]'/'#this is not working
#The goal is for the Text column to retain the string from the first date up to before the second date. Then create a new Text1, get every value from the second date up to before the third date. And if there are more dates, create Textn and so on.
#Exception, if a date immediately follows a date (i.e. 12/1/09 -6/18/10) or a date ends a value string (i.e. 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007), they should be considered to be in the same column
Any thoughts on how to make this work will save my day. Thank you!
There you go
from itertools import chain, starmap, zip_longest
import itertools
import re
import pandas as pd
ids = [10, 20, 10, 40, 91, 84, 17, 23, 15, 69, 32, 70, 70]
text = [
"6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007",
"7/17/06-advil, qui;",
"7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;",
"9/26/06-penicilin, tramadol;",
"5/23/06-penicilin, amoxicilin, tylenol;",
"10/20/06-ibuprofen, tramadol;",
"12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up",
"12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up",
"Follow up appt. scheduled",
"talk to care giver",
"12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months",
"12/1/06?Follow up but no serious allergies",
"12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil"]
by_date = re.compile(
"""((?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d\s*"""
"""(?:(?:-|to |through )\s*(?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d)?\s*\S)""")
def to_items(line):
starts = [m.start() for m in by_date.finditer(line)]
if not starts or starts[0] > 0:
starts.insert(0, 0)
stops = iter(starts)
next(stops)
return map(line.__getitem__, starmap(slice, zip_longest(starts, stops)))
cleaned = zip_longest(*map(to_items, text))
col_names = chain(["Text"], map("Text{}".format, itertools.count(2)))
df = pd.DataFrame(dict(zip(col_names, cleaned), ID=ids))
print(df)

Categories