Using python to extract data from a chart

Using python to extract data from a chart - python

I am trying to extract only the rented prices (green dots) from this site but I can't find where the data is coming from. I want to use beautifulsoup or scrapy to do the web scraping. Is the data being imported as a JSON or how is it appearing on the website? Apologizes for such a broad question I am relatively new to python programming language. There are URLs in source code that may lead to the data but I can't figure it out. Any push in the right direction would be so appreciated.
Here is the website: https://www.redweek.com/whats-my-timeshare-worth/P5035-wyndham-bonnet-creek-resort/rental-historical

I am only going to help you find the place where the data is coming from. Parsing the JSON is up to you. If you open up the Network Tab in Chrome Developer's Console, you can see:
xhr?resort_id=5035&type=rental&active=0
Now, when you click on that, you will get the Request URL option on the right hand side. This is where the data is coming from:
https://redweek.com/whats-my-timeshare-worth/xhr?resort_id=5035&type=rental&active=0

Hamza said where to find it. But re-iterate, when you are on the site, right-click and select "Inspect" (or Ctrl-Shift-I). In the right pannel you'll find it in Network, XHR, Headers (you may need to reload the page once you have the panel open)
Here's the code to turn that json to a table:
import pandas as pd
import requests
url = 'https://www.redweek.com/whats-my-timeshare-worth/xhr?resort_id=5035&type=rental&active=0'
headers = {'User-Agent': 'Mozilla/5.0'}
jsonData = requests.get(url, headers=headers).json()
cols = {}
for idx, each in enumerate(jsonData['cols']):
cols.update({idx:each['label']})
cols.update({0:'Week'})
rows = []
for row in jsonData['rows']:
temp_row = {}
for idx, each in enumerate(row['c']):
w=1
temp_row.update({cols[idx]:each['v']})
rows.append(temp_row)
df = pd.DataFrame(rows)
df['Price'] = df['Rented'].fillna(df['Unknown'])
df = df.drop(['Not Rented','Active posting','Rented','Unknown'],axis=1)
Output:
print(df)
Bedrooms Status Week Price
0 1 Unknown 52 120.0
1 1 Unknown 52 130.0
2 1 Unknown 53 120.0
3 1 Unknown 1 60.0
4 1 Unknown 3 140.0
5 1 Unknown 5 100.0
6 1 Unknown 11 170.0
7 1 Unknown 11 90.0
8 1 Unknown 20 90.0
9 1 Unknown 22 130.0
10 1 Unknown 23 100.0
11 1 Unknown 24 100.0
12 1 Unknown 24 180.0
13 1 Unknown 25 100.0
14 1 Unknown 27 90.0
15 1 Unknown 28 90.0
16 1 Unknown 29 90.0
17 1 Unknown 30 90.0
18 1 Unknown 47 100.0
19 1 Unknown 52 100.0
20 1 Unknown 1 140.0
21 1 Unknown 10 140.0
22 1 Unknown 12 130.0
23 1 Unknown 14 100.0
24 1 Unknown 14 160.0
25 1 Unknown 26 110.0
26 1 Unknown 34 90.0
27 1 Unknown 39 140.0
28 1 Unknown 43 160.0
29 1 Unknown 51 100.0
... ... ... ...
4035 3 Rented 12 250.0
4036 3 Rented 13 270.0
4037 3 Rented 14 230.0
4038 3 Rented 18 280.0
4039 3 Rented 27 180.0
4040 3 Rented 35 90.0
4041 4 Rented 53 330.0
4042 4 Rented 15 170.0
4043 4 Rented 14 310.0
4044 4 Rented 18 250.0
4045 4 Rented 19 250.0
4046 4 Rented 46 300.0
4047 4 Rented 18 250.0
4048 4 Rented 19 200.0
4049 4 Rented 8 190.0
4050 4 Rented 12 240.0
4051 4 Rented 27 200.0
4052 4 Rented 7 240.0
4053 4 Rented 18 200.0
4054 4 Rented 47 310.0
4055 4 Rented 45 210.0
4056 4 Rented 7 320.0
4057 4 Rented 51 320.0
4058 4 Rented 9 300.0
4059 4 Rented 15 220.0
4060 4 Rented 39 210.0
4061 4 Rented 41 280.0
4062 4 Rented 4 200.0
4063 4 Rented 5 260.0
4064 4 Rented 35 130.0
[4065 rows x 4 columns]

Related

How to use fillna function in pandas?

I have a dataframe which has three Battery's charging and discharging sequence:
Battery 1 Battery 2 Battery 3
0 32 3 -1
1 21 11 -31
2 23 27 63
3 12 -22 -22
4 -21 22 44
5 -66 6 66
6 -12 32 -52
7 -45 -45 -4
8 45 -55 -77
9 66 66 96
10 99 -39 -69
11 88 99 48
if the number is negative then it will be charging and if it is positive then it will discharging. So what I added all the numbers rows and then try to the charging and discharging sequence.
import pandas as pd
dic1 = {
'Battery 1': [32,21,23,12,-21,-66,-12,-45,45,66,99,88],
'Battery 2': [3,11,27,-22,22,6,32,-45,-55,66,-39,99],
'Battery 3': [-1,-31,63,-22,44,66,-52,-4,-77,96,-69,48]
}
df = pd.DataFrame(dic1)
bess = df.filter(like='Battery').sum(axis=1) # Adding all batteries
charging = bess[bess<=0].fillna(0) #Charging
discharging = bess[bess>0].fillna(0) #Discharging
bess['charging'] = charging #creating new column for charging
bess['discharging'] = discharging #creating new column for discharging
print(bess)
Excpected output:
bess charging discharging
0 34 0.0 34.0
1 1 0.0 1.0
2 113 0.0 113.0
3 -32 -32.0 0.0
4 45 0.0 45.0
5 6 0.0 6.0
6 -32 -32.0 0.0
7 -94 -94.0 0.0
8 -87 -87.0 0.0
9 228 0.0 228.0
10 -9 -9.0 0.0
11 235 0.0 235.0
but instead somehow this fillna is not filling 0 values and giving this output:
bess charging discharging
0 34 34
1 1 1
2 113 113
3 -32 -32
4 45 45
5 6 6
6 -32 -32
7 -94 -94
8 -87 -87
9 228 228
10 -9 -9
11 235 235

Change the lane here with reindex
charging = bess[bess<=0].reindex(df.index,fill_value=0) #Charging
discharging = bess[bess>0].reindex(df.index,fill_value=0) #Discharging

Here is a way with using clip
df.assign((bess = df.sum(axis=1),
charging = df.sum(axis=1).clip(upper = 0),
discharging = df.sum(axis=1).clip(0)))

How can I get the difference of the values of the rows in a dataframe? (for customer code)

My dataset has Customer_Code, As_Of_Date and 24 products. The products have a value of 0 -1. I ordered the data set by customer code and as_of_date. I want to subtract from the next row in the products to the previous row. The important thing here is to get each customer out according to their as_of_date.
I try
df2.set_index('Customer_Code').diff()
and
df2.set_index('As_Of_Date').diff()
and
for i in new["Customer_Code"].unique():
df14 = df12.set_index('As_Of_Date').diff()
but is not true. My code is true for first customer but it is not true for second customer.
How I can do?

You didn't share any data so I made up something that you may use. Your expected outcome also lacks. For further reference, please do not share images. Let's say you have this data:
id date product
0 12 2008-01-01 1
1 12 2008-01-01 2
2 12 2008-01-01 1
3 12 2008-01-02 4
4 12 2008-01-02 5
5 34 2009-01-01 6
6 34 2009-01-01 7
7 34 2009-01-01 84
8 34 2009-01-02 4
9 34 2009-01-02 3
10 34 2009-01-02 3
11 34 2009-01-03 5
12 34 2009-01-03 6
13 34 2009-01-03 8
As I understand it, you want to substract the product value from the previous row, grouped by id and date. (if any other group, adapt). You then need to do this:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(1)), np.nan))
which returns:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -1.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 1.0
7 34 2009-01-01 84 77.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 -1.0
10 34 2009-01-02 3 0.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 1.0
13 34 2009-01-03 8 2.0
or if you want it the other way around:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(-1)), np.nan))
which gives:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -3.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 -1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 -77.0
7 34 2009-01-01 84 80.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 0.0
10 34 2009-01-02 3 -2.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 -2.0
13 34 2009-01-03 8 NaN

Subtracting fix date from whole panda data frame - python

I have data
customer_id purchase_amount date_of_purchase
0 760 25.0 06-11-2009
1 860 50.0 09-28-2012
2 1200 100.0 10-25-2005
3 1420 50.0 09-07-2009
4 1940 70.0 01-25-2013
5 1960 40.0 10-29-2013
6 2620 30.0 09-03-2006
7 3050 50.0 12-04-2007
8 3120 150.0 08-11-2006
9 3260 45.0 10-20-2010
10 3510 35.0 04-05-2013
11 3970 30.0 07-06-2007
12 4000 20.0 11-25-2005
13 4180 20.0 09-22-2010
14 4390 30.0 04-15-2011
15 4750 60.0 02-12-2013
16 4840 30.0 10-14-2005
17 4910 15.0 12-13-2006
18 4950 50.0 05-19-2010
19 4970 30.0 01-12-2006
20 5250 50.0 12-20-2005
Now I want to subtract 01-01-2016 from each row of date_of_purchase
I tried the following so I should have a new column days_since with a number of days.
NOW = pd.to_datetime('01/01/2016').strftime('%m-%d-%Y')
gb = customer_purchases_df.groupby('customer_id')
df2 = gb.agg({'date_of_purchase': lambda x: (NOW - x.max()).days})
any suggestion. how I can achieve this
Thanks in advance

pd.to_datetime(df['date_of_purchase']).rsub(pd.to_datetime('2016-01-01')).dt.days
0 2395
1 1190
2 3720
3 2307
4 1071
5 794
6 3407
7 2950
8 3430
9 1899
10 1001
11 3101
12 3689
13 1927
14 1722
15 1053
16 3731
17 3306
18 2053
19 3641
20 3664
Name: date_of_purchase, dtype: int64

I'm assuming the 'date_of_purchase' column already has the datetime dtype.
>>> df
customer_id purchase_amount date_of_purchase
0 760 25.0 2009-06-11
1 860 50.0 2012-09-28
2 1200 100.0 2005-10-25
>>> df['days_since'] = df['date_of_purchase'].sub(pd.to_datetime('01/01/2016')).dt.days.abs()
>>> df
customer_id purchase_amount date_of_purchase days_since
0 760 25.0 2009-06-11 2395
1 860 50.0 2012-09-28 1190
2 1200 100.0 2005-10-25 3720

Improve Performance of Apply Method

I would like to groupby by the variable of my df "cod_id" and then apply this function:
[df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in df['dt_op']]
Moving from this df:
print(df)
dt_op quantity cod_id
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
To this one:
print(final_df)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...
I tried with:
def lookforward(x):
L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \
'quantity'].sum() for row in x.itertuples(index=False)]
return pd.Series(L, index=x.index)
s = df.groupby('cod_id').apply(lookforward)
s.index = s.index.droplevel(0)
df['Final_Quantity'] = s
print(df)
dt_op quantity cod_id Final_Quantity
0 2018-01-20 1 613 2
1 2018-01-21 8 611 8
2 2018-01-21 1 613 1
But it is not an efficient solution, since it is computationally slow;
How can I improve its performance?
I would achieve it even with a new code/new function that leads to the same result.
EDIT:
Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m":
print(df)
cod_id dt_op quantita final_sum
0 2 2017-01-03 1 54.0
1 2 2017-01-04 1 53.0
2 2 2017-01-13 1 52.0
3 2 2017-01-23 2 51.0
4 2 2017-01-26 1 49.0
5 2 2017-02-03 1 48.0
6 2 2017-02-27 1 47.0
7 2 2017-03-05 1 46.0
8 2 2017-03-15 1 45.0
9 2 2017-03-23 1 44.0
10 2 2017-03-27 2 43.0
11 2 2017-03-31 3 41.0
12 2 2017-04-04 1 38.0
13 2 2017-04-05 1 37.0
14 2 2017-04-15 2 36.0
15 2 2017-04-27 2 34.0
16 2 2017-04-30 1 32.0
17 2 2017-05-16 1 31.0
18 2 2017-05-18 1 30.0
19 2 2017-05-19 1 29.0
20 2 2017-06-03 1 28.0
21 2 2017-06-04 1 27.0
22 2 2017-06-07 1 26.0
23 2 2017-06-13 2 25.0
24 2 2017-06-14 1 23.0
25 2 2017-06-20 1 22.0
26 2 2017-06-22 2 21.0
27 2 2017-06-28 1 19.0
28 2 2017-06-30 1 18.0
29 2 2017-07-03 1 17.0
30 2 2017-07-06 2 16.0
31 2 2017-07-07 1 14.0
32 2 2017-07-13 1 13.0
33 2 2017-07-20 1 12.0
34 2 2017-07-28 1 11.0
35 2 2017-08-06 1 10.0
36 2 2017-08-07 1 9.0
37 2 2017-08-24 1 8.0
38 2 2017-09-06 1 7.0
39 2 2017-09-16 2 6.0
40 2 2017-09-20 1 4.0
41 2 2017-10-07 1 3.0
42 2 2017-11-04 1 2.0
43 2 2017-12-07 1 1.0

Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments.
Using for loops can be a performance killer when doing pandas operations.
The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here.
Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby.
quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \
.rolling("7D", on="dt_op").quantity.sum()
cod_id dt_op
611 2018-01-21 8.0
613 2018-01-21 1.0
2018-01-20 2.0
Name: quantity, dtype: float64
result = df.set_index(["cod_id", "dt_op"])
result["final_sum"] = quant_sum
result.reset_index()
cod_id dt_op quantity final_sum
0 613 2018-01-20 1 2.0
1 611 2018-01-21 8 8.0
2 613 2018-01-21 1 1.0

Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details).
This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data.
# Create a temporary df with all in between days filled in with zeros
filled = df.set_index("dt_op").groupby("cod_id") \
.resample("D").asfreq().fillna(0) \
.quantity.to_frame()
# Reverse and sum
filled["quant_sum"] = filled.reset_index().set_index("dt_op") \
.iloc[::-1] \
.groupby("cod_id") \
.rolling(7, min_periods=1) \
.quantity.sum().astype(int)
# Join with original `df`, dropping the filled days
result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()

Appending multiple rows in df2 to df1 based on datetime

I have 2 data frames, df1 and df2, both have the same format.
For example, df1 looks like this:
Date A B C D E
2018-03-01 1 40 30 30 70
2018-03-02 3 60 70 50 55
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
df2 may look like this: The only difference is the start date
Date A B C D E
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
2018-03-06 7 55 26 46 42
2018-03-07 2 73 46 33 25
I want to append all the rows from df2 to df1, in this case, all the rows from 2018-03-06 so that df1 becomes:
Date A B C D E
2018-03-01 1 40 30 30 70
2018-03-02 3 60 70 50 55
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
2018-03-06 7 55 26 46 42
2018-03-07 2 73 46 33 25
Note: df2 may skip 2018-03-06, so all rows from 2018-03-07 will be copied and appended if that's the case.
My dtype for df['Date'] is datetime64. I got an error when I tried to index the last_date of df1 to find the next_date to copy from df2.
>>>> last_date = df1['Date'].tail(1)
>>>> next_date = datetime.datetime(last_date) + datetime.timedelta(days=1)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
Alternatively, how would you copy all the rows in df2 (starting from the date after the last date of df1) and append them to df1? Thanks.

Option 1
Use combine_first on the Date column:
i = df1.set_index('Date')
j = df2[df2.Date.gt(df1.Date.max())].set_index('Date')
i.combine_first(j).reset_index()
Date A B C D E
0 2018-03-01 1.0 40.0 30.0 30.0 70.0
1 2018-03-02 3.0 60.0 70.0 50.0 55.0
2 2018-03-03 4.0 60.0 70.0 45.0 80.0
3 2018-03-04 5.0 80.0 90.0 30.0 47.0
4 2018-03-05 3.0 40.0 40.0 37.0 20.0
5 2018-03-06 7.0 55.0 26.0 46.0 42.0
6 2018-03-07 2.0 73.0 46.0 33.0 25.0
Option 2
concat + groupby
pd.concat([i, j]).groupby('Date').first().reset_index()
Date A B C D E
0 2018-03-01 1 40 30 30 70
1 2018-03-02 3 60 70 50 55
2 2018-03-03 4 60 70 45 80
3 2018-03-04 5 80 90 30 47
4 2018-03-05 3 40 40 37 20
5 2018-03-06 7 55 26 46 42
6 2018-03-07 2 73 46 33 25

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using python to extract data from a chart - python

Related

How to use fillna function in pandas?

How can I get the difference of the values of the rows in a dataframe? (for customer code)

Subtracting fix date from whole panda data frame - python

Improve Performance of Apply Method

Appending multiple rows in df2 to df1 based on datetime

Categories

Resources