I am trying to get data from a website to write them on an excel file to be worked on. I have a main url scheme and I have to change the "year" and the "reference number" accordingly:
http://calcio-seriea.net/presenze/"year"/"reference number"/
I already tried to write a part of the code but I have one issue. First of all, I should keep the year the same while the reference number takes every number of an interval of 18. Then the year increases of 1, and the reference number take again every number of an interval of 18. I try to give an example:
Y = 1998 RN = [1142:1159];
Y = 1999 RN = [1160:1177];
Y = 2000 RN = [1178:1195];
Y = … RN = …
Then from year 2004 the interval becomes of 20, so
Y = 2004 RN = [1250:1269];
Y = 2005 RN = [1270:1289];
Till year = 2019 included.
This is the code I could make so far:
import pandas as pd
year = str(1998)
all_items = []
for i in range(1142, 1159):
pattern = "http://calcio-seriea.net/presenze/" + year + "/" + str(i) + "/"
df = pd.read_html(pattern)[6]
all_items.append(df)
pd.DataFrame(all_items).to_csv(r"C:\Users\glcve\Desktop\data.csv", index = False, header = False)
print("Done!")
Thanks to all in advance
All that's missing is a pd.concat from your function, however as you're calling the same method over and over, lets write a function so you can keep your code dry.
def create_html_df(base_url, year,range_nums = ()):
"""
Returns a dataframe from a url/html table
base_url : the url to target
year : the target year.
range_nums = the range of numbers i.e (1,50)
"""
start, stop = range_nums
url_pat = [f"{base_url}/{year}/{i}" for i in range(start,stop)]
dfs = []
for each_url in url_pat:
df = pd.read_html(each_url)[6]
dfs.append(df)
return pd.concat(dfs)
final_df = create_html_df(base_url = "http://calcio-seriea.net/presenze/",
year = 1998,
range_nums = (1142, 1159))
Related
Hello need some help with this problem
a = pd.date_range(start="2001-01-01", freq="T", periods=520000)
This creates the date-range i need for 1 year. I want to do the same for the next 80 years. The end result should be a date range for 80year but every year ends after 520000min. Then i add the date range to my dataset.
# this is the data
ALL_Data = pd.DataFrame({"Lebensverbrauch_Min": LebensverbrauchMIN,
"HPT": Heisspunkttemperatur_Sim1,
"Innentemperatur": StartS,
"Verlustleistung": V_Leistung,
"SolarEintrag": SolarEintrag,
"Lastfaktor": K_Load_Faktor
})
# How many minutes are left in the year
DatenJahr = len(pd.date_range(start=str(xx) + "-01-01", freq="T", periods=520000))
VollesJahr = len(pd.date_range(start=str(xx) + "-01-01", freq="T", end=str(xx + 1) + "-01-01"))
GG = (VollesJahr - DatenJahr)
d = pd.DataFrame(np.zeros((GG, 6)), columns=['Lebensverbrauch_Min', 'HPT', 'Innentemperatur','Verlustleistung',
'SolarEintrag', 'Lastfaktor',])
#combine Data with 0
ALL_Data = pd.concat([ALL_Data, d])
seems to work but the complete code needs 4h to run so we will see
I'm fairly new to Orange.
I'm trying to separate rows of angle (elv) into intervals.
Let's say, if I want to separate my 90-degree angle into 8 intervals, or 90/8 = 11.25 degrees per interval.
Here's the table I'm working with
Here's what I did originally, separating them by their elv value
Here's the result that I want, x rows 16 columns separated by their elv value.
But I want them done dynamically.
I list them out and turn each list into a table with x rows and 2 columns.
This is what I originally did
from Orange.data.table import Table
from Orange.data import Domain, Domain, ContinuousVariable, DiscreteVariable
import numpy
import pandas as pd
from pandas import DataFrame
df = pd.DataFrame()
num = 10 #number of intervals that we want to seperate our elv into.
interval = 90.00/num #separating them into degree/interval
low = 0
high = interval
table = []
first = []
second = []
for i in range(num):
between = []
if i != 0: #not the first run
low = high
high = high + interval
for row in in_data: #Run through the whole table to see if the elv falls in between interval
if row[0] >= low and row[0] < high:
between.append(row)
elv = "elv" + str(i)
err = "err" + str(i)
domain = Domain([ContinuousVariable.make(err)],[ContinuousVariable.make(elv)])
data = Table.from_numpy(domain, numpy.array(between))
print("table number ", i)
print(data[:3])
Here's the output
But as you can see, these are separated tables being assigned every loop.
And I have to find a way to concatenate axis = 1 for these tables.
Even the source code for Orange3 forbids this for some reason.
I have data in following csv format
Date,State,City,Station Code,Minimum temperature (C),Maximum temperature (C),Rainfall (mm),Evaporation (mm),Sunshine (hours),Direction of maximum wind gust,Speed of maximum wind gust (km/h),9am Temperature (C),9am relative humidity (%),3pm Temperature (C),3pm relative humidity (%)
2017-12-25,VIC,Melbourne,086338,15.1,21.4,0,8.2,10.4,S,44,17.2,57,20.7,54
2017-12-25,VIC,Bendigo,081123,11.3,26.3,0,,,ESE,46,17.2,53,25.5,25
2017-12-25,QLD,Gold Coast,040764,22.3,35.7,0,,,SE,59,29.2,53,27.7,67
2017-12-25,SA,Adelaide,023034,13.9,29.5,0,10.8,12.4,N,43,18.6,42,27.7,17
The output for VIC sohuld be
S : 1
ESE : 1
SE : 0
N : 0
however i am getting output as
S : 1
ESE : 1
Thus would like to know, how can a unique function be used to include the other 2 missing results. Below is the proram which calls a csv file
import pandas as pd
#read file
df = pd.read_csv('climate_data_Dec2017.csv')
#marker
value = df['Date']
date = value == "2017-12-26"
marker = df[date]
#group data
directionwise_data = marker.groupby('Direction of maximum wind gust')
count = directionwise_data.size()
numbers = count.to_dict()
for key in numbers:
print(key, ":", numbers[key])
To begin with, i'm not sure what you're trying to get from this:
Your data sample has no "2017-12-26" records yet you're using it in your code, hence i presume for that sample, i'll change the code to "2017-12-25" just to see what is it producing, now that produces the exact thing you're expecting! Therefore i guess in your full data, you don't have records for "2017-12-26" for SE and N and therefore it's not being grouped, i suggest you create a unique set of the four directions you've in your df, then just count their occurances in a slice of your dataframe fo the needed date!
Or if all you want is how many records for each direction you have by date, why not just pivot it like below:
output = df.pivot_table(index='Date', columns = 'Direction of maximum wind gust', aggfunc={'Direction of maximum wind gust':'count'}, fill_value=0)
EDIT:
Ok, so i wrote this real quick which should get you what you want, however you need to feed it which date you want:
import pandas as pd
#read csv
df = pd.read_csv('climate_data_Dec2017.csv')
#specify date
neededDate = '2017-12-25'
#slice dataframe to keep needed records based on the date
subFrame = df.loc[df['Date'] == neededDate].reset_index(drop=True)
#set count to zero
d1 = 0 #'S'
d2 = 0 #'SE'
d3 = 0 #'N'
d4 = 0 #'ESE'
#loop over slice and count directions
for i, row in subFrame.iterrows():
direction = subFrame.at[i,'Direction of maximum wind gust']
if direction == 'S':
d1 = d1+1
elif direction == 'SE':
d2 = d2+1
elif direction == 'N':
d3 = d3+1
if direction == 'ESE':
d4 = d4+1
#print directions count
print ('S = ' + str(d1))
print ('SE = ' + str(d2))
print ('N = ' + str(d3))
print ('ESE = ' + str(d4))
S = 1
SE = 1
N = 1
ESE = 1
I have made a for loop which uses a list of stock tickers to get day closing prices. Once collected, I ask the code to store the data in a dataframe. This works fine, but I am having trouble creating a way to append the dataframe over and over again, such that I am left with one large dataframe. Can anybody help with that? Please note that the API connection allows a certain amount of calls pr. minutes and so there should be a time-extension if the call fails - I have tried to account for this. Please see code below:
C20 = ['AMBU-B.CPH', 'MAERSK-B.CPH']
df = pd.DataFrame()
def getdata(symbol_input):
for i in symbol_input:
try:
API_KEY = 'XXXXXXXXX' #MY API KEY
symbol = i #søg på google efter firmanavnet og "stock price". Tickeren er den der skal bruges
r = requests.get('https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=' + i + '&apikey=' + API_KEY)
result = r.json()
AllData = result['Time Series (Daily)']
alldays = list(AllData.keys())
alldays.sort()
timeinterval = 10
days = alldays[len(alldays)-timeinterval:len(alldays)]
#print(days)
SymbolList = []
for i in range(timeinterval):
SymbolList.append(symbol)
#print(SymbolList)
StockPriceList = []
if (r.status_code == 200):
for i, day in enumerate(days):
result = r.json()
dataForAllDays = result['Time Series (Daily)']
dataForSingleDate = dataForAllDays[days[i]]
#print (days[i], dataForSingleDate['4. close'])
StockPriceList.append(dataForSingleDate['4. close'])
#print(StockPriceList)
combined_lists = list(zip(days, StockPriceList, SymbolList)) #create tuples to feed into dataframe from multiple lists
df1 = pd.DataFrame(combined_lists, columns = ['Date', 'Price', 'Stock Ticker'])
print(df1)
time.sleep(10)
except:
print('could not get data for: ' + i)
time.sleep(1) # wait for 1 seconds before trying to fetch the data again
continue
print(getdata(C20))
You can use pd.concat and then joining everything by using temporary dataframe into one final dataframe.
You can use this code as an example for concatenating two different dataframes into a single final dataframe.
dataset1 = pd.DataFrame([[1,2],[2,3],[3,4]],columns=['A','B'])
dataset2 = pd.DataFrame([[4,5],[5,6],[6,7]],columns=['A','B'])
full_dataset = pd.concat([dataset1,dataset2])
full_dataset
A B
0 1 2
1 2 3
2 3 4
0 4 5
1 5 6
2 6 7
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
Let me know if you require anything else. Have a great day!
For every user, I'd like to find the date of their earliest visit that falls within a 90 day lookback window from their first order date.
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
lookback = test[test['orderNumber']==1]['date'].apply(lambda x: x - timedelta(days=90))
lookback.name = 'window_min'
ids = test['fullVisitorId']
ids = ids.reset_index()
ids = ids.set_index('index')
lookback = lookback.reset_index()
lookback['fullVisitorId'] = lookback['index'].map(ids['fullVisitorId'])
lookback = lookback.set_index('fullVisitorId')
test['window'] = test['fullVisitorId'].map(lookback['window_min'])
test = test[test['window']<test['date']]
test.loc[test.groupby('fullVisitorId')['date'].idxmin()]
This works, but I feel like there ought to be a cleaner way...
How about this? Basically we assign a new column (order-90days) to help us filter away those who are False.
We apply groupby and pick the 1st (0-nth) element.
import pandas as pd
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
test.sort_values(by='date', inplace=True)
firstorder = test[test.orderNumber > 0].set_index('fullVisitorId').date
test['firstorder_90'] = test.fullVisitorId.map(firstorder - pd.Timedelta(days=90))
test.query('date >= firstorder_90').groupby('fullVisitorId', as_index=False).nth(0)
We get:
date fullVisitorId sessionId \
121154 2016-10-07 7634897085866546110 7634897085866546110_1475846055
189763 2016-12-18 643786734868244401 0643786734868244401_1482120932
orderNumber firstorder_90
121154 0.0 2016-10-07
189763 1.0 2016-09-19