How to calculate combined experience - python

I have the following in a function:
edu = sorted(context.education, key=lambda studied: studied["studied_to"], reverse=True)
#print edu[0].studied_to
job_history = []
for job in context.job_history:
if edu[0].fields_of_study in job.industries:
from_ = job.from_
to_ = job.to_
industries = job.industries
rd = rdelta.relativedelta(to_, from_) # get date difference
# I suspect that combined exp calculation would be done here.
experience = "{0.years} Years {0.months} Months".format(rd) #get Year - Month format
#print experience
job_history.append({"job_title": job_title,
"company_name": company_name,
"from_": from_,
"to_": to_,
"industries": industries,
"experience": experience})
j = sorted(job_history, key=lambda s: s["to_"])
#print j[0]["job_title"]
return {"relocate_list": provinces,
"disabilities_dict": disabilities,
"industries_list": industry_dict,
"job_history_sorted": j,
"education_sorted": edu}
I can get the experience from each job with the code above. Is there a way to calculate the combined experience.
Currently, say the user has/had more than one job in the IT industry for arguments sake, The above code will give me e.g. 1 Years 0 Months and 1 Years 4 Months.
How can I calculate the combined experience so the above example would be 2 Years 4 Months?
I have tried:
rd += rd
But this adds the same date, i.e.
1 Years 4 Months + 1 Years 4 Months
would output:
2 Years 8 Months

Why not create a new variable to save the relative deltas and display it outside the loop, say:
job_history = []
total_exp = rdelta.relativedelta()
for job in context.job_history:
if edu[0].fields_of_study in job.industries:
from_ = job.from_
to_ = job.to_
industries = job.industries
rd = rdelta.relativedelta(to_, from_) # get date difference
total_exp += rd
job_history.append({"job_title": job_title,
"company_name": company_name,
"from_": from_,
"to_": to_,
"industries": industries,
"experience": experience})
# I suspect that combined exp calculation would be done here.
experience = "{0.years} Years {0.months} Months".format(total_exp) #get Year - Month format
#print experience

Related

python create multiple year daterange with specific end date for every year

Hello need some help with this problem
a = pd.date_range(start="2001-01-01", freq="T", periods=520000)
This creates the date-range i need for 1 year. I want to do the same for the next 80 years. The end result should be a date range for 80year but every year ends after 520000min. Then i add the date range to my dataset.
# this is the data
ALL_Data = pd.DataFrame({"Lebensverbrauch_Min": LebensverbrauchMIN,
"HPT": Heisspunkttemperatur_Sim1,
"Innentemperatur": StartS,
"Verlustleistung": V_Leistung,
"SolarEintrag": SolarEintrag,
"Lastfaktor": K_Load_Faktor
})
# How many minutes are left in the year
DatenJahr = len(pd.date_range(start=str(xx) + "-01-01", freq="T", periods=520000))
VollesJahr = len(pd.date_range(start=str(xx) + "-01-01", freq="T", end=str(xx + 1) + "-01-01"))
GG = (VollesJahr - DatenJahr)
d = pd.DataFrame(np.zeros((GG, 6)), columns=['Lebensverbrauch_Min', 'HPT', 'Innentemperatur','Verlustleistung',
'SolarEintrag', 'Lastfaktor',])
#combine Data with 0
ALL_Data = pd.concat([ALL_Data, d])
seems to work but the complete code needs 4h to run so we will see

Double for loop to extract data from several urls

I am trying to get data from a website to write them on an excel file to be worked on. I have a main url scheme and I have to change the "year" and the "reference number" accordingly:
http://calcio-seriea.net/presenze/"year"/"reference number"/
I already tried to write a part of the code but I have one issue. First of all, I should keep the year the same while the reference number takes every number of an interval of 18. Then the year increases of 1, and the reference number take again every number of an interval of 18. I try to give an example:
Y = 1998 RN = [1142:1159];
Y = 1999 RN = [1160:1177];
Y = 2000 RN = [1178:1195];
Y = … RN = …
Then from year 2004 the interval becomes of 20, so
Y = 2004 RN = [1250:1269];
Y = 2005 RN = [1270:1289];
Till year = 2019 included.
This is the code I could make so far:
import pandas as pd
year = str(1998)
all_items = []
for i in range(1142, 1159):
pattern = "http://calcio-seriea.net/presenze/" + year + "/" + str(i) + "/"
df = pd.read_html(pattern)[6]
all_items.append(df)
pd.DataFrame(all_items).to_csv(r"C:\Users\glcve\Desktop\data.csv", index = False, header = False)
print("Done!")
Thanks to all in advance
All that's missing is a pd.concat from your function, however as you're calling the same method over and over, lets write a function so you can keep your code dry.
def create_html_df(base_url, year,range_nums = ()):
"""
Returns a dataframe from a url/html table
base_url : the url to target
year : the target year.
range_nums = the range of numbers i.e (1,50)
"""
start, stop = range_nums
url_pat = [f"{base_url}/{year}/{i}" for i in range(start,stop)]
dfs = []
for each_url in url_pat:
df = pd.read_html(each_url)[6]
dfs.append(df)
return pd.concat(dfs)
final_df = create_html_df(base_url = "http://calcio-seriea.net/presenze/",
year = 1998,
range_nums = (1142, 1159))

What is the best way to compute a rolling (lag and lead) difference in sales?

I'm looking to add a field or two into my data set that represents the difference in sales from the last week to current week and from current week to the next week.
My dataset is about 4.5 million rows so I'm looking to find an efficient way of doing this, currently I'm getting into a lot of iteration and for loops and I'm quite sure I'm going about this the wrong way. but Im trying to write code that will be reusable on other datasets and there are situations where you might have nulls or no change in sales week to week (therefore there is no record)
The dataset looks like the following:
Store Item WeekID WeeklySales
1 1567 34 100.00
2 2765 34 86.00
3 1163 34 200.00
1 1567 35 160.00
. .
. .
. .
I have each week as its own dictionary and then each store sales for that week in a dictionary within. So I can use the week as a key and then within the week I access the store's dictionary of item sales.
weekly_sales_dict = {}
for i in df['WeekID'].unique():
store_items_dict = {}
subset = df[df['WeekID'] == i]
subset = subset.groupby(['Store', 'Item']).agg({'WeeklySales':'sum'}).reset_index()
for j in subset['Store'].unique():
storeset = subset[subset['Store'] == j]
store_items_dict.update({str(j): storeset})
weekly_sales_dict.update({ str(i) : store_items_dict})
Then I iterate through each week in the weekly_sales_dict and compare each store/item within it to the week behind it (I planned to do the same for the next week as well). The 'lag_list' I create can be indexed by week, store, and Item so I was going to iterate through and add the values to my df as a new lag column but I feel I am way overthinking this.
count = 0
key_list = list(df['WeekID'].unique())
lag_list = []
for k,v in weekly_sales_dict.items():
if count != 0 and count != len(df['WeekID'].unique())-1:
prev_wk = weekly_sales_dict[str(key_list[(count - 1)])]
current_wk = weekly_sales_dict[str(key_list[count])
for i in df['Store'].unique():
prev_df = prev_wk[str(i)]
current_df = current_wk[str(i)]
for j in df['Item'].unique():
print('in j')
if j in list(current_df['Item'].unique()) and j in list(prev_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values - prev_df[prev_df['Item'] == int(j)]['WeeklySales'].values
df[df['Item'] == j][df['Store'] == i ][df['WeekID'] == key_list[count]]['lag'] = item_lag[0]
lag_list.append((str(i),str(j),item_lag[0]))
elif j in list(current_df['Item'].unique()):
item_lag = current_df[current_df['Item'] == int(j)]['WeeklySales'].values
lag_list.append((str(i),str(j),item_lag[0]))
else:
pass
count += 1
else:
count += 1
Using pd.diff() the problem was solved. I sorted all rows by week, then created a subset with a multi-index by grouping on store,items,and week. Finally I used pd.diff() with a period of 1 and I ended up with the sales difference from the current week to the week prior.
df = df.sort_values(by = 'WeekID')
subset = df.groupby(['Store', 'Items', 'WeekID']).agg({''WeeklySales'':'sum'})
subset['lag'] = subset[['WeeklySales']].diff(1)

Need a regular expression to split String in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
str = 'FW201703002082017MF0164EXESTBOPF01163500116000 0001201700258000580000116000.WALTERS BAY BOGAWANTALAWA 1M'
Above expression is the string need to be split and extract separately as follow:
Borkername = FW
Sale year = 2017
Saleno = 0300
sale_dte = 20.08.2017 # date need to be format
Factoryno = MF0164
Catalogu code= EXEST
Grade =BOPF
Gross weight =01163.50 #decimal point needed
Net Weight = 01163.50 #decimal point needed
Lot_No = 0001
invoice_year = 2017
invoice_no = 00258
price = 000580.00 #decimal point needed
Netweight = 01160.00 #decimal point needed
Buyer = 'WALTERS BAY BOGAWANTALAWA'
Buyer_code = '1M'
This is a single line without any denominators. So, kindly help me out to write a regular expression to separate each field to column of panda in python.
For example:
(\A[A-Z]{2})
This will give me the first 2 characters. How can I get next 4 digits as the year?
You need to do this in two goes. First use a regular expression to split the string up into (mostly) fixed length segments. Then with the list you get back, fix the fields manually into the format you require. For example:
import re
import csv
headings = [
"Borkername", "Sale year", "Saleno", "sale_dte", "Factoryno", "Catalogu code", "Grade", "Gross weight",
"Net Weight", "Lot_No", "invoice_year", "invoice_no", "price", "Netweight", "Buyer", "Buyer_code"]
re_fields = re.compile(r'(.{2})(.{4})(.{3})(.{8})(.{6})(.{5})(.{4})(.{7})(.{7}) (.{4})(.{4})(.{5})(.{8})(.{7}).(.*?) (.{2})$')
with open('input.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_writer = csv.writer(f_output)
csv_writer.writerow(headings)
for line in f_input:
fields = list(re_fields.match(line).groups())
fields[3] = "{}.{}.{}".format(fields[3][:2], fields[3][2:4], fields[3][4:])
fields[7] = float("{}.{}".format(fields[7][:5], fields[7][5:]))
fields[8] = float("{}.{}".format(fields[8][:5], fields[8][5:]))
fields[12] = float("{}.{}".format(fields[12][:6], fields[12][6:]))
fields[13] = float("{}.{}".format(fields[13][:5], fields[13][5:]))
csv_writer.writerow(fields)
This would give you output.csv containing:
Borkername,Sale year,Saleno,sale_dte,Factoryno,Catalogu code,Grade,Gross weight,Net Weight,Lot_No,invoice_year,invoice_no,price,Netweight,Buyer,Buyer_code
FW,2017,030,02.08.2017,MF0164,EXEST,BOPF,1163.5,1160.0,0001,2017,00258,580.0,1160.0,WALTERS BAY BOGAWANTALAWA,1M
This can then be read in using Pandas:
import pandas as pd
data = pd.read_csv('output.csv')
print data
Which gives:
Borkername Sale year Saleno sale_dte Factoryno Catalogu code Grade Gross weight Net Weight Lot_No \
0 FW 2017 30 02.08.2017 MF0164 EXEST BOPF 1163.5 1160.0 1
invoice_year invoice_no price Netweight Buyer Buyer_code
0 2017 258 580.0 1160.0 WALTERS BAY BOGAWANTALAWA 1M

Find the year which has highest number of active bonds

Given two lists:
Issue year of bonds
Maturity year of bond
Something like:
issue_year = [1934, 1932, 1945, 1946, ...]
mature_years = [1967, 1937, 1957, 1998, ...]
With this example, the first bond has issue-year of 1934, and maturity year of 1967, while the second bond has issue-year of 1932 and maturity year of 1937, and so on.
The problem I am trying to solve is to find the year which has the highest number of active bonds.
Here is what I have so far. This finds the year in which all bonds are active.
L1=[1936,1934,1937]
L2=[1940,1938,1940]
ctr=0
for i in range(len(L1)):
j=i
L3=list(range(L1[i],L2[j]))
if ctr==0:
tempnew=L3
else:
tempnew=list(set(L3) & set(tempnew))
ctr = ctr+1
Here tempnew is the intersection of all the active years for all the bonds. But, it might happen that the intersection of all the active years might be empty. For example, if bond 1 were active from 1932 through 1945, and bond 2 is active from 1947 thru 1960.
Can someone help ?
Here is some code which I believe meets your requirements. It works by scanning through issue and mature years list using zip. It then fills out a dict whose keys are all of the active years, and whose value are the number of bonds active that year. Finally it dumps all of the years which have the max number of active bonds:
Code:
def add_active_years(years, issue, mature):
for year in range(issue, mature+1):
years[year] = years.get(year, 0) + 1
# go through the lists and calculate the active years
years = {}
for issue, mature in zip(L1, L2):
add_active_years(years, issue, mature)
# now reverse the years dict into a count dict of lists of years
counts = {}
for year, count in years.items():
counts[count] = counts.get(count, []) + [year]
# show the result
print(counts[max(counts.keys())])
Sample Data:
L1 = [1936,1934,1937]
L2 = [1940,1938,1940]
Results:
[1937, 1938]

Categories