Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
Most profitable element for every category
Must read my file and determinate most profitable element for every category in range of dates by user entries.
File:
Date|Category|Name|Price
05/01/2016|category6|Name8|4200
06/01/2016|category1|Name1|1000
07/01/2016|category2|Name2|1200
07/01/2016|category3|Name1|1000
07/01/2016|category1|Name2|1200
07/01/2016|category3|Name2|1200
07/01/2016|category2|Name1|1000
07/01/2016|category2|Name2|1200
07/01/2016|category2|Name2|1200
08/01/2016|category2|Name1|1000
09/01/2016|category4|Name7|3100
My file will be a lot bigger this is just example.
Start Date : 07/01/2016
End Date: 07/01/2016
For every date in that range program will print most profitable element for every category
Category 1:
07/01/2016|category1|Name2|1200
Name2 = 1200
Comparing prices >>> Most profitable is: Name2
Category 2:
07/01/2016|category2|Name2|1200
07/01/2016|category2|Name1|1000
07/01/2016|category2|Name2|1200
07/01/2016|category2|Name2|1200
Name1 = 1000
Name2 = 3600
Comparing prices >>> Most proftable: Name2
Category 3:
07/01/2016|category3|Name1|1000
07/01/2016|category3|Name2|1200
Name1: 1000
Name2: 1200
Comparing prices >>> Most profitable: Name2
Problem is i don't know how to compare these prices for categoris and names.
Also dates will be always on asending order.
I'm using both dictionary and lists.
INPUT AND OUTPUT:
Start Date : 07/01/2016
End Date: 07/01/2016
Category1; Most profitable is: Name2
Category2; Most profitable is: Name2
Category3; Most profitable is: Name2
in this case most profitable is Name2 for every category.
The following is not exactly what you need but should give you a fair idea to get going. I keep track of the most profitable name and value for combinations of date and category:
date_cat_profit_dict = {}
with open('data.txt') as f:
for line in f:
# split and store into variables.
# You could skip processing line
# if you are looking for specific date
date, category, name, profit = line.split('|')
# Convert to int for comparison
profit = int(profit)
# Key for storing into dict
composite_key = '{0}|{1}'.format(date, category)
# _ because we don't need the name right now
_, max_profit = (date_cat_profit_dict.
setdefault(composite_key, ('', 0)))
if max_profit < profit:
date_cat_profit_dict[composite_key] = (name, profit)
for composite_key, (name, profit) in date_cat_profit_dict.items():
print('Max for {0} : {1}, {2}'.format(composite_key, name, profit))
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I had a list that has Business = ['Company name','Mycompany',Revenue','1000','Income','2000','employee','3000','Facilities','4000','Stock','5000'] , the output of the list structure is shown below:
Company Mycompany
Revenue 1000
Income 2000
employee 3000
Facilities 4000
Stock 5000
The Dynamic list gets updated ***
for every iteration for the list and some of the items in the list is
missing
***. for example execution 1 the list gets updated as below:
Company Mycompany
Income 2000 #revenue is missing
employee 3000
Facilities 4000
Stock 5000
In the above list the Revenue is removed from list as company has no revenue, in second example :
Company Mycompany
Revenue 1000
Income 2000
Facilities 4000 #Employee is missing
Stock 5000
In the above example 2 Employee is missing. How to create a output list that replaces the missing values with 0, in example 1 revenue is missing , hence I have to replace the output list with ['Revenue,'0'] at its position, for better understanding please find below
output list created for example 1: Revenue replaced with 0
Company Mycompany| **Revenue 0**| Income 2000| employee 3000| Facilities 4000| Stock 5000
Output list for example 2: employee is replaced with 0
Company Mycompany| Revenue 1000| Income 2000| **employee 0**| Facilities 4000| Stock 5000
How can I achieve the output list with replacing the output list with 0 on missing list items without changing the structure of list. My code so far:
for line in Business:
if 'Company' not in line:
Business.insert( 0, 'company')
Business.insert( 1, '0')
if 'Revenue' not in line:
#got stuck here
if 'Income' not in line:
#got stuck here
if 'Employee' not in line:
#got stuck here
if 'Facilities' not in line:
#got stuck here
if 'Stock' not in line:
#got stuck here
Thanks a lot in advance
If you are getting inputs as a list then you can convert the list into a dict like this then you'll have a better approach on data, getting as a dictionary would be a better choice though
Business = ['Company name','Mycompany','Revenue',1000,'Income',2000,'employee',3000,'Facilities',4000,'Stock',5000]
BusinessDict = {Business[i]:Business[i+1] for i in range(0,len(Business)-1,2)}
print(BusinessDict)
As said in the comments, a dict is a much better data structure for the problem. If you really need the list, you could use a temporary dict like this:
example = ['Company name','Mycompany','Income','2000','employee','3000','Facilities','4000','Stock','5000']
template = ['Company name', 'Revenue', 'Income', 'employee', 'Facilities', 'Stock']
# build a temporary dict
exDict = dict(zip(example[::2], example[1::2]))
# work on it
result = []
for i in template:
result.append(i)
if i in exDict.keys():
result.append(exDict[i])
else:
result.append(0)
A bit more efficient (but harder to understand for beginners) would be to create the temporary dict like this:
i = iter(example)
example_dict = dict(zip(i, i))
This works because zip uses lazy evaluation.
You can use dictionary like this:
d={'Company':0,'Revenue':0,'Income':0,'employee':0,'Facilities':0,'Stock':0}
given=[['Company','Mycompany'],['Income',2000],['employee',3000],['Facilities',4000],['Stock',5000]]
for i in given:
d[i[0]]=i[1]
ans=[]
for key,value in d.items():
ans.append([key,value])
I'm pretty new to Python and have searched the web for an answer to this but it is tricky to find without showing it as an example!
The data I have data is here:
Dataset
What I'm after is the number of times each 'HomeTeam' has appeared in both the 'HomeTeam' and 'AwayTeam' columns up to and including the date. So for the last row of data in the sample, the input would be 'Fulham', and the output = 4. This is because 'Fulham' has appeared 4 times in the 'HomeTeam' and 'AwayTeam' columns. For the first row of data, again, the input would be 'Fulham', but the output = 1, as it is the first time 'Fulham' has appeared. For the sample dataset, the output should be:
[1,1,2,1,3,1,4]
My code so far only allows me to get the number of times each team has appeared in the 'HomeTeam' column only:
df['H Count'] = df.groupby(['HomeTeam']).cumcount()+1
This gives me the output:
[1,1,1,1,2,1,2]
Any help would be much appreciated!
As I understand, the team currently in the HomeTeam is being used as input.
I don't know how you read in the dataset, but I have just created lists below. The logic should however be clear.
Having the below, I get [1, 1, 3]
HomeTeam = list()
HomeTeam.append("Fulham")
HomeTeam.append("Tottenham")
HomeTeam.append("Fulham")
AwayTeam = list()
AwayTeam.append("Chelsea")
AwayTeam.append("Fulham")
AwayTeam.append("Liverpool")
H_Count = []
p = 1
''' The team in the HomeTeam is used as input'''
for team in HomeTeam:
''' Get the list up until the current row'''
tmp_Home = HomeTeam[:p]
tmp_Away = AwayTeam[:p]
''' Count the number of times team has occured in home and away'''
H_Count.append(tmp_Home.count(team) + tmp_Away.count(team))
p+=1
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
str = 'FW201703002082017MF0164EXESTBOPF01163500116000 0001201700258000580000116000.WALTERS BAY BOGAWANTALAWA 1M'
Above expression is the string need to be split and extract separately as follow:
Borkername = FW
Sale year = 2017
Saleno = 0300
sale_dte = 20.08.2017 # date need to be format
Factoryno = MF0164
Catalogu code= EXEST
Grade =BOPF
Gross weight =01163.50 #decimal point needed
Net Weight = 01163.50 #decimal point needed
Lot_No = 0001
invoice_year = 2017
invoice_no = 00258
price = 000580.00 #decimal point needed
Netweight = 01160.00 #decimal point needed
Buyer = 'WALTERS BAY BOGAWANTALAWA'
Buyer_code = '1M'
This is a single line without any denominators. So, kindly help me out to write a regular expression to separate each field to column of panda in python.
For example:
(\A[A-Z]{2})
This will give me the first 2 characters. How can I get next 4 digits as the year?
You need to do this in two goes. First use a regular expression to split the string up into (mostly) fixed length segments. Then with the list you get back, fix the fields manually into the format you require. For example:
import re
import csv
headings = [
"Borkername", "Sale year", "Saleno", "sale_dte", "Factoryno", "Catalogu code", "Grade", "Gross weight",
"Net Weight", "Lot_No", "invoice_year", "invoice_no", "price", "Netweight", "Buyer", "Buyer_code"]
re_fields = re.compile(r'(.{2})(.{4})(.{3})(.{8})(.{6})(.{5})(.{4})(.{7})(.{7}) (.{4})(.{4})(.{5})(.{8})(.{7}).(.*?) (.{2})$')
with open('input.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_writer = csv.writer(f_output)
csv_writer.writerow(headings)
for line in f_input:
fields = list(re_fields.match(line).groups())
fields[3] = "{}.{}.{}".format(fields[3][:2], fields[3][2:4], fields[3][4:])
fields[7] = float("{}.{}".format(fields[7][:5], fields[7][5:]))
fields[8] = float("{}.{}".format(fields[8][:5], fields[8][5:]))
fields[12] = float("{}.{}".format(fields[12][:6], fields[12][6:]))
fields[13] = float("{}.{}".format(fields[13][:5], fields[13][5:]))
csv_writer.writerow(fields)
This would give you output.csv containing:
Borkername,Sale year,Saleno,sale_dte,Factoryno,Catalogu code,Grade,Gross weight,Net Weight,Lot_No,invoice_year,invoice_no,price,Netweight,Buyer,Buyer_code
FW,2017,030,02.08.2017,MF0164,EXEST,BOPF,1163.5,1160.0,0001,2017,00258,580.0,1160.0,WALTERS BAY BOGAWANTALAWA,1M
This can then be read in using Pandas:
import pandas as pd
data = pd.read_csv('output.csv')
print data
Which gives:
Borkername Sale year Saleno sale_dte Factoryno Catalogu code Grade Gross weight Net Weight Lot_No \
0 FW 2017 30 02.08.2017 MF0164 EXEST BOPF 1163.5 1160.0 1
invoice_year invoice_no price Netweight Buyer Buyer_code
0 2017 258 580.0 1160.0 WALTERS BAY BOGAWANTALAWA 1M
I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35
It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.
I'm attempting to learn how to search csv files. In this example, I've worked out how to search a specific column (date of birth) and how to search indexes within that column to get the year of birth.
I can search for greater than a specific year - e.g. typing in 45 will give me everyone born in or after 1945, but the bit I'm stuck on is if I type in a year not specifically in the csv/list I will get an error saying the year isn't in the list (which it isn't).
What I'd like to do is iterate through the years in the column until the next year that is in the list is found and print anything greater than that.
I've tried a few bits with iteration, but my brain has finally ground to a halt. Here is my code so far...
data=[]
with open("users.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
print(data)
lookup = input("Please enter a year of birth to start at (eg 67): ")
#lookupint = int(lookup)
#searching column 3 eg [3]
#but also searching index 6-8 in column 3
#eg [6:8] being the year of birth within the DOB field
col3 = [x[3][6:8] for x in data]
#just to check if col3 is showing the right data
print(col3)
print ("test3")
#looks in column 3 for 'lookup' which is a string
#in the table
if lookup in col3: #can get rid of this
output = col3.index(lookup)
print (col3.index(lookup))
print("test2")
for k in range (0, len(col3)):
#looks for data that is equal or greater than YOB
if col3[k] >= lookup:
print(data[k])
Thanks in advance!