I am trying to iterate over a dictionary that contains multiple row indexes in its values and then apply pd.nsmallest function to generate the top 3 smallest values for the multiple sets of row indexes that are in the dictionary. However, there seems to be something wrong with my for loop statement as I am overwriting the top 3 values over and over till the last set of values in the dictionary and so my final excel file output shows only 3 rows for the last run of the for loop.
When I use print statements this works as expected and I get an output for all 16 values in the dictionary but when writing to excel file it only gives me the output of the last run on the loop
import pandas as pd
from tabulate import tabulate
VA = pd.read_excel('Columnar BU P&L.xlsx', sheet_name = 'Variance by Co')
legcon = VA[['Expense', 'Consolidation', 'Exp Category']]
legcon['Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in legcon['Consolidation']]
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Printing output directly works fine when I do this:
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
print(a)
Expense Consolidation Exp Category Variance Type
5 Transportation - AIR -19054 Travel & Entertainment Unfavorable
9 Meals -9617 Travel & Entertainment Unfavorable
7 Lodging -9439 Travel & Entertainment Unfavorable
Expense Consolidation Exp Category Variance Type
26 Bank Charges -4320 Fees & Assessments Unfavorable
27 Finance Charges -1389 Fees & Assessments Unfavorable
25 Payroll Fees -1145 Fees & Assessments Unfavorable
However when I use the below code to write to excel:
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
for i in range(0,16):
a.to_excel(writer, sheet_name = 'test', startrow = row+4, index = False)
writer.save()
my output looks like this and does not show all the exp categories:
I would really appreciate any feedback on how to correct this. Thanks in advance!
With some help from a friend I just realized my silly mistake, there was no row iterator in my for loop for the output to print on the next lines and using the below code fixed the issue (initially i placed the row iterator within my df.to_excel statement):
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
a.to_excel(writer, sheet_name = 'Testt', startrow = row, index = False)
row = row+4
writer.save()
Related
I've been given a homework task to get data from a csv file without using Pandas. The info in the csv file contains headers such as...
work year:
experience level: EN Entry-level / Junior MI Mid-level / Inter- mediate SE Senior-level / Expert EX Executive-level / Director
employment type: PT Part-time FT Full-time CT Contract FL Freelance
job title:
salary:
salary currency:
salaryinusd: The salary in USD
employee residence: Employee’s primary country of residence
remote ratio:
One of the questions is:
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
The only way I've managed to do this is to iterate through the csv and add a load of 'if' statements according to the experience level and job title, but this is taking me forever.
Any ideas of how to tackle this differently? Not using any libraries/modules.
Example of my code:
with open('/Users/xxx/Desktop/ds_salaries.csv', 'r') as f:
csv_reader = f.readlines()
for row in csv_reader[1:]:
new_row = row.split(',')
experience_level = new_row[2]
job_title = new_row[4]
salary_in_usd = new_row[7]
if experience_level == 'EN' and job_title == 'AI Scientist':
en_ai_scientist += int(salary_in_usd)
count_en_ai_scientist += 1
avg_en_ai_scientist = en_ai_scientist / count_en_ai_scientist
print(avg_en_ai_scientist)
Data:
When working out an example like this, I find it helpful to ask, "What data structure would make this question easy to answer?"
For example, the question asks
For each experience level, compute the average salary (over 3 years (2020/21/22)) for each job title?
To me, this implies that I want a dictionary keyed by a tuple of experience level and job title, with the salaries of every person who matches. Something like this:
data = {
("EN", "AI Scientist"): [1000, 2000, 3000],
("SE", "AI Scientist"): [2000, 3000, 4000],
}
The next question is: how do I get my data into that format? I would read the data in with csv.DictReader, and add each salary number into the structure.
data = {}
with open('input.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
experience_level = row['first_name']
job_title = row['last_name']
key = experience_level, job_title
if key not in data:
# provide default value if no key exists
# look at collections.defaultdict if you want to see a better way to do this
data[key] = []
data[key].append(row['salary_in_usd'])
Now that you have your data organized, you can compute average salaries:
for (experience_level, job_title), salary_data in data:
print(experience_level, job_title, sum(salary_data)/len(salary_data))
I am new to python and trying to read a csv file and generate a sales report. For example, I have a dataset as per below
Category | State | Sales| Profits|
Clothes | California| 2389 | 608
Stationery| Georgia | 687 | 54
Gadgets | Washington| 324 | 90
How can I get the sum of the profit based on the state and category without using pandas? Meaning I need to sum the values of "Sales" and "Profit" when category is "Clothes".
I am using the code below currently, which requires a lot manual effort.
with open(path,"r") as store_file:
reader = csv.DictReader(superstore_file)
total_sales_clothes = 0
total_sales_stationery = 0
total_sales_gadgets = 0
for row in reader:
category = row.get("Category")
if category=="Clothes":
sales_clothes = float(row.get("Sales"))
total_sales_clothes += sales_clothes
elif category=="Stationery":
sales_stationery = float(row.get("Sales"))
total_sales_stationery += sales_stationery
elif category=="Gadgets":
sales_office_gadgets = float(row.get("Sales"))
total_sales_gadgets += sales_gadgets
print("Total Sales for Clothes is: {0}".format(total_sales_clothes))
print("Total Sales for Stationery is {0}".format(total_sales_stationery))
print("Total Sales for Gadgets is {0}".format(total_sales_gadgets))
You can use the python dict. This way, you don't have to hardcode the categories
The python dictionary is a key-value data structure which is extremely useful.
In your case, key would be the categories and value would be the total sales.
Eg.
{ "Clothes" : 2389, "Stationery" : 0, "Gadgets" : 0, }
Edit : Note that you should check if the category exists. If it does, just add the sales value, else just assign the value.
with open(path,"r") as store_file:
reader = csv.DictReader(superstore_file)
total_sales = {}
for row in reader:
category = row.get("Category")
if category in total_sales:
total_sales[category] += float(row.get(category))
else:
total_sales[category] = float(row.get(category))
print("Total Sales for Clothes is: {0}".format(total_sales['Clothes']))
If you want to traverse through the dict and print sales for all the categories use
for key in total_sales:
print("Total Sales for " + key + " is : " + total_sales[key])
And, though you mentioned you don't want to use Pandas, I would suggest you to check its usage if you are going to work in these types of CSV Dataset for long period. Once you start, you ll find how easy it makes your job. You will save more time than what you spend learning Pandas.
I want to return join from this following text that contains find = ['gold', 'mining', 'silver, 'steel'] but turns out it just prints the first one that appears.
one of the row in output.csv
desc
"The **2014 Orkney earthquake** occurred at 12:22:33 SAST on 5 August, with the
epicentre near Orkney, a gold mining town in the Klerksdorp district in the
North West province of South Africa. The shock was assigned a magnitude of 5.5
on the Richter scale by the Council for Geoscience (CGS) in South Africa,
making it the biggest earthquake in South Africa since the 1969 Tulbagh
earthquake, which had a magnitude of 6.3 on the Richter scale. The United
States Geological Survey (USGS) estimated a focal depth of 5.0 km (3.1 mi).
The CGS reported 84 aftershocks on 5 August and 31 aftershocks on 6 August,
with a magnitude of 1.0 to 3.8 on the Richter scale. According to the CGS, the
earthquake is the biggest mining-related earthquake in South African history."
output: gold
expected output: gold, mining
here is what I have done
reader = pd.read_csv('output.csv', chunksize=1000)
find = ['gold','mining','silver','steel']
for chunk in reader:
chunk.columns = ['desc']
def process(x):
for s in find:
if s in x['desc']:
print('', s)
return s
return ''
chunk['place'] = chunk.apply(lambda x: (process(x)), axis=1)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
How to join the result?
EDIT
def preprocess_patetnt(name):
reader = pd.read_csv('output.csv', chunksize=1000)
sname = sorted(name, key=len, reverse=True)
for chunk in reader:
chunk.columns = ['row', 'desc']
chunk['place'] = chunk["desc"].str.findall("|".join(sname)).apply(set)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
place = pd.read_csv('country.csv', chunksize=13000000, error_bad_lines=False)
for chunk in place:
chunk.columns = ['name']
preprocess_patetnt(chunk["name"])
from country.csv is a list of name country like following:
country.csv
and here is output.csv
output.csv
But it gives me this error: re.error: bad character range á-K at position 77230
Your process function returns as soon as it gets the first hit. Insted, you should store all your hits in a string and return it. Use list comprehension for these kind of loops
and str.join(iterable) method to join the list to a string (I'm guessing here that sname is actually find).
reader = pd.read_csv('output.csv', chunksize=1000)
find = ['gold','mining','silver','steel']
for chunk in reader:
chunk.columns = ['desc']
def process(x):
return ','.join([s for s in find if s in x['desc']])
chunk['place'] = chunk.apply(lambda x: (process(x)), axis=1)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
happy coding!
I am stumbled upon a trivial problem in pandas. I have two dataframes. The first one, df_1 is as follows
vendor_name date company_name state
PERTH is june 2019 Abc enterprise Kentucky
Megan Ent 25-april-2019 Xyz Fincorp Texas
The second one df_2 contains the correct values for each column in df_1.
df_2
Field wrong value correct value
vendor_name PERTH Perth Enterprise
date is 15 ## this means that is should be read as 15
company_name Abc enterprise ABC International Enterprise Inc.
In order to replace the values with correct ones in df_1 (except date field) I am using pandas.loc method. Below is the code snippet
vend = df_1['vendor_Name'].tolist()
comp = df_1['company_name'].tolist()
state = df_1['state'].tolist()
for i in vend:
if df_2['wrong value'].str.contains(i):
crct = df_2.loc[df_2['wrong value'] == i,'correct value'].tolist()
Similarly, for company and state I have followed the above way.
However, the crct is returning a blank series. Ideally it should return
['Perth Enterprise','Abc International Enterprise Inc']
The next step would be to replace the respective field values by the above list.
With the above, I have three questions:
Why the above code is generating a blank list? What I am missing here?
How can I replace the respective fields using df_1.replace method?
What should be a correct approach to replace the portion of date in df_1 by the correct one in df_2?
Edit: when data has looping replacement(i.e overlaping keys and values), replacement on whole dataframe will fail. In this case, doing it column by column and concat them together. Finally, use join to adding any missing columns from df1:
df_replace = pd.concat([df1[k].replace(val, regex=True) for k, val in d.items()], axis=1).join(df1.state)
Original:
I tried your code in my interactive and it gives error ValueError: The truth value of a Series is ambiguous on df_2['wrong value'].str.contains(i).
assume you have multiple vendor names, so the simple way is construct a dictionary from groupby of df2 and use it with df.replace on df1.
d = {k: gp.set_index('wrong value')['correct value'].to_dict()
for k, gp in df2.groupby('Field')}
Out[64]:
{'company_name': {'Abc enterprise': 'ABC International Enterprise Inc. '},
'date': {'is': '15'},
'vendor_name': {'PERTH': 'Perth Enterprise'}}
df_replace = df1.replace(d, regex=True)
print(df_replace)
In [68]:
vendor_name date company_name \
0 Perth Enterprise 15 june 2019 ABC International Enterprise Inc.
1 Megan Ent 25-april-2019 Xyz Fincorp
state
0 Kentucky
1 Texas
Note: your sample df2 has only value for vendor PERTH, so it only replace first row. When you have all vendor_names in df2, it will replace them all in df1.
A simple way to do that is to iterate over the first dataframe and then replace the wrong values :
Result = pd.DataFrame()
for i in range(len(df1)):
vendor_name = df1.iloc[i]['vendor_name']
date = df1.iloc[i]['date']
company_name = df1.iloc[i]['company_name']
if vendor_name in df2['wrong value'].values:
vendor_name = df2.loc[df2['wrong value'] == vendor_name]['correct value'].values[0]
if company_name in df2['wrong value'].values:
company_name = df2.loc[df2['wrong value'] == company_name]['correct value'].values[0]
new_row = {'vendor_name':[vendor_name],'date':[date],'company_name':[company_name]}
new_row = pd.DataFrame(new_row,columns=['vendor_name','date','company_name'])
Result = Result.append(new_row,ignore_index=True)
Result :
Define the following replace function:
def repl(row):
fld = row.Field
v1 = row['wrong value']
v2 = row['correct value']
updInd = df_1[df_1[fld].str.contains(v1)].index
df_1.loc[updInd, fld] = df_1.loc[updInd, fld]\
.str.replace(re.escape(v1), v2)
Then call it for each row in df_2:
for _, row in df_2.iterrows():
repl(row)
Note that str.replace alone does not require to import re (Pandas
imports it under the hood).
But in the above function re.escape is called explicitely, from our code,
hence import re is required.
In a csv file, bikeshare data is available for three different cities: NYC, Chicago, Washington. I Need to find which city has the highest number of trips, and also which city has the highest proportion of trips made by subscribers (User_type).
Below is my code:
def number_of_trips(filename):
with open(filename, 'r') as f_in:
# set up csv reader object
reader = csv.DictReader(f_in)
# initialize count variables
ny_trips = 0
wh_trips = 0
ch_trips = 0
n_usertype = 0
# tally up ride types
for row in reader:
if row['city'] == 'NYC':
ny_trips += 1
elif row['city'] == 'Chicago':
ch_trips += 1
else:
wh_trips +=1
if wh_trips < ny_trips > ch_trips:
city = 'NYC'
elif ny_trips < wh_trips > ch_trips:
city = 'Chicago'
else:
city = 'Washington'
return city
# return tallies as a tuple
return(city, n_customers, n_total)
This is throwing an error: KeyError: 'city'.
I am very new to python - please guide me on how to achieve above the requirements.
You should consider using the pandas library.
import pandas as pd
## Reading the CSV
df = pd.read_cvs('file')
## Counting the values of each city entry (assuming 'city' is a column header)
df['city'].value_counts()
For the second piece, you can use a pivot table with the len as the aggfunc value. The documentation for pd.pivot_table is shown here.