I have the following piece of code:
my_list = ["US", "IT", "ES", "NL"]
for i in my_list:
A = sum_products_by_country(world_level,i)
df = pd.DataFrame({'value':A})
Descending = df.sort_values( by='value', ascending = 0 )
Top_5 = Descending[0:5]
print(Top_5)
The "sum_products_by_country" is a created function which takes as arguments a data frame ( in my case is named "world_level") and a country name and returns the sum by product for this country. Using this loop I find the top5 products and the sums for each country of my_list. Here is the output of this loop:
US value
Product
B 1492
H 455
BB 351
C 119
F 117
IT value
Product
P 346
U 331
A 379
Q 190
D 1389
ES value
Product
P 3046
U3 331
A 379
Q 1390
DD 10389
NL value
Product
P 3465
U 3313
AA 379
2Q 190
D 189
I want to write this output in a excel sheet using :
writer = pd.ExcelWriter('top products.xlsx', engine='xlsxwriter')
Top_5.to_excel(writer, sheet_name='Sheet1')
writer.save()
Could you tell me where should I put the code above in order to get the required excel document?
Is there also a way to get the column names(country,product,value) only once on the top in my excel document and not for each country separately? So I want something like this:
Country Product value
US
B 1492
H 455
BB 351
C 119
F 117
IT
P 346
U 331
A 379
Q 190
D 1389
ES
P 3046
U3 331
A 379
Q 1390
DD 10389
NL
P 3465
U 3313
AA 379
2Q 190
D 189
Thank you
This script should help you:
#Create workbook object
wb = openpyxl.Workbook()
sheet = wb.get_active_sheet()
sheet.title='Products by country'
#Generate data
#Add titles in the first row of each column
sheet.cell(row=1, column=1).value='country'
sheet.cell(row=1, column=2).value='product'
sheet.cell(row=1, column=3).value='value'
#Loop to set the value of each cell
for i in range(0, len(Country)):
sheet.cell(row=i+2, column=1).value=Country[i]#with country being your array full of country names. If you have 5 values for one country I would advise just having the country name in there five times.
sheet.cell(row=i+2, column=2).value=Product[i]#array with products
sheet.cell(row=i+2, column=3).value=Values[i]#array with values
#Finally, save the file and give it a name
wb.save('NameFile.xlsx')
Related
I have a dataframe detailing information on store location and revenue. I would like to iterate this information, but break it down by location and machine number, and then export this information to excel. My current dataframe looks like this
Location Machine Number Net Funds Net Revenue
0 Location 1 123456 123 76
1 Location 1 325462 869 522
2 Location 1 569183 896 234
3 Location 2 129756 535 542
4 Location 2 234515 986 516
5 Location 2 097019 236 512
6 Location 3 129865 976 251
Ideally, the output would look something like this
Machine Number Net Funds Net Revenue
Location 1
123456 123 76
325462 869 522
269183 896 234
Machine Number Net Funds Net Revenue
Location 2
129756 535 542
234515 986 516
097019 236 512
Machine Number Net Funds Net Revenue
Location 3
129865 976 251
While I have been able to iterate this data into the format that I like using
for name, group in grouped:
print(name)
print(group)
I cannot call it to xlsxwriter.
Any guidance would be appreciated.
For this, you can use to_csv to create a csv string, then adjust the column headers. You can open the CSV file in Excel.
Try this code:
import pandas as pd
cols = ['Location','Machine Number','Net Funds','Net Revenue']
lst = [
['Location 1','123456',123, 76],
['Location 1','325462',869,522],
['Location 1','569183',896,234],
['Location 2','129756',535,542],
['Location 2','234515',986,516],
['Location 2','097019',236,512],
['Location 3','129865',976,251]]
df = pd.DataFrame(lst, columns=cols)
loclst = df['Location'].unique().tolist()
cc = ""
for loc in loclst:
dfl = df[df['Location']==loc][cols[1:]]
cc += ','.join(cols[1:]) + '\n' + loc+',,\n' + dfl.to_csv(index=False, header=False)
print(cc)
with open('out.csv','w') as f:
f.write(cc.replace('\r\n','\n'))
Output (out.csv)
Machine Number,Net Funds,Net Revenue
Location 1,,
123456,123,76
325462,869,522
569183,896,234
Machine Number,Net Funds,Net Revenue
Location 2,,
129756,535,542
234515,986,516
097019,236,512
Machine Number,Net Funds,Net Revenue
Location 3,,
129865,976,251
I am adding a column "state" into an existing dataframe that does not share a common column with my other data frame. Therefore, I need to convert zipcodes into states (example, 00704 would be PR) to load into the dataframe that has the new column state.
reviewers = pd.read_csv('reviewers.txt',
sep='|',
header=None,
names=['user id','age','gender','occupation','zipcode'])
reviewers['state'] = ""
user id age gender occupation zipcode state
0 1 24 M technician 85711
1 2 53 F other 94043
zipcodes = pd.read_csv('zipcodes.txt',
usecols = [1,4],
converters={'Zipcode':str})
Zipcode State
0 00704 PR
1 00704 PR
2 00704 PR
3 00704 PR
4 00704 PR
zipcodes1 = zipcodes.set_index('Zipcode') ###Setting the index to zipcode
dfzip = zipcodes1
print(dfzip)
State
Zipcode
00704 PR
00704 PR
00704 PR
zips = (pd.Series(dfzip.values.tolist(), index = zipcodes1['State'].index))
states = []
for zipcode in reviewers['Zipcode']:
if re.search('[a-zA-Z]+', zipcode):
append.states['canada']
elif zipcode in zips.index:
append.states(zips['zipcode'])
else:
append.states('unkown')
I am not sure if my loop is correct either. I have to sort the zipcodes by U.S zipcode (numerical), Canada zip codes(alphabetical), and then other zip codes which we define as (unknown). Let me know if you need the data file.
Use:
#remove duplicates and create Series for mapping
zips = zipcodes.drop_duplicates().set_index('Zipcode')['State']
#get mask for canada zip codes
#if possible small letters change to [a-zA-Z]+
mask = reviewers['zipcode'].str.match('[A-Z]+')
#new column by mask
reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
#NaNs are replaced
reviewers['state'] = reviewers['state'].fillna('unknown')
Loop version with apply:
import re
def f(code):
res="unknown"
#if possible small letter change to [a-zA-Z]+
if re.match('[A-Z]+', code):
res='canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State1'] = reviewers['zipcode'].apply(f)
print (reviewers.tail(10))
user id age gender occupation zipcode state State1
933 934 61 M engineer 22902 VA VA
934 935 42 M doctor 66221 KS KS
935 936 24 M other 32789 FL FL
936 937 48 M educator 98072 WA WA
937 938 38 F technician 55038 MN MN
938 939 26 F student 33319 FL FL
939 940 32 M administrator 02215 MA MA
940 941 20 M student 97229 OR OR
941 942 48 F librarian 78209 TX TX
942 943 22 M student 77841 TX TX
#test if same output
print ((reviewers['State1'] == reviewers['state']).all())
True
Timings:
In [56]: %%timeit
...: mask = reviewers['zipcode'].str.match('[A-Z]+')
...: reviewers['state'] = np.where(mask, 'canada', reviewers['zipcode'].map(zips))
...: reviewers['state'] = reviewers['state'].fillna('unknown')
...:
100 loops, best of 3: 2.08 ms per loop
In [57]: %%timeit
...: reviewers['State1'] = reviewers['zipcode'].apply(f)
...:
100 loops, best of 3: 17 ms per loop
Your loop needs to be fixed:
states = []
for zipcode in reviewers['Zipcode']:
if re.match(r'\w+', zipcode):
states.extend('Canada')
elif zipcode in zips.index:
states.extend(zips[zipcode])
else:
states.extend('Unknown')
Also, am assuming you want the states list to be plugged back into the dataframe. In that case you don't need the for loop. You can use pandas apply on the dataframe to get a new column:
def findState(code):
res='Unknown'
if re.match(r'\w+', code):
res='Canada'
elif code in zips.index:
res=zips[code]
return res
reviewers['State'] = reviewers['Zipcode'].apply(findstate)
I have a dataframe of samples, with a country column. The relative number of records in each country are:
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
How do I select, say, 100 random samples from each country, if that country has > 100 samples? (if the country has <= 100 samples, do nothing). Currently, I do this for, say, Singapore:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
But manually a new list for each country that has >100 names is just poor form, not to mention that I first have to manually pick out the countries with > 100 names.
You can group by country, then sample based on the group size:
d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g)
Example:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g)
# A B
#2 a 2
#0 a 0
#1 a 1
#4 b 4
#5 b 5
#6 b 6
#7 c 7
#8 d 8
My excel data looks like this:
A B C
1 123 534 576
2 456 745 345
3 234 765 285
In another excel spreadsheet, my data may look like this:
B C A
1 123 534 576
2 456 745 345
3 234 765 285
How can I extract column C's contents from both spreadsheets?
My code is as follows:
#Open the workbook
ow = xlrd.open_workbook('export.xlsx').sheet_by_index(0)
#Store column 3's data inside an array
ips = ow.col_values(2, 1)
I would like something more like: ips = ow.col_values(C, 1)
How can I achieve the above?
Since I have two different spreadsheets, with the data that I'm wanting are in two separate rows, I have to search the first row by name until I find it, then extract that column.
Here's how I did it:
ow = xlrd.open_workbook('export.xlsx').sheet_by_index(0)
for x in range (0, 20):
try:
if ow.cell_value(0, x) == "IP Address":
print "found it!"
ips = ow.col_values(x, 1)
break
except IndexError:
continue
I have a csv file that contains trade data for some countries. The data has a format as follows:
rep par commodity value
USA GER 1 700
USA GER 2 100
USA GER 3 400
USA GER 5 100
USA GER 80 900
GER USA 2 300
GER USA 4 500
GER USA 5 700
GER USA 97 450
GER UK 50 300
UK USA 4 1100
UK USA 80 200
UK GER 50 200
UK GER 39 650
I intend to make a new dictionary and by using the created dictionary, calculate the total value of common traded commodities between countries.
For example, consider trade between USA-GER, I intend to check whether GER-USA is in the data and if it exists, values for the common commodities be summed and do the same for all countries. The dictionary should be like:
Dic_c1c2_producs=
{('USA','GER'): ('1','700'),('2','100'),('3','400'),('5','100'),('80','900');
('GER','USA'):('2','300'),('4','500'),('5','700'),('97','450') ;
('GER','UK'):('50','300');
('UK','USA'): ('4','80'),('80','200');
('UK','GER'): ('50','200'),('39','650')}
As you can see, USA-GER and GER-USA have commodities 2 and 5 in common and the value of these goods are (100+300)+(100+700).
For the pairs USA-UK and UK-USA, we have common commodities: 0 so total trade will be 0 as well. For GER-UK and UK-GER, commodity 50 is in common and total trade is 300+200.
At the end, I want to have something like:
Dic_c1c2_summation={('USA','GER'):1200;('GER','UK'):500; ('UK','USA'):0}
Any help would be appreciated.
In addition to my post, I have written following lines:
from collections import defaultdict
rfile = csv.reader(open("filepath",'r'))
rfile.next()
dic_c1c2_products = defaultdict(set)
dic_c_products = {}
country = set()
for row in rfile :
c1 = row[0]
c2 = row[1]
p = row[2]
country.add(c1)
for i in country :
dic_c_products[i] = set()
rfile = csv.reader(open("filepath"))
rfile.next()
for i in rfile:
c1 = i[0]
c2 = i[1]
p = i[2]
v=i[3]
dic_c_products[c1].add((p,v))
if not dic_c1c2_products.has_key((c1,c2)) :
dic_c1c2_products[(c1,c2)] = set()
dic_c1c2_products[(c1,c2)].add((p,v))
else:
dic_c1c2_products[(c1,c2)].add((p,v))
c_list = dic_c_products.keys()
dic_c1c2_productsummation = set()
for i in dic_c1c2_products.keys():
if dic_c1c2_products.has_key((i[1],i[0])):
for p1, v1 in dic_c1c2_products[(i[0],i[1])]:
for p2, v2 in dic_c1c2_products[(i[1],i[0])]:
if p1==p2:
summation=v1+v2
if i not in dic_c1c2_productsum.keys():
dic_c1c2_productsum[(i[0],i[1])]=(p1, summation)
else:
dic_c1c2_productsum[(i[0],i[1])].add((p1, summation))
else:
dic_c1c2_productsn[i] = " "
# save your data in a file called data
import pandas as pd
data = pd.read_csv('data', delim_whitespace=True)
data['par_rep'] = data.apply(lambda x: '_'.join(sorted([x['par'], x['rep']])), axis=1)
result = data.groupby(('par_rep', 'commodity')).filter(lambda x: len(x) >= 2).groupby(('par_rep'))['value'].sum().to_dict()
at the end result is {'GER_UK': 500, 'GER_USA': 1200}