I have a csv file that contains trade data for some countries. The data has a format as follows:
rep par commodity value
USA GER 1 700
USA GER 2 100
USA GER 3 400
USA GER 5 100
USA GER 80 900
GER USA 2 300
GER USA 4 500
GER USA 5 700
GER USA 97 450
GER UK 50 300
UK USA 4 1100
UK USA 80 200
UK GER 50 200
UK GER 39 650
I intend to make a new dictionary and by using the created dictionary, calculate the total value of common traded commodities between countries.
For example, consider trade between USA-GER, I intend to check whether GER-USA is in the data and if it exists, values for the common commodities be summed and do the same for all countries. The dictionary should be like:
Dic_c1c2_producs=
{('USA','GER'): ('1','700'),('2','100'),('3','400'),('5','100'),('80','900');
('GER','USA'):('2','300'),('4','500'),('5','700'),('97','450') ;
('GER','UK'):('50','300');
('UK','USA'): ('4','80'),('80','200');
('UK','GER'): ('50','200'),('39','650')}
As you can see, USA-GER and GER-USA have commodities 2 and 5 in common and the value of these goods are (100+300)+(100+700).
For the pairs USA-UK and UK-USA, we have common commodities: 0 so total trade will be 0 as well. For GER-UK and UK-GER, commodity 50 is in common and total trade is 300+200.
At the end, I want to have something like:
Dic_c1c2_summation={('USA','GER'):1200;('GER','UK'):500; ('UK','USA'):0}
Any help would be appreciated.
In addition to my post, I have written following lines:
from collections import defaultdict
rfile = csv.reader(open("filepath",'r'))
rfile.next()
dic_c1c2_products = defaultdict(set)
dic_c_products = {}
country = set()
for row in rfile :
c1 = row[0]
c2 = row[1]
p = row[2]
country.add(c1)
for i in country :
dic_c_products[i] = set()
rfile = csv.reader(open("filepath"))
rfile.next()
for i in rfile:
c1 = i[0]
c2 = i[1]
p = i[2]
v=i[3]
dic_c_products[c1].add((p,v))
if not dic_c1c2_products.has_key((c1,c2)) :
dic_c1c2_products[(c1,c2)] = set()
dic_c1c2_products[(c1,c2)].add((p,v))
else:
dic_c1c2_products[(c1,c2)].add((p,v))
c_list = dic_c_products.keys()
dic_c1c2_productsummation = set()
for i in dic_c1c2_products.keys():
if dic_c1c2_products.has_key((i[1],i[0])):
for p1, v1 in dic_c1c2_products[(i[0],i[1])]:
for p2, v2 in dic_c1c2_products[(i[1],i[0])]:
if p1==p2:
summation=v1+v2
if i not in dic_c1c2_productsum.keys():
dic_c1c2_productsum[(i[0],i[1])]=(p1, summation)
else:
dic_c1c2_productsum[(i[0],i[1])].add((p1, summation))
else:
dic_c1c2_productsn[i] = " "
# save your data in a file called data
import pandas as pd
data = pd.read_csv('data', delim_whitespace=True)
data['par_rep'] = data.apply(lambda x: '_'.join(sorted([x['par'], x['rep']])), axis=1)
result = data.groupby(('par_rep', 'commodity')).filter(lambda x: len(x) >= 2).groupby(('par_rep'))['value'].sum().to_dict()
at the end result is {'GER_UK': 500, 'GER_USA': 1200}
Related
I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.
I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302
I am trying to combine OR | with df.loc to extract data. The code I have written extracts everything in the csv file. Here is the original csv file: https://drive.google.com/open?id=16eo29mF0pn_qNw-BGpZyVM9PBxv2aN1G
import pandas as pd
df = pd.read_csv("yelp_business.csv")
df = df.loc[(df['categories'].str.contains('chinese', case = False)) | (df['name'].str.contains('subway', case = False)) | (df['categories'].str.contains('', case = False)) | (df['address'].str.contains('', case = False))]
print df
It looks like the blank quotes '' are not working in str.contains or the OR | doesn't work in df.loc. Instead of just returning rows with chinese restaurants (which are 4171 in number) and the row with the restaurant name subway, it returns all the 174,568 rows.
EDITED
The output I want should be all the rows of category chinese and all the rows of name subway while taking into consideration that the address might not have any assigned value or is null.
import pandas as pd
df = pd.read_csv("yelp_business.csv")
cusine = 'chinese'
name = 'subway'
address #address has no assigned value or is NULL
df = df.loc[(df['categories'].str.contains(cusine, case = False)) |
(df['name'].str.contains(name, case = False)) |
(df['address'].str.contains(address, case = False))]
print df
This code gives me an error NameError: name 'address' is not defined.
I think here is possible chain conditions by | for categories column, for find empty string use ^""$ - it match start and end of string with quotes:
df = pd.read_csv("yelp_business.csv")
df1 = df.loc[(df['categories'].str.contains('chinese|^""$', case = False)) |
(df['name'].str.contains('subway', case = False)) |
(df['address'].str.contains('^""$', case = False))]
print (len(df1))
11320
print (df1.head())
business_id name neighborhood \
9 TGWhGNusxyMaA4kQVBNeew "Detailing Gone Mobile" NaN
53 4srfPk1s8nlm1YusyDUbjg ***"Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
63 r6Jw8oRCeumxu7Y1WRxT7A "D&D Cleaning" NaN
88 YhV93k9uiMdr3FlV4FHjwA "Caviness Studio" NaN
address city state postal_code latitude \
9 ***"" Henderson NV 89014 36.055825
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119 36.064652
57 "8515 McCowan Road" Markham ON L3P 5E5 43.867918
63 ***"" Urbana IL 61802 40.110588
88 ***"" Phoenix AZ 85001 33.449967
longitude stars review_count is_open \
9 -115.046350 5.0 7 1
53 -115.118954 2.5 6 1
57 -79.283687 3.0 80 1
63 -88.207270 5.0 4 0
88 -112.070223 5.0 4 1
categories
9 Automotive;Auto Detailing
53 Fast Food;Restaurants;Sandwiches
57 ***Restaurants;Chinese
63 Home Cleaning;Home Services;Window Washing
88 Marketing;Men's Clothing;Restaurants;Graphic D...
EDIT: If need filter out empty and NaNs values:
df2 = df.loc[(df['categories'].str.contains('chinese', case = False)) |
(df['name'].str.contains('subway', case = False)) &
~((df['address'] == '""') | (df['categories'] == '""'))]
print (df2.head())
business_id name neighborhood \
53 4srfPk1s8nlm1YusyDUbjg "Subway" Southeast
57 spDZkD6cp0JUUm6ghIWHzA "Kitchen M" Unionville
96 dTWfATVrBfKj7Vdn0qWVWg "Flavor Cuisine" Scarborough
126 WUiDaFQRZ8wKYGLvmjFjAw "China Buffet" University City
145 vzx1WdVivFsaN4QYrez2rw "Subway" NaN
address city state postal_code \
53 "6889 S Eastern Ave, Ste 101" Las Vegas NV 89119
57 "8515 McCowan Road" Markham ON L3P 5E5
96 "8 Glen Watford Drive" Toronto ON M1S 2C1
126 "8630 University Executive Park Dr" Charlotte NC 28262
145 "5111 Boulder Hwy" Las Vegas NV 89122
latitude longitude stars review_count is_open \
53 36.064652 -115.118954 2.5 6 1
57 43.867918 -79.283687 3.0 80 1
96 43.787061 -79.276166 3.0 6 1
126 35.306173 -80.752672 3.5 76 1
145 36.112895 -115.062353 3.0 3 1
categories
53 Fast Food;Restaurants;Sandwiches
57 Restaurants;Chinese
96 Restaurants;Chinese;Food Court
126 Buffets;Restaurants;Sushi Bars;Chinese
145 Sandwiches;Restaurants;Fast Food
Find detail information about contains at
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html
I have a dataframe of samples, with a country column. The relative number of records in each country are:
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
How do I select, say, 100 random samples from each country, if that country has > 100 samples? (if the country has <= 100 samples, do nothing). Currently, I do this for, say, Singapore:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
But manually a new list for each country that has >100 names is just poor form, not to mention that I first have to manually pick out the countries with > 100 names.
You can group by country, then sample based on the group size:
d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g)
Example:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g)
# A B
#2 a 2
#0 a 0
#1 a 1
#4 b 4
#5 b 5
#6 b 6
#7 c 7
#8 d 8
This is my data:
Year Country Albania Andorra Armenia Austria Azerbaijan
2009 Lithuania 0 0 0 0 1
2009 Israel 0 7 0 0 0
2008 Israel 1 2 2 0 4
2008 Lithuania 1 5 1 0 8
Actually, it is csv file and delimiter is , so raw data is:
Year,Country,Albania,Andorra,Armenia,Austria,Azerbaijan
2009,Lithuania,0,0,0,0,1
2009,Israel,0,7,0,0,0
2008,Israel,1,2,2,0,4
2008,Lithuania,1,5,1,0,8
where the first element of the list means sum by column for Lithuania and the second element means sum by column for Israel (for Albania column)?
I am a beginner in python and don't really know many python tricks. What I do know is that I probably complicate too much in my code.
And I want to get this:
final_dict = {Albania: [1, 1], Andorra: [5, 9], Armenia: [1, 2], Austria: [0, 0], Azerbaijan: [9, 4]}
Explanation of output: for every country in first row (Albania, Andorra, Armenia, Austria and Azerbaijan) I want to get sum by countries from Country column.
Andorra: [5,9]
# 5 is sum for Lithuania in Andorra column
# 9 is sum for Israel in Andorra column
You can use the Pandas module which is perfect for this type of application:
import pandas as pd
df = pd.read_csv('songfestival.csv')
gb = df.groupby('Country')
res = pd.concat([i[1].sum(numeric_only=True) for i in gb], axis=1).T
res.pop('Year')
order = [i[0] for i in gb]
print(order)
print(res)
#['Israel', 'Lithuania']
# Albania Andorra Armenia Austria Azerbaijan
#0 1 9 2 0 4
#1 1 5 1 0 9
to query the result for each column you can simply do:
print(res.Albania)
print(res.Andorra)
...
Ok, so you want the the lines aggregated by year:
import csv
from collections import defaultdict
with open("songfestival.csv", "r") as ifile:
reader = csv.DictReader(ifile)
country_columns = [k for k in reader.fieldnames if k not in ["Year","Country"]]
data = defaultdict(lambda:defaultdict(int))
for line in reader:
curr_country = data[line["Country"]]
for country_column in country_columns:
curr_country[country_column] += int(line[country_column])
with open("songfestival_aggr.csv", "w") as ofile:
writer = csv.DictWriter(ofile, fieldnames=country_columns+["Country"])
writer.writeheader()
for k, v in data.items():
row = dict(v)
row["Country"] = k
writer.writerow(row)
I toke the liberty to output it in another csv file. Your data struct is very error prone, since it depends on the order of the columns. Better to use an intermediate dict in a dict to assign names to the aggregations -> see #gboffi's comment on your question.
Your hattrick is using the defaultdict from the collections module, please search for
python defaultdict
on SO, you'll find lot of useful examples, and here is my answer
import csv
from collections import defaultdict
# slurp the data
data = list(csv.reader(open('points.csv')))
# massage the data
for i, row in enumerate(data[1:],1):
data[i] = [int(elt) if elt.isdigit() else elt for elt in row]
points = {} # an empty dictionary
for i, country in enumerate(data[0][2:],2):
# for each country, a couple country:defaultdict is put in points
points[country] = defaultdict(int)
for row in data[1:]:
opponent = row[1]
points[country][opponent] += row[i]
# here you can post-process points as you like,
# I'll simply print out the stuff
for country in points:
for opponent in points[country]:
print country, "vs", opponent, "scored",
print points[country][opponent], "points."
The example output for your data has been
Andorra vs Israel scored 9 points.
Andorra vs Lithuania scored 5 points.
Austria vs Israel scored 0 points.
Austria vs Lithuania scored 0 points.
Albania vs Israel scored 1 points.
Albania vs Lithuania scored 1 points.
Azerbaijan vs Israel scored 4 points.
Azerbaijan vs Lithuania scored 9 points.
Armenia vs Israel scored 2 points.
Armenia vs Lithuania scored 1 points.
Edit
If you're against defaultdict, you can use the .get method of an ordinary dict, that allows you to have back an optional default value if the key:value pair was not initialised
points[country] = {} # a standard empty dict
for row in data[1:]:
opponent = row[1]
points[country][opponent] = points[country].get(opponent,0) + row[i]
as you see, it's a bit clumsier, but still manageable.