Python / Pulp - first pick one item of each group - python

I have a dataframe that looks like this:
City Country Time Points
---------------------------------------
London UK 31 20
Paris France 42 12
Sydney Australia 73 44
New York US 41 18
Lyon France 22 25
...
...
Program should pick cities to visit in given time frame. There is only a time constraint and no strict constraint in number of cities.
However:
each city should be visited once at maximum
you can visit twice in the same country, but only after you have visited in every single country in the dataframe. (same goes for the 3rd time in the same country: you have to visit each country twice before)
Currently my code looks like this:
max_time = 500
x = pulp.LpVariable.dicts("x", df.index, 0, 1, pulp.LpInteger)
mod = pulp.LpProblem("travel_prog", pulp.LpMaximize)
objval_points = {idx: (df['Points'][idx]) for idx in df.index}
mod += sum([x[idx]*objval_points[idx] for idx in df.index])
objval_time = {idx: (df['Time'][idx]) for idx in df.index}
mod += sum([x[idx]*objval_time[idx] for idx in df.index]) < max_time
for idx in df.index:
mod += x[idx] <= 1
I have created a constraint that allows the program to choose only one city per country, but this is not what i want:
for country in df['Country'].unique():
sub_idx = df[df['Country']==country].index
mod += pulp.lpSum([x[idx] for idx in sub_idx]) <= 1

Count the number of times each country is visited. Put this in a variable say CountVisit[Country]. Introduce variable maxVisit and MinVisit. Then add constraints maxVisit >= CountVisit[Country] and MinVisit <= CountVisit[Country] and finally impose the constraint MaxVisit-MinVisit<=1.

Related

Extract matrix from a dataframe by value from columns

I am trying something that could be a little hard to understand but i will try to be very specific.
I have a dataframe of python like this
Locality
Count
Lat.
Long.
Krasnodar
Russia
44
39
Tirana
Albania
41.33
19.83
Areni
Armenia
39.73
45.2
Kars
Armenia
40.604517
43.100758
Brunn Wolfholz
Austria
48.120396
16.291722
Kleinhadersdorf Flur Marchleiten
Austria
48.663197
16.589687
Jalilabad district
Azerbaijan
39.3607139
48.4613556
Zeyem Chaj
Azerbaijan
40.9418889
45.8327778
Jalilabad district
Azerbaijan
39.5186111
48.65
And a dataframe cities.txt with a the name of some countries:
Albania
Armenia
Austria
Azerbaijan
And so on.
The nex what I am doing is convert this Lat. and Long. values as radians and then with the values from the list do something like:
with open('cities.txt') as file:
lines=file.readlines()
x=np.where(df['Count'].eq(lines),pd.DataFrame(
dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373,
columns=df.Locality.unique(), index=df.Locality.unique()))
Where pd.DataFrame(dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373, columns=df.Locality.unique(), index=df.Locality.unique()) is converting radians in Lat. Long. into distances in km and create a dataframe as a matrix for each line (country).
In the end i will have a lot of matrix 2d (in theory) grouped by countries and i want to apply this:
>>>Russia.min()
0
>>>Russia.max()
5
to get the .min() and .max() value in each matrix and save this results in cities.txt as
Country Max.Dist. Min. Dist.
Albania 5 1
Armenia 10 9
Austria 5 3
Azerbaijan 0 0
Unfortunately, 1) I'm stock in the first part where I have an warning ValueError: Lengths must be equal, 2) can be possible have this matrix grouped by country and 3) save my .min() and .max() values?
I am not sure what you exactly want as minimum. In this solution, the minimum is 0 if there is only 1 city, but otherwise the shortest distance between 2 cities within the country. Also, the filename cities.txt seems just a filter. I didn't do this but seems straightforward.
import numpy as np
import pandas as pd
Here just some sample data;
cities = pd.read_json("https://raw.githubusercontent.com/lutangar/cities.json/master/cities.json")
cities = cities.sample(10000)
Create and apply a custom aggregate for groupby()
from sklearn.metrics import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
country_groups = cities.groupby("country")
def city_distances(group):
geo = group[['lat','lng']]
EARTH_RADIUS = 6371
haversine_distances = dist.pairwise(np.radians(geo) )
haversine_distances *= EARTH_RADIUS
distances = {}
distances['max'] = np.max(haversine_distances)
distances['min'] = 0
if len(haversine_distances[ np.nonzero(haversine_distances)] ) > 0 :
distances['min'] = np.min( haversine_distances[ np.nonzero(haversine_distances)] )
return pd.Series(distances)
country_groups.apply(city_distances)
In my case this prints something like
max min
country
AE 323.288482 323.288482
AF 1130.966661 15.435642
AI 12.056890 12.056890
AL 272.300688 3.437074
AM 268.051071 1.328605
... ... ...
YE 662.412344 19.103222
YT 3.723376 3.723376
ZA 1466.334609 24.319334
ZM 1227.429001 218.566369
ZW 503.562608 26.316902
[194 rows x 2 columns]

Getting a value out of pandas dataframe based on a set of conditions

I have a dataframe as shown below
Token Label StartID EndID
0 Germany Country 0 2
1 Berlin Capital 6 9
2 Frankfurt City 15 18
3 four million Number 21 24
4 Sweden Country 26 27
5 United Kingdom Country 32 34
6 ten million Number 40 45
7 London Capital 50 55
I am trying to get row based on certain condition, i.e. associate the label Number to closest capital i.e. Berlin
3 four million Number 21 24 - > 1 Berlin Capital 6 9
or something like:
df[row3] -> df [row1]
A pseudo logic
First check, for the rows with label: Number then (assumption is that the city is always '2 rows' above or below) and has the label: Capital. But, label: 'capital' loc is always after the label: Country
What I have done until now,
columnsName =['Token', 'Label', 'StartID', 'EndID']
df = pd.read_csv('resources/testcsv.csv', index_col= 0, skip_blank_lines=True, header=0)
print(df)
key_number = 'Number'
df_with_number = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_number), regex=True, case=False)])
print(df_with_number)
key_capital = 'Capital'
df_with_capitals = (df[df['Label'].str.lower().str.contains(r"\b{}\b".format(key_capital), regex=True, case=False)])
print(df_with_capitals)
key_country = 'Country'
df_with_country = (df[df[1].str.lower().str.contains(r"\b{}\b".format(key_country), regex=True, case=False)])
print(df_with_country)
The logic is to compare the index's and then make possible relations
i.e.
df[row3] -> [ df [row1], df[row7]]
you could use merge_asof with the parameter direction=nearest for example:
df_nb_cap = pd.merge_asof(df_with_number.reset_index(),
df_with_capitals.reset_index(),
on='index',
suffixes=('_nb', '_cap'), direction='nearest')
print (df_nb_cap)
index Token_nb Label_nb StartID_nb EndID_nb Token_cap Label_cap \
0 3 four_million Number 21 24 Berlin Capital
1 6 ten_million Number 40 45 London Capital
StartID_cap EndID_cap
0 6 9
1 50 55
# adjusted sample data
s = """Token,Label,StartID,EndID
Germany,Country,0,2
Berlin,Capital,6,9
Frankfurt,City,15,18
four million,Number,21,24
Sweden,Country,26,27
United Kingdom,Country,32,34
ten million,Number,40,45
London,Capital,50,55
ten million,Number,40,45
ten million,Number,40,45"""
df = pd.read_csv(StringIO(s))
# create a mask for number where capital is 2 above or below
# and where country is three above number or one below number
mask = (df['Label'] == 'Number') & (((df['Label'].shift(2) == 'Capital') |
(df['Label'].shift(-2) == 'Capital')) &
(df['Label'].shift(3) == 'Country') |
(df['Label'].shift(-1) == 'Country'))
# create a mask for capital where number is 2 above or below
# and where country is one above capital
mask2 = (df['Label'] == 'Capital') & (((df['Label'].shift(2) == 'Number') |
(df['Label'].shift(-2) == 'Number')) &
(df['Label'].shift(1) == 'Country'))
# hstack your two masks and create a frame
new_df = pd.DataFrame(np.hstack([df[mask].to_numpy(), df[mask2].to_numpy()]))
print(new_df)
0 1 2 3 4 5 6 7
0 four million Number 21 24 Berlin Capital 6 9

Python pandas random sample by row

I have a dataframe of samples, with a country column. The relative number of records in each country are:
d1.groupby("country").size()
country
Australia 21
Cambodia 58
China 280
India 133
Indonesia 195
Malaysia 138
Myanmar 51
Philippines 49
Singapore 1268
Taiwan 47
Thailand 273
Vietnam 288
How do I select, say, 100 random samples from each country, if that country has > 100 samples? (if the country has <= 100 samples, do nothing). Currently, I do this for, say, Singapore:
names_nonsg_ls = []
names_sg_ls = []
# if the country is not SG, add it to names_nonsg_ls.
# else, add it to names_sg_ls, which will be subsampled later.
for index, row in d0.iterrows():
if str(row["country"]) != "Singapore":
names_nonsg_ls.append(str(row["header"]))
else:
names_sg_ls.append(str(row["header"]))
# Select 100 random names from names_sg_ls
names_sg_ls = random.sample(names_sg_ls, 100)
# Form the list of names to retain
names_ls = names_nonsg_ls + names_sg_ls
# create new dataframe
d1 = d0.loc[d0["header"].isin(names_ls)]
But manually a new list for each country that has >100 names is just poor form, not to mention that I first have to manually pick out the countries with > 100 names.
You can group by country, then sample based on the group size:
d1.groupby("country", group_keys=False).apply(lambda g: g.sample(100) if len(g) > 100 else g)
Example:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
df.groupby('A', group_keys=False).apply(lambda g: g.sample(3) if len(g) > 3 else g)
# A B
#2 a 2
#0 a 0
#1 a 1
#4 b 4
#5 b 5
#6 b 6
#7 c 7
#8 d 8

Intersections between values in two dictionaries in python

I have a csv file that contains trade data for some countries. The data has a format as follows:
rep par commodity value
USA GER 1 700
USA GER 2 100
USA GER 3 400
USA GER 5 100
USA GER 80 900
GER USA 2 300
GER USA 4 500
GER USA 5 700
GER USA 97 450
GER UK 50 300
UK USA 4 1100
UK USA 80 200
UK GER 50 200
UK GER 39 650
I intend to make a new dictionary and by using the created dictionary, calculate the total value of common traded commodities between countries.
For example, consider trade between USA-GER, I intend to check whether GER-USA is in the data and if it exists, values for the common commodities be summed and do the same for all countries. The dictionary should be like:
Dic_c1c2_producs=
{('USA','GER'): ('1','700'),('2','100'),('3','400'),('5','100'),('80','900');
('GER','USA'):('2','300'),('4','500'),('5','700'),('97','450') ;
('GER','UK'):('50','300');
('UK','USA'): ('4','80'),('80','200');
('UK','GER'): ('50','200'),('39','650')}
As you can see, USA-GER and GER-USA have commodities 2 and 5 in common and the value of these goods are (100+300)+(100+700).
For the pairs USA-UK and UK-USA, we have common commodities: 0 so total trade will be 0 as well. For GER-UK and UK-GER, commodity 50 is in common and total trade is 300+200.
At the end, I want to have something like:
Dic_c1c2_summation={('USA','GER'):1200;('GER','UK'):500; ('UK','USA'):0}
Any help would be appreciated.
In addition to my post, I have written following lines:
from collections import defaultdict
rfile = csv.reader(open("filepath",'r'))
rfile.next()
dic_c1c2_products = defaultdict(set)
dic_c_products = {}
country = set()
for row in rfile :
c1 = row[0]
c2 = row[1]
p = row[2]
country.add(c1)
for i in country :
dic_c_products[i] = set()
rfile = csv.reader(open("filepath"))
rfile.next()
for i in rfile:
c1 = i[0]
c2 = i[1]
p = i[2]
v=i[3]
dic_c_products[c1].add((p,v))
if not dic_c1c2_products.has_key((c1,c2)) :
dic_c1c2_products[(c1,c2)] = set()
dic_c1c2_products[(c1,c2)].add((p,v))
else:
dic_c1c2_products[(c1,c2)].add((p,v))
c_list = dic_c_products.keys()
dic_c1c2_productsummation = set()
for i in dic_c1c2_products.keys():
if dic_c1c2_products.has_key((i[1],i[0])):
for p1, v1 in dic_c1c2_products[(i[0],i[1])]:
for p2, v2 in dic_c1c2_products[(i[1],i[0])]:
if p1==p2:
summation=v1+v2
if i not in dic_c1c2_productsum.keys():
dic_c1c2_productsum[(i[0],i[1])]=(p1, summation)
else:
dic_c1c2_productsum[(i[0],i[1])].add((p1, summation))
else:
dic_c1c2_productsn[i] = " "
# save your data in a file called data
import pandas as pd
data = pd.read_csv('data', delim_whitespace=True)
data['par_rep'] = data.apply(lambda x: '_'.join(sorted([x['par'], x['rep']])), axis=1)
result = data.groupby(('par_rep', 'commodity')).filter(lambda x: len(x) >= 2).groupby(('par_rep'))['value'].sum().to_dict()
at the end result is {'GER_UK': 500, 'GER_USA': 1200}

How to sum columns in csv file in python more efficiently

This is my data:
Year Country Albania Andorra Armenia Austria Azerbaijan
2009 Lithuania 0 0 0 0 1
2009 Israel 0 7 0 0 0
2008 Israel 1 2 2 0 4
2008 Lithuania 1 5 1 0 8
Actually, it is csv file and delimiter is , so raw data is:
Year,Country,Albania,Andorra,Armenia,Austria,Azerbaijan
2009,Lithuania,0,0,0,0,1
2009,Israel,0,7,0,0,0
2008,Israel,1,2,2,0,4
2008,Lithuania,1,5,1,0,8
where the first element of the list means sum by column for Lithuania and the second element means sum by column for Israel (for Albania column)?
I am a beginner in python and don't really know many python tricks. What I do know is that I probably complicate too much in my code.
And I want to get this:
final_dict = {Albania: [1, 1], Andorra: [5, 9], Armenia: [1, 2], Austria: [0, 0], Azerbaijan: [9, 4]}
Explanation of output: for every country in first row (Albania, Andorra, Armenia, Austria and Azerbaijan) I want to get sum by countries from Country column.
Andorra: [5,9]
# 5 is sum for Lithuania in Andorra column
# 9 is sum for Israel in Andorra column
You can use the Pandas module which is perfect for this type of application:
import pandas as pd
df = pd.read_csv('songfestival.csv')
gb = df.groupby('Country')
res = pd.concat([i[1].sum(numeric_only=True) for i in gb], axis=1).T
res.pop('Year')
order = [i[0] for i in gb]
print(order)
print(res)
#['Israel', 'Lithuania']
# Albania Andorra Armenia Austria Azerbaijan
#0 1 9 2 0 4
#1 1 5 1 0 9
to query the result for each column you can simply do:
print(res.Albania)
print(res.Andorra)
...
Ok, so you want the the lines aggregated by year:
import csv
from collections import defaultdict
with open("songfestival.csv", "r") as ifile:
reader = csv.DictReader(ifile)
country_columns = [k for k in reader.fieldnames if k not in ["Year","Country"]]
data = defaultdict(lambda:defaultdict(int))
for line in reader:
curr_country = data[line["Country"]]
for country_column in country_columns:
curr_country[country_column] += int(line[country_column])
with open("songfestival_aggr.csv", "w") as ofile:
writer = csv.DictWriter(ofile, fieldnames=country_columns+["Country"])
writer.writeheader()
for k, v in data.items():
row = dict(v)
row["Country"] = k
writer.writerow(row)
I toke the liberty to output it in another csv file. Your data struct is very error prone, since it depends on the order of the columns. Better to use an intermediate dict in a dict to assign names to the aggregations -> see #gboffi's comment on your question.
Your hattrick is using the defaultdict from the collections module, please search for
python defaultdict
on SO, you'll find lot of useful examples, and here is my answer
import csv
from collections import defaultdict
# slurp the data
data = list(csv.reader(open('points.csv')))
# massage the data
for i, row in enumerate(data[1:],1):
data[i] = [int(elt) if elt.isdigit() else elt for elt in row]
points = {} # an empty dictionary
for i, country in enumerate(data[0][2:],2):
# for each country, a couple country:defaultdict is put in points
points[country] = defaultdict(int)
for row in data[1:]:
opponent = row[1]
points[country][opponent] += row[i]
# here you can post-process points as you like,
# I'll simply print out the stuff
for country in points:
for opponent in points[country]:
print country, "vs", opponent, "scored",
print points[country][opponent], "points."
The example output for your data has been
Andorra vs Israel scored 9 points.
Andorra vs Lithuania scored 5 points.
Austria vs Israel scored 0 points.
Austria vs Lithuania scored 0 points.
Albania vs Israel scored 1 points.
Albania vs Lithuania scored 1 points.
Azerbaijan vs Israel scored 4 points.
Azerbaijan vs Lithuania scored 9 points.
Armenia vs Israel scored 2 points.
Armenia vs Lithuania scored 1 points.
Edit
If you're against defaultdict, you can use the .get method of an ordinary dict, that allows you to have back an optional default value if the key:value pair was not initialised
points[country] = {} # a standard empty dict
for row in data[1:]:
opponent = row[1]
points[country][opponent] = points[country].get(opponent,0) + row[i]
as you see, it's a bit clumsier, but still manageable.

Categories