How to sum columns in csv file in python more efficiently - python

This is my data:
Year Country Albania Andorra Armenia Austria Azerbaijan
2009 Lithuania 0 0 0 0 1
2009 Israel 0 7 0 0 0
2008 Israel 1 2 2 0 4
2008 Lithuania 1 5 1 0 8
Actually, it is csv file and delimiter is , so raw data is:
Year,Country,Albania,Andorra,Armenia,Austria,Azerbaijan
2009,Lithuania,0,0,0,0,1
2009,Israel,0,7,0,0,0
2008,Israel,1,2,2,0,4
2008,Lithuania,1,5,1,0,8
where the first element of the list means sum by column for Lithuania and the second element means sum by column for Israel (for Albania column)?
I am a beginner in python and don't really know many python tricks. What I do know is that I probably complicate too much in my code.
And I want to get this:
final_dict = {Albania: [1, 1], Andorra: [5, 9], Armenia: [1, 2], Austria: [0, 0], Azerbaijan: [9, 4]}
Explanation of output: for every country in first row (Albania, Andorra, Armenia, Austria and Azerbaijan) I want to get sum by countries from Country column.
Andorra: [5,9]
# 5 is sum for Lithuania in Andorra column
# 9 is sum for Israel in Andorra column

You can use the Pandas module which is perfect for this type of application:
import pandas as pd
df = pd.read_csv('songfestival.csv')
gb = df.groupby('Country')
res = pd.concat([i[1].sum(numeric_only=True) for i in gb], axis=1).T
res.pop('Year')
order = [i[0] for i in gb]
print(order)
print(res)
#['Israel', 'Lithuania']
# Albania Andorra Armenia Austria Azerbaijan
#0 1 9 2 0 4
#1 1 5 1 0 9
to query the result for each column you can simply do:
print(res.Albania)
print(res.Andorra)
...

Ok, so you want the the lines aggregated by year:
import csv
from collections import defaultdict
with open("songfestival.csv", "r") as ifile:
reader = csv.DictReader(ifile)
country_columns = [k for k in reader.fieldnames if k not in ["Year","Country"]]
data = defaultdict(lambda:defaultdict(int))
for line in reader:
curr_country = data[line["Country"]]
for country_column in country_columns:
curr_country[country_column] += int(line[country_column])
with open("songfestival_aggr.csv", "w") as ofile:
writer = csv.DictWriter(ofile, fieldnames=country_columns+["Country"])
writer.writeheader()
for k, v in data.items():
row = dict(v)
row["Country"] = k
writer.writerow(row)
I toke the liberty to output it in another csv file. Your data struct is very error prone, since it depends on the order of the columns. Better to use an intermediate dict in a dict to assign names to the aggregations -> see #gboffi's comment on your question.

Your hattrick is using the defaultdict from the collections module, please search for
python defaultdict
on SO, you'll find lot of useful examples, and here is my answer
import csv
from collections import defaultdict
# slurp the data
data = list(csv.reader(open('points.csv')))
# massage the data
for i, row in enumerate(data[1:],1):
data[i] = [int(elt) if elt.isdigit() else elt for elt in row]
points = {} # an empty dictionary
for i, country in enumerate(data[0][2:],2):
# for each country, a couple country:defaultdict is put in points
points[country] = defaultdict(int)
for row in data[1:]:
opponent = row[1]
points[country][opponent] += row[i]
# here you can post-process points as you like,
# I'll simply print out the stuff
for country in points:
for opponent in points[country]:
print country, "vs", opponent, "scored",
print points[country][opponent], "points."
The example output for your data has been
Andorra vs Israel scored 9 points.
Andorra vs Lithuania scored 5 points.
Austria vs Israel scored 0 points.
Austria vs Lithuania scored 0 points.
Albania vs Israel scored 1 points.
Albania vs Lithuania scored 1 points.
Azerbaijan vs Israel scored 4 points.
Azerbaijan vs Lithuania scored 9 points.
Armenia vs Israel scored 2 points.
Armenia vs Lithuania scored 1 points.
Edit
If you're against defaultdict, you can use the .get method of an ordinary dict, that allows you to have back an optional default value if the key:value pair was not initialised
points[country] = {} # a standard empty dict
for row in data[1:]:
opponent = row[1]
points[country][opponent] = points[country].get(opponent,0) + row[i]
as you see, it's a bit clumsier, but still manageable.

Related

Extract matrix from a dataframe by value from columns

I am trying something that could be a little hard to understand but i will try to be very specific.
I have a dataframe of python like this
Locality
Count
Lat.
Long.
Krasnodar
Russia
44
39
Tirana
Albania
41.33
19.83
Areni
Armenia
39.73
45.2
Kars
Armenia
40.604517
43.100758
Brunn Wolfholz
Austria
48.120396
16.291722
Kleinhadersdorf Flur Marchleiten
Austria
48.663197
16.589687
Jalilabad district
Azerbaijan
39.3607139
48.4613556
Zeyem Chaj
Azerbaijan
40.9418889
45.8327778
Jalilabad district
Azerbaijan
39.5186111
48.65
And a dataframe cities.txt with a the name of some countries:
Albania
Armenia
Austria
Azerbaijan
And so on.
The nex what I am doing is convert this Lat. and Long. values as radians and then with the values from the list do something like:
with open('cities.txt') as file:
lines=file.readlines()
x=np.where(df['Count'].eq(lines),pd.DataFrame(
dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373,
columns=df.Locality.unique(), index=df.Locality.unique()))
Where pd.DataFrame(dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373, columns=df.Locality.unique(), index=df.Locality.unique()) is converting radians in Lat. Long. into distances in km and create a dataframe as a matrix for each line (country).
In the end i will have a lot of matrix 2d (in theory) grouped by countries and i want to apply this:
>>>Russia.min()
0
>>>Russia.max()
5
to get the .min() and .max() value in each matrix and save this results in cities.txt as
Country Max.Dist. Min. Dist.
Albania 5 1
Armenia 10 9
Austria 5 3
Azerbaijan 0 0
Unfortunately, 1) I'm stock in the first part where I have an warning ValueError: Lengths must be equal, 2) can be possible have this matrix grouped by country and 3) save my .min() and .max() values?
I am not sure what you exactly want as minimum. In this solution, the minimum is 0 if there is only 1 city, but otherwise the shortest distance between 2 cities within the country. Also, the filename cities.txt seems just a filter. I didn't do this but seems straightforward.
import numpy as np
import pandas as pd
Here just some sample data;
cities = pd.read_json("https://raw.githubusercontent.com/lutangar/cities.json/master/cities.json")
cities = cities.sample(10000)
Create and apply a custom aggregate for groupby()
from sklearn.metrics import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
country_groups = cities.groupby("country")
def city_distances(group):
geo = group[['lat','lng']]
EARTH_RADIUS = 6371
haversine_distances = dist.pairwise(np.radians(geo) )
haversine_distances *= EARTH_RADIUS
distances = {}
distances['max'] = np.max(haversine_distances)
distances['min'] = 0
if len(haversine_distances[ np.nonzero(haversine_distances)] ) > 0 :
distances['min'] = np.min( haversine_distances[ np.nonzero(haversine_distances)] )
return pd.Series(distances)
country_groups.apply(city_distances)
In my case this prints something like
max min
country
AE 323.288482 323.288482
AF 1130.966661 15.435642
AI 12.056890 12.056890
AL 272.300688 3.437074
AM 268.051071 1.328605
... ... ...
YE 662.412344 19.103222
YT 3.723376 3.723376
ZA 1466.334609 24.319334
ZM 1227.429001 218.566369
ZW 503.562608 26.316902
[194 rows x 2 columns]

Data Frame - changing of the nested variable

We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?
The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object

Sum df columns for weighted average

Backstory: I have a pandas dataframe scaledData that is just a standard df of information as follows:
COL NAME0 COL NAME1 ... COL NAME3 COL NAME4
0 Alabama 4.099099 ... 2.042345 1.392755
1 Alaska 1.396396 ... 1.000000 1.000000
2 Arizona 4.189189 ... 2.003257 1.537777
3 Arkansas 2.927928 ... 2.208723 1.007370
4 California 3.378378 ... 1.754930 2.012395
5 Colorado 3.378378 ... 3.282196 2.843435
6 Connecticut 5.000000 ... 1.452587 4.277286
7 Delaware 4.409692 ... 2.134501 1.970434
8 District of Columbia 5.000000 ... 1.000000 1.000000
9 Florida 4.628118 ... 1.806412 2.213038
10 Georgia 4.628118 ... 1.513896 2.748559
11 Hawaii 3.902494 ... 2.891694 3.872309
12 Idaho 1.090703 ... 2.978469 4.127419
13 Illinois 4.537415 ... 1.242970 1.888353
14 Indiana 4.537415 ... 2.368881 2.307914
15 Iowa 2.088435 ... 3.298368 3.421122
16 Kansas 2.723356 ... 2.791375 2.160330
17 Kentucky 3.902494 ... 1.692890 4.133744
18 Louisiana 2.451247 ... 1.000000 1.000000
19 Maine 3.448980 ... 2.535328 5.000000
20 Maryland 5.000000 ... 1.632194 1.046567
I want to create another column Total in this df that is a result of adding all of the column values per each state (COL NAME0) divided by the sum of a dictionary weights. Additionally, column E to perform the same total but only for columns with those specific tags. The weights dictionary's key is the column names of the df and the values are a tuple containing the weight values for the columns (used previously but irrelevant to this problem) and the category the column belongs to. Here is my current implementation:
weights = {'COL NAME1': (2.14, 'E'), 'COL NAME2': (5.14, 'E'), 'COL NAME3': (10, 'G'), 'COL NAME4' : (5, 'E')}
eWeights = { key: value for key, value in weights.items() if value[1] == 'E'}
gWeights = { key: value for key, value in weights.items() if value[1] == 'G'}
#Total should be the result of adding each of the columns per COL NAME0 row
#and dividing by the sum of the weight values.
scaledData['Total'] = scaledData.sum(axis = 1, skipna = True)/ sum(list(weights.values())[0])
#Same calculation on only columns marked 'E'
for key in eWeights:
scaledData['E'] = scaledData['E'] + scaledData[key]
scaledData['E'] = scaledData['E'] / sum(list(eWeights.values())[0])
Unfortunately, the above code results in the following error (caused by the line creating the Total column in scaledData) :
TypeError: unsupported operand type(s) for +: 'float' and 'str'
I've simplified the scaledData and weights but any solution or suggestions will help me with my actual df with many more rows and columns. Appreciate the help and let me know if more information is needed.
Your df seems to be stored as float. Try:
for key in eWeights:
scaledData['E'] = scaledData['E'].astype(float) + scaledData[key].astype(float)
scaledData['E'] / sum(list(eWeights.values())[0])
# should this be a print? Are you trying to set any values?

Python / Pulp - first pick one item of each group

I have a dataframe that looks like this:
City Country Time Points
---------------------------------------
London UK 31 20
Paris France 42 12
Sydney Australia 73 44
New York US 41 18
Lyon France 22 25
...
...
Program should pick cities to visit in given time frame. There is only a time constraint and no strict constraint in number of cities.
However:
each city should be visited once at maximum
you can visit twice in the same country, but only after you have visited in every single country in the dataframe. (same goes for the 3rd time in the same country: you have to visit each country twice before)
Currently my code looks like this:
max_time = 500
x = pulp.LpVariable.dicts("x", df.index, 0, 1, pulp.LpInteger)
mod = pulp.LpProblem("travel_prog", pulp.LpMaximize)
objval_points = {idx: (df['Points'][idx]) for idx in df.index}
mod += sum([x[idx]*objval_points[idx] for idx in df.index])
objval_time = {idx: (df['Time'][idx]) for idx in df.index}
mod += sum([x[idx]*objval_time[idx] for idx in df.index]) < max_time
for idx in df.index:
mod += x[idx] <= 1
I have created a constraint that allows the program to choose only one city per country, but this is not what i want:
for country in df['Country'].unique():
sub_idx = df[df['Country']==country].index
mod += pulp.lpSum([x[idx] for idx in sub_idx]) <= 1
Count the number of times each country is visited. Put this in a variable say CountVisit[Country]. Introduce variable maxVisit and MinVisit. Then add constraints maxVisit >= CountVisit[Country] and MinVisit <= CountVisit[Country] and finally impose the constraint MaxVisit-MinVisit<=1.

Intersections between values in two dictionaries in python

I have a csv file that contains trade data for some countries. The data has a format as follows:
rep par commodity value
USA GER 1 700
USA GER 2 100
USA GER 3 400
USA GER 5 100
USA GER 80 900
GER USA 2 300
GER USA 4 500
GER USA 5 700
GER USA 97 450
GER UK 50 300
UK USA 4 1100
UK USA 80 200
UK GER 50 200
UK GER 39 650
I intend to make a new dictionary and by using the created dictionary, calculate the total value of common traded commodities between countries.
For example, consider trade between USA-GER, I intend to check whether GER-USA is in the data and if it exists, values for the common commodities be summed and do the same for all countries. The dictionary should be like:
Dic_c1c2_producs=
{('USA','GER'): ('1','700'),('2','100'),('3','400'),('5','100'),('80','900');
('GER','USA'):('2','300'),('4','500'),('5','700'),('97','450') ;
('GER','UK'):('50','300');
('UK','USA'): ('4','80'),('80','200');
('UK','GER'): ('50','200'),('39','650')}
As you can see, USA-GER and GER-USA have commodities 2 and 5 in common and the value of these goods are (100+300)+(100+700).
For the pairs USA-UK and UK-USA, we have common commodities: 0 so total trade will be 0 as well. For GER-UK and UK-GER, commodity 50 is in common and total trade is 300+200.
At the end, I want to have something like:
Dic_c1c2_summation={('USA','GER'):1200;('GER','UK'):500; ('UK','USA'):0}
Any help would be appreciated.
In addition to my post, I have written following lines:
from collections import defaultdict
rfile = csv.reader(open("filepath",'r'))
rfile.next()
dic_c1c2_products = defaultdict(set)
dic_c_products = {}
country = set()
for row in rfile :
c1 = row[0]
c2 = row[1]
p = row[2]
country.add(c1)
for i in country :
dic_c_products[i] = set()
rfile = csv.reader(open("filepath"))
rfile.next()
for i in rfile:
c1 = i[0]
c2 = i[1]
p = i[2]
v=i[3]
dic_c_products[c1].add((p,v))
if not dic_c1c2_products.has_key((c1,c2)) :
dic_c1c2_products[(c1,c2)] = set()
dic_c1c2_products[(c1,c2)].add((p,v))
else:
dic_c1c2_products[(c1,c2)].add((p,v))
c_list = dic_c_products.keys()
dic_c1c2_productsummation = set()
for i in dic_c1c2_products.keys():
if dic_c1c2_products.has_key((i[1],i[0])):
for p1, v1 in dic_c1c2_products[(i[0],i[1])]:
for p2, v2 in dic_c1c2_products[(i[1],i[0])]:
if p1==p2:
summation=v1+v2
if i not in dic_c1c2_productsum.keys():
dic_c1c2_productsum[(i[0],i[1])]=(p1, summation)
else:
dic_c1c2_productsum[(i[0],i[1])].add((p1, summation))
else:
dic_c1c2_productsn[i] = " "
# save your data in a file called data
import pandas as pd
data = pd.read_csv('data', delim_whitespace=True)
data['par_rep'] = data.apply(lambda x: '_'.join(sorted([x['par'], x['rep']])), axis=1)
result = data.groupby(('par_rep', 'commodity')).filter(lambda x: len(x) >= 2).groupby(('par_rep'))['value'].sum().to_dict()
at the end result is {'GER_UK': 500, 'GER_USA': 1200}

Categories