Extract matrix from a dataframe by value from columns - python

I am trying something that could be a little hard to understand but i will try to be very specific.
I have a dataframe of python like this
Locality
Count
Lat.
Long.
Krasnodar
Russia
44
39
Tirana
Albania
41.33
19.83
Areni
Armenia
39.73
45.2
Kars
Armenia
40.604517
43.100758
Brunn Wolfholz
Austria
48.120396
16.291722
Kleinhadersdorf Flur Marchleiten
Austria
48.663197
16.589687
Jalilabad district
Azerbaijan
39.3607139
48.4613556
Zeyem Chaj
Azerbaijan
40.9418889
45.8327778
Jalilabad district
Azerbaijan
39.5186111
48.65
And a dataframe cities.txt with a the name of some countries:
Albania
Armenia
Austria
Azerbaijan
And so on.
The nex what I am doing is convert this Lat. and Long. values as radians and then with the values from the list do something like:
with open('cities.txt') as file:
lines=file.readlines()
x=np.where(df['Count'].eq(lines),pd.DataFrame(
dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373,
columns=df.Locality.unique(), index=df.Locality.unique()))
Where pd.DataFrame(dist.pairwise(df[['Lat.','Long.']].to_numpy())*6373, columns=df.Locality.unique(), index=df.Locality.unique()) is converting radians in Lat. Long. into distances in km and create a dataframe as a matrix for each line (country).
In the end i will have a lot of matrix 2d (in theory) grouped by countries and i want to apply this:
>>>Russia.min()
0
>>>Russia.max()
5
to get the .min() and .max() value in each matrix and save this results in cities.txt as
Country Max.Dist. Min. Dist.
Albania 5 1
Armenia 10 9
Austria 5 3
Azerbaijan 0 0
Unfortunately, 1) I'm stock in the first part where I have an warning ValueError: Lengths must be equal, 2) can be possible have this matrix grouped by country and 3) save my .min() and .max() values?

I am not sure what you exactly want as minimum. In this solution, the minimum is 0 if there is only 1 city, but otherwise the shortest distance between 2 cities within the country. Also, the filename cities.txt seems just a filter. I didn't do this but seems straightforward.
import numpy as np
import pandas as pd
Here just some sample data;
cities = pd.read_json("https://raw.githubusercontent.com/lutangar/cities.json/master/cities.json")
cities = cities.sample(10000)
Create and apply a custom aggregate for groupby()
from sklearn.metrics import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
country_groups = cities.groupby("country")
def city_distances(group):
geo = group[['lat','lng']]
EARTH_RADIUS = 6371
haversine_distances = dist.pairwise(np.radians(geo) )
haversine_distances *= EARTH_RADIUS
distances = {}
distances['max'] = np.max(haversine_distances)
distances['min'] = 0
if len(haversine_distances[ np.nonzero(haversine_distances)] ) > 0 :
distances['min'] = np.min( haversine_distances[ np.nonzero(haversine_distances)] )
return pd.Series(distances)
country_groups.apply(city_distances)
In my case this prints something like
max min
country
AE 323.288482 323.288482
AF 1130.966661 15.435642
AI 12.056890 12.056890
AL 272.300688 3.437074
AM 268.051071 1.328605
... ... ...
YE 662.412344 19.103222
YT 3.723376 3.723376
ZA 1466.334609 24.319334
ZM 1227.429001 218.566369
ZW 503.562608 26.316902
[194 rows x 2 columns]

Related

Calculating Distances and Filtering and Summing Values Pandas

I currently have data which contains a location name, latitude, longitude and then a number value associated locations. The final goal for me would to get a dataframe that has the sum of the values of each location within specific distance ranges. A sample dataframe is below:
IDVALUE,Latitude,Longitude,NumberValue
ID1,44.968046,-94.420307,1
ID2,44.933208,-94.421310,10
ID3,33.755787,-116.359998,15
ID4,33.844843,-116.54911,207
ID5,44.92057,-93.44786,133
ID6,44.240309,-91.493619,52
ID7,44.968041,-94.419696,39
ID8,44.333304,-89.132027,694
ID9,33.755783,-116.360066,245
ID10,33.844847,-116.549069,188
ID11,44.920474,-93.447851,3856
ID12,44.240304,-91.493768,189
Firstly, I managed to get the distances between each of them using the haversine function. Using the code below I turned the latlongs into radians and then created a matrix where the diagonals are infinite values.
df_latlongs['LATITUDE'] = np.radians(df_latlongs['LATITUDE'])
df_latlongs['LONGITUDE'] = np.radians(df_latlongs['LONGITUDE'])
dist = DistanceMetric.get_metric('haversine')
latlong_df = pd.DataFrame(dist.pairwise(df_latlongs[['LATITUDE','LONGITUDE']].to_numpy())*6373, columns=df_latlongs.IDVALUE.unique(), index=df_latlongs.IDVALUE.unique())
np.fill_diagonal(latlong_df.values, math.inf)
This distance matrix is then in kilometres. What I'm struggling with next is to be able to filter the distances of each of the locations and get the total number of values within a range and link this to the original dataframe.
Below is the code I have used to filter the distance matrix to get all of the locations within 500 meters:
latlong_df_rows = latlong_df[latlong_df < 0.5]
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=0)
latlong_df_rows = latlong_df_rows.dropna(how='all', axis=1)
My attempt was to them get a list for each location of the locations that were in this value using the code below:
within_range_df = latlong_df_rows.apply(lambda row: row[row < 0.05].index.tolist(), axis=1)
within_range_df = within_range_df.to_frame()
within_range_df = within_range_df.dropna(how='all', axis=0)
within_range_df = within_range_df.dropna(how='all', axis=1)
From here I was going to try and get the NumberValue from the original dataframe by looping through the list of values to obtain another column for the number for that location. Then sum all of them. The final dataframe would ideally look like the following:
ID VALUE,<500m,500-1000m,>100m
ID1,x1,y1,z1
ID2,x2,y2,z2
ID3,x3,y3,z3
ID4,x4,y4,z4
ID5,x5,y5,z5
ID6,x6,y6,z6
ID7,x7,y7,z7
ID8,x8,y8,z8
ID9,x9,y9,z9
ID10,x10,y10,z10
ID11,x11,y11,z11
ID12,x12,y12,z12
Where x y and z are the total number values for the nearest locations for different distances. I know this is probably really weird and overcomplicated so any tips to change the question or anything else that is needed I'll be happy to provide. Cheers
I would define a helper function, making use of BallTree, e.g.
from sklearn.neighbors import BallTree
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv')
We use query_radius() to get the IDs and use list comprehension to get the values and sum them;
locations_radians = np.radians(df[["Latitude","Longitude"]].values)
tree = BallTree(locations_radians, leaf_size=12, metric='haversine')
def summed_numbervalue_for_radius( radius_in_m=100):
distance_in_meters = radius_in_m
earth_radius = 6371000
radius = distance_in_meters / earth_radius
ids_within_radius = tree.query_radius(locations_radians, r=radius, count_only=False)
values_as_array = np.array(df.NumberValue)
summed_values = [values_as_array[ix].sum() for ix in ids_within_radius]
return np.array(summed_values)
With the helper function you can do for instance;
df = df.assign( sum_100=summed_numbervalue_for_radius(100))
df = df.assign( sum_500=summed_numbervalue_for_radius(500))
df = df.assign( sum_1000=summed_numbervalue_for_radius(1000))
df = df.assign( sum_1000_to_5000=summed_numbervalue_for_radius(5000)-summed_numbervalue_for_radius(1000))
Will give you
IDVALUE Latitude Longitude NumberValue sum_100 sum_500 sum_1000 \
0 ID1 44.968046 -94.420307 1 40 40 40
1 ID2 44.933208 -94.421310 10 10 10 10
2 ID3 33.755787 -116.359998 15 260 260 260
3 ID4 33.844843 -116.549110 207 395 395 395
4 ID5 44.920570 -93.447860 133 3989 3989 3989
5 ID6 44.240309 -91.493619 52 241 241 241
6 ID7 44.968041 -94.419696 39 40 40 40
7 ID8 44.333304 -89.132027 694 694 694 694
8 ID9 33.755783 -116.360066 245 260 260 260
9 ID10 33.844847 -116.549069 188 395 395 395
10 ID11 44.920474 -93.447851 3856 3989 3989 3989
11 ID12 44.240304 -91.493768 189 241 241 241
sum_1000_to_5000
0 10
1 40
2 0
3 0
4 0
5 0
6 10
7 0
8 0
9 0
10 0
11 0

Data Frame - changing of the nested variable

We are discussing data that is imported from excel
ene2 = pd.read_excel('Energy Indicators.xls', index=False)
recently I asked in post, where answers were clear, straightforward and brought success.
Changing Values of elements in Pandas Datastructure
However I went steps further, and I have similar (sic!) problem, where assigning variable does not change anything.
Lets consider Data Structure
print(ene2.head())
Country Energy Supply Energy Supply per Capita % Renewable's
15 NaN Gigajoules Gigajoules %
16 Afghanistan 321000000 10 78.6693
17 Albania 102000000 35 100
18 Algeria1 1959000000 51 0.55101
19 American Samoa ... ... 0.641026
238 Viet Nam 2554000000 28 45.3215
239 Wallis and Futuna Islands 0 26 0
240 Yemen 344000000 13 0
241 Zambia 400000000 26 99.7147
242 Zimbabwe 480000000 32 52.5361
243 NaN NaN NaN NaN
244 NaN NaN NaN NaN
where some countries have index (like Algieria1 or Australia12)
I want to change those names to become just Algieria, Australia and so on.
There is in total 20 entries that suppose to be changed.
I developed a method to do it, which at the last step fails..
for value in ene2['Country']:
if type(value) == float: # to cover NaN values
continue
x = re.findall("\D+\d", value) # to find those countries/elements which are with number
while len(x) > 0: # this shows elements with number, otherwise answer is [], which is 0
for letters in x: # to touch letters
right = letters[:-1] # and get rid of the last number
ene2.loc[ene2['Country'] == value, 'Country'] = right # THIS IS ELEMENT WHICH FAILS <= it does not chagne the value
x = re.findall("\D+\d", value) # to bring the new value to the while loop
Code above should make the task, to finally remove all the indexes from the names,
however the code - ene2.loc[...] which used to work previously, here, where is nested, just do nothing.
What could be the case that this exchange does not work, how can I overcome the problem a) in a old style way b) in the Panda way?
The code suggest you already use pandas, so why not use the built-in replace method with regex?
df = pd.DataFrame(data=["Afghanistan","Albania", "Algeria1", "Algeria9999"], columns=["Country"])
df["Country_clean"] = df["Country"].str.replace(r'\d+$', '')
output:
print(df["Country_clean"])
0 Afghanistan
1 Albania
2 Algeria
3 Algeria
Name: Country, dtype: object

Find highest growth using python pandas?

I have a dataframe named growth with 4 columns.
State Name Average Fare ($)_x Average Fare ($)_y Average Fare ($)
0 AK 599.372368 577.790640 585.944324
1 AL 548.825867 545.144447 555.939466
2 AR 496.033146 511.867026 513.761296
3 AZ 324.641818 396.895324 389.545267
4 CA 368.937971 376.723839 366.918761
5 CO 502.611572 537.206439 531.191893
6 CT 394.105453 388.772428 370.904182
7 DC 390.872738 382.326510 392.394165
8 FL 324.941100 329.728524 337.249248
9 GA 485.335737 480.606365 489.574241
10 HI 326.084793 335.547369 298.709998
11 IA 428.151682 445.625840 462.614195
12 ID 482.092567 475.822275 491.714945
13 IL 329.449503 349.938794 346.022226
14 IN 391.627917 418.945137 412.242053
15 KS 452.312058 490.024059 420.182836
The last three columns are the average fare of each year of each state.
2nd,3rd,4th column being year 2017,2018,2019 respectively.
I wanted to find out that which state has highest growth in fare since 2017.
I tried with this code of mine and it gives some output that I cant really understand.
I just need to find the state that has highest fare growth since 2017.
my code:
growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change()
You can you this
df.set_index('State_name').pct_change(periods = 1, axis='columns').idxmax()
Change the periods value to 2 if you want to calculate the difference between first year & the 3rd year.
output
Average_fare_x NaN
Average_fare_y AZ #state with max change between 1st & 2nd year
Average_fare WV #state with max change between 2nd & 3rd year
growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change(axis='columns')
This should give you the percentage change between each year.
growth['variation_percentage'] = growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change(axis='columns').sum(axis=1)
This should give you the cumulative percentage change.
Since you are talking about variation prices the total growth/decrease in fare will be the variation from 2017 to your last available data (2019). Therefore you can compute this ratio and then just get the max() to find the row with the most growth.
growth['variation_fare'] = growth['Average Fare ($)'] / growth['Average Fare ($)_x']
growth = growth.sort_values(['variation_fare'],ascending=False)
print(growth.head(1))
Example:
import pandas as pd
a = {'State':['AK','AL','AR','AZ','CA'],'2017':[100,200,300,400,500],'2018':[120,242,324,457,592],'2019':[220,393,484,593,582]}
growth = pd.DataFrame(a)
growth['2018-2017 variation'] = (growth['2018'] / growth['2017']) - 1
growth['2019-2018 variation'] = (growth['2019'] / growth['2018']) - 1
growth['total variation'] = (growth['2019'] / growth['2017']) - 1
growth = growth.sort_values(['total variation'],ascending=False)
print(growth.head(5)) #Showing top 5
Output:
State 2017 2018 2019 2018-2017 variation 2019-2018 variation total variation
0 AK 100 120 220 0.2000 0.833333 1.200000
1 AL 200 242 393 0.2100 0.623967 0.965000
2 AR 300 324 484 0.0800 0.493827 0.613333
3 AZ 400 457 593 0.1425 0.297593 0.482500
4 CA 500 592 582 0.1840 -0.016892 0.164000

operations in pandas DataFrame

I have a fairly large (~5000 rows) DataFrame, with a number of variables, say 2 ['max', 'min'], sorted by 4 parameters, ['Hs', 'Tp', 'wd', 'seed']. It looks like this:
>>> data.head()
Hs Tp wd seed max min
0 1 9 165 22 225 18
1 1 9 195 16 190 18
2 2 5 165 43 193 12
3 2 10 180 15 141 22
4 1 6 180 17 219 18
>>> len(data)
4500
I want to keep only the first 2 parameters and get the maximum standard deviation for all 'seed's calculated individually for each 'wd'.
In the end, I'm left with unique (Hs, Tp) pairs with the maximum standard deviations for each variable. Something like:
>>> stdev.head()
Hs Tp max min
0 1 5 43.31321 4.597629
1 1 6 43.20004 4.640795
2 1 7 47.31507 4.569408
3 1 8 41.75081 4.651762
4 1 9 41.35818 4.285991
>>> len(stdev)
30
The following code does what I want, but since I have little understanding about DataFrames, I'm wondering if these nested loops can be done in a different and more DataFramy way =)
import pandas as pd
import numpy as np
#
#data = pd.read_table('data.txt')
#
# don't worry too much about this ugly generator,
# it just emulates the format of my data...
total = 4500
data = pd.DataFrame()
data['Hs'] = np.random.randint(1,4,size=total)
data['Tp'] = np.random.randint(5,15,size=total)
data['wd'] = [[165, 180, 195][np.random.randint(0,3)] for _ in xrange(total)]
data['seed'] = np.random.randint(1,51,size=total)
data['max'] = np.random.randint(100,250,size=total)
data['min'] = np.random.randint(10,25,size=total)
# and here it starts. would the creators of pandas pull their hair out if they see this?
# can this be made better?
stdev = pd.DataFrame(columns = ['Hs', 'Tp', 'max', 'min'])
i=0
for hs in set(data['Hs']):
data_Hs = data[data['Hs'] == hs]
for tp in set(data_Hs['Tp']):
data_tp = data_Hs[data_Hs['Tp'] == tp]
stdev.loc[i] = [
hs,
tp,
max([np.std(data_tp[data_tp['wd']==wd]['max']) for wd in set(data_tp['wd'])]),
max([np.std(data_tp[data_tp['wd']==wd]['min']) for wd in set(data_tp['wd'])])]
i+=1
Thanks!
PS: if curious, this is statistics on variables depending on sea waves. Hs is wave height, Tp wave period, wd wave direction, the seeds represent different realizations of an irregular wave train, and min and max are the peaks or my variable during a certain exposition time. After all this, by means of the standard deviation and average, I can fit some distribution to the data, like Gumbel.
This could be a one-liner, if I understood you correctly:
data.groupby(['Hs', 'Tp', 'wd'])[['max', 'min']].std(ddof=0).max(level=[0, 1])
(include reset_index() on the end if you want)

How to sum columns in csv file in python more efficiently

This is my data:
Year Country Albania Andorra Armenia Austria Azerbaijan
2009 Lithuania 0 0 0 0 1
2009 Israel 0 7 0 0 0
2008 Israel 1 2 2 0 4
2008 Lithuania 1 5 1 0 8
Actually, it is csv file and delimiter is , so raw data is:
Year,Country,Albania,Andorra,Armenia,Austria,Azerbaijan
2009,Lithuania,0,0,0,0,1
2009,Israel,0,7,0,0,0
2008,Israel,1,2,2,0,4
2008,Lithuania,1,5,1,0,8
where the first element of the list means sum by column for Lithuania and the second element means sum by column for Israel (for Albania column)?
I am a beginner in python and don't really know many python tricks. What I do know is that I probably complicate too much in my code.
And I want to get this:
final_dict = {Albania: [1, 1], Andorra: [5, 9], Armenia: [1, 2], Austria: [0, 0], Azerbaijan: [9, 4]}
Explanation of output: for every country in first row (Albania, Andorra, Armenia, Austria and Azerbaijan) I want to get sum by countries from Country column.
Andorra: [5,9]
# 5 is sum for Lithuania in Andorra column
# 9 is sum for Israel in Andorra column
You can use the Pandas module which is perfect for this type of application:
import pandas as pd
df = pd.read_csv('songfestival.csv')
gb = df.groupby('Country')
res = pd.concat([i[1].sum(numeric_only=True) for i in gb], axis=1).T
res.pop('Year')
order = [i[0] for i in gb]
print(order)
print(res)
#['Israel', 'Lithuania']
# Albania Andorra Armenia Austria Azerbaijan
#0 1 9 2 0 4
#1 1 5 1 0 9
to query the result for each column you can simply do:
print(res.Albania)
print(res.Andorra)
...
Ok, so you want the the lines aggregated by year:
import csv
from collections import defaultdict
with open("songfestival.csv", "r") as ifile:
reader = csv.DictReader(ifile)
country_columns = [k for k in reader.fieldnames if k not in ["Year","Country"]]
data = defaultdict(lambda:defaultdict(int))
for line in reader:
curr_country = data[line["Country"]]
for country_column in country_columns:
curr_country[country_column] += int(line[country_column])
with open("songfestival_aggr.csv", "w") as ofile:
writer = csv.DictWriter(ofile, fieldnames=country_columns+["Country"])
writer.writeheader()
for k, v in data.items():
row = dict(v)
row["Country"] = k
writer.writerow(row)
I toke the liberty to output it in another csv file. Your data struct is very error prone, since it depends on the order of the columns. Better to use an intermediate dict in a dict to assign names to the aggregations -> see #gboffi's comment on your question.
Your hattrick is using the defaultdict from the collections module, please search for
python defaultdict
on SO, you'll find lot of useful examples, and here is my answer
import csv
from collections import defaultdict
# slurp the data
data = list(csv.reader(open('points.csv')))
# massage the data
for i, row in enumerate(data[1:],1):
data[i] = [int(elt) if elt.isdigit() else elt for elt in row]
points = {} # an empty dictionary
for i, country in enumerate(data[0][2:],2):
# for each country, a couple country:defaultdict is put in points
points[country] = defaultdict(int)
for row in data[1:]:
opponent = row[1]
points[country][opponent] += row[i]
# here you can post-process points as you like,
# I'll simply print out the stuff
for country in points:
for opponent in points[country]:
print country, "vs", opponent, "scored",
print points[country][opponent], "points."
The example output for your data has been
Andorra vs Israel scored 9 points.
Andorra vs Lithuania scored 5 points.
Austria vs Israel scored 0 points.
Austria vs Lithuania scored 0 points.
Albania vs Israel scored 1 points.
Albania vs Lithuania scored 1 points.
Azerbaijan vs Israel scored 4 points.
Azerbaijan vs Lithuania scored 9 points.
Armenia vs Israel scored 2 points.
Armenia vs Lithuania scored 1 points.
Edit
If you're against defaultdict, you can use the .get method of an ordinary dict, that allows you to have back an optional default value if the key:value pair was not initialised
points[country] = {} # a standard empty dict
for row in data[1:]:
opponent = row[1]
points[country][opponent] = points[country].get(opponent,0) + row[i]
as you see, it's a bit clumsier, but still manageable.

Categories