I am trying to generate a choropleth map of india with some data , everything seems to work fine, i cleaned the data and made a proper Pandas Dataframe, but somehow its still blank. the geojson _id matches with the dataframe yet no resolve.
import plotly.express as px
import pandas as pd
import json
import numpy as np
india_map = json.load(open('INDIA_STATES.json','r'))
df = pd.read_excel("data.xlsx")
df_s = pd.DataFrame({'Stats':df['Unnamed: 7'][3:38]},)
#print(df_s)
df = pd.concat([df['Unnamed: 1'][3:38],df_s],axis=1,ignore_index=True)
df.reset_index(drop=True,inplace=True)
df = df.rename(columns={df.columns[0]:"State",df.columns[1]:"Stat"})
#print(df)
state_id_json = []
state_name_json = []
for st in india_map['features']:
state_name_json.append(st['properties']['STATE'])
state_id_json.append(st['_id'])
state_name_json = list(map(lambda x:x.upper(),state_name_json))
df['State'] = list(map(lambda x:x.upper(),df['State']))
state_id = pd.DataFrame({"State":state_name_json,"_id":state_id_json}).sort_values(by=['State'])
#print(state_id,df)
df = pd.merge(df,state_id,on='State')
print(df)
fig = px.choropleth(df,geojson=india_map,locations='_id',color='Stat',scope= 'asia')
fig.show()
I get this in Terminal :
State Stat _id
0 ANDAMAN AND NICOBAR ISLANDS 356.0 8
1 ANDHRA PRADESH 76210.0 9
2 ARUNACHAL PRADESH 1098.0 10
3 ASSAM 26656.0 11
4 BIHAR 82999.0 row_112
5 CHANDIGARH 901.0 35
6 CHANDIGARH 901.0 13
7 CHHATTISGARH 20834.0 row_251
8 DAMAN AND DIU 158.0 34
9 DELHI 13851.0 16
10 GOA 1348.0 17
11 GUJARAT 50671.0 18
12 HARYANA 21145.0 19
13 HIMACHAL PRADESH 6078.0 20
14 JAMMU AND KASHMIR 10144.0 7
15 JHARKHAND 26946.0 row_267
16 KARNATAKA 52851.0 21
17 KERALA 31841.0 22
18 LAKSHADWEEP 61.0 23
19 MADHYA PRADESH 60348.0 row_250
20 MAHARASHTRA 96879.0 25
21 MANIPUR 2294.0 26
22 MEGHALAYA 2319.0 27
23 MIZORAM 889.0 28
24 NAGALAND 1990.0 29
25 ODISHA 36805.0 30
26 PUDUCHERRY 974.0 37
27 PUNJAB 24359.0 1
28 RAJASTHAN 56507.0 2
29 SIKKIM 541.0 3
30 TAMIL NADU 62406.0 4
31 TRIPURA 3199.0 5
32 UTTAR PRADESH 166198.0 31
33 WEST BENGAL 80176.0 32
Plotly Result:
Empty Map with no data
Since no geojson was provided, I used Json retrieved from the web and created the code with the data you provided. geojson was not provided, so I may not be certain, but I think it was probably due to a missing featureidkey setting. In my example, I have associated the state name with NAME_1 in geojson. Some of the state names do not match, so I adapted them to geojson.
Note:
Some states are missing because the state name in the user data does not match NAME_1 in geojson. Please replace it with your data.
from urllib.request import urlopen
import json
import pandas as pd
import numpy as np
import io
import plotly.express as px
# https://github.com/Subhash9325/GeoJson-Data-of-Indian-States/blob/master/Indian_States
url = 'https://raw.githubusercontent.com/Subhash9325/GeoJson-Data-of-Indian-States/master/Indian_States'
with urlopen(url) as response:
india_map = json.load(response)
data = '''
State Stat _id
0 "ANDAMAN AND NICOBAR ISLANDS" 356.0 8
1 "ANDHRA PRADESH" 76210.0 9
2 "ARUNACHAL PRADESH" 1098.0 10
3 ASSAM 26656.0 11
4 BIHAR 82999.0 row_112
5 CHANDIGARH 901.0 35
6 CHANDIGARH 901.0 13
7 CHHATTISGARH 20834.0 row_251
8 "DAMAN AND DIU" 158.0 34
9 DELHI 13851.0 16
10 GOA 1348.0 17
11 GUJARAT 50671.0 18
12 HARYANA 21145.0 19
13 "HIMACHAL PRADESH" 6078.0 20
14 "JAMMU AND KASHMIR" 10144.0 7
15 JHARKHAND 26946.0 row_267
16 KARNATAKA 52851.0 21
17 KERALA 31841.0 22
18 LAKSHADWEEP 61.0 23
19 "MADHYA PRADESH" 60348.0 row_250
20 MAHARASHTRA 96879.0 25
21 MANIPUR 2294.0 26
22 MEGHALAYA 2319.0 27
23 MIZORAM 889.0 28
24 NAGALAND 1990.0 29
25 ODISHA 36805.0 30
26 PUDUCHERRY 974.0 37
27 PUNJAB 24359.0 1
28 RAJASTHAN 56507.0 2
29 SIKKIM 541.0 3
30 "TAMIL NADU" 62406.0 4
31 TRIPURA 3199.0 5
32 "UTTAR PRADESH" 166198.0 31
33 "WEST BENGAL" 80176.0 32
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['State'] = df['State'].str.title()
df['State'] = df['State'].str.replace(' And ', ' and ')
df.loc[0, 'State'] = 'Andaman and Nicobar'
fig = px.choropleth(df,
geojson=india_map,
locations='State',
color='Stat',
featureidkey="properties.NAME_1",
scope='asia')
fig.update_geos(fitbounds="locations", visible=False)
fig.update_layout(autosize=False,
width=1000,
height=600
)
fig.show()
Related
When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.
I want to plot variable by date, days and month. Grid is uneven when month is changing. How to force size of grid in this case?
Data is loaded via Pandas, as DataFrame.
ga =
Reference Organic_search Direct Date
0 0 0 0 2021-11-22
1 0 0 0 2021-11-23
2 0 0 0 2021-11-24
3 0 0 0 2021-11-25
4 0 0 0 2021-11-26
5 0 0 0 2021-11-27
6 0 0 0 2021-11-28
7 42 19 35 2021-11-29
8 69 33 48 2021-11-30
9 107 32 35 2021-12-01
10 62 30 26 2021-12-02
11 20 26 30 2021-12-03
12 22 22 20 2021-12-04
13 40 41 20 2021-12-05
14 14 39 26 2021-12-06
15 18 25 34 2021-12-07
16 8 21 13 2021-12-08
17 11 21 17 2021-12-09
18 23 27 20 2021-12-10
19 46 26 17 2021-12-11
20 29 42 20 2021-12-12
21 122 37 19 2021-12-13
22 97 25 29 2021-12-14
23 288 51 39 2021-12-15
24 96 29 26 2021-12-16
25 51 25 36 2021-12-17
26 23 16 21 2021-12-18
27 47 32 10 2021-12-19
code:
fig, ax = plt.subplots(figsize = (15,5))
ax.plot(ga.date, ga.reference)
ax.set(xlabel = 'Data',
ylabel = 'Ruch na stronie')
date_form = DateFormatter('%d/%m')
ax.xaxis.set_major_formatter(date_form)
graph
Looking at the added data, I realized why the interval was not constant.
This is because the number of days corresponding to each month is different.
So I just made the date data into one string data. And the grid spacing was forced to be the same.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_excel('test.xlsx', index_col=0)
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(df['Date'].dt.strftime('%d/%y'), df.Refference)
ax.set(xlabel='Data',
ylabel='Ruch na stronie')
ax.grid(True)
# set xaxis interval
interval = 3
ax.xaxis.set_major_locator(ticker.MultipleLocator(interval))
I'm scraping National Hockey League (NHL) data for multiple seasons from this URL:
https://www.hockey-reference.com/leagues/NHL_2018_skaters.html
I'm only getting a few instances here and have tried moving my dict statements throughout the for loops. I've also tried utilizing solutions I found on other posts with no luck. Any help is appreciated. Thank you!
import requests
from bs4 import BeautifulSoup
import pandas as pd
dict={}
for i in range (2010,2020):
year = str(i)
source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
soup = BeautifulSoup(source,features='lxml')
#identifying table in html
table = soup.find('table', id="stats")
#grabbing <tr> tags in html
rows = table.findAll("tr")
#creating passable values for each "stat" in td tag
data_stats = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage"
]
for rownum in rows:
# grabbing player name and using as key
filter = { "data-stat":'player' }
cell = rows[3].findAll("td",filter)
nameval = cell[0].string
list = []
for data in data_stats:
#iterating through data_stat to grab values
filter = { "data-stat":data }
cell = rows[3].findAll("td",filter)
value = cell[0].string
list.append(value)
dict[nameval] = list
dict[nameval].append(year)
# conversion to numeric values and creating dataframe
columns = [
"player",
"age",
"team_id",
"pos",
"games_played",
"goals",
"assists",
"points",
"plus_minus",
"pen_min",
"ps",
"goals_ev",
"goals_pp",
"goals_sh",
"goals_gw",
"assists_ev",
"assists_pp",
"assists_sh",
"shots",
"shot_pct",
"time_on_ice",
"time_on_ice_avg",
"blocks",
"hits",
"faceoff_wins",
"faceoff_losses",
"faceoff_percentage",
"year"
]
df = pd.DataFrame.from_dict(dict,orient='index',columns=columns)
cols = df.columns.drop(['player','team_id','pos','year'])
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
print(df)
Output
Craig Adams Craig Adams 32 ... 43.9 2010
Luke Adam Luke Adam 22 ... 100.0 2013
Justin Abdelkader Justin Abdelkader 29 ... 29.4 2017
Will Acton Will Acton 27 ... 50.0 2015
Noel Acciari Noel Acciari 24 ... 44.1 2016
Pontus Aberg Pontus Aberg 25 ... 10.5 2019
[6 rows x 28 columns]
I'd just use pandas' .read_html(), It does the hard work of parsing tables for you (uses BeautifulSoup under the hood)
Code:
import pandas as pd
result = pd.DataFrame()
for i in range (2010,2020):
print(i)
year = str(i)
url = 'https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html'
#source = requests.get('https://www.hockey-reference.com/leagues/NHL_'+year+'_skaters.html').text
df = pd.read_html(url,header=1)[0]
df['year'] = year
result = result.append(df, sort=False)
result = result[~result['Age'].str.contains("Age")]
result = result.reset_index(drop=True)
You can then save to file with result.to_csv('filename.csv',index=False)
Output:
print (result)
Rk Player Age Tm Pos GP ... BLK HIT FOW FOL FO% year
0 1 Justin Abdelkader 22 DET LW 50 ... 20 152 148 170 46.5 2010
1 2 Craig Adams 32 PIT RW 82 ... 58 193 243 311 43.9 2010
2 3 Maxim Afinogenov 30 ATL RW 82 ... 21 32 1 2 33.3 2010
3 4 Andrew Alberts 28 TOT D 76 ... 88 216 0 1 0.0 2010
4 4 Andrew Alberts 28 CAR D 62 ... 67 172 0 0 NaN 2010
5 4 Andrew Alberts 28 VAN D 14 ... 21 44 0 1 0.0 2010
6 5 Daniel Alfredsson 37 OTT RW 70 ... 36 41 14 25 35.9 2010
7 6 Bryan Allen 29 FLA D 74 ... 137 120 0 0 NaN 2010
8 7 Cody Almond 20 MIN C 7 ... 5 7 18 12 60.0 2010
9 8 Karl Alzner 21 WSH D 21 ... 21 15 0 0 NaN 2010
10 9 Artem Anisimov 21 NYR C 82 ... 41 45 310 380 44.9 2010
11 10 Nik Antropov 29 ATL C 76 ... 35 82 481 627 43.4 2010
12 11 Colby Armstrong 27 ATL RW 79 ... 29 74 10 10 50.0 2010
13 12 Derek Armstrong 36 STL C 6 ... 0 4 7 8 46.7 2010
14 13 Jason Arnott 35 NSH C 63 ... 17 24 526 551 48.8 2010
15 14 Dean Arsene 29 EDM D 13 ... 13 18 0 0 NaN 2010
16 15 Evgeny Artyukhin 26 TOT RW 54 ... 10 127 1 1 50.0 2010
17 15 Evgeny Artyukhin 26 ANA RW 37 ... 8 90 0 1 0.0 2010
18 15 Evgeny Artyukhin 26 ATL RW 17 ... 2 37 1 0 100.0 2010
19 16 Arron Asham 31 PHI RW 72 ... 16 92 2 11 15.4 2010
20 17 Adrian Aucoin 36 PHX D 82 ... 67 131 1 0 100.0 2010
21 18 Keith Aucoin 31 WSH C 9 ... 0 2 31 25 55.4 2010
22 19 Sean Avery 29 NYR C 69 ... 17 145 4 10 28.6 2010
23 20 David Backes 25 STL RW 79 ... 60 266 504 561 47.3 2010
24 21 Mikael Backlund 20 CGY C 23 ... 4 12 100 86 53.8 2010
25 22 Nicklas Backstrom 22 WSH C 82 ... 61 90 657 660 49.9 2010
26 23 Josh Bailey 20 NYI C 73 ... 36 67 171 255 40.1 2010
27 24 Keith Ballard 27 FLA D 82 ... 201 156 0 0 NaN 2010
28 25 Krys Barch 29 DAL RW 63 ... 13 120 0 3 0.0 2010
29 26 Cam Barker 23 TOT D 70 ... 53 75 0 0 NaN 2010
... ... .. ... .. .. ... ... ... ... ... ... ...
10251 885 Chris Wideman 29 TOT D 25 ... 26 35 0 0 NaN 2019
10252 885 Chris Wideman 29 OTT D 19 ... 25 26 0 0 NaN 2019
10253 885 Chris Wideman 29 EDM D 5 ... 1 7 0 0 NaN 2019
10254 885 Chris Wideman 29 FLA D 1 ... 0 2 0 0 NaN 2019
10255 886 Justin Williams 37 CAR RW 82 ... 32 55 92 150 38.0 2019
10256 887 Colin Wilson 29 COL C 65 ... 31 55 20 32 38.5 2019
10257 888 Garrett Wilson 27 PIT LW 50 ... 16 114 3 4 42.9 2019
10258 889 Scott Wilson 26 BUF C 15 ... 2 29 1 2 33.3 2019
10259 890 Tom Wilson 24 WSH RW 63 ... 52 200 29 24 54.7 2019
10260 891 Luke Witkowski 28 DET D 34 ... 27 67 0 0 NaN 2019
10261 892 Christian Wolanin 23 OTT D 30 ... 31 11 0 0 NaN 2019
10262 893 Miles Wood 23 NJD LW 63 ... 27 97 0 2 0.0 2019
10263 894 Egor Yakovlev 27 NJD D 25 ... 22 12 0 0 NaN 2019
10264 895 Kailer Yamamoto 20 EDM RW 17 ... 11 18 0 0 NaN 2019
10265 896 Keith Yandle 32 FLA D 82 ... 76 47 0 0 NaN 2019
10266 897 Pavel Zacha 21 NJD C 61 ... 24 68 348 364 48.9 2019
10267 898 Filip Zadina 19 DET RW 9 ... 3 6 3 3 50.0 2019
10268 899 Nikita Zadorov 23 COL D 70 ... 67 228 0 0 NaN 2019
10269 900 Nikita Zaitsev 27 TOR D 81 ... 151 139 0 0 NaN 2019
10270 901 Travis Zajac 33 NJD C 80 ... 38 66 841 605 58.2 2019
10271 902 Jakub Zboril 21 BOS D 2 ... 0 3 0 0 NaN 2019
10272 903 Mika Zibanejad 25 NYR C 82 ... 66 134 830 842 49.6 2019
10273 904 Mats Zuccarello 31 TOT LW 48 ... 43 57 10 20 33.3 2019
10274 904 Mats Zuccarello 31 NYR LW 46 ... 42 57 10 20 33.3 2019
10275 904 Mats Zuccarello 31 DAL LW 2 ... 1 0 0 0 NaN 2019
10276 905 Jason Zucker 27 MIN LW 81 ... 38 87 2 11 15.4 2019
10277 906 Valentin Zykov 23 TOT LW 28 ... 6 26 2 7 22.2 2019
10278 906 Valentin Zykov 23 CAR LW 13 ... 2 6 2 6 25.0 2019
10279 906 Valentin Zykov 23 VEG LW 10 ... 3 18 0 1 0.0 2019
10280 906 Valentin Zykov 23 EDM LW 5 ... 1 2 0 0 NaN 2019
[10281 rows x 29 columns]
Scraping heavily formatted tables are positively painful with Beautiful Soup (not to bash on Beautiful Soup, it's wonderful for several use cases). There's a bit of a 'hack' I use for scraping data surrounded with dense markup, if you're willing to be a bit utilitarian about it:
1. Select entire table on web page
2. Copy + paste into Evernote (simplifies and reformats the HTML)
3. Copy + paste from Evernote to Excel or another spreadsheet software (removes the HTML)
4. Save as .csv
Input
Output
It isn't perfect. There will be blank lines in the CSV, but blank lines are easier and far less time-consuming to remove than such data is to scrape. Good luck!
As reference, I've linked my own conversions below.
Parsed to Evernote
Parsed to Excel
I want to sort my dataframe in decending order with "Total Confirmed cases"
My Code
high_cases_sorted_df = df.sort_values(by='Total Confirmed cases',ascending=False)
print(high_cases_sorted_df)
Output
state Total Confirmed cases
19 Maharashtra 8590
14 Jharkhand 82
24 Puducherry 8
9 Goa 7
32 West Bengal 697
13 Jammu and Kashmir 546
15 Karnataka 512
30 Uttarakhand 51
16 Kerala 481
6 Chandigarh 40
12 Himachal Pradesh 40
7 Chhattisgarh 37
4 Assam 36
10 Gujarat 3548
5 Bihar 345
1 Andaman and Nicobar Islands 33
25 Punjab 313
8 Delhi 3108
11 Haryana 296
26 Rajasthan 2262
18 Madhya Pradesh 2168
17 Ladakh 20
20 Manipur 2
29 Tripura 2
31 Uttar Pradesh 1955
I don't know why it shows like this it should be
(1.Maharashtra, 2.Gujarat, 3.Delhi, etc)
complete script Here
simple by converting that column into integer
df['Total_Confirmed_cases'] = df['Total_Confirmed_cases'].astype(int)
Assume that I have a dataframe like this:
Date Artist percent_gray percent_blue percent_black percent_red
33 Leonardo 22 33 36 46
45 Leonardo 23 47 23 14
46 Leonardo 13 34 33 12
23 Michelangelo 28 19 38 25
25 Michelangelo 24 56 55 13
26 Michelangelo 21 22 45 13
13 Titian 24 17 23 22
16 Titian 45 43 44 13
19 Titian 17 45 56 13
24 Raphael 34 34 34 45
27 Raphael 31 22 25 67
I want to get maximum color differences of different pictures for the same artist. I can also compare percent_gray with percent_blue e.g. for Lenoardo the biggest difference is percent_red (date:46) - percent_blue(date:45) =12 - 47 = -35. I wanna see how it evolves over time, so I just wanna compare new pictures of the same artist with the old ones(in this case I can compare third picture with first and second ones, and second picture only with first one) and get the maximum differences. So dataframe should look like
Date Artist max_d
33 Leonardo NaN
45 Leonardo -32
46 Leonardo -35
23 Michelangelo NaN
25 Michelangelo 37
26 Michelangelo -43
13 Titian NaN
16 Titian 28
19 Titian 43
24 Raphael NaN
27 Raphael 33
I think I have to use groupby but couldn't manage to get the output I want.
You can use:
#first sort in real data
df = df.sort_values(['Artist', 'Date'])
mi = df.iloc[:,2:].min(axis=1)
ma = df.iloc[:,2:].max(axis=1)
ma1 = ma.groupby(df['Artist']).shift()
mi1 = mi.groupby(df['Artist']).shift()
mad1 = mi - ma1
mad2 = ma - mi1
df['max_d'] = np.where(mad1.abs() > mad2.abs(), mad1, mad2)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
Explanation (with new columns):
#get min and max per rows
df['min'] = df.iloc[:,2:].min(axis=1)
df['max'] = df.iloc[:,2:].max(axis=1)
#get shifted min and max by Artist
df['max1'] = df.groupby('Artist')['max'].shift()
df['min1'] = df.groupby('Artist')['min'].shift()
#get differences
df['max_d1'] = df['min'] - df['max1']
df['max_d2'] = df['max'] - df['min1']
#if else of absolute values
df['max_d'] = np.where(df['max_d1'].abs() > df['max_d2'].abs(), df['max_d1'], df['max_d2'])
print (df)
percent_red min max max1 min1 max_d1 max_d2 max_d
0 46 22 46 NaN NaN NaN NaN NaN
1 14 14 47 46.0 22.0 -32.0 25.0 -32.0
2 12 12 34 47.0 14.0 -35.0 20.0 -35.0
3 25 19 38 NaN NaN NaN NaN NaN
4 13 13 56 38.0 19.0 -25.0 37.0 37.0
5 13 13 45 56.0 13.0 -43.0 32.0 -43.0
6 22 17 24 NaN NaN NaN NaN NaN
7 13 13 45 24.0 17.0 -11.0 28.0 28.0
8 13 13 56 45.0 13.0 -32.0 43.0 43.0
9 45 34 45 NaN NaN NaN NaN NaN
10 67 22 67 45.0 34.0 -23.0 33.0 33.0
And if use second explanation solution, remove columns:
df = df.drop(['min','max','max1','min1','max_d1', 'max_d2'], axis=1)
print (df)
Date Artist percent_gray percent_blue percent_black \
0 33 Leonardo 22 33 36
1 45 Leonardo 23 47 23
2 46 Leonardo 13 34 33
3 23 Michelangelo 28 19 38
4 25 Michelangelo 24 56 55
5 26 Michelangelo 21 22 45
6 13 Titian 24 17 23
7 16 Titian 45 43 44
8 19 Titian 17 45 56
9 24 Raphael 34 34 34
10 27 Raphael 31 22 25
percent_red max_d
0 46 NaN
1 14 -32.0
2 12 -35.0
3 25 NaN
4 13 37.0
5 13 -43.0
6 22 NaN
7 13 28.0
8 13 43.0
9 45 NaN
10 67 33.0
How about a custom apply function. Does this work?
from operator import itemgetter
import pandas
import itertools
p = pandas.read_csv('Artits.tsv', sep='\s+')
def diff(x):
return x
def max_any_color(cols):
grey = []
blue = []
black = []
red = []
for row in cols.iterrows():
date = row[1]['Date']
grey.append(row[1]['percent_gray'])
blue.append(row[1]['percent_blue'])
black.append(row[1]['percent_black'])
red.append(row[1]['percent_red'])
gb = max([abs(a[0] - a[1]) for a in itertools.product(grey,blue)])
gblack = max([abs(a[0] - a[1]) for a in itertools.product(grey,black)])
gr = max([abs(a[0] - a[1]) for a in itertools.product(grey,red)])
bb = max([abs(a[0] - a[1]) for a in itertools.product(blue,black)])
br = max([abs(a[0] - a[1]) for a in itertools.product(blue,red)])
blackr = max([abs(a[0] - a[1]) for a in itertools.product(black,red)])
l = [gb,gblack,gr,bb,br,blackr]
c = ['grey/blue','grey/black','grey/red','blue/black','blue/red','black/red']
max_ = max(l)
between_colors_index = l.index(max_)
return c[between_colors_index], max_
p.groupby('Artist').apply(lambda x: max_any_color(x))
Output:
Leonardo (blue/red, 35)
Michelangelo (blue/red, 43)
Raphael (blue/red, 45)
Titian (black/red, 43)