I am trying to read JSON file using pandas:
import pandas as pd
df = pd.read_json('https://data.gov.in/node/305681/datastore/export/json')
I get ValueError: arrays must all be same length
Some other JSON pages show this error:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
How do I somehow read the values? I am not particular about data validity.
Looking at the json it is valid, but it's nested with data and fields:
import json
import requests
In [11]: d = json.loads(requests.get('https://data.gov.in/node/305681/datastore/export/json').text)
In [12]: list(d.keys())
Out[12]: ['data', 'fields']
You want the data as the content, and fields as the column names:
In [13]: pd.DataFrame(d["data"], columns=[x["label"] for x in d["fields"]])
Out[13]:
S. No. States/UTs 2008-09 2009-10 2010-11 2011-12 2012-13
0 1 Andhra Pradesh 183446.36 193958.45 201277.09 212103.27 222973.83
1 2 Arunachal Pradesh 360.5 380.15 407.42 419 438.69
2 3 Assam 4658.93 4671.22 4707.31 4705 4709.58
3 4 Bihar 10740.43 11001.77 7446.08 7552 8371.86
4 5 Chhattisgarh 9737.92 10520.01 12454.34 12984.44 13704.06
5 6 Goa 148.61 148 149 149.45 457.87
6 7 Gujarat 12675.35 12761.98 13269.23 14269.19 14558.39
7 8 Haryana 38149.81 38453.06 39644.17 41141.91 42342.66
8 9 Himachal Pradesh 977.3 1000.26 1020.62 1049.66 1069.39
9 10 Jammu and Kashmir 7208.26 7242.01 7725.19 6519.8 6715.41
10 11 Jharkhand 3994.77 3924.73 4153.16 4313.22 4238.95
11 12 Karnataka 23687.61 29094.3 30674.18 34698.77 36773.33
12 13 Kerala 15094.54 16329.52 16856.02 17048.89 22375.28
13 14 Madhya Pradesh 6712.6 7075.48 7577.23 7971.53 8710.78
14 15 Maharashtra 35502.28 38640.12 42245.1 43860.99 45661.07
15 16 Manipur 1105.25 1119 1137.05 1149.17 1162.19
16 17 Meghalaya 994.52 999.47 1010.77 1021.14 1028.18
17 18 Mizoram 411.14 370.92 387.32 349.33 352.02
18 19 Nagaland 831.92 833.5 802.03 703.65 617.98
19 20 Odisha 19940.15 23193.01 23570.78 23006.87 23229.84
20 21 Punjab 36789.7 32828.13 35449.01 36030 37911.01
21 22 Rajasthan 6449.17 6713.38 6696.92 9605.43 10334.9
22 23 Sikkim 136.51 136.07 139.83 146.24 146
23 24 Tamil Nadu 88097.59 108475.73 115137.14 118518.45 119333.55
24 25 Tripura 1388.41 1442.39 1569.45 1650 1565.17
25 26 Uttar Pradesh 10139.8 10596.17 10990.72 16075.42 17073.67
26 27 Uttarakhand 1961.81 2535.77 2613.81 2711.96 3079.14
27 28 West Bengal 33055.7 36977.96 39939.32 43432.71 47114.91
28 29 Andaman and Nicobar Islands 617.58 657.44 671.78 780 741.32
29 30 Chandigarh 272.88 248.53 180.06 180.56 170.27
30 31 Dadra and Nagar Haveli 70.66 70.71 70.28 73 73
31 32 Daman and Diu 18.83 18.9 18.81 19.67 20
32 33 Delhi 1.17 1.17 1.17 1.23 NA
33 34 Lakshadweep 134.64 138.22 137.98 139.86 139.99
34 35 Puducherry 111.69 112.84 113.53 116 112.89
See also json_normalize for more complex json DataFrame extraction.
The following listed both the key and value pair for me:
from urllib.request import urlopen
import json
from pandas.io.json import json_normalize
import pandas as pd
import requests
df = json.loads(requests.get('https://api.github.com/repos/akkhil2012/MachineLearning').text)
data = pd.DataFrame.from_dict(df, orient='index')
print(data)
Related
I am currently working on a project and want with API or webscrapping get the table from a website.
I gave the following code:
import requests
import pandas as pd
import numpy as np
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
resp = requests.get(url)
tables = pd.read_html(resp.text)
all_df = pd.concat(tables)
data= pd.DataFrame(all_df)
But i got the error message no tables found, but I want the table which also can download csv.
Anyone know what the problem is?
With some help from selenium before calling read_html :
#https://selenium-python.readthedocs.io/installation.html
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
import pandas as pd
s = Service("./chromedriver.exe")
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
with webdriver.Chrome(service=s) as driver:
driver.get(url)
df = pd.concat(pd.read_html(driver.page_source))
Output :
print(df)
State Circumcision Rate
0 West Virginia 87%
1 Michigan 86%
2 Kentucky 85%
3 Nebraska 84%
4 Ohio 84%
.. ... ...
45 Alaska 0%
46 Arizona 0%
47 Delaware 0%
48 Idaho 0%
49 Mississippi 0%
[50 rows x 2 columns]
Here is one way of getting that data as a dataframe:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import json
url = 'https://worldpopulationreview.com/state-rankings/circumcision-rates-by-state'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
script_w_data = soup.select_one('script[id="__NEXT_DATA__"]').text
df = pd.json_normalize(json.loads(script_w_data)['props']['pageProps']['listing'])
print(df)
Result in terminal:
fips state densityMi pop2023 pop2022 pop2020 pop2019 pop2010 growthRate growth growthSince2010 circumcisionRate
0 54 West Virginia 73.88019 1775932 1781860 1793716 1799642 1852994 -0.00333 -5928 -0.04159 0.87
1 26 Michigan 179.26454 10135438 10116069 10077331 10057961 9883640 0.00191 19369 0.02548 0.86
2 21 Kentucky 115.37702 4555777 4539130 4505836 4489190 4339367 0.00367 16647 0.04987 0.85
3 31 Nebraska 26.06024 2002052 1988536 1961504 1947985 1826341 0.00680 13516 0.09621 0.84
4 39 Ohio 290.70091 11878330 11852036 11799448 11773150 11536504 0.00222 26294 0.02963 0.84
5 18 Indiana 191.92896 6876047 6845874 6785528 6755359 6483802 0.00441 30173 0.06050 0.83
6 19 Iowa 57.89018 3233572 3219171 3190369 3175964 3046355 0.00447 14401 0.06146 0.82
7 55 Wisconsin 109.96966 5955737 5935064 5893718 5873043 5686986 0.00348 20673 0.04726 0.82
8 45 South Carolina 175.18855 5266343 5217037 5118425 5069118 4625364 0.00945 49306 0.13858 0.81
9 42 Pennsylvania 292.62222 13092796 13062764 13002700 12972667 12702379 0.00230 30032 0.03074 0.79
10 56 Wyoming 5.98207 580817 579495 576851 575524 563626 0.00228 1322 0.03050 0.79
11 15 Hawaii 231.00763 1483762 1474265 1455271 1445774 1360301 0.00644 9497 0.09076 0.78
12 20 Kansas 36.24443 2963308 2954832 2937880 2929402 2853118 0.00287 8476 0.03862 0.77
13 38 North Dakota 11.75409 811044 800394 779094 768441 672591 0.01331 10650 0.20585 0.77
14 40 Oklahoma 58.63041 4021753 4000953 3959353 3938551 3751351 0.00520 20800 0.07208 0.77
15 46 South Dakota 11.98261 908414 901165 886667 879421 814180 0.00804 7249 0.11574 0.77
16 29 Missouri 90.26083 6204710 6188111 6154913 6138318 5988927 0.00268 16599 0.03603 0.76
17 33 New Hampshire 155.90830 1395847 1389741 1377529 1371424 1316470 0.00439 6106 0.06030 0.76
18 44 Rhode Island 1074.29594 1110822 1106341 1097379 1092896 1052567 0.00405 4481 0.05535 0.76
19 47 Tennessee 171.70515 7080262 7023788 6910840 6854371 6346105 0.00804 56474 0.11569 0.76
20 51 Virginia 223.36045 8820504 8757467 8631393 8568357 8001024 0.00720 63037 0.10242 0.74
21 13 Georgia 191.59470 11019186 10916760 10711908 10609487 9687653 0.00938 102426 0.13745 0.72
22 24 Maryland 648.84362 6298325 6257958 6177224 6136855 5773552 0.00645 40367 0.09089 0.72
23 9 Connecticut 746.69537 3615499 3612314 3605944 3602762 3574097 0.00088 3185 0.01158 0.71
24 23 Maine 44.50148 1372559 1369159 1362359 1358961 1328361 0.00248 3400 0.03327 0.67
25 5 Arkansas 58.42619 3040207 3030646 3011524 3001967 2915918 0.00315 9561 0.04262 0.66
26 8 Colorado 57.86332 5997070 5922618 5773714 5699264 5029196 0.01257 74452 0.19245 0.66
27 25 Massachusetts 919.82103 7174604 7126375 7029917 6981690 6547629 0.00677 48229 0.09576 0.66
28 34 New Jersey 1283.40005 9438124 9388414 9288994 9239284 8791894 0.00529 49710 0.07350 0.66
29 50 Vermont 70.33514 648279 646545 643077 641347 625741 0.00268 1734 0.03602 0.64
30 17 Illinois 230.67908 12807072 12808884 12812508 12814324 12830632 -0.00014 -1812 -0.00184 0.63
31 27 Minnesota 73.18202 5827265 5787008 5706494 5666238 5303925 0.00696 40257 0.09867 0.63
32 36 New York 433.90472 20448194 20365879 20201249 20118937 19378102 0.00404 82315 0.05522 0.59
33 37 North Carolina 220.30026 10710558 10620168 10439388 10348993 9535483 0.00851 90390 0.12323 0.52
34 30 Montana 7.64479 1112668 1103187 1084225 1074744 989415 0.00859 9481 0.12457 0.50
35 48 Texas 116.16298 30345487 29945493 29145505 28745507 25145561 0.01336 399994 0.20679 0.50
36 35 New Mexico 17.60148 2135024 2129190 2117522 2111685 2059179 0.00274 5834 0.03683 0.49
37 22 Louisiana 108.67214 4695071 4682633 4657757 4645314 4533372 0.00266 12438 0.03567 0.45
38 49 Utah 41.66892 3423935 3373162 3271616 3220842 2763885 0.01505 50773 0.23881 0.42
39 12 Florida 416.95573 22359251 22085563 21538187 21264502 18801310 0.01239 273688 0.18924 0.35
40 41 Oregon 45.41307 4359110 4318492 4237256 4196636 3831074 0.00941 40618 0.13783 0.24
41 6 California 258.20877 40223504 39995077 39538223 39309799 37253956 0.00571 228427 0.07971 0.22
42 1 Alabama 100.65438 5097641 5073187 5024279 4999822 4779736 0.00482 24454 0.06651 0.20
43 53 Washington 120.37292 7999503 7901429 7705281 7607206 6724540 0.01241 98074 0.18960 0.15
44 32 Nevada 29.38425 3225832 3185426 3104614 3064205 2700551 0.01268 40406 0.19451 0.12
45 2 Alaska 1.29738 740339 738023 733391 731075 710231 0.00314 2316 0.04239 NaN
46 4 Arizona 64.96246 7379346 7303398 7151502 7075549 6392017 0.01040 75948 0.15446 NaN
47 10 Delaware 522.08876 1017551 1008350 989948 980743 897934 0.00912 9201 0.13321 NaN
48 16 Idaho 23.23926 1920562 1893410 1839106 1811950 1567582 0.01434 27152 0.22517 NaN
49 28 Mississippi 63.07084 2959473 2960075 2961279 2961879 2967297 -0.00020 -602 -0.00264 NaN
I am trying to scrape the table from:
https://worldpopulationreview.com/states
My code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://worldpopulationreview.com/states'
page = requests.get(url)
soup = BeautifulSoup(page.text,'lxml')
table = soup.find('table', {'class': 'jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow'})
headers = []
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns=headers)
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip() for td in data]
length = len(df)
df.loc[length] = row_data
df
Currently returns
'NoneType' object has no attribute 'find_all'
Clearly the error is because the table variable is returning nothing, but I believe I have the table tag correct.
The table data is dynamically loaded by JavaScript and bs4 can't render JS but you can do the job bs4 with an automation tool something like selenium and grab the table using pandas DataFrame.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
driver.get('https://worldpopulationreview.com/states')
driver.maximize_window()
time.sleep(8)
soup = BeautifulSoup(driver.page_source,"lxml")
#You can pull the table directly from the web page
df = pd.read_html(str(soup))[0]
print(df)
#OR
#table= soup.select_one('table[class="jsx-a3119e4553b2cac7 table is-striped is-hoverable is-fullwidth tp-table-body is-narrow"]')
# df = pd.read_html(str(table))[0]
# print(df)
Output:
Rank State 2022 Population Growth Rate ... 2010 Population Growth Since 2010 % of US Density (/mi²)
0 1 California 39995077 0.57% ... 37253956 7.36% 11.93% 257
1 2 Texas 29945493 1.35% ... 25145561 19.09% 8.93% 115
2 3 Florida 22085563 1.25% ... 18801310 17.47% 6.59% 412
3 4 New York 20365879 0.41% ... 19378102 5.10% 6.07% 432
4 5 Pennsylvania 13062764 0.23% ... 12702379 2.84% 3.90% 292
5 6 Illinois 12808884 -0.01% ... 12830632 -0.17% 3.82% 231
6 7 Ohio 11852036 0.22% ... 11536504 2.74% 3.53% 290
7 8 Georgia 10916760 0.95% ... 9687653 12.69% 3.26% 190
8 9 North Carolina 10620168 0.86% ... 9535483 11.38% 3.17% 218
9 10 Michigan 10116069 0.19% ... 9883640 2.35% 3.02% 179
10 11 New Jersey 9388414 0.53% ... 8791894 6.78% 2.80% 1277
11 12 Virginia 8757467 0.73% ... 8001024 9.45% 2.61% 222
12 13 Washington 7901429 1.26% ... 6724540 17.50% 2.36% 119
13 14 Arizona 7303398 1.05% ... 6392017 14.26% 2.18% 64
14 15 Massachusetts 7126375 0.68% ... 6547629 8.84% 2.13% 914
15 16 Tennessee 7023788 0.81% ... 6346105 10.68% 2.09% 170
16 17 Indiana 6845874 0.44% ... 6483802 5.58% 2.04% 191
17 18 Maryland 6257958 0.65% ... 5773552 8.39% 1.87% 645
18 19 Missouri 6188111 0.27% ... 5988927 3.33% 1.85% 90
19 20 Wisconsin 5935064 0.35% ... 5686986 4.36% 1.77% 110
20 21 Colorado 5922618 1.27% ... 5029196 17.76% 1.77% 57
21 22 Minnesota 5787008 0.70% ... 5303925 9.11% 1.73% 73
22 23 South Carolina 5217037 0.95% ... 4625364 12.79% 1.56% 174
23 24 Alabama 5073187 0.48% ... 4779736 6.14% 1.51% 100
24 25 Louisiana 4682633 0.27% ... 4533372 3.29% 1.40% 108
25 26 Kentucky 4539130 0.37% ... 4339367 4.60% 1.35% 115
26 27 Oregon 4318492 0.95% ... 3831074 12.72% 1.29% 45
27 28 Oklahoma 4000953 0.52% ... 3751351 6.65% 1.19% 58
28 29 Connecticut 3612314 0.09% ... 3574097 1.07% 1.08% 746
29 30 Utah 3373162 1.53% ... 2763885 22.04% 1.01% 41
30 31 Iowa 3219171 0.45% ... 3046355 5.67% 0.96% 58
31 32 Nevada 3185426 1.28% ... 2700551 17.95% 0.95% 29
32 33 Arkansas 3030646 0.32% ... 2915918 3.93% 0.90% 58
33 34 Mississippi 2960075 -0.02% ... 2967297 -0.24% 0.88% 63
34 35 Kansas 2954832 0.29% ... 2853118 3.57% 0.88% 36
35 36 New Mexico 2129190 0.27% ... 2059179 3.40% 0.64% 18
36 37 Nebraska 1988536 0.68% ... 1826341 8.88% 0.59% 26
37 38 Idaho 1893410 1.45% ... 1567582 20.79% 0.56% 23
38 39 West Virginia 1781860 -0.33% ... 1852994 -3.84% 0.53% 74
39 40 Hawaii 1474265 0.65% ... 1360301 8.38% 0.44% 230
40 41 New Hampshire 1389741 0.44% ... 1316470 5.57% 0.41% 155
41 42 Maine 1369159 0.25% ... 1328361 3.07% 0.41% 44
42 43 Rhode Island 1106341 0.41% ... 1052567 5.11% 0.33% 1070
43 44 Montana 1103187 0.87% ... 989415 11.50% 0.33%
8
44 45 Delaware 1008350 0.92% ... 897934 12.30% 0.30% 517
45 46 South Dakota 901165 0.81% ... 814180 10.68% 0.27% 12
46 47 North Dakota 800394 1.35% ... 672591 19.00% 0.24% 12
47 48 Alaska 738023 0.31% ... 710231 3.91% 0.22%
1
48 49 Vermont 646545 0.27% ... 625741 3.32% 0.19% 70
49 50 Wyoming 579495 0.23% ... 563626 2.82% 0.17%
6
[50 rows x 9 columns]
Table is rendered dynamically from JSON that is placed at the end of the source code, so it do not need selenium simply extract the tag and load the JSON - This also includes all additional information from the page:
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
Example
import requests, json
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://worldpopulationreview.com/states').text)
pd.DataFrame(
json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['data']
)
Example
Cause there are also additional information, that is used for the map, simply choose columns you need by header.
fips
state
densityMi
pop2022
pop2021
pop2020
pop2019
pop2010
growthRate
growth
growthSince2010
area
fill
Name
rank
0
6
California
256.742
39995077
39766650
39538223
39309799
37253956
0.00574419
228427
0.0735793
155779
#084594
California
1
1
48
Texas
114.632
29945493
29545499
29145505
28745507
25145561
0.0135382
399994
0.190886
261232
#084594
Texas
2
2
12
Florida
411.852
22085563
21811875
21538187
21264502
18801310
0.0125477
273688
0.174682
53625
#084594
Florida
3
3
36
New York
432.158
20365879
20283564
20201249
20118937
19378102
0.00405821
82315
0.0509739
47126
#084594
New York
4
4
42
Pennsylvania
291.951
13062764
13032732
13002700
12972667
12702379
0.00230435
30032
0.0283715
44743
#2171b5
Pennsylvania
5
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
45
46
South Dakota
11.887
901165
893916
886667
879421
814180
0.00810926
7249
0.106838
75811
#c6dbef
South Dakota
46
46
38
North Dakota
11.5997
800394
789744
779094
768441
672591
0.0134854
10650
0.190016
69001
#c6dbef
North Dakota
47
47
2
Alaska
1.29332
738023
735707
733391
731075
710231
0.00314799
2316
0.0391309
570641
#c6dbef
Alaska
48
48
50
Vermont
70.147
646545
644811
643077
641347
625741
0.00268916
1734
0.033247
9217
#c6dbef
Vermont
49
49
56
Wyoming
5.96845
579495
578173
576851
575524
563626
0.00228651
1322
0.0281552
97093
#c6dbef
Wyoming
50
Here is my dataframe:
Boston
Zipcode Employees Latitude Longitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
And I want to draw circles on my map, each circle corresponds to one point, the size of the circle depends on the number of Employees
Here are my map code (I try to use marker, but I think circle is better:
boston_map=folium.Map([Boston['Longitude'].mean(), Boston['Latitude'].mean()],zoom_start=12)
incidents2=plugins.MarkerCluster().add_to(boston_map)
for Latitude,Longitude,Employees in zip(Boston.Latitude,Boston.Longitude,Boston.Employees):
folium.Marker(location=[Latitude,Longitude],icon=None,popup=Employees).add_to(incidents2)
boston_map.add_child(incidents2)
boston_map
Here is my map:
If the number of employees can show in the circle, it will be better! Thank you very much!
To draw circles you can use CircleMarker instead of Marker
BTW: you have wrong column's names. Boston has lat: 42.361145, long: -71.057083 but you have values 42 in column Longitude and values -71 in column Latitude
Because I don't use Juputer so I save map in HTML file and use webbrowser to automatically open it in web browser.
Because it created big circles so I divide Employees to create smaller circles. But now some circles are very small and it shows number of circles instead circles. Maybe it should be used math.log() or other method to make it smaller (normalized).
I use tooltip=str(employees) to display number when you hover circle.
text = '''
Zipcode Employees Longitude Latitude
0 02021 174 -71.131057 42.228065
1 02026 193 -71.143038 42.237719
3 02109 45 -71.054027 42.363498
4 02110 14 -71.053642 42.357649
5 02111 30 -71.060280 42.350586
6 02113 77 -71.054618 42.365215
8 02115 116 -71.095106 42.343330
10 02118 318 -71.072103 42.339342
11 02119 804 -71.085268 42.323002
12 02120 168 -71.097569 42.332539
13 02121 781 -71.086649 42.305792
15 02124 1938 -71.066702 42.281721
16 02125 859 -71.053049 42.310813
17 02126 882 -71.090424 42.272444
19 02128 786 -71.016037 42.375254
21 02130 886 -71.114080 42.309087
22 02131 1222 -71.121464 42.285216
23 02132 1348 -71.168150 42.280316
24 02134 230 -71.123323 42.355355
25 02135 584 -71.147046 42.357537
26 02136 1712 -71.125550 42.255064
28 02152 119 -70.960324 42.351129
29 02163 1 -71.120420 42.367263
30 02186 361 -71.113223 42.258883
31 02199 4 -71.082279 42.346991
32 02210 35 -71.044281 42.347148
33 02215 83 -71.103877 42.348709
34 02459 27 -71.187563 42.286356
35 02467 66 -71.157691 42.314277
'''
import pandas as pd
import io
import folium
import folium.plugins
boston = pd.read_csv(io.StringIO(text), sep='\s+')
boston_map = folium.Map([boston.Latitude.mean(), boston.Longitude.mean(), ], zoom_start=12)
incidents2 = folium.plugins.MarkerCluster().add_to(boston_map)
for latitude, longitude, employees in zip(boston.Latitude, boston.Longitude, boston.Employees):
print(latitude, longitude, employees)
folium.vector_layers.CircleMarker(
location=[latitude, longitude],
tooltip=str(employees),
radius=employees/10,
color='#3186cc',
fill=True,
fill_color='#3186cc'
).add_to(incidents2)
boston_map.add_child(incidents2)
# display in web browser
import webbrowser
boston_map.save('map.html')
webbrowser.open('map.html')
EDIT: answer for question how to add a label on each circle in a folium.circile map python shows how to use Marker with icon=DivIcon(text) to add text but it doesn't work as I expect.
It is the first time I use pandas and I do not really know how to deal with my problematic.
In fact I have 2 data frame:
import pandas
blast=pandas.read_table("blast")
cluster=pandas.read_table("cluster")
Here is an exemple of their contents:
>>> cluster
cluster_name seq_names
0 1 g1.t1_0035
1 1 g1.t1_0035_0042
2 119365 g1.t1_0042
3 90273 g1.t1_0042_0035
4 71567 g10.t1_0035
5 37976 g10.t1_0035_0042
6 22560 g10.t1_0042
7 90280 g10.t1_0042_0035
8 82698 g100.t1_0035
9 47392 g100.t1_0035_0042
10 28484 g100.t1_0042
11 22580 g100.t1_0042_0035
12 19474 g1000.t1_0035
13 5770 g1000.t1_0035_0042
14 29708 g1000.t1_0042
15 99776 g1000.t1_0042_0035
16 6283 g10000.t1_0035
17 39828 g10000.t1_0035_0042
18 25383 g10000.t1_0042
19 106614 g10000.t1_0042_0035
20 6285 g10001.t1_0035
21 13866 g10001.t1_0035_0042
22 121157 g10001.t1_0042
23 106615 g10001.t1_0042_0035
24 6286 g10002.t1_0035
25 113 g10002.t1_0035_0042
26 25397 g10002.t1_0042
27 106616 g10002.t1_0042_0035
28 4643 g10003.t1_0035
29 13868 g10003.t1_0035_0042
... ... ...
and
[78793 rows x 2 columns]
>>> blast
qseqid sseqid pident length mismatch \
0 g1.t1_0035_0042 g1.t1_0035_0042 100.0 286 0
1 g1.t1_0035_0042 g1.t1_0035 100.0 257 0
2 g1.t1_0035_0042 g9307.t1_0035 26.9 134 65
3 g2.t1_0035_0042 g2.t1_0035_0042 100.0 445 0
4 g2.t1_0035_0042 g2.t1_0035 95.8 451 3
5 g2.t1_0035_0042 g24520.t1_0042_0035 61.1 429 137
6 g2.t1_0035_0042 g9924.t1_0042 61.1 429 137
7 g2.t1_0035_0042 g1838.t1_0035 86.2 29 4
8 g3.t1_0035_0042 g3.t1_0035_0042 100.0 719 0
9 g3.t1_0035_0042 g3.t1_0035 84.7 753 62
10 g4.t1_0035_0042 g4.t1_0035_0042 100.0 242 0
11 g4.t1_0035_0042 g3.t1_0035 98.8 161 2
12 g5.t1_0035_0042 g5.t1_0035_0042 100.0 291 0
13 g5.t1_0035_0042 g3.t1_0035 93.1 291 0
14 g6.t1_0035_0042 g6.t1_0035_0042 100.0 152 0
15 g6.t1_0035_0042 g4.t1_0035 100.0 152 0
16 g7.t1_0035_0042 g7.t1_0035_0042 100.0 216 0
17 g7.t1_0035_0042 g5.t1_0035 98.1 160 3
18 g7.t1_0035_0042 g11143.t1_0042 46.5 230 99
19 g7.t1_0035_0042 g27537.t1_0042_0035 40.8 233 111
20 g3778.t1_0035_0042 g3778.t1_0035_0042 100.0 86 0
21 g3778.t1_0035_0042 g6174.t1_0035 98.0 51 1
22 g3778.t1_0035_0042 g20037.t1_0035_0042 100.0 50 0
23 g3778.t1_0035_0042 g37190.t1_0035 100.0 50 0
24 g3778.t1_0035_0042 g15112.t1_0042_0035 66.0 53 18
25 g3778.t1_0035_0042 g6061.t1_0042 66.0 53 18
26 g18109.t1_0035_0042 g18109.t1_0035_0042 100.0 86 0
27 g18109.t1_0035_0042 g33071.t1_0035 100.0 81 0
28 g18109.t1_0035_0042 g32810.t1_0035 96.4 83 3
29 g18109.t1_0035_0042 g17982.t1_0035_0042 98.6 72 1
... ... ... ... ... ...
if you stay focus on the cluster database, the first column correspond to the cluster ID and inside those clusters there are several sequences ID.
What I need to to is first to split all my cluster (in R it would be like: liste=split(x = data$V2, f = data$V1) )
And then, creat a function which displays the most similarity paires sequence within each cluster.
here is an exemple:
let's say I have two clusters (dataframe cluster):
cluster 1:
seq1
seq2
seq3
seq4
cluster 2:
seq5
seq6
seq7
...
On the blast dataframe there is on the 3th column the similarity between all sequences (all against all), so something like:
seq1 vs seq1 100
seq1 vs seq2 90
seq1 vs seq3 56
seq1 vs seq4 49
seq1 vs seq5 40
....
seq2 vs seq3 70
seq2 vs seq4 98
...
seq5 vs seq5 100
seq5 vs seq6 89
seq5 vs seq7 60
seq7 vs seq7 46
seq7 vs seq7 100
seq6 vs seq6 100
and what I need to get is :
cluster 1 (best paired sequences):
seq 1 vs seq 2
cluster2 (best paired sequences):
seq 5 vs seq6
...
So as you can see, I do not want to take into account the sequences paired by themselves
IF someone could give me some clues it would be fantastic.
Thank you all.
Firstly I assume that there are no Pairings in 'blast' with sequences from two different Clusters. In other words: in this solution the cluster-ID of a pairing will be evaluated by only one of the two sequence IDs.
Including cluster information and pairing information into one dataframe:
data = cluster.merge(blast, left_on='seq_names', right_on='qseqid')
Then the data should only contain pairings of different sequences:
data = data[data['qseqid']!=data['sseqid']]
To ignore pairings which have the same substrings in their seqid, the most readable way would be to add data columns with these data:
data['qspec'] = [seqid.split('_')[1] for seqid in data['qseqid'].values]
data['sspec'] = [seqid.split('_')[1] for seqid in data['sseqid'].values]
Now equal spec-values can be filtered the same way like it was done with equal seqids above:
data = data[data['qspec']!=data['sspec']]
In the end the data should be grouped by cluster-ID and within each group, the maximum of pident is of interest:
data_grpd = data.groupby('cluster_name')
result = data.loc[data_grpd['pident'].idxmax()]
The only drawback here - except the above mentioned assumption - is, that if there are several exactly equal max-values, only one of them would be taken into account.
Note: if you don't want the spec-columns to be of type string, you could easiliy turn them into integers on the fly by:
import numpy as np
data['qspec'] = [np.int(seqid.split('_')[1]) for seqid in data['qseqid'].values]
This merges the dataframes based first on sseqid, then on qseqid, and then returns results_df. Any with 100% match are filtered out. Let me know if this works. You can then order by cluster name.
blast = blast.loc[blast['pident'] != 100]
results_df = cluster.merge(blast, left_on='seq_names',right_on='sseqid')
results_df = results_df.append(cluster.merge(blast, left_on='seq_names',right_on='qseqid'))
I have a dataframe, grouped, with multiindex columns as below:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
>> grouped
code colour size scaled_size
sum average size idxmax sum average size idxmax
0 one black 1031 60.647059 17 81 185.153944 10.891408 17 47
1 one white 481 37.000000 13 53 204.139249 15.703019 13 53
2 three black 822 48.352941 17 6 123.269405 7.251141 17 31
3 three white 1614 57.642857 28 50 285.638337 10.201369 28 37
4 two black 523 58.111111 9 85 80.908912 8.989879 9 88
5 two white 669 41.812500 16 78 82.098870 5.131179 16 78
[6 rows x 10 columns]
How can I flatten/merge the column index levels as: "Level1|Level2", e.g. size|sum, scaled_size|sum. etc? If this is not possible, is there a way to groupby() as I did above without creating multi-index columns?
There is potentially a better way, more pythonic way to flatten MultiIndex columns.
1. Use map and join with string column headers:
grouped.columns = grouped.columns.map('|'.join).str.strip('|')
print(grouped)
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 862 53.875000 16 14
1 one white 554 46.166667 12 18
2 three black 842 49.529412 17 90
3 three white 740 56.923077 13 97
4 two black 1541 61.640000 25 50
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 6980 436.250000 16 77
1 6101 508.416667 12 13
2 7889 464.058824 17 64
3 6329 486.846154 13 73
4 12809 512.360000 25 23
2. Use map with format for column headers that have numeric data types.
grouped.columns = grouped.columns.map('{0[0]}|{0[1]}'.format)
Output:
code| colour| size|sum size|average size|size size|idxmax \
0 one black 734 52.428571 14 30
1 one white 1110 65.294118 17 88
2 three black 930 51.666667 18 3
3 three white 1140 51.818182 22 20
4 two black 656 38.588235 17 77
5 two white 704 58.666667 12 17
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 8229 587.785714 14 57
1 8781 516.529412 17 73
2 10743 596.833333 18 21
3 10240 465.454545 22 26
4 9982 587.176471 17 16
5 6537 544.750000 12 49
3. Use list comprehension with f-string for Python 3.6+:
grouped.columns = [f'{i}|{j}' if j != '' else f'{i}' for i,j in grouped.columns]
Output:
code colour size|sum size|average size|size size|idxmax \
0 one black 1003 43.608696 23 76
1 one white 1255 59.761905 21 66
2 three black 777 45.705882 17 39
3 three white 630 52.500000 12 23
4 two black 823 54.866667 15 33
5 two white 491 40.916667 12 64
scaled_size|sum scaled_size|average scaled_size|size scaled_size|idxmax
0 12532 544.869565 23 27
1 13223 629.666667 21 13
2 8615 506.764706 17 92
3 6101 508.416667 12 43
4 7661 510.733333 15 42
5 6143 511.916667 12 49
you could always change the columns:
grouped.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in grouped.columns]
Based on Scott Boston's answer,
little update(it will be work for 2 or more levels column):
temp.columns.map(lambda x: '|'.join([str(i) for i in x]))
Thank you, Boston!
Full credit to suraj's concise answer: https://stackoverflow.com/a/72616083/317797
df.columns = df.columns.map('_'.join)