starting from a known public data set which I copied to my own server.
The data set is here: https://www.kaggle.com/imdevskp/corona-virus-report/download
import pandas as pd
#df = pd.read_csv("http://g0mesp1res.dynip.sapo.pt/covid_19_clean_complete.csv", index_col=4, parse_dates=True)
df = pd.read_csv("http://g0mesp1res.dynip.sapo.pt/covid_19_clean_complete.csv")
df=df.drop(labels=None, axis=0, index=None, columns=['Province','Lat','Long'], level=None, inplace=False, errors='raise')
#print(df.head())
df['Date']=pd.to_datetime(df['Date'])
#print(df.head())
list_countries = ['Portugal','Brazil','Spain','Italy','Korea, South','Japan']
df= df[df['Country'].isin(list_countries)]
df_pt = df[df.Country == 'Portugal']
df_es = df[df.Country == 'Spain']
df_it = df[df.Country == 'Italy']
print(df_pt.head())
print(df_pt.tail())
I get what I expected
Country Date Confirmed Deaths Recovered
59 Portugal 2020-01-22 0 0 0
345 Portugal 2020-01-23 0 0 0
631 Portugal 2020-01-24 0 0 0
917 Portugal 2020-01-25 0 0 0
1203 Portugal 2020-01-26 0 0 0
Country Date Confirmed Deaths Recovered
15503 Portugal 2020-03-16 331 0 3
15789 Portugal 2020-03-17 448 1 3
16075 Portugal 2020-03-18 448 2 3
16361 Portugal 2020-03-19 785 3 3
16647 Portugal 2020-03-20 1020 6 5
However, when plotting, it seems that all data is in January!
import plotly.graph_objects as go
fig = go.Figure( go.Scatter(x=df.Date, y=df_pt.Confirmed, name='Portugal'))
fig.show()
plotly output graph :
What is missing?
Change the x axis from x=df.Date to x = df_pt.Date:
import plotly.graph_objects as go
fig = go.Figure(go.Scatter(x = df_pt.Date,
y = df_pt.Confirmed,
name='Portugal'))
fig.show()
and you get:
Related
I'm trying to set a bubble map with go.scattergeo, everything is good so far but when I change the size of the bubbles (each one represents a number), the biggest one "absorbs" the smallest one, see pictures:
https://i.stack.imgur.com/qgnIG.png
https://i.stack.imgur.com/z5IrD.png
As you can see I have commented out two more options that I tried, but both ended with the same result. Changing the color doesn't affect the position, and I've checked the whole documentation and found nothing about it.
Does anyone know what am I doing wrong?
Thank you in advance.
This is my code so far
This is a 10 row sample of the DF i'm using (essentially it's the titanic database with assigned latitude and longitude for each port of departure):
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked lon lat 0 1 0 3 Braund, Mr. Owen Harris 0 22.000000 1 0 A/5 21171 7.2500 0 -1.406013 50.896364 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.000000 1 0 PC 17599 71.2833 1 -8.294143 51.850910 2 3 1 3 Heikkinen, Miss. Laina 1 26.000000 0 0 STON/O2. 3101282 7.9250 0 -1.406013 50.896364 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.000000 1 0 113803 53.1000 0 -1.406013 50.896364 4 5 0 3 Allen, Mr. William Henry 0 35.000000 0 0 373450 8.0500 0 -1.406013 50.896364 5 6 0 3 Moran, Mr. James 0 29.699118 0 0 330877 8.4583 2 -1.612260 49.648194 6 7 0 1 McCarthy, Mr. Timothy J 0 54.000000 0 0 17463 51.8625 0 -1.406013 50.896364 7 8 0 3 Palsson, Master. Gosta Leonard 0 2.000000 3 1 349909 21.0750 0 -1.406013 50.896364 8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 1 27.000000 0 2 347742 11.1333 0 -1.406013 50.896364 9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) 1 14.000000 1 0 237736 30.0708 1 -8.294143 51.850910
Basically it's titanic dataset with coordenates assigned to each port of departure as it follows:
df.replace({'Embarked':{'S':0, 'C':1, 'Q':2}}, inplace=True)
df.loc[df['Embarked']==0, 'lon'] = '-1.406013'
df.loc[df['Embarked']==1, 'lon'] = '-8.294143'
df.loc[df['Embarked']==2, 'lon'] = '-1.612260'
df.loc[df['Embarked']==0, 'lat'] = '50.896364'
df.loc[df['Embarked']==1, 'lat'] = '51.850910'
df.loc[df['Embarked']==2, 'lat'] = '49.648194'
import plotly.graph_objects as go
sizearray = np.asarray(df['lon'].value_counts())
fig = go.Figure(data=go.Scattergeo(lon = df['lon'],lat = df['lat'],mode = 'markers'))
fig.update_layout(
title = 'Shipping Ports',geo_scope='world',)
fig.update_geos(fitbounds="locations")
fig.update_traces(marker=dict(size=sizearray,sizemode='area',sizeref=2.*max(sizearray)/(40.**2),sizemin=4))
#fig.update_traces(marker_size=sizearray/5)
#fig.update_traces(marker_size=df['lon'].value_counts()/5)
fig.update_traces(marker=dict(color=['rgb(93, 164, 214)', 'rgb(255, 144, 14)','rgb(44, 160, 101)']))
fig.show()
If you want to visualize the number of people embarking from each port, you can either draw a graph with a data frame grouping the original data, or you can create a graph based on unique values of latitude and longitude, similar to the way the number of cases was extracted. I have obtained the latitude and longitude as well as the array of counts and specified it as the latitude and longitude. I also customized the marker size and coloring with your code. Please correct me if my customization is wrong.
import plotly.graph_objects as go
sizearray = np.asarray(df['lon'].value_counts())
latarray = np.asarray(df['lat'].unique())
lonarray = np.asarray(df['lon'].unique())
fig = go.Figure(data=go.Scattergeo(lon=lonarray,
lat=latarray,
mode='markers',
marker=dict(
size=sizearray,
sizemode='area',
sizeref=2.*max(sizearray)/(40.**2),
sizemin=4,
color=[
'rgb(93, 164, 214)',
'rgb(255, 144, 14)',
'rgb(44, 160, 101)'
]
)))
fig.update_layout(autosize=True, height=600, title='Shipping Ports', geo_scope='europe')
fig.update_geos(fitbounds="locations")
fig.show()
I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.
I'm new to python so I apologize in advance if the question is too easy.
I'm trying to make a simulation to find the optimization point on a dataframe. This is what I have so far:
import random
import pandas as pd
import math
import numpy as np
loops = int(input('Q of simulations: '))
cost = 175
sell_price = 250
sale_price = 250/2
# order = 1000
simulation = 0
profit = 0
rows = []
order = range(1000, 3000)
ordenes = []
for i in order:
ordenes.append(i)
for i in ordenes:
demand = math.trunc(1000 + random.random() * (2001))
if demand >= i:
profit = (sell_price - cost)* i
rows.append([simulation, demand, i, profit, (demand - i)])
else:
profit = (sell_price - cost)* demand - (i - demand)* (sale_price - cost)
rows.append([simulation, demand, i, profit, (demand - i)])
DataFrame = pd.DataFrame(rows, columns = ['#Simulation', 'Demand', 'Order', 'Utility', 'Shortage'])
print(DataFrame)
DataFrame.loc[DataFrame['Utility'].idxmax()]
The current output (for any number specified in tis:
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 0 1392 1001 75075.0 391
2 0 1042 1002 75150.0 40
3 0 1457 1003 75225.0 454
4 0 1930 1004 75300.0 926
... ... ... ... ... ...
1995 0 1823 2995 195325.0 -1172
1996 0 2186 2996 204450.0 -810
1997 0 1384 2997 184450.0 -1613
1998 0 1795 2998 194775.0 -1203
1999 0 1611 2999 190225.0 -1388
[2000 rows x 5 columns]
#Simulation 0.0
Demand 2922.0
Order 2989.0
Utility 222500.0
Shortage -67.0
Name: 1989, dtype: float64
Desired Output (writing 5 in the input):
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 1 1392 1001 75075.0 391
2 2 1042 1002 75150.0 40
3 3 1457 1003 75225.0 454
4 4 1930 1004 75300.0 926
[5 rows x 5 columns]
#Simulation 4.0
Demand 1930.0
Order 1004.0
Utility 75300.0
Shortage 926.0
Name: 1989, dtype: float64
I really don't know how to make it happen, I've tried everything that comes to my mind but the outcome either fails on the 'order' column or as shown above.
Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.
I have a dataset in .xlsx with hundreds of thousands of rows as follow:
slug symbol name date ranknow open high low close volume market close_ratio spread
companyA AAA companyA 28/04/2013 1 135,3 135,98 132,1 134,21 0 1500520000 0,5438 3,88
companyA AAA companyA 29/04/2013 1 134,44 147,49 134 144,54 0 1491160000 0,7813 13,49
companyA AAA companyA 30/04/2013 1 144 146,93 134,05 139 0 1597780000 0,3843 12,88
....
companyA AAA companyA 17/04/2018 1 8071,66 8285,96 7881,72 7902,09 6900880000 1,3707E+11 0,0504 404,24
....
lancer LA Lancer 09/01/2018 731 0,347111 0,422736 0,345451 0,422736 3536710 0 1 0,08
lancer LA Lancer 10/01/2018 731 0,435794 0,512958 0,331123 0,487106 2586980 0 0,8578 0,18
lancer LA Lancer 11/01/2018 731 0,479738 0,499482 0,309485 0,331977 950410 0 0,1184 0,19
....
lancer LA Lancer 17/04/2018 731 0,027279 0,041106 0,02558 0,031017 9936 1927680 0,3502 0,02
....
yocomin YC Yocomin 21/01/2016 732 0,008135 0,010833 0,002853 0,002876 63 139008 0,0029 0,01
yocomin YC Yocomin 22/01/2016 732 0,002872 0,008174 0,001192 0,005737 69 49086 0,651 0,01
yocomin YC Yocomin 23/01/2016 732 0,005737 0,005918 0,001357 0,00136 67 98050 0,0007 0
....
yocomin YC Yocomin 17/04/2018 732 0,020425 0,021194 0,017635 0,01764 12862 2291610 0,0014 0
....
Let's say I have a .txt file with a list of symbol of that time series I want to extract. For example:
AAA
LA
YC
I would like to get a dataset that would look as follow:
date AAA LA YC
28/04/2013 134,21 NaN NaN
29/04/2013 144,54 NaN NaN
30/04/2013 139 NaN NaN
....
....
....
17/04/2018 7902,09 0,031017 0,01764
where under the stock name (like AAA, etc) i get the "close" price. I'm open to both Python and R. Any help would be grate!
In python using pandas, this should work.
import pandas as pd
df = pd.read_excel("/path/to/file/Book1.xlsx")
df = df.loc[:, ['symbol', 'name', 'date', 'close']]
df = df.set_index(['symbol', 'name', 'date'])
df = df.unstack(level=[0,1])
df = df['close']
to read the symbols file file and then filter out symbols not in the dataframe:
symbols = pd.read_csv('/path/to/file/symbols.txt', sep=" ", header=None)
symbols = symbols[0].tolist()
symbols = pd.Index(symbols).unique()
symbols = symbols.intersection(df.columns.get_level_values(0))
And the output will look like:
print(df[symbols])
symbol AAA LA YC
name companyA Lancer Yocomin
date
2018-09-01 00:00:00 None 0,422736 None
2018-10-01 00:00:00 None 0,487106 None
2018-11-01 00:00:00 None 0,331977 None