Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.
Related
I want to add all the data from charts.zip from https://doi.org/10.5281/zenodo.4778562 in a single DataFrame. The data consist of a file per year that contains multiple CSVs. I made the following code:
header = 0
dfs = []
for file in glob.glob('Charts/*/201?/*.csv'):
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
weekly_chart = pd.read_csv(file, header=header, sep='\t')
weekly_chart['week_start'] = datetime.strptime(dates[0], '%Y-%m-%d')
weekly_chart['week_end'] = datetime.strptime(dates[1], '%Y-%m-%d')
weekly_chart['region'] = region
dfs.append(weekly_chart)
all_charts = pd.concat(dfs)
But, when I run it, python returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_12886/3473678833.py in <module>
9 weekly_chart['region'] = region
10 dfs.append(weekly_chart)
---> 11 all_charts = pd.concat(dfs)
~/Downloads/enter/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
344 ValueError: Indexes have overlapping values: ['a']
345 """
--> 346 op = _Concatenator(
347 objs,
348 axis=axis,
~/Downloads/enter/lib/python3.9/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
401
402 if len(objs) == 0:
--> 403 raise ValueError("No objects to concatenate")
404
405 if keys is None:
ValueError: No objects to concatenate
How can I fix it?
I think the glob.glob might just be over complicating things... This works perfectly for me.
# Gives you a list of EVERY file in the Charts directory
# and sub directories that is a CSV
file_list = []
for path, subdirs, files in os.walk("Charts"):
file_list.extend([os.path.join(path, x) for x in files if x.endswith('.csv')])
dfs = []
for file in file_list:
region = file.split('/')[1]
dates = re.findall('\d{4}-\d{2}-\d{2}', file.split('/')[-1])
df = pd.read_csv(file, sep='\t')
df['week_start'] = dates[0]
df['week_end'] = dates[1]
df['region'] = region
dfs.append(df)
all_charts = pd.concat(dfs, ignore_index=True)
print(all_charts)
Output:
position song_id song_name artist streams ... peak_position position_status week_start week_end region
0 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 ... 1 0 2017-10-20 2017-10-27 us
1 2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 ... 2 0 2017-10-20 2017-10-27 us
2 3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 ... 3 1 2017-10-20 2017-10-27 us
3 4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 ... 4 1 2017-10-20 2017-10-27 us
4 5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 ... 1 -2 2017-10-20 2017-10-27 us
... ... ... ... ... ... ... ... ... ... ... ...
273595 196 6kex4EBAj0WHXDKZMEJaaF Swalla (feat. Nicki Minaj & Ty Dolla $ign) Jason Derulo 3747830 ... 8 -5 2018-03-02 2018-03-09 global
273596 197 0CokSRCu5hZgPxcZBaEzVE Glorious (feat. Skylar Grey) Macklemore 3725286 ... 14 -8 2018-03-02 2018-03-09 global
273597 198 7oK9VyNzrYvRFo7nQEYkWN Mr. Brightside The Killers 3717326 ... 148 -3 2018-03-02 2018-03-09 global
273598 199 7EUfNvyCVxQV3oN5ScA2Lb Next To Me Imagine Dragons 3681739 ... 122 -77 2018-03-02 2018-03-09 global
273599 200 6u0EAxf1OJTLS7CvInuNd7 Vai malandra (feat. Tropkillaz & DJ Yuri Martins) Anitta 3676542 ... 30 -23 2018-03-02 2018-03-09 global
If you really want the dates to be dates, you can run this on those two columns at the end.
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
Personally, I'd also do the following:
all_charts['week_start'] = pd.to_datetime(all_charts['week_start'])
all_charts['week_end'] = pd.to_datetime(all_charts['week_end'])
all_charts['region'] = all_charts['region'].astype('category')
all_charts['artist'] = all_charts['artist'].astype('category')
all_charts['song_name'] = all_charts['song_name'].astype('category')
all_charts['song_id'] = all_charts['song_id'].astype('category')
all_charts.set_index(['region', 'week_start', 'week_end', 'position'], inplace=True)
all_charts.position_status = pd.to_numeric(all_charts.position_status, errors='coerce')
print(df.head(10))
Giving:
song_id song_name artist streams last_week_position weeks_on_chart peak_position position_status
region week_start week_end position
us 2017-10-20 2017-10-27 1 7wGoVu4Dady5GV0Sv4UIsx rockstar Post Malone 17532665 1.0 3 1 0.0
2 75ZvA4QfFiZvzhj2xkaWAh I Fall Apart Post Malone 8350785 2.0 6 2 0.0
3 2fQrGHiQOvpL9UgPvtYy6G Bank Account 21 Savage 7589124 4.0 5 3 1.0
4 43ZyHQITOjhciSUUNPVRHc Gucci Gang Lil Pump 7584237 5.0 3 4 1.0
5 5tz69p7tJuGPeMGwNTxYuV 1-800-273-8255 Logic 7527770 3.0 26 1 -2.0
6 5Gd19NupVe5X8bAqxf9Iaz Gorgeous Taylor Swift 6940802 NaN 1 6 NaN
7 0ofbQMrRDsUaVKq2mGLEAb Havana Camila Cabello 6623184 10.0 12 7 3.0
8 2771LMNxwf62FTAdpJMQfM Bodak Yellow Cardi B 6472727 6.0 14 3 -2.0
9 5Z3GHaZ6ec9bsiI5BenrbY Young Dumb & Broke Khalid 5982108 9.0 29 6 0.0
10 7GX5flRQZVHRAGd6B4TmDO XO Tour Llif3 Lil Uzi Vert 5822583 8.0 9 2 -2.0
I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?
To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.
Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])
Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64
df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....
I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.
I'm having trouble removing non-digits from a df column. I have tried a few methods, but there are still quite a few that produce NaN values when the function passed through the column.
I need the output to be only digits in an integer form (No leading zeros)
Cust #
0 10726
2 11699
5 12963
8 z13307
9 13405
12 14831-001
16 16416
17 16917
18 18027
24 19233z
dtype('O')
I have tried:
Unique_Stores['Cust #2']=Unique_Stores['Cust #2'].str.extract('(\d+)',expand=True)
Unique_Stores['Cust #2'].str.replace(r'(\D+)','')
Unique_Stores['Cust #2'].replace(to_replace="([0-9]+)", value=r"\1", regex=True, inplace=True)
Unique_Stores['Cust #2'] = pd.to_numeric(Unique_Stores['Cust #2'].str.replace(r'\D+', ''), errors='coerce')
Thank you in advance, and let me know if you need more info.
But no matter what I do, the first 1000 or so lines return NaN values- even when the value is an integer.
Link to actual dataset
UPDATE:
In [144]: df = pd.read_csv(r'D:\download\Customer_Numbers.csv', index_col=0)
In [145]: df['Cust #2'] = df['Cust #'].str.replace(r'\D+', '').astype(int)
In [146]: df
Out[146]:
State Zip Code Cust # Cust #2
0 PA 16505 10726 10726
2 MI 48103 11699 11699
5 NH 3253 12963 12963
8 PA 18951 13307 13307
9 MA 2360 13405 13405
12 NY 11940 14831 14831
16 OH 44278 16416 16416
17 OH 45459 16917 16917
18 MA 1748 18027 18027
24 NY 14226 19233 19233
... ... ... ... ...
54393 WA 99207 005611-99 561199
54394 WA 99006 7775 7775
54395 WA 99353 8006 8006
54399 WA 99206 8888 8888
54404 CA 92117 444202 444202
54408 CA 90019 30066 30066
54411 CA 90026 443607 443607
54414 CA 90094 9242 9242
54417 CA 90405 9245 9245
54420 CA 90038 9247 9247
[6492 rows x 4 columns]
In [147]: df.dtypes
Out[147]:
State object
Zip Code object
Cust # object
Cust #2 int32
dtype: object
OLD answer:
In [123]: df
Out[123]:
val
0 10726
2 11699
5 12963
8 z13307
9 13405
12 14831-001
16 16416
17 16917
18 18027
24 19233z
In [124]: df['val'] = pd.to_numeric(df['val'].str.replace(r'\D+', ''), errors='coerce')
In [125]: df
Out[125]:
val
0 10726
2 11699
5 12963
8 13307
9 13405
12 14831001
16 16416
17 16917
18 18027
24 19233
I have a pandas DataFrame structured in the following way
0 1 2 3 4 5 6 7 8 9
0 42 2012 106 1200 0.112986 -0.647709 -0.303534 31.73 14.80 1096
1 42 2012 106 1200 0.185159 -0.588728 -0.249392 31.74 14.80 1097
2 42 2012 106 1200 0.199910 -0.547780 -0.226356 31.74 14.80 1096
3 42 2012 106 1200 0.065741 -0.796107 -0.099782 31.70 14.81 1097
4 42 2012 106 1200 0.116718 -0.780699 -0.043169 31.66 14.78 1094
5 42 2012 106 1200 0.280035 -0.788511 -0.171763 31.66 14.79 1094
6 42 2012 106 1200 0.311319 -0.663151 -0.271162 31.78 14.79 1094
In which columns 4, 5 and 6 are actually the components of a vector. I want to apply a matrix multiplication in these columns, that is to replace columns 4, 5 and 6 with the vector resulting of a the multiplication of the previous vector with a matrix.
What I did was
DC=[[ .. definition of multiplication matrix .. ]]
def rotate(vector):
return dot(DC, vector)
data[[4,5,6]]=data[[4,5,6]].apply(rotate, axis='columns')
Which I thought should work, but the returned DataFrame is exactly the same as the original.
What am I missing here?
You code is correct but very slow. You can use values property to get the ndarray and use dot() to transform all the vectors at once:
import numpy as np
import pandas as pd
DC = np.random.randn(3, 3)
df = pd.DataFrame(np.random.randn(1000, 10))
df2 = df.copy()
df[[4,5,6]] = np.dot(DC, df[[4,5,6]].values.T).T
def rotate(vector):
return np.dot(DC, vector)
df2[[4,5,6]] = df2[[4,5,6]].apply(rotate, axis='columns')
df.equals(df2)
On my PC, it's about 90x faster.