Get length of values in pandas dataframe column - python

I'm trying to get the length of each zipCd value in the dataframe mentioned below. When I run the code below I get 958 for every record. I'm expecting to get something more like '4'. Does anyone see what the issue is?
Code:
zipDfCopy['zipCd'].str.len()
Data:
print zipDfCopy[1:5]
Zip Code Place Name State State Abbreviation County \
1 544 Holtsville New York NY Suffolk
2 1001 Agawam Massachusetts MA Hampden
3 1002 Amherst Massachusetts MA Hampshire
4 1003 Amherst Massachusetts MA Hampshire
Latitude Longitude zipCd
1 40.8154 -73.0451 0 501\n1 544\n2 1001...
2 42.0702 -72.6227 0 501\n1 544\n2 1001...
3 42.3671 -72.4646 0 501\n1 544\n2 1001...
4 42.3919 -72.5248 0 501\n1 544\n2 1001...

One way is to convert to string and use pd.Series.map with len built-in.
pd.Series.str is used for vectorized string functions, while pd.Series.astype is used to change column type.
import pandas as pd
df = pd.DataFrame({'ZipCode': [341, 4624, 536, 123, 462, 4642]})
df['ZipLen'] = df['ZipCode'].astype(str).map(len)
# ZipCode ZipLen
# 0 341 3
# 1 4624 4
# 2 536 3
# 3 123 3
# 4 462 3
# 5 4642 4
A more explicit alternative is to use np.log10:
df['ZipLen'] = np.floor(np.log10(df['ZipCode'].values)).astype(int) + 1

Related

How can we create a Chord Diagram with a dataframe object?

I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.

How do I calculate an average of a range from a series within in a dataframe?

Im new to Python and working with data manipulation
I have a dataframe
df3
Out[22]:
Breed Lifespan
0 New Guinea Singing Dog 18
1 Chihuahua 17
2 Toy Poodle 16
3 Jack Russell Terrier 16
4 Cockapoo 16
.. ... ...
201 Whippet 12--15
202 Wirehaired Pointing Griffon 12--14
203 Xoloitzcuintle 13
204 Yorkie--Poo 14
205 Yorkshire Terrier 14--16
As you observe above, some of the lifespans are in a range like 14--16. The datatype of [Lifespan] is
type(df3['Lifespan'])
Out[24]: pandas.core.series.Series
I want it to reflect the average of these two numbers i.e. 15. I do not want any ranges. Just the average as a single digit. How do I do this?
Using split and expand=True
df = pd.DataFrame({'Breed': ['Dog1', 'Dog2'],
'Lifespan': [12, '14--15']})
df['Lifespan'] = (df['Lifespan']
.astype(str).str.split('--', expand=True)
.astype(float).mean(axis=1)
)
df
# Breed Lifespan
# 0 Dog1 12.0
# 1 Dog2 14.5

Pandas can only convert an array of size 1 to a Python scalar

I have this dataframe, df_pm:
Player GameWeek Minutes \
PlayerMatchesDetailID
1 Alisson 1 90
2 Virgil van Dijk 1 90
3 Joseph Gomez 1 90
ForTeam AgainstTeam \
1 Liverpool Norwich City
2 Liverpool Norwich City
3 Liverpool Norwich City
Goals ShotsOnTarget ShotsInBox CloseShots \
1 0 0 0 0
2 1 1 1 1
3 0 0 0 0
TotalShots Headers GoalAssists ShotOnTargetCreated \
1 0 0 0 0
2 1 1 0 0
3 0 0 0 0
ShotInBoxCreated CloseShotCreated TotalShotCreated \
1 0 0 0
2 0 0 0
3 0 0 1
HeadersCreated
1 0
2 0
3 0
this second dataframe, df_melt:
MatchID GameWeek Date Team Home \
0 46605 1 2019-08-09 Liverpool Home
1 46605 1 2019-08-09 Norwich City Away
2 46606 1 2019-08-10 AFC Bournemouth Home
AgainstTeam
0 Norwich City
1 Liverpool
2 Sheffield United
3 AFC Bournemouth
...
575 Sheffield United
576 Newcastle United
577 Southampton
and this snippet, which uses both:
match_ids = []
home_away = []
dates = []
#For each row in the player matches dataframe...
for row in df_pm.itertuples():
#Look up the match id from the team matches dataframe
team = row.ForTeam
againstteam = row.AgainstTeam
gameweek = row.GameWeek
print (team,againstteam,gameweek)
match_id = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'MatchID'].item()
date = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Date'].item()
home = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Home'].item()
match_ids.append(match_id)
home_away.append(home)
dates.append(date)
At first iteration, I print:
Liverpool
Norwich City
1
But I'm getting the error:
Traceback (most recent call last):
File "tableau_data_generation.py", line 166, in <module>
'MatchID'].item()
File "/Users/me/anaconda2/envs/data_science/lib/python3.7/site-packages/pandas/core/base.py", line 652, in item
return self.values.item()
ValueError: can only convert an array of size 1 to a Python scalar
printing the whole df_melt dataframe, I see that these four datetime values are flawed:
540 46875 28 TBC Aston Villa Home
541 46875 28 TBC Sheffield United Away
...
548 46879 28 TBC Manchester City Home
549 46879 28 TBC Arsenal Away
How do I fix this?
When you use item() on a Series you should actually have received:
FutureWarning: `item` has been deprecated and will be removed in a future version
Since item() has been deprecated in version 0.25.0, it looks like you use
some outdated version of Pandas and possibly you should start from upgrading it.
Even in a newer version of Pandas you can use item(), but on a Numpy
array (at least now, not deprecated).
So change your code to:
df_melt.loc[...].values.item()
Another option is to use iloc[0], so you can also change your code to:
df_melt.loc[...].iloc[0]
Edit
The above solution still can raise an exception (IndexError) if df_melt
does not find any row meeting the given criteria.
To make your code resistant to such cases (and return some default value)
you can add a function getting the given attribute (attr, actually a
column) from the first row meeting the criteria given (gameweek, team,
and againstteam):
def getAttr(gameweek, team, againstteam, attr, default=None):
xx = df_melt.loc[(df_melt['GameWeek'] == gameweek)
& (df_melt['Team'] == team)
& (df_melt['AgainstTeam'] == againstteam)]
return default if xx.empty else xx.iloc[0].loc[attr]
Then, instead of all 3 ... = df_melt.loc[...].item() instructions run:
match_id = getAttr(gameweek, team, againstteam, 'MatchID', default=-1)
date = getAttr(gameweek, team, againstteam, 'Date')
home = getAttr(gameweek, team, againstteam, 'Home', default='????')

Creating column to keep track of missing values in another column

I am adding a mock dataframe to exemplify my problem.
I have a large dataframe in which some columns are missing values.
I would like to create some extra boolean columns in which 1 corresponds to a non missing value in the row and 0 corresponds to a missing value.
names = ['Banana, Andrew Something (Maria Banana)', np.nan, 'Willis, Mr. Bruce (Demi Moore)', 'Crews, Master Terry', np.nan]
room = [100, 330, 212, 111, 222]
hotel_loon = {'Name' : pd.Series(names), 'Room' : pd.Series(room)}
hotel_loon_df = pd.DataFrame(hotel_loon)
In another question I found on stack overflow they were super thorough and clear on how to proceed to keep track of all the columns that have missing values but not for specific ones.
I tried a few variations of that code (namely using where) but I was not successful with creating what I wanted which would be something like this:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
Thank you for your time, I am sure that in the end it is going to be trivial, but for some reason I got stuck.
To save some typing, use DataFrame.notnull, add some suffixes, and join the result back.
pd.concat([df, df.notnull().astype(int).add_suffix('_present')], axis=1)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
You can use .isnull() for your case, and change the type from bool to int:
hotel_loon_df['Name_present'] = (~hotel_loon_df['Name'].isnull()).astype(int)
hotel_loon_df['Room_present'] = (~hotel_loon_df['Room'].isnull()).astype(int)
Out[1]:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
The ~ means the opposite of, or something that is not.
If you are tracking only for Nan fields, you can use isnull() function.
df['name_present'] =df['name'].isnull()
df['name_present'].replace(True,0, inplace=True)
df['name_present'].replace(False,1, inplace=True)
df['room_present'] =df['room'].isnull()
df['room_present'].replace(True,0, inplace=True)
df['room_present'].replace(False,1, inplace=True)
We can do this in a concise manner by using DataFrame.isnull:
hotel_loon_df[['Name_present', 'Room_present']] = (~hotel_loon_df.isnull()).astype(int)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1

Getting KeyError after reading in pipe-separated CSV

I read in a pipe-separated CSV like this
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv")
test.head()
This returns
SK_Country|"Number"|"Alpha2Code"|"Alpha3Code"|"CountryName"|"TopLevelDomain"
0 1|20|"ad"|"and"|"Andorra"|".ad"
1 2|4|"af"|"afg"|"Afghanistan"|".af"
2 3|28|"ag"|"atg"|"Antigua and Barbuda"|".ag"
3 4|660|"ai"|"aia"|"Anguilla"|".ai"
4 5|8|"al"|"alb"|"Albania"|".al"
When I try and extract specific data from it, like below:
df = test[["Alpha3Code"]]
I get the following error:
KeyError: ['Alpha3Code'] not in index
I don't understand what goes wrong - I can see the value is in the CSV when I print the head, likewise when I open the CSV, everything looks fine.
I've tried to google around and read some posts regarding the issue here on the stack and tried different approaches, but nothing seems to fix this annoying problem.
Notice how everything is crammed into one string column? That's because you didn't specify the delimiter separating columns to pd.read_csv, which in this case has to be '|'.
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",
sep='|')
test.head()
# SK_Country Number Alpha2Code Alpha3Code CountryName \
# 0 1 20 ad and Andorra
# 1 2 4 af afg Afghanistan
# 2 3 28 ag atg Antigua and Barbuda
# 3 4 660 ai aia Anguilla
# 4 5 8 al alb Albania
#
# TopLevelDomain
# 0 .ad
# 1 .af
# 2 .ag
# 3 .ai
# 4 .al
As pointed out in the comment by #chrisz, you have to specify the delimiter:
test = pd.read_csv("http://kejser.org/wp-content/uploads/2014/06/Country.csv",delimiter='|')
test.head()
SK_Country Number Alpha2Code Alpha3Code CountryName \
0 1 20 ad and Andorra
1 2 4 af afg Afghanistan
2 3 28 ag atg Antigua and Barbuda
3 4 660 ai aia Anguilla
4 5 8 al alb Albania
TopLevelDomain
0 .ad
1 .af
2 .ag
3 .ai
4 .al

Categories