This simple code draws line chart as expected:
james_f=names[(names.name=='James') & (names.sex=='F')]
plt.plot(james_f['year'],james_f['births'])
plt.show()
But then I change condition, just delete one of them, and then it starts to draw bar chart. Why and how to force to draw line chart?
james_f=names[(names.name=='James')]
plt.plot(james_f['year'],james_f['births'])
plt.show()
Adding instead of it 1==1 rule, nothing changes(
james_f=names[(names.name=='James') & ( 1 == 1)]
plt.plot(james_f['year'],james_f['births'])
plt.show()
Even this code draws barchart:
james_f=names[(names.name=='James') | (names.name=='John') | (names.name=='Robert') ]
plt.plot(james_f['year'],james_f['births'])
james_f['births'] output (pandas.core.series.Series):
228 46
343 22
538 11
942 9655
944 5927
2312 26
2329 24
2617 9
2938 8769
....
Name: births, dtype: int64
james_f['births'].min() return 7 There is no zero or NaN values
>>> print(james_f[james_f['births'].isnull()])
Empty DataFrame
Columns: [name, sex, births, year]
Index: []
>>> james_f.head(10)
name sex births year
343 James F 22 1880
944 James M 5927 1880
2329 James F 24 1881
2940 James M 5441 1881
4372 James F 18 1882
4965 James M 5892 1882
6428 James F 25 1883
7118 James M 5223 1883
8488 James F 33 1884
9320 James M 5693 1884
Not filtering on gender yields two observations per year: one for women and one for men. The numbers of men and women with name 'James' are vastly different making the plot appear very noisy. You have (at least) two options:
(1) Sum up the number of men and women like so.
james = names[names.name == 'james']
years = []
births = []
for year, subset in james.groupby('year'):
years.append(year)
births.append(subset.births.sum())
plt.plot(years, births)
Someone with more pandas skills can probably write this as one line.
(2) Plot two separate lines for men and women like so.
james = names[names.name == 'james']
for sex, subset in james.groupby('sex'):
plt.plot(subset.year, subset.births, label=sex)
plt.legend()
Related
My dataframe looks like:
School
Term
Students
A
summer 2020
324
B
spring 21
101
A
summer/spring
201
F
wintersem
44
C
fall trimester
98
E
23
I need to add a new column Termcode that assumes any of the 6 values:
summer, spring, fall, winter, multiple, none based on corresponding value in the Term Column, viz:
School
Term
Students
Termcode
A
summer 2020
324
summer
B
spring 21
101
spring
A
summer/spring
201
multiple
F
wintersem
44
winter
C
fall trimester
98
fall
E
23
none
You can use a regex with str.extractall and filling of the values depending on the number of matches:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
output:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN
I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.
I have a data frame with 20 values, and I am trying to bar.plot it using matplotlib. when I do it, I am not seeing the 20 bars but 10. I have 5 nana values in it and 4 of them.
Here is a sample of dataframe:
Name Bonus
Jack Carpenter 890
John Clegg 653
Mike Holiday 367
Rene Moukad 900
........... ...
my code is standard:
fig,ax = plt.subplots(figsize=(16,6))
plt.bar(df.Name, df.Bonus)
fig.autofmt_xdate(rotation=45)
I have a data frame like this:
Code Group Name Number
ABC Group_1_ABC Mike 40
Amber 60
Group_2_ABC Rachel 90
XYZ Group_1_XYZ Bob 30
Peter 75
Nikki 55
Group_2_XYZ Julia 23
Ross 80
LMN Group_1_LMN Paul 95
. . . .
. . . .
I have created this data frame by grouping by code, group, name and summing the number.
Now i want to calculate the percentage of each name for a particular code. For that i want to sum all the numbers that are part of one code. I was doing this to calculate the percentage.
df['Percentage']= (df['Number']/df['??'])*100
Now for the total sum part for each group, I can`t figure out how to calculate it? I want the total sum for each code category, in order to calculate the percentage.
So for example for Code: ABC the total should be 40+60+90=190. This 190 would than be divided with all the number for each user in ABC to calculate their percentage for their respective code category. So technically the column group and name don`t have any role in calculating the total sum for each code category.
Use GroupBy.transform by first level or by level name Code:
df['Percentage']= (df['Number']/df.groupby(level=0)['Number'].transform('sum'))*100
df['Percentage']= (df['Number']/df.groupby(level=['Code'])['Number'].transform('sum'))*100
Or in last pandas versions is not necessary specified level parameter:
df['Percentage']= (df['Number']/df.groupby('Code')['Number'].transform('sum'))*100
print (df)
Number Percentage
Code Group Name
ABC Group_1_ABC Mike 40 21.052632
Amber 60 31.578947
Group_2_ABC Rachel 90 47.368421
XYZ Group_1_XYZ Bob 30 11.406844
Peter 75 28.517110
Nikki 55 20.912548
Group_2_XYZ Julia 23 8.745247
Ross 80 30.418251
LMN Group_1_LMN Paul 95 100.000000
Detail:
print (df.groupby(level=0)['Number'].transform('sum'))
Code Group Name
ABC Group_1_ABC Mike 190
Amber 190
Group_2_ABC Rachel 190
XYZ Group_1_XYZ Bob 263
Peter 263
Nikki 263
Group_2_XYZ Julia 263
Ross 263
LMN Group_1_LMN Paul 95
Name: Number, dtype: int64
I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)