Conditional function on multiple rows - python

I have a csv file like so:
Landform Number Name Class
0 Deltaic Plain 912 Lx NaN
1 Hummock and Swale 912 Lx NaN
2 Sand Dunes 912 Lx NaN
3 Hummock and Swale 939 Woodbury NaN
4 Sand Dunes 939 Woodbury NaN
and when Landform contains Deltaic Plain, Hummock and Swale and Sand Dunes for a particular Name I want to assign a value of 1 to Class.
When Landform is contains Hummock and Swale and Sand Dunes I want to assign a value of 2 for Class.
My desired output is:
Landform Number Name Class
0 Deltaic Plain 912 Lx 1
1 Hummock and Swale 912 Lx 1
2 Sand Dunes 912 Lx 1
3 Hummock and Swale 939 Woodbury 2
4 Sand Dunes 939 Woodbury 2
I know how to do this for just 1 row like this:
def f(x):
if x['Landform'] == 'Hummock and Swale' : return '1'
else: return '2'
df['Class'] = df.apply(f, axis=1)
but I am not sure how to group by Name and then create the conditional functions based on numerous rows.

The idea is to group on your Number column, and apply a function that looks at all the landforms in that group and returns an appropriate class. Here's an example:
def determineClass(landforms):
if all(form in landforms.values for form in ('Deltaic Plain', 'Hummock and Swale', 'Sand Dunes')):
return 1
elif all(form in landforms.values for form in ('Hummock and Swale', 'Sand Dunes')):
return 2
# etc.
else:
# return "default" class
return 0
>>> df.groupby('Number').Landform.apply(determineClass)
Number
912 1
939 2
Name: Landform, dtype: int64
If you want to assign the values back into the Class column, just use map, as described in this question from 20 minutes ago:
>>> classes = df.groupby('Number').Landform.apply(determineClass)
>>> df['Class'] = df.Number.map(classes)
>>> df
Landform Number Name Class
0 Deltaic Plain 912 Lx 1
1 Hummock and Swale 912 Lx 1
2 Sand Dunes 912 Lx 1
3 Hummock and Swale 939 Woodbury 2
4 Sand Dunes 939 Woodbury 2

Related

Merge two dataframes based on condition

I have these two dataframes:
sp_client
ConnectionID Value
0 CN01493292 495
1 CN01492424 440
2 CN01491959 403
3 CN01493200 312
4 CN01493278 282
.. ... ...
110 CN01492864 1
111 CN01492513 1
112 CN01492899 1
113 CN01493010 1
114 CN01493032 1
[115 rows x 2 columns]
sp_server
ConnectionID Value
1 CN01491920 2
1 CN01491920 2
3 CN01491922 2
3 CN01491922 2
5 CN01491928 2
.. ... ...
595 CN01493166 3
595 CN01493166 3
595 CN01493166 3
597 CN01493163 2
597 CN01493163 2
[673 rows x 2 columns]
I would like to merge them in a way where sp_client['Value'] increments by addition of sp_sever['Value'] and sp_client['Value'] only when the rows satisfy the condition sp_sever['ConnectionID']==sp_client['ConnectionID'].
It was a little bit complicated for me but I tried the following, but I am missing the condition part. Maybe it does not need to be merged in the first place. Happy to hear suggestions.
as per my comment, try to append tables and group them by ID while summing Value column as per example:
all_data = pd.concat([sp_server,sp_client])
all_data = all_data.groupby('ConnectionID')['Value'].agg(sum).reset_index()
out:
ConnectionID Value
0 CN01491920 4
1 CN01491922 4
2 CN01491928 2
3 CN01491959 403
4 CN01492424 440
5 CN01493200 312

How can we create a Chord Diagram with a dataframe object?

I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.
I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.

Creating column to keep track of missing values in another column

I am adding a mock dataframe to exemplify my problem.
I have a large dataframe in which some columns are missing values.
I would like to create some extra boolean columns in which 1 corresponds to a non missing value in the row and 0 corresponds to a missing value.
names = ['Banana, Andrew Something (Maria Banana)', np.nan, 'Willis, Mr. Bruce (Demi Moore)', 'Crews, Master Terry', np.nan]
room = [100, 330, 212, 111, 222]
hotel_loon = {'Name' : pd.Series(names), 'Room' : pd.Series(room)}
hotel_loon_df = pd.DataFrame(hotel_loon)
In another question I found on stack overflow they were super thorough and clear on how to proceed to keep track of all the columns that have missing values but not for specific ones.
I tried a few variations of that code (namely using where) but I was not successful with creating what I wanted which would be something like this:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
Thank you for your time, I am sure that in the end it is going to be trivial, but for some reason I got stuck.
To save some typing, use DataFrame.notnull, add some suffixes, and join the result back.
pd.concat([df, df.notnull().astype(int).add_suffix('_present')], axis=1)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
You can use .isnull() for your case, and change the type from bool to int:
hotel_loon_df['Name_present'] = (~hotel_loon_df['Name'].isnull()).astype(int)
hotel_loon_df['Room_present'] = (~hotel_loon_df['Room'].isnull()).astype(int)
Out[1]:
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1
The ~ means the opposite of, or something that is not.
If you are tracking only for Nan fields, you can use isnull() function.
df['name_present'] =df['name'].isnull()
df['name_present'].replace(True,0, inplace=True)
df['name_present'].replace(False,1, inplace=True)
df['room_present'] =df['room'].isnull()
df['room_present'].replace(True,0, inplace=True)
df['room_present'].replace(False,1, inplace=True)
We can do this in a concise manner by using DataFrame.isnull:
hotel_loon_df[['Name_present', 'Room_present']] = (~hotel_loon_df.isnull()).astype(int)
Name Room Name_present Room_present
0 Banana, Andrew Something (Maria Banana) 100 1 1
1 NaN 330 0 1
2 Willis, Mr. Bruce (Demi Moore) 212 1 1
3 Crews, Master Terry 111 1 1
4 NaN 222 0 1

Pandas - Count the number of rows that would be true for a function - for each input row

I have a dataframe that needs a column added to it. That column needs to be a count of all the other rows in the table that meet a certain condition, that condition needs to take in input both from the "input" row and the "output" row.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
I'd want the height and weight of the row, as well as the height and weight of the other rows in a function, so I can do something like:
def example_function(height1, weight1, height2, weight2):
if height1 > height2 and weight1 < weight2:
return True
else:
return False
And it would just sum up all the True's and give that sum in the column.
Is something like this possible?
Thanks in advance for any ideas!
Edit: Sample input:
id name height weight country
0 Adam 70 180 USA
1 Bill 65 190 CANADA
2 Chris 71 150 GERMANY
3 Eric 72 210 USA
4 Fred 74 160 FRANCE
5 Gary 75 220 MEXICO
6 Henry 61 230 SPAIN
The result would need to be:
id name height weight country new_column
0 Adam 70 180 USA 1
1 Bill 65 190 CANADA 1
2 Chris 71 150 GERMANY 3
3 Eric 72 210 USA 1
4 Fred 74 160 FRANCE 4
5 Gary 75 220 MEXICO 1
6 Henry 61 230 SPAIN 0
I believe it will need to be some sort of function, as the actual logic I need to use is more complicated.
edit 2:fixed typo
You can add booleans, like this:
count = ((df.height1 > df.height2) & (df.weight1 < df.weight2)).sum()
EDIT:
I test it a bit and then change conditions with custom function:
def f(x):
#check boolean mask
#print ((df.height > x.height) & (df.weight < x.weight))
return ((df.height < x.height) & (df.weight > x.weight)).sum()
df['new_column'] = df.apply(f, axis=1)
print (df)
id name height weight country new_column
0 0 Adam 70 180 USA 2
1 1 Bill 65 190 CANADA 1
2 2 Chris 71 150 GERMANY 3
3 3 Eric 72 210 USA 1
4 4 Fred 74 160 FRANCE 4
5 5 Gary 75 220 MEXICO 1
6 6 Henry 61 230 SPAIN 0
Explanation:
For each row compare values and for count simply sum values True.
For example, if it was a dataframe describing people, and I wanted to make a column that counted how many people were taller than the current row and lighter.
As far as I understand, you want to assign to a new column something like
df['num_heigher_and_leighter'] = df.apply(lambda r: ((df.height > r.height) & (df.weight < r.weight)).sum(), axis=1)
However, your text description doesn't seem to match the outcome, which is:
0 2
1 3
2 0
3 1
4 0
5 0
6 6
dtype: int64
Edit
As in any other case, you can use a named function instead of a lambda:
df = ...
def foo(r):
return ((df.height > r.height) & (df.weight < r.weight)).sum()
df['num_heigher_and_leighter'] = df.apply(foo, axis=1)
I'm assuming you had a typo and want to compare heights with heights and weights with weights. If so, you could count the number of persons taller OR heavier like so:
>>> for i,height,weight in zip(df.index,df.height, df.weight):
... cnt = df.loc[((df.height>height) & (df.weight>weight)), 'height'].count()
... df.loc[i,'thing'] = cnt
...
>>> df
name height weight country thing
0 Adam 70 180 USA 2.0
1 Bill 65 190 CANADA 2.0
2 Chris 71 150 GERMANY 3.0
3 Eric 72 210 USA 1.0
4 Fred 74 160 FRANCE 1.0
5 Gary 75 220 MEXICO 0.0
6 Henry 61 230 SPAIN 0.0
Here for instance, no person is Heavier than Henry, and no person is taller than Gary. If that's not what you intended, it should be easy to modify the & above to a | instead or switching out the > to a <.
When you're more accustomed to Pandas, I suggest you use Ami Tavory excellent answer instead.
PS. For the love of god, use the Metric system for representing weight and height, and convert to whatever for presentation. These numbers are totally nonsensical for the world population at large. :)

Get length of values in pandas dataframe column

I'm trying to get the length of each zipCd value in the dataframe mentioned below. When I run the code below I get 958 for every record. I'm expecting to get something more like '4'. Does anyone see what the issue is?
Code:
zipDfCopy['zipCd'].str.len()
Data:
print zipDfCopy[1:5]
Zip Code Place Name State State Abbreviation County \
1 544 Holtsville New York NY Suffolk
2 1001 Agawam Massachusetts MA Hampden
3 1002 Amherst Massachusetts MA Hampshire
4 1003 Amherst Massachusetts MA Hampshire
Latitude Longitude zipCd
1 40.8154 -73.0451 0 501\n1 544\n2 1001...
2 42.0702 -72.6227 0 501\n1 544\n2 1001...
3 42.3671 -72.4646 0 501\n1 544\n2 1001...
4 42.3919 -72.5248 0 501\n1 544\n2 1001...
One way is to convert to string and use pd.Series.map with len built-in.
pd.Series.str is used for vectorized string functions, while pd.Series.astype is used to change column type.
import pandas as pd
df = pd.DataFrame({'ZipCode': [341, 4624, 536, 123, 462, 4642]})
df['ZipLen'] = df['ZipCode'].astype(str).map(len)
# ZipCode ZipLen
# 0 341 3
# 1 4624 4
# 2 536 3
# 3 123 3
# 4 462 3
# 5 4642 4
A more explicit alternative is to use np.log10:
df['ZipLen'] = np.floor(np.log10(df['ZipCode'].values)).astype(int) + 1

Categories