Can anyone help me in getting the name of the CurrencySymbol which have the highest count.
filt = df['Country'] == 'India'
df.loc[filt]['CurrencySymbol'].value_counts()
INR
4918
USD
133
AED
16
AUD
11
EUR
8
AMD
7
AFN
5
CAD
2
RON
2
AWG
2
JPY
2
ARS
2
AOA
2
when I tried this:
df.loc[filt]['CurrencySymbol'].value_counts().max()
It returns me 4918 But I want to return INR.
Because your value count is already ordered, just query the first element of the index of the value count.
pd.Series.value_counts().index[0]
Eg, for this trivial example:
df = pd.DataFrame([{'str': 'free peoples of middle earth'}])
df['str'].value_counts().index[0]
>>> 'free peoples of middle earth'
Related
I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?
You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])
df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.
df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]
You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)
We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})
What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
I have a dataframe df which contains company names that I need to neatly-format. The names are already in titlecase:
Company Name
0 Visa Inc
1 Msci Inc
2 Coca Cola Inc
3 Pnc Bank
4 Aig Corp
5 Td Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
Since many of the companies go by an acronym or an abbreviation (rows 1, 3, 4, 5), I want only the first string in those company names to be uppercase, like so:
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
I know I can't get 100% accurate replacement, but I believe I can get close by uppercasing only the first string if:
it's 4 or fewer characters
and the first string is not a word in the dictionary
How can I achieve this with something like: df['Company Name'] = df['Company Name'].replace()?
So you can actually use the enchant module to find out if it is a dictionary word or not. Given you are still going to have some off results I.E. Uber.
Here is the code I came up with, sorry for the terrible names of variables and what not.
import enchant
import pandas as pd
def main():
d = enchant.Dict("en_US")
listofcompanys = ['Msci Inc',
'Coca Cola Inc',
'Pnc Bank',
'Aig Corp',
'Td Ameritrade',
'Uber Inc',
'Costco Inc',
'New York Times']
dataframe = pd.DataFrame(listofcompanys, columns=['Company Name'])
for index, name in dataframe.iterrows():
first_word = name['Company Name'].split()
is_word = d.check(first_word[0])
if not is_word:
name['Company Name'] = first_word[0].upper() + ' ' + first_word[1]
print(dataframe)
if __name__ == '__main__':
main()
Output for this was:
Company Name
0 MSCI Inc
1 Coca Cola Inc
2 PNC Bank
3 AIG Corp
4 TD Ameritrade
5 UBER Inc
6 Costco Inc
7 New York Times
Here's a working solution, which makes use of a english word list. Only it's not accurate for td and uber, but like you said, this will be hard to get 100% accurate.
url = 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'
words = set(pd.read_csv(url, header=None)[0])
w1 = df['Company Name'].str.split()
m1 = ~w1.str[0].str.lower().isin(words) # is not an english word
m2 = w1.str[0].str.len().le(4) # first word is < 4 characters
df.loc[m1 & m2, 'Company Name'] = w1.str[0].str.upper() + ' ' + w1.str[1:].str.join(' ')
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 Td Ameritrade
6 UBER Inc
7 Costco Inc
8 New York Times
Note: I also tried this with nltk package, but apparently, the nltk.corpus.words module is by far not complete with the english words.
You can first separate the first words and the other parts. Then filter those first words based on your logic:
company_list = ['Visa']
s = df['Company Name'].str.extract('^(\S+)(.*)')
mask = s[0].str.len().le(4) & (~s[0].isin(company_list))
df['Company Name'] = s[0].mask(mask, s[0].str.upper()) + s[1]
Output (notice that NEW in New York gets changed as well):
Company Name
0 Visa Inc
1 MSCI Inc
2 COCA Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 UBER Inc
7 Costco Inc
8 NEW York Times
This will get you first word from the string and make it upper only for those company names that are included in the include list:
import pandas as pd
import numpy as np
company_name = {'Visa Inc', 'Msci Inc', 'Coca Cola Ins', 'Pnc Bank'}
include = ['Msci', 'Pnc']
df = pd.DataFrame(company_name)
df.rename(columns={0: 'Company Name'}, inplace=True)
df['Company Name'] = df['Company Name'].apply(lambda x: x.split()[0].upper() + ' ' + x[len(x.split()[0].upper()):] if x.split()[0].strip() in include else x)
df['Company Name']
Output:
0 MSCI Inc
1 Coca Cola Ins
2 PNC Bank
3 Visa Inc
Name: Company Name, dtype: object
A manual workaround could be appending words like "uber"
from nltk.corpus import words
dict_words = words.words()
dict_words.append('uber')
create a new column
df.apply(lambda x : x['Company Name'].replace(x['Company Name'].split(" ")[0].strip(), x['Company Name'].split(" ")[0].strip().upper())
if len(x['Company Name'].split(" ")[0].strip()) <= 4 and x['Company Name'].split(" ")[0].strip().lower() not in dict_words
else x['Company Name'],axis=1)
Output:
0 Visa Inc
1 Msci Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 TD Ameritrade
6 Uber Inc
7 Costco Inc
8 New York Times
Download the nltk package version by runnning:
import nltk
nltk.download()
Demo:
from nltk.corpus import words
"new" in words.words()
Output:
False
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09
I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)