Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30
So I have data in CSV. Here is my code.
data = pd.read_csv('cast.csv')
data = pd.DataFrame(data)
print(data)
The result looks like this.
title year name type \
0 Closet Monster 2015 Buffy #1 actor
1 Suuri illusioni 1985 Homo $ actor
2 Battle of the Sexes 2017 $hutter actor
3 Secret in Their Eyes 2015 $hutter actor
4 Steve Jobs 2015 $hutter actor
... ... ... ... ...
74996 Mia fora kai ena... moro 2011 Penelope Anastasopoulou actress
74997 The Magician King 2004 Tiannah Anastassiades actress
74998 Festival of Lights 2010 Zoe Anastassiou actress
74999 Toxic Tutu 2016 Zoe Anastassiou actress
75000 Fugitive Pieces 2007 Anastassia Anastassopoulou actress
character n
0 Buffy 4 31.0
1 Guests 22.0
2 Bobby Riggs Fan 10.0
3 2002 Dodger Fan NaN
4 1988 Opera House Patron NaN
... ... ...
74996 Popi voulkanizater 11.0
74997 Unicycle Race Attendant NaN
74998 Guidance Counselor 20.0
74999 Demon of Toxicity NaN
75000 Laundry Girl 25.0
[75001 rows x 6 columns]
I want to group the data by year and type. Then I want to know the size of the each type on specific year. So here is my code.
grouped = data.groupby(['year', 'type']).size()
print(grouped)
The result look like this.
year type
1912 actor 1
actress 2
1913 actor 9
actress 1
1914 actor 38
..
2019 actress 3
2020 actor 3
actress 1
2023 actor 1
actress 2
Length: 220, dtype: int64
The problem is, how if I want to get the size data from 1910 until 2020 and the increase year is 10 (Per decade). So the year index will 1910, 1920, 1930, 1940, and so on until 2020.
I see two simple options.
1- round the years to the lower 10:
group = df['year']//10*10 # or df['year'].round(-1)
grouped = data.groupby([group, 'type']).size()
2- use pandas.cut:
years = list(range(1910,2031,10))
group = pd.cut(s, bins=years, labels=years[:-1])
grouped = data.groupby([group, 'type']).size()
I have built a dataframe that extracts data through a scraper. I extracted job positions, and currently, this column contains job positions as follows:
Title Research Number \
1 Dean NaN
2 Professor of Law NaN
3 Associate Dean for Information & Technology Se... NaN
4 Professor of Law\n NaN
5 Associate Dean for Faculty Development\nCharle... NaN
6 Associate Dean for Faculty Development\nCharle... NaN
7 Assistant Professor of Clinical Education & Di... NaN
8 Judge George Howard, Jr., Distinguished Profes... NaN
9 Visiting Assistant Professor of Law NaN
10 Associate Dean for Academic Affairs\nArkansas ... NaN
11 Distinguished Professor in Constitutional Law NaN
12 Assistant Professor of Law NaN
13 Instructor of Clinical Education; Supervising ... NaN
14 Associate Professor of Law NaN
15 Assistant Professor of Law\n NaN
16 Assistant Professor of Clinical Education; Tax... NaN
17 Assistant Professor of Law Librarianship; NaN
18 Byron M. Eiseman Distinguished Professor of Ta... NaN
19 Professor of Law\n NaN
20 Associate Professor of Law; Mediation Clinic D... NaN
21 Assistant Professor of Clinical Education; Fam... NaN
22 Assistant Professor of Clinical Education; Co... NaN
23 Associate Professor of Law\n NaN
24 Professor of Law Librarianship; Electronic Res... NaN
25 Professor of Law\n NaN
26 Professor of Law\n NaN
27 Associate Dean for Experiential Learning & Cli... NaN
28 Associate Professor of Law\n NaN
29 Assistant Professor of Clinical Education; Bus... NaN
30 Associate Professor of Law Librarianship; NaN
I would like to replace these titles with the following titles:
titles=["Adjunct Professor","Professor Emeritus","Associate Professor","Assistant Professor","Professor"]
How can I look for partial text and replace it? I don't want to fully replace the text if it's not a 100% match.
For example 'Visiting Assistant Professor of Law' should be replaced with 'Assistant Professor'
Thank you!
Use str.extract:
df['Title2'] = df['Title'].str.extract(f'({"|".join(titles)})')
output:
Title
1 NaN
2 Professor
3 NaN
...
29 Assistant Professor
30 Associate Professor
If you want to keep the original Title in case of no match, use:
df['Title'] = df['Title'].str.extract(f'({"|".join(titles)})', expand=False).fillna(df['Title'])
Hope the title is not misleading.
I load an Excel file in a pandas dataframe as usual
df = pd.read_excel('complete.xlsx')
and this is what's inside (usually is already ordered - this is a really small sample)
df
Out[21]:
Country City First Name Last Name Ref
0 England London John Smith 34
1 England London Bill Owen 332
2 England Brighton Max Crowe 25
3 England Brighton Steve Grant 55
4 France Paris Roland Tomas 44
5 France Paris Anatole Donnet 534
6 France Lyon Paulin Botrel 234
7 Spain Madrid Oriol Abarquero 34
8 Spain Madrid Alberto Olloqui 534
9 Spain Barcelona Ander Moreno 254
10 Spain Barcelona Cesar Aranda 222
what I need to do is automating an export of the data creating a sqlite db for every country, (i.e. 'England.sqlite') which will contain a table for evey city (i.e. London and Brighton) and every table will have the related personnel info.
The sqlite is not a problem, I'm only trying to figure how to "unpack" the dataframe in the most rapid and "pythonic way
Thanks
You can loop by DataFrame.groupby object:
for i, subdf in df.groupby('Country'):
print (i)
print (subdf)
#processing
How can I create new rows from an existing DataFrame by grouping by certain fields (in the example "Country" and "Industry") and applying some math to another field (in the example "Field" and "Value")?
Source DataFrame
df = pd.DataFrame({'Country': ['USA','USA','USA','USA','USA','USA','Canada','Canada'],
'Industry': ['Finance', 'Finance', 'Retail',
'Retail', 'Energy', 'Energy',
'Retail', 'Retail'],
'Field': ['Import', 'Export','Import',
'Export','Import', 'Export',
'Import', 'Export'],
'Value': [100, 50, 80, 10, 20, 5, 30, 10]})
Country Industry Field Value
0 USA Finance Import 100
1 USA Finance Export 50
2 USA Retail Import 80
3 USA Retail Export 10
4 USA Energy Import 20
5 USA Energy Export 5
6 Canada Retail Import 30
7 Canada Retail Export 10
Target DataFrame
Net = Import - Export
Country Industry Field Value
0 USA Finance Net 50
1 USA Retail Net 70
2 USA Energy Net 15
3 Canada Retail Net 20
There are quite possibly many ways. Here's one using groupby and unstack:
(df.groupby(['Country', 'Industry', 'Field'], sort=False)['Value']
.sum()
.unstack('Field')
.eval('Import - Export')
.reset_index(name='Value'))
Country Industry Value
0 USA Finance 50
1 USA Retail 70
2 USA Energy 15
3 Canada Retail 20
IIUC
df=df.set_index(['Country','Industry'])
Newdf=(df.loc[df.Field=='Export','Value']-df.loc[df.Field=='Import','Value']).reset_index().assign(Field='Net')
Newdf
Country Industry Value Field
0 USA Finance -50 Net
1 USA Retail -70 Net
2 USA Energy -15 Net
3 Canada Retail -20 Net
pivot_table
df.pivot_table(index=['Country','Industry'],columns='Field',values='Value',aggfunc='sum').\
diff(axis=1).\
dropna(1).\
rename(columns={'Import':'Value'}).\
reset_index()
Out[112]:
Field Country Industry Value
0 Canada Retail 20.0
1 USA Energy 15.0
2 USA Finance 50.0
3 USA Retail 70.0
You can do it this way to add those rows to your original dataframe:
df.set_index(['Country','Industry','Field'])\
.unstack()['Value']\
.eval('Net = Import - Export')\
.stack().rename('Value').reset_index()
Output:
Country Industry Field Value
0 Canada Retail Export 10
1 Canada Retail Import 30
2 Canada Retail Net 20
3 USA Energy Export 5
4 USA Energy Import 20
5 USA Energy Net 15
6 USA Finance Export 50
7 USA Finance Import 100
8 USA Finance Net 50
9 USA Retail Export 10
10 USA Retail Import 80
11 USA Retail Net 70
You can use Groupby.diff() and after that recreate the Field column and finally use DataFrame.dropna:
df['Value'] = df.groupby(['Country', 'Industry'])['Value'].diff().abs()
df['Field'] = 'Net'
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Country Industry Field Value
0 USA Finance Net 50.0
1 USA Retail Net 70.0
2 USA Energy Net 15.0
3 Canada Retail Net 20.0
This answer takes advantage of the fact that pandas puts the group keys in the multiindex of the resulting dataframe. (If there were only one group key, you could use loc.)
>>> s = df.groupby(['Country', 'Industry', 'Field'])['Value'].sum()
>>> s.xs('Import', axis=0, level='Field') - s.xs('Export', axis=0, level='Field')
Country Industry
Canada Retail 20
USA Energy 15
Finance 50
Retail 70
Name: Value, dtype: int64