I have a DataFrame (df) with following values:
Title
fintech_countries
US 60
UK 54
India 28
Australia 25
Germany 13
Singapore 11
Canada 10
I want to add all the countries with values < 25, and show them as 'Others' with their sum (34).
I have created a column name for countries through the following code:
df1 = df.rename_axis('fintech_countries').rename_axis("countries", axis="columns" , inplace=True)
countries Title
fintech_countries
US 60
UK 54
India 28
Australia 25
Germany 13
Singapore 11
Canada 10
Now, I have tried the following code based on another query on StackOverflow:
df1.loc[df1['Title'] < 25, "countries"].sum()
but am getting the following error:
KeyError: 'the label [countries] is not in the [columns]'
Can someone please help? I need the final output as:
countries Title
fintech_countries
US 60
UK 54
India 28
Australia 25
Others 34
TIA
Solution with loc for setting with enlargement and filtering by boolean indexing:
mask = df['Title'] < 25
print (mask)
fintech_countries
US False
UK False
India False
Australia False
Germany True
Singapore True
Canada True
Name: Title, dtype: bool
df1 = df[~mask].copy()
df1.loc['Others', 'Title'] = df.loc[mask, 'Title'].sum()
df1.Title = df1.Title.astype(int)
print (df1)
countries Title
fintech_countries
US 60
UK 54
India 28
Australia 25
Others 34
Related
enter image description here what i need to to ist to increment the ID based on the Value in column Country i used this
code:
i=1 for row in new_cols5): new_cols5.loc[new_cols5.Country=='Germany','ID']='GR'+str(i) new_cols5.loc[new_cols5.Country=='Italy', 'ID']='IT'+str(i) new_cols5.loc[new_cols5.Country=='France','ID']='FR'+str(i) i+=1
What i get is always the same number concatinated to the ID
ID
Country
GR1
Germany
FR2
France
IT3
Italy
GR1
Germany
FR2
France
IT3
Italy
desired output:
ID
Country
GR1
Germany
FR1
France
IT1
Italy
GR2
Germany
FR2
France
IT2
Italy
GR3
Germany
FR3
France
IT3
Italy
GR4
Germany
FR4
France
IT4
Italy
i would appreciate your help.
the Dataset look like this :
First you could use print() to see what you get with new_cols5.loc[].
new_cols5.loc[] gives you all matching rows and you assign the same value to all rows at once.
You would have to iterate these rows to assign different values.
Or:
You should get number of matching rows to create list ["GR1", "GR2", ..."] and assign this list. And this doesn't need for-loop
matching = new_cols5.loc[new_cols5.Country=='Germany']
count = len(matching)
ids = [f"GR{i}" for i in range(1, count+1)]
new_cols5.loc[new_cols5.Country=='Germany', 'ID'] = ids
or using only mask with True/False and sum() which treads True as 1 and False as 0
mask = (new_cols5.Country == 'Germany')
count = sum(mask)
ids = [f'GR{i}' for i in range(1, count+1)]
new_cols5.loc[new_cols5.Country == 'Germany', 'ID'] = ids
Minimal working code:
import pandas as pd
# --- columns ---
data = {
'ID': ['A','B','C','D','E','F','G','H','I'],
'Country': ['Germany','France','Italy','Germany','France','Italy','Germany','France','Italy'],
}
df = pd.DataFrame(data)
print(df)
# --- version 1 ---
matching = df.loc[df.Country=='Germany']
count = len(matching)
ids = [f"GR{i}" for i in range(1, count+1)]
df.loc[df.Country=='Germany', 'ID'] = ids
print(df)
# --- version 2 ---
count = sum(df.Country == 'France')
ids = [f'FR{i}' for i in range(1, count+1)]
df.loc[ df.Country == 'France', 'ID' ] = ids
print(df)
Result:
ID Country
0 A Germany
1 B France
2 C Italy
3 D Germany
4 E France
5 F Italy
6 G Germany
7 H France
8 I Italy
# version 1
ID Country
0 GR1 Germany
1 B France
2 C Italy
3 GR2 Germany
4 E France
5 F Italy
6 GR3 Germany
7 H France
8 I Italy
# version 2
ID Country
0 GR1 Germany
1 FR1 France
2 C Italy
3 GR2 Germany
4 FR2 France
5 F Italy
6 GR3 Germany
7 FR3 France
8 I Italy
EDIT:
Version which count all values in column and later replace IDs for all countries.
But it uses first letters from country name - GE instead of GR
for country, count in df.Country.value_counts().items():
short = country[:2].upper()
ids = [f'{short}{i}' for i in range(1, count+1)]
df.loc[ df.Country == country, 'ID' ] = ids
I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
I'll try my best to explain this as I had trouble phrasing the title. I have two dataframes. What I would like to do is add a column from df1 into df2 between every other column.
For example, df1 looks like this :
Age City
0 34 Sydney
1 30 Toronto
2 31 Mumbai
3 32 Richmond
And after adding in df2 it looks like this:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
In terms of code, I wasn't quite sure where to even start.
'''Concatenating the dataframes'''
for i in range len(df2):
pos = i+1
df3 = df2.insert
#df2 = pd.concat([df1, df2], axis=1).sort_index(axis=1)
#df2.columns = np.arange(len(df2.columns))
#print (df2)
I was originally going to run it through a loop, but I wasn't quite sure how to do it. Any help would be appreciated!
You can use itertools.zip_longest. For example:
from itertools import zip_longest
new_columns = [
v
for v in (c for a in zip_longest(df2.columns, df1.columns) for c in a)
if not v is None
]
df_out = pd.concat([df1, df2], axis=1)[new_columns]
print(df_out)
Prints:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc