If I have this dataframe:
# data
data = [['london_1', 10,'london'], ['london_2', 15,'london'], ['london_3', 14,'london'],['london',49,'']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['station', 'info','parent_station'])
So:
station info parent_station
0 london_1 10 london
1 london_2 15 london
2 london_3 14 london
3 london 49
I would like to overwrite the info value of the child station according to the info value of the parent station:
station info parent_station
0 london_1 49 london
1 london_2 49 london
2 london_3 49 london
3 london 49
Is there a simple way to do that ?
Additional information:
There could be more than one parent station, but only one parent station per station.
You can map then condition assign
df.loc[df.parent_station.ne(''),'info'] = df.parent_station.map(df.set_index('station')['info'])
df
Out[329]:
station info parent_station
0 london_1 49.0 london
1 london_2 49.0 london
2 london_3 49.0 london
3 london 49.0
Related
I have a big dataset. It's about news reading. I'm trying to clean it. I created a checklist of cities that I want to keep (the set has all the cities). How can I drop the rows based on that checklist? For example, I have a checklist (as a list) that contains all the french cities. How can I drop other cities?
To picture the data frame (I have 1.5m rows btw):
City Age
0 Paris 25-34
1 Lyon 45-54
2 Kiev 35-44
3 Berlin 25-34
4 New York 25-34
5 Paris 65+
6 Toulouse 35-44
7 Nice 55-64
8 Hannover 45-54
9 Lille 35-44
10 Edinburgh 65+
11 Moscow 25-34
You can do this using pandas.Dataframe.isin. This will return boolean values checking whether each element is inside the list x. You can then use the boolean values and take out the subset of the df with rows that return True by doing df[df['City'].isin(x)]. Following is my solution:
import pandas as pd
x = ['Paris' , 'Marseille']
df = pd.DataFrame(data={'City':['Paris', 'London', 'New York', 'Marseille'],
'Age':[1, 2, 3, 4]})
print(df)
df = df[df['City'].isin(x)]
print(df)
Output:
>>> City Age
0 Paris 1
1 London 2
2 New York 3
3 Marseille 4
City Age
0 Paris 1
3 Marseille 4
I'll try my best to explain this as I had trouble phrasing the title. I have two dataframes. What I would like to do is add a column from df1 into df2 between every other column.
For example, df1 looks like this :
Age City
0 34 Sydney
1 30 Toronto
2 31 Mumbai
3 32 Richmond
And after adding in df2 it looks like this:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
In terms of code, I wasn't quite sure where to even start.
'''Concatenating the dataframes'''
for i in range len(df2):
pos = i+1
df3 = df2.insert
#df2 = pd.concat([df1, df2], axis=1).sort_index(axis=1)
#df2.columns = np.arange(len(df2.columns))
#print (df2)
I was originally going to run it through a loop, but I wasn't quite sure how to do it. Any help would be appreciated!
You can use itertools.zip_longest. For example:
from itertools import zip_longest
new_columns = [
v
for v in (c for a in zip_longest(df2.columns, df1.columns) for c in a)
if not v is None
]
df_out = pd.concat([df1, df2], axis=1)[new_columns]
print(df_out)
Prints:
Name Age Clicks City Country
0 Ali 34 10 Sydney Australia
1 Lori 30 20 Toronto Canada
2 Asher 31 45 Mumbai United States
3 Lylah 32 33 Richmond United States
I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc
I have two dataframes , the first one has 1000 rows and looks like:
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
The second dataframe contains all the unique values and also the hotels, that are associated to these values:
Group Hotel
tri23_1 Jamel
hsgç_T2 Frank
bbbj-1Y_jn Luxy
mlkl_781 Grand Hotel
vchs_94 Vancouver
My goal is to replace the columns of the first dataframe by the the corresponding values of the column Hotel of the second dataframe and the output should look like below:-
Date Jamel Frank Luxy Family Bonus
2011-06-09 qwer 1 rits Laavin 456
2011-07-09 ww 43 mayo Grendy 679
2011-09-10 wwer 44 ramya Fantol 431
2011-11-02 5 sam Gondow 569
Can i achieve this using python.
You could try this, using to_dict():
df1.columns=[df2.set_index('Group').to_dict()['Hotel'][i] if i in df2.set_index('Group').to_dict()['Hotel'].keys() else i for i in df1.columns]
print(df1)
Output:
df1
Date tri23_1 hsgç_T2 bbbj-1Y_jn Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
df2
Group Hotel
0 tri23_1 Jamel
1 hsgç_T2 Frank
2 bbbj-1Y_jn Luxy
3 mlkl_781 Grand Hotel
4 vchs_94 Vancouver
df1 changed
Date Jamel Frank Luxy Family Bonus
0 2011-06-09 qwer 1 rits Laavin 456.0
1 2011-07-09 ww 43 mayo Grendy 679.0
2 2011-09-10 wwer 44 ramya Fantol 431.0
3 2011-11-02 5 sam Gondow 569
Update: Explanation
First, if df2['Group'] isn't the index of df2, we set it as index.
Then pass the dataframe to a dict:
df2.set_index('Group').to_dict()
>>>{'Hotel': {'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}}
Then we select the value of key 'Hotel'
df2.set_index('Group').to_dict()['Hotel']
>>>{'tri23_1': 'Jamel', 'hsgç_T2': 'Frank', 'bbbj-1Y_jn': 'Luxy', 'mlkl_781': 'Grand Hotel', 'vchs_94': 'Vancouver'}
Then column by column we search its value in that dictionary, and if such column doesn't exit in the keys of the dictionary, we just return the same value e.g. Date, Family, Bonus:
i='Date'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->False
return 'Date'
...
i='tri23_1'
i in df2.set_index('Group').to_dict()['Hotel'].keys --->True
return df2.set_index('Group').to_dict()['Hotel']['tri23_1']
...
...
#And so on...
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09