I have a pandas data frame df like this.
In [1]: df
Out[1]:
country count
0 Japan 78
1 Japan 80
2 USA 45
3 France 34
4 France 90
5 UK 45
6 UK 34
7 China 32
8 China 87
9 Russia 20
10 Russia 67
I want to remove rows with the maximum value in each group. So the result should look like:
country count
0 Japan 78
3 France 34
6 UK 34
7 China 32
9 Russia 20
My first attempt:
idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))
My second attempt:
idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))
But it didn't work. Any ideas?
groupby / transform('max')
You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.
g = df.groupby(['country'])['count'].transform('max')
df = df[~(df['count'] == g)]
The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.
print(df.groupby(['country'])['count'].transform('max'))
0 80
1 80
2 45
3 90
4 90
5 45
6 45
7 87
8 87
9 20
Name: count, dtype: int64
sort + drop
Alternatively, you can sort and drop the final occurrence:
res = df.sort_values('count')
res = res.drop(res.groupby('country').tail(1).index)
print(res)
country count
9 Russia 20
7 China 32
3 France 34
6 UK 34
0 Japan 78
Related
I have a dataframe created by the following code:
dfHubR2I=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby(['SHOP_CODE', dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2I=dfHubR2I['median'].unstack('SHOP_CODE')
dfHubR2I=dfHubR2I.iloc[:date.month-1]
dfHubR2I
It looks like this:
shop code A B C D All Shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
where ind is months and the letters are different shops
I then got the median across all the shops for each month from this code:
dfHubR2Imonthallshops=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby([dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2Imonthallshops=dfHubR2Imonthallshops.rename(columns={'median':'All Shops'})
dfHubR2Imonthallshops=dfHubR2Imonthallshops.iloc[:date.month-1]
dfHubR2Imonthallshops
which looks like this:
A B C D All shops
median 2 3 4 5 2
And I need to append it onto the bigger dataframe as a row but when I try to use pd.concat I get the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I'm assuming it's because the larger dadtaframe has 2 levels but I'm not sure how to go about getting my final desired result:
shop code A B C D All shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
YTD 2 3 4 5 2
Have you tried to do it with an assignment?
dfHubR2I.loc['YTD', :] = dfHubR2Imonthallshops.loc['median', :]
Eleonora
I have Dataframe:
teamId pts xpts
Liverpool 82 59
Man City 57 63
Leicester 53 47
Chelsea 48 55
And I'm trying to add new columns that identify the team position by each column
I wanna get this:
teamId pts xpts №pts №xpts
Liverpool 82 59 1 2
Man City 57 63 2 1
Leicester 53 47 3 4
Chelsea 48 53 4 3
I tried to do something similar with the following code, but to no avail. The result is a list
df = [df.sort_values(by=i, ascending=False).assign(new_col=lambda x: range(1, len(df) + 1)) for i in df.columns]
Use DataFrame.rank with DataFrame.add_prefix and add new DataFrame to original DataFrame by DataFrame.join:
c = ['pts','xpts']
df = df.join(df[c].rank(method='dense', ascending=False).astype(int).add_prefix('N'))
print (df)
teamId pts xpts Npts Nxpts
0 Liverpool 82 59 1 2
1 Man City 57 63 2 1
2 Leicester 53 47 3 4
3 Chelsea 48 55 4 3
Another idea with for loop and f-strings for new columns names:
c = ['pts','xpts']
for x in c:
df[f'N{x}'] = df[x].rank(method='dense', ascending=False).astype(int)
print (df)
teamId pts xpts Npts Nxpts
0 Liverpool 82 59 1 2
1 Man City 57 63 2 1
2 Leicester 53 47 3 4
3 Chelsea 48 55 4 3
Another option, similar to rank and join recipe shown by jezrael is to use np.argsort, this operates on the dataframe (pandas >= 1.2) and returns a df you can join:
df.join(np.argsort(-df[['pts','xpts']], axis=0).add(1).add_prefix('No'))
teamId pts xpts Nopts Noxpts
0 Liverpool 82 59 1 2
1 Man City 57 63 2 1
2 Leicester 53 47 3 4
3 Chelsea 48 55 4 3
There are two files. If the ID number matches both files, then I want only the value 1 and value 2 from File2.txt , Please let me know if my question is unclear
File1.txt
ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy
File2.txt
0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51
The output should look like this.
ID Number Value 1 Value 2 Country
0001 5 6 Spain
0231 7 18 USA
4213 54 87 Canada
7541 32 29 Italy
New example:
File3.txt
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.739000 -114.635700
01711 CX 21 -- 3 32.779700 -115.567500
08433 CX 21 -- 2 31.919930 -123.321000
File4.txt
20022,32.45,-114.88
01192,32.839,-115.487
01711,32.88,-115.45
01218,32.717,-115.637
output
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.45 -114.88
01711 CX 21 -- 3 32.88 -115.45
08433 CX 21 -- 2 31.919930 -123.321000
Code I got so far
f = open("File3.txt", "r")
x= open("File4.txt","r")
df1 = pd.read_csv(f, sep=' ', engine='python')
df2 = pd.read_csv(x, sep=' ', header=None, engine='python')
df2 = df2.set_index(0).rename_axis("#ID")
df2 = df2.rename(columns={5:'LATITUDE', 6: 'LONGITUDE'})
df1 = df1.set_index('#ID')
df1.update(df2)
print(df1)
Something like this, possibly:
file1_data = []
file1_headers = []
with open("File1.txt") as file1:
for line in file1:
file1_data.append(line.strip().split("\t"))
file1_headers = file1_data[0]
del file1_data[0]
file2_data = []
with open("File2.txt") as file2:
for line in file2:
file2_data.append(line.strip().split("\t"))
file2_ids = [x[0] for x in file2_data]
final_data = [file1_headers] + file1_data
for i in range(1, len(final_data)):
if final_data[i][0] in file2_ids:
match = [x for x in file2_data if x[0] == final_data[i][0]]
final_data[i] = [match[0] + [final_data[i][3]]]
with open("output.txt", "w") as output:
output.writelines(["\t".join(x) for x in final_data])
final_data becomes an alias of file1_data and then is selectively replacing rows with matching id's in file2_data, but keeping the country.
Okay, what you need to do here is to get the indexes to match in both dataframes after importing. This is important because pandas use data alignment based on indexes.
Here is a complete example using your data:
from io import StringIO
import pandas as pd
File1txt=StringIO("""ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy""")
File2txt = StringIO("""0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51""")
df1 = pd.read_csv(File1txt, sep='\s\s+', engine='python')
df2 = pd.read_csv(File2txt, sep='\s\s+', header=None, engine='python')
print(df1)
# ID Number Value 1 Value 2 Country
# 0 1 23 55 Spain
# 1 231 15 23 USA
# 2 4213 10 11 Canada
# 3 7541 32 29 Italy
print(df2)
# 0 1 2
# 0 1 5 6
# 1 231 7 18
# 2 4213 54 87
# 3 5554 12 10
# 4 1111 31 13
# 5 6422 66 51
df2 = df2.set_index(0).rename_axis('ID Number')
df2 = df2.rename(columns={1:'Value 1', 2: 'Value 2'})
df1 = df1.set_index('ID Number')
df1.update(df2)
print(df1.reset_index())
Output:
ID Number Value 1 Value 2 Country
0 1 5.0 6.0 Spain
1 231 7.0 18.0 USA
2 4213 54.0 87.0 Canada
3 7541 32.0 29.0 Italy
I have the following Pandas Dataframe.
name day h1 h2 h3 h4 h5
pepe 1 10 4 0 4 7
pepe 2 54 65 4 42 6
pepe 3 1 3 28 6 12
pepe 4 5 6 1 8 5
juan 1 78 9 2 65 4
juan 2 2 42 14 54 95
I want to obtain:
name day h1 h2 h3 h4 h5 sum
pepe 1 10 4 0 4 7
pepe 2 54 65 4 42 6 18
pepe 3 1 3 28 6 12 165
pepe 4 5 6 1 8 5 38
juan 1 78 9 2 65 4
juan 2 2 42 14 54 95 154
I've been searching the web, but without success.
The number 38 of the sum column is in the pepe row, day 4, and is the sum of h1 to h4 of the pepe row of the day 4-1 = 3.
Similarly, it proceeds for day 3 and day 2. On day 1 you must keep an empty result in your corresponding sum cell.
The same must be done for Juan and so for the different values of name.
How can I do it?. Maybe it's better to try to make a loop using iterrows first or something like that.
I would sum the rows based on the values... This is my favorite resource for complex loc calls, lots of options here --
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/
df.reset_index(inplace=True)
df.loc[df['name'] == 'pepe','sum'] = df.sum(axis=1)
or
df.reset_index(inplace=True)
df.groupby('name')['h1','h2','h3','h4'].sum(axis=1)
to use loop, would need df.itertuples()
df['sum'] = 0 #Must initialize column first
for i in df.itertuples():
temp_sum = i.h1 + i.h2 + i.h3 + i.h4
#May need to check if final row of 'name', or groupby name first.
df.at[i,'sum'] = temp_sum
Suppose I have a large data-set(in CSV formate) like the following :
Country Age Salary Purchased
0 France 44 72000 No
1 Spain 27 48000 Yes
2 Germany 30 54000 No
3 Spain 38 61000 No
4 Germany 40 45000 Yes
5 France 35 58000 Yes
6 Spain 75 52000 No
7 France 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
Now how can i swap all the values for a selected column Randomly ? For Example
i want to swap all the values of the first column 'Country' randomly.
Looking for your suggestion. Thanks in advance !
Shuffle in-place using random.shuffle:
# <= 0.23
# np.random.shuffle(df['Country'].values)
# 0.24+
np.random.shuffle(df['Country'].to_numpy())
Or, assign back with random.choice:
df['Country'] = np.random.choice(df['Country'], len(df), replace=False)
permutation
np.random.seed([3, 1415])
df.assign(Country=df.Country.to_numpy()[np.random.permutation(len(df))])
Country Age Salary Purchased
0 France 44 72000 No
1 Germany 27 48000 Yes
2 France 30 54000 No
3 Spain 38 61000 No
4 France 40 45000 Yes
5 Spain 35 58000 Yes
6 Germany 75 52000 No
7 Spain 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
sample
df.assign(Country=df.Country.sample(frac=1).to_numpy())