Suppose I have a large data-set(in CSV formate) like the following :
Country Age Salary Purchased
0 France 44 72000 No
1 Spain 27 48000 Yes
2 Germany 30 54000 No
3 Spain 38 61000 No
4 Germany 40 45000 Yes
5 France 35 58000 Yes
6 Spain 75 52000 No
7 France 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
Now how can i swap all the values for a selected column Randomly ? For Example
i want to swap all the values of the first column 'Country' randomly.
Looking for your suggestion. Thanks in advance !
Shuffle in-place using random.shuffle:
# <= 0.23
# np.random.shuffle(df['Country'].values)
# 0.24+
np.random.shuffle(df['Country'].to_numpy())
Or, assign back with random.choice:
df['Country'] = np.random.choice(df['Country'], len(df), replace=False)
permutation
np.random.seed([3, 1415])
df.assign(Country=df.Country.to_numpy()[np.random.permutation(len(df))])
Country Age Salary Purchased
0 France 44 72000 No
1 Germany 27 48000 Yes
2 France 30 54000 No
3 Spain 38 61000 No
4 France 40 45000 Yes
5 Spain 35 58000 Yes
6 Germany 75 52000 No
7 Spain 48 79000 Yes
8 Germany 50 83000 No
9 France 37 67000 Yes
sample
df.assign(Country=df.Country.sample(frac=1).to_numpy())
Related
Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30
try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30
Let it be the following Python Pandas DataFrame.
ID
region
12
FRA
99
GER
13
ESP
69
UK
17
GER
02
GER
Using the next code:
dictionary = {'GER': 'Germany', 'FRA': 'France'}
df['region'] = df['region'].map(dictionary)
I get the following result:
ID
region
12
France
99
Germany
13
NaN
69
NaN
17
Germany
02
Germany
My idea is that the values that do not appear in the dictionary, keep their previous values.
ID
region
12
France
99
Germany
13
ESP
69
UK
17
Germany
02
Germany
How could I do this? Thank you in advance.
I think what you want is that :
df.replace({"region": dictionary})
Use fillna (or combine_first):
df['region'] = df['region'].map(dictionary).fillna(df['region'])
or take advantage of the get method to set the value as default:
df['region'] = df['region'].map(lambda x: dictionary.get(x, x))
output:
ID region
0 12 France
1 99 Germany
2 13 ESP
3 69 UK
4 17 Germany
5 2 Germany
I have a pandas DataFrame which looks like this
Region Sub Region Country Size Plants Birds Mammals
Africa Northern Africa Algeria 2380000 22 41 15
Egypt 1000000 8 58 14
Libya 1760000 7 32 8
Sub-Saharan Africa Angola 1250000 34 53 32
Benin 115000 20 40 12
Western Africa Cape Verde 4030 51 35 7
Americas Latin America Antigua 440 4 31 3
Argentina 2780000 70 42 52
Bolivia 1100000 106 8 55
Northern America Canada 9980000 18 44 24
Grenada 340 3 29 2
USA 9830000 510 251 91
Asia Central Asia Kazakhstan 2720000 14 14 27
Kyrgyz 200000 13 3 15
Uzbekistan 447000 16 7 19
Eastern Asia China 9560000 593 136 96
Japan 378000 50 77 49
South Korea 100000 31 28 33
So I am trying to prompt the user to input a value and if the input exists within the Sub Region column, perform a particular task.
I tried turning the 'Sub region' column to a list and iterate through it if it matches the user input
sub_region_list=[]
for i in world_data.index.values:
sub_region_list.append(i[1])
print(sub_region_list[0])
That is not the output I had in mind.
I believe there is an easier way to do this but can not seem to figure it out
You can use get_level_values to filter.
sub_region = input("Enter a sub region:")
if sub_region not in df.index.get_level_values('Sub Region'):
raise ValueError("You must enter a valid sub-region")
If you want to save the column values in a list, try:
df.index.get_level_values("Sub Region").unique().to_list()
I have a huge dataframe as:
country1 import1 export1 country2 import2 export2
0 USA 12 82 Germany 12 82
1 Germany 65 31 France 65 31
2 England 74 47 Japan 74 47
3 Japan 23 55 England 23 55
4 France 48 12 Usa 48 12
export1 and import1 belongs to country1
export2 and import2 belongs to country2
I want to count export and import values by country.
Output may be like:
country | total_export | total_import
______________________________________________
USA | 12211221 | 212121
France | 4545 | 5454
...
...
Use wide_to_long first:
df = (pd.wide_to_long(data.reset_index(), ['country','import','export'], i='index', j='tmp')
.reset_index(drop=True))
print (df)
country import export
0 USA 12 82
1 Germany 65 31
2 England 74 47
3 Japan 23 55
4 France 48 12
5 Germany 12 82
6 France 65 31
7 Japan 74 47
8 England 23 55
9 Usa 48 12
And then aggregate sum:
df = df.groupby('country', as_index=False).sum()
print (df)
country import export
0 England 97 102
1 France 113 43
2 Germany 77 113
3 Japan 97 102
4 USA 12 82
5 Usa 48 12
You can slice the table into two parts and concatenate them:
func = lambda x: x[:-1] # or lambda x: x.rstrip('0123456789')
data.iloc[:,:3].rename(func, axis=1).\
append(data.iloc[:,3:].rename(func, axis=1)).\
groupby('country').sum()
Output:
import export
country
England 97 102
France 113 43
Germany 77 113
Japan 97 102
USA 12 82
Usa 48 12
I have a pandas data frame df like this.
In [1]: df
Out[1]:
country count
0 Japan 78
1 Japan 80
2 USA 45
3 France 34
4 France 90
5 UK 45
6 UK 34
7 China 32
8 China 87
9 Russia 20
10 Russia 67
I want to remove rows with the maximum value in each group. So the result should look like:
country count
0 Japan 78
3 France 34
6 UK 34
7 China 32
9 Russia 20
My first attempt:
idx = df.groupby(['country'], sort=False).max()['count'].index
df_new = df.drop(list(idx))
My second attempt:
idx = df.groupby(['country'])['count'].transform(max).index
df_new = df.drop(list(idx))
But it didn't work. Any ideas?
groupby / transform('max')
You can first calculate a series of maximums by group. Then filter out instances where count is equal to that series. Note this will also remove duplicates maximums.
g = df.groupby(['country'])['count'].transform('max')
df = df[~(df['count'] == g)]
The series g represents maximums for each row by group. Where this equals df['count'] (by index), you have a row where you have the maximum for your group. You then use ~ for the negative condition.
print(df.groupby(['country'])['count'].transform('max'))
0 80
1 80
2 45
3 90
4 90
5 45
6 45
7 87
8 87
9 20
Name: count, dtype: int64
sort + drop
Alternatively, you can sort and drop the final occurrence:
res = df.sort_values('count')
res = res.drop(res.groupby('country').tail(1).index)
print(res)
country count
9 Russia 20
7 China 32
3 France 34
6 UK 34
0 Japan 78