Here is what my dataframe looks like:
df = pd.DataFrame([
['01', 'aa', '1+', 1200],
['01', 'ab', '1+', 1500],
['01', 'jn', '1+', 1600],
['02', 'bb', '2', 2100],
['02', 'ji', '2', 785],
['03', 'oo', '2', 5234],
['04', 'hg', '5-', 1231],
['04', 'kf', '5-', 454],
['05', 'mn', '6', 45],
], columns=['faculty_id', 'sub_id', 'default_grade', 'sum'])
df
I want to groupby facility id, ignore subid, aggregate sum, and assign one default_grade to each facility id. How to do that? I know how to groupby facility id and aggregate sum, but I'm not sure about how to assign the default_grade to each facility.
Thanks a lot!
You can apply different functions by column in a groupby using dictionary syntax.
df.groupby('faculty_id').agg({'default_grade': 'first', 'sum': 'sum'})
Related
How can I sort a dictionary using the values from a list?
names = ['bread', 'banana', 'apple']
prices = ['2', '4', '1']
...
dict = {'name': names, 'price': prices}
dict is now {'name': ['bread', 'banana', 'apple'], 'price': ['2', '4', '1']}
I want to sort the dictionary in a way that the first name corresponds to the lower price.
Is this possible to achieve with sorting on a dictionary?
Example
sorted_dict = {'name': ['apple', 'bread', 'banana'], price: ['1', '2', '4']}
IIUC, you want to sort the first list (in name) based on the values of the second list (in price).
If that's what you want, then a quick way is to use pandas, since the data structure you have (dict of lists), fits really nicely with a pd.DataFrame.
import pandas as pd
pd.DataFrame(d).sort_values('price').to_dict('list')
{'name': ['apple', 'bread', 'banana'], 'price': ['1', '2', '4']}
Added the example as per OPs modified request -
names = ['bread', 'banana', 'apple']
prices = ['2', '4', '1']
description = ['a','x','b']
...
d = {'name': names, 'price': prices, 'description':description}
pd.DataFrame(d).sort_values('price').to_dict('list')
{'name': ['apple', 'bread', 'banana'],
'price': ['1', '2', '4'],
'description': ['b', 'a', 'x']}
I have the following 2d list and dictionary:
List2d = [['1', '55', '32', '667' ],
['43', '76', '55', '100'],
['23', '70', '15', '300']]
dictionary = {'New York':0, "London": 0, "Tokyo": 0, "Toronto": 0 }
How do I replace all the values of the dictionary with sums of the columns in List2d? So dictionary will look like this:
dictionary= {'New York' : 67, 'London': 201, 'Tokyo': 102, 'Toronto': 1067}
#67 comes from adding up first column (1+43+23) in 'List2d'
#201 comes from adding up second column (55+76+70) in 'List2d'
#102 comes from adding up third column (32+55+15) in 'List2d'
#1067 comes from adding up fourth column (667+100+300) in 'List2d'
Since Python 3.7, keys in dict are ordered.
You can use enumerate in order to keep track of the position of the element in the dict while iterating over it. Then, you use the i as an index on each row of the 2d list, convert each value to int and do a sum of the result.
List2d = [['1', '55', '32', '667' ],
['43', '76', '55', '100'],
['23', '70', '15', '300']]
dictionary = {'New York':0, "London": 0, "Tokyo": 0, "Toronto": 0 }
for i, city in enumerate(dictionary.keys()):
dictionary[city] = sum(int(row[i]) for row in List2d)
print(dictionary)
# {'New York': 67, 'London': 201, 'Tokyo': 102, 'Toronto': 1067}
Use pandas
#!pip install pandas
import pandas as pd
pd.DataFrame(List2d, columns=dictionary.keys()).astype(int).sum(axis=0).to_dict()
output:
{'New York': 67, 'London': 201, 'Tokyo': 102, 'Toronto': 1067}
I have a bunch of dataframes, like the ones below:
import pandas as pd
data1 = [['1', '2', 'mary', 123], ['1', '3', 'john', 234 ], ['2', '4', 'layla', 345 ]]
data2 = [['2', '6', 'josh', 345], ['1', '2', 'dolores', 987], ['1', '4', 'kate', 843]]
df1 = pd.DataFrame(data1, columns = ['state', 'city', 'name', 'number1'])
df2 = pd.DataFrame(data2, columns = ['state', 'city', 'name', 'number1'])
for some silly reason I need to transform it in a list in this manner (for each row):
list(
df1.apply(
lambda x: {
"profile": {"state": x["state"], "city": x["city"], "name": x["name"]},
"number1": x["number1"],
},
axis=1,
)
)
what returns me exactly what I need:
[{'profile': {'state': '1', 'city': '2', 'name': 'mary'}, 'number1': 123},
{'profile': {'state': '1', 'city': '3', 'name': 'john'}, 'number1': 234},
{'profile': {'state': '2', 'city': '4', 'name': 'layla'}, 'number1': 345}]
It works if I do it for each dataframe, but I need to write a function so I can use it latter. Also, I need to be able to store both df1 and df2 separately after the operation.
I tried something like this:
df_list = [df1, df2]
for row in df_list:
row = list(row.apply(lambda x: {'send': {'state':x['state'], 'city':x['city'], 'name':x['name']}, 'number1':x['number1']}, axis=1))
but it saves only the value of the last df in the list (df2) row.
also, I tried something like this (and a lot of other stuff):
new_values = []
for row in df_list:
row = list(row.apply(lambda x: {'send'{'state':x['state'],'city':x['city'],'name':x['name']},'number1':x['number1']}, axis=1))
new_values.append(df_list)
I know it might be about not been saving the row value locally. I've read a lot posts here similar to my problem, but I couldn't manage to fully use then... Any help will be appreciated, I'm really stuck here..
Do you mean this?
def func(df):
return list(df.apply(lambda x:{'profile' : {'state': x['state'],'city': x['city'],'name':x['name']},'number1': x['number1']}, axis=1))
you can use it just like that:
df1 = func(df1)
also if you want to map all of data frames:
df1, df2 = [func(df) for df in [df1, df2]]
how do i reshape my dataframe from
to
using Python
df1 = pd.DataFrame({'Name':['John', 'Martin', 'Ricky'], 'Age': ['25', '27', '22'], 'Car1': ['Hyundai', 'VW', 'Ford'], 'Car2': ['Maruti', 'Merc', 'NA']})
You want :
df_melted = pd.melt(df, id_vars=['Name', 'Age', 'salary'], value_vars=['car1', 'car2'], var_name='car_number', value_name='car')
df_melted.drop('car_number', axis=1, inplace=True)
df_melted.sort_values('Name', inplace=True)
df_melted.dropna(inplace=True)
I believe I am ultimately looking for a way to change the dtype of data frame indices. Please allow me to explain:
Each df is multi-indexed on (the same) four levels. One level consists of mixed labels of integers, integer and letters (like D8), and just letters.
However, for df1, the integers within the index labels are surrounded by quotation marks, while for df2, the same integer lables are free of any quotes; i.e.,
df1.index.levels[1]
Index(['Z5', '02', '1C', '26', '2G', '2S', '30', '46', '48', '5M', 'CSA', etc...'], dtype='object', name='BMDIV')
df2.index.levels[1]
Index([ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', '8F',
'8J', 'AN', 'AS', 'C3', 'CA', etc.
dtype='object', name='BMDIV')
When I try to merge these tables
df_merge = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
I get:
TypeError: type object argument after * must be a sequence, not map
Is there a way to change, for example, the type of label in df2 so that the numbers are in quotes and therefore presumably match the corresponding labels in df1?
One way to change the level values is to build a new MultiIndex and re-assign it to df.index:
import pandas as pd
df = pd.DataFrame(
{'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'],
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)
which makes the level values strings:
In [53]: df.index.levels[0]
Out[56]:
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
'AS', 'C3', 'CA'],
dtype='object', name='index')
Alternatively, you could avoid the somewhat low-level messiness by using reset_index and set_value:
import pandas as pd
df = pd.DataFrame(
{'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'],
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)
which again produces string-valued index level values:
In [67]: df.index.levels[0]
Out[67]:
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
'AS', 'C3', 'CA'],
dtype='object', name='index')
Of these two options, using_MultiIndex is faster:
N = 1000
def make_df(N):
df = pd.DataFrame(
{'index': np.random.choice(np.array(
[26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'], dtype='O'), size=N),
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
return df
def using_MultiIndex(df):
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)
return df
def using_reset_index(df):
df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)
return df
In [81]: %%timeit df = make_df(1000)
....: using_MultiIndex(df)
....:
1000 loops, best of 3: 693 µs per loop
In [82]: %%timeit df = make_df(1000)
....: using_reset_index(df)
....:
100 loops, best of 3: 2.09 ms per loop