I believe I am ultimately looking for a way to change the dtype of data frame indices. Please allow me to explain:
Each df is multi-indexed on (the same) four levels. One level consists of mixed labels of integers, integer and letters (like D8), and just letters.
However, for df1, the integers within the index labels are surrounded by quotation marks, while for df2, the same integer lables are free of any quotes; i.e.,
df1.index.levels[1]
Index(['Z5', '02', '1C', '26', '2G', '2S', '30', '46', '48', '5M', 'CSA', etc...'], dtype='object', name='BMDIV')
df2.index.levels[1]
Index([ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', '8F',
'8J', 'AN', 'AS', 'C3', 'CA', etc.
dtype='object', name='BMDIV')
When I try to merge these tables
df_merge = pd.merge(df1, df2, how='left', left_index=True, right_index=True)
I get:
TypeError: type object argument after * must be a sequence, not map
Is there a way to change, for example, the type of label in df2 so that the numbers are in quotes and therefore presumably match the corresponding labels in df1?
One way to change the level values is to build a new MultiIndex and re-assign it to df.index:
import pandas as pd
df = pd.DataFrame(
{'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'],
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)
which makes the level values strings:
In [53]: df.index.levels[0]
Out[56]:
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
'AS', 'C3', 'CA'],
dtype='object', name='index')
Alternatively, you could avoid the somewhat low-level messiness by using reset_index and set_value:
import pandas as pd
df = pd.DataFrame(
{'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'],
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)
which again produces string-valued index level values:
In [67]: df.index.levels[0]
Out[67]:
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
'AS', 'C3', 'CA'],
dtype='object', name='index')
Of these two options, using_MultiIndex is faster:
N = 1000
def make_df(N):
df = pd.DataFrame(
{'index': np.random.choice(np.array(
[26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y',
'8F', '8J', 'AN', 'AS', 'C3', 'CA'], dtype='O'), size=N),
'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
return df
def using_MultiIndex(df):
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)
return df
def using_reset_index(df):
df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)
return df
In [81]: %%timeit df = make_df(1000)
....: using_MultiIndex(df)
....:
1000 loops, best of 3: 693 µs per loop
In [82]: %%timeit df = make_df(1000)
....: using_reset_index(df)
....:
100 loops, best of 3: 2.09 ms per loop
Related
I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}
Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0
Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5
I have a dataframe that looks like:
import pandas as pd
data = {
'Other':['A1', 'A2', 'A3', 'A4', 'A5'],
'category':[['Transfer'], ['Unknown'], ['Transfer','Facebook'], ['Facebook', 'Google', 'Other'], ['C3']]
}
df = pd.DataFrame(data)
I am trying to get a list of unique values for categories, unfortunately using
categories = df['category'].unique()
doesnt work, I am not sure what should be the approach to end with an outcome of
['Transfer', 'Unknown','Facebook','Google','Other','c3']
Let us try explode
df.category.explode().unique()
array(['Transfer', 'Unknown', 'Facebook', 'Google', 'Other', 'C3'],
dtype=object)
If you need the unique list, you may use pd.unique on flatten of df.category using np.concatenate
l = pd.unique(np.concatenate(df.category))
Out[100]:
array(['Transfer', 'Unknown', 'Facebook', 'Google', 'Other', 'C3'],
dtype=object)
how do i reshape my dataframe from
to
using Python
df1 = pd.DataFrame({'Name':['John', 'Martin', 'Ricky'], 'Age': ['25', '27', '22'], 'Car1': ['Hyundai', 'VW', 'Ford'], 'Car2': ['Maruti', 'Merc', 'NA']})
You want :
df_melted = pd.melt(df, id_vars=['Name', 'Age', 'salary'], value_vars=['car1', 'car2'], var_name='car_number', value_name='car')
df_melted.drop('car_number', axis=1, inplace=True)
df_melted.sort_values('Name', inplace=True)
df_melted.dropna(inplace=True)
I have a csv file in the following format:
"age","job","marital","education","default","balance","housing","loan"
58,"management","married","tertiary","no",2143,"yes","no"
44,"technician","single","secondary","no",29,"yes","no"
However, instead of being separated by tabs (different columns), they all lie in the same first column. When I try reading this using pandas, the output gives all the values in the same list instead of a list of lists.
My code:
dataframe = pd.read_csv("marketing-data.csv", header = 0, sep= ",")
dataset = dataframe.values
print(dataset)
O/p:
[[58 'management' 'married' ..., 2143 'yes' 'no']
[44 'technician' 'single' ..., 29 'yes' 'no']]
What I need:
[[58, 'management', 'married', ..., 2143, 'yes', 'no']
[44 ,'technician', 'single', ..., 29, 'yes', 'no']]
What is it I am missing?
I think you are confused by the print() output which doesn't show commas.
Demo:
In [1]: df = pd.read_csv(filename)
Pandas representation:
In [2]: df
Out[2]:
age job marital education default balance housing loan
0 58 management married tertiary no 2143 yes no
1 44 technician single secondary no 29 yes no
Numpy representation:
In [3]: df.values
Out[3]:
array([[58, 'management', 'married', 'tertiary', 'no', 2143, 'yes', 'no'],
[44, 'technician', 'single', 'secondary', 'no', 29, 'yes', 'no']], dtype=object)
Numpy string representation (result of print(numpy_array)):
In [4]: print(df.values)
[[58 'management' 'married' 'tertiary' 'no' 2143 'yes' 'no']
[44 'technician' 'single' 'secondary' 'no' 29 'yes' 'no']]
Conclusion: your CSV file has been parsed correctly.
I don't really see a difference between what you want and what you get.. but parsing the csv file with the built in csv module give your desired result
import csv
with open('file.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
print list(spamreader)
[
['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan'],
['58', 'management', 'married', 'tertiary', 'no', '2143', 'yes', 'no'],
['44', 'technician', 'single', 'secondary', 'no', '29', 'yes', 'no']
]
Here is what my dataframe looks like:
df = pd.DataFrame([
['01', 'aa', '1+', 1200],
['01', 'ab', '1+', 1500],
['01', 'jn', '1+', 1600],
['02', 'bb', '2', 2100],
['02', 'ji', '2', 785],
['03', 'oo', '2', 5234],
['04', 'hg', '5-', 1231],
['04', 'kf', '5-', 454],
['05', 'mn', '6', 45],
], columns=['faculty_id', 'sub_id', 'default_grade', 'sum'])
df
I want to groupby facility id, ignore subid, aggregate sum, and assign one default_grade to each facility id. How to do that? I know how to groupby facility id and aggregate sum, but I'm not sure about how to assign the default_grade to each facility.
Thanks a lot!
You can apply different functions by column in a groupby using dictionary syntax.
df.groupby('faculty_id').agg({'default_grade': 'first', 'sum': 'sum'})