I have 3 dataframes with the same format.
Then I combine them horizontally and get
I would like to add a row to denote the name of each dataframe, i.e.,
I get above form by copying the data to MS Excel and manually adding the row. Is there anyway to directly do so for displaying in Python?
import pandas as pd
data = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21]}
df1 = pd.DataFrame(data)
data = {'Name': ['John', 'Kim'], 'Age': [15, 17]}
df2 = pd.DataFrame(data)
data = {'Name': ['Paul', 'Dood'], 'Age': [10, 5]}
df3 = pd.DataFrame(data)
pd.concat([df1, df2, df3], axis = 1)
Use key parameter in concat:
df = pd.concat([df1, df2, df3], axis = 1, keys=('df1','df2','df3'))
print (df)
df1 df2 df3
Name Age Name Age Name Age
0 Tom 20 John 15 Paul 10
1 Joseph 21 Kim 17 Dood 5
The row is actually a first-level column. You can have it by adding this level to each dataframe before concatenating:
for df_name, df in zip(("df1", "df2", "df3"), (df1, df2, df3)):
df.columns = pd.MultiIndex.from_tuples(((df_name, col) for col in df))
pd.concat([df1, df2, df3], axis = 1)
Very nich case, but you can use Multindex objects in order to be able to build want you want.
Consider that what you need is a "two level headers" to display the information as you want. Multindex at a columns level can accomplish that.
To understand more the code, read about Multindex objects in pandas. You basically create the labels (called levels) and then use indexes to point to those labels (called codes) to build the object.
Here how to do it:
data = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21]}
df1 = pd.DataFrame(data)
data = {'Name': ['John', 'Kim'], 'Age': [15, 17]}
df2 = pd.DataFrame(data)
data = {'Name': ['Paul', 'Dood'], 'Age': [10, 5]}
df3 = pd.DataFrame(data)
df1.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[0, 0], [0, 1]])
df2.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[1, 1], [0, 1]])
df3.columns = pd.MultiIndex(levels=[['df1', 'df2', 'df3'], ['Name', 'Age']], codes=[[2, 2], [0, 1]])
And after the concatenation, you will have:
pd.concat([df1, df2, df3], axis = 1)
I have a large dataset (circa. 200,000 rows x 30 columns) as a CSV. I need to use pandas to pre-process this data. I have included a dummy dataset below to help visualise the problem.
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
df
The goal is to have individual columns that show the probability of each outcome for a batsman & bowler. By way of an example from the dummy dataset, Tom would have a 50% chance of an outcome of '1' or 'Out'
This is calculated by:
Batsman column - The total number of rows with batsman 'X';
Outcome column - The total number of outcomes with 'X';
Point 2. / Point 1. to determine the probability of each outcome;
Repeat the above to determine the Bowler probabilities
The final dataframe from the dummy data should look similar to:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'zero_prob_bat':[0,0.4,0.4,0.4,0,0.4,0.4,0.5,0.5],
'one_prob_bat':[0.5,0.4,0.4,0.4,0.5,0.4,0.4,0,0],
'two_prob_bat':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bat':[0,0,0,0,0,0,0,0,0],
'four_prob_bat':[0,0.2,0.2,0.2,0,0.2,0.2,0,0],
'six_prob_bat':[0,0,0,0,0,0,0,0,0],
'out_prob_bat':[0.5,0,0,0,0.5,0,0,0,0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben'],
'zero_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.5,0.5],
'one_prob_bowl':[0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0.4285,0,0],
'two_prob_bowl':[0,0,0,0,0,0,0,0.5,0.5],
'three_prob_bowl':[0,0,0,0,0,0,0,0,0],
'four_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'six_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'out_prob_bowl':[0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0.1428,0,0],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0]
}
One issue is that with my original dataset there are over 600 unique names. I could manually .groupby each unique name in the batsman/bowler columns, but this is not a scaleable solution as new names will continually be added.
I am tempted to:-
.count the number of instances of each unique name for batsman/bowler;
.count the number of different outcomes for each unique batsman/bowler;
Perform a lookup to match the probability next to each batsman/bowler;
However, I am cautious about implementing a lookup function as detailed in the answer here due to my dataset size which will continuously grow. In the past this has also created numerous issues when I have worked with excel/CSVs so I do not want to fall into any similar traps.
If someone could explain how they would go about solving this problem, so that I have something to aim towards, then it would be much appreciated.
Not sure how much this scales with your actual dataset, but I find it hard to think of a better solution than using groupby on the "Batsman" column and then value_counts on the grouped "Outcome" column. Example:
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
grouped_data = df.groupby('Batsman')['Outcome'].value_counts(normalize=True)
print(grouped_data)
Output:
Batsman Outcome
Nick 1 0.4
0 0.2
4 0.2
Wide 0.2
Pete 0 0.5
2 0.5
Tom 1 0.5
Out 0.5
Name: Outcome, dtype: float64
Note that we did not need to groupby over each unique name manually, since groupby already does that for us.
The same logic can be applied to the "Bowler" column by simply replacing the "Batsman" string in the groupby call.
I think this answers your question...
import pandas as pd
import numpy as np
data = {'Batsman':['Tom', 'Nick', 'Nick', 'Nick', 'Tom', 'Nick', 'Nick', 'Pete', 'Pete'],
'Outcome':[1, 0, 1, 'Wide', 'Out', 4, 1, 2, 0],
'Bowler':['Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Bill', 'Ben', 'Ben']
}
df = pd.DataFrame(data)
display(df)
batsman = df['Batsman'].unique()
bowler = df['Bowler'].unique()
print(sorted(batsman))
print(sorted(bowler))
final_df = pd.DataFrame()
for man in batsman:
df1 = df[df['Batsman'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
batsman_prob = np.array(count_man/count_outcome)
batsman_df = pd.DataFrame(data=[batman_prob], columns=[man], index=['Batsman'])
final_df = pd.concat([final_df, batsman_df, ], axis=1)
for man in bowler:
df1 = df[df['Bowler'] == man]
count_man = len(df1)
outcome = df['Outcome'].unique()
count_outcome = len(outcome)
bowler_prob = np.array(count_man/count_outcome)
bowler_df = pd.DataFrame(data=[bowler_prob], columns=[man], index=['Bowler'])
final_df = pd.concat([final_df, bowler_df, ], axis=1)
display(final_df)
Here is the output:
Tom Nick Pete Bill Ben
Batsman 0.333333 0.333333 0.333333 NaN NaN
Bowler NaN NaN NaN 1.166667 0.333333
I would like to know how to use a start date from a data frame column and have it add rows to the dataframe from the number of days in another column. A new date per day.
Essentially, I am trying to turn this data frame:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
...into this data frame:
df_2 = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter'],
'Date':['1/1/2019', '1/2/2019', '1/2/2019', '1/3/2019', '1/4/2019','1/10/2019', '1/15/2019', '1/16/2019'],
'Hrs':[0.6, 0.6, 1, 1, 1, 1.2, 0.3, 0.3]})
I'm new to programming in general and have tried the following:
df_2 = pd.DataFrame({
'date': pd.date_range(
start = df.Planned_Start,
end = pd.to_timedelta(df.Duration, unit='D'),
freq = 'D'
)
})
... and ...
df["date"] = df.Planned_Start + timedelta(int(df.Duration))
with no luck.
I am not entirely sure what you are trying to achieve as your df_2 looks a bit wrong from what I can see.
If you want the take the duration column as days and add this many dates to a Date column, then the below code achieves that:
You can also drop any columns you don't need with pd.Series.drop() method:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df_new = pd.DataFrame()
for i, row in df.iterrows():
for duration in range(row.Duration):
date = pd.Series([pd.datetime.strptime(row.Planned_Start, '%m/%d/%Y') + timedelta(days=duration)], index=['date'])
newrow = row.append(date)
df_new = df_new.append(newrow, ignore_index=True)
Lets say I have a panadas DataFrame:
import pandas as pd
df = pd.DataFrame(columns=['name','time'])
df = df.append({'name':'Waren', 'time': '20:15'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '20:12'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '20:11'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '01:29'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '02:15'}, ignore_index=True)
df = df.append({'name':'Waren', 'time': '02:16'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '20:11'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '01:29'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '02:15'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '01:49'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '01:49'}, ignore_index=True)
df = df.append({'name':'Kim', 'time': '02:15'}, ignore_index=True)
df = df.append({'name':'Mary', 'time': '22:15'}, ignore_index=True)
df = df.drop(df.index[2])
df = df.drop(df.index[7])
I would like to group this frame by name and secondly group by continuous indexes (Group by continuous indexes in Pandas DataFrame).
The desired output would be a grouping like this:
So the rows are grouped by name and for row this continuous increasing indexes only the first and last element is taken.
I tried it like so:
df.groupby(['name']).groupby(df.index.to_series().diff().ne(1).cumsum()).group
which only raises the error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy' objects, try using the 'apply' method
Any help is welcome!
You are doing it wrong. When you do df.groupby(['name']) it returns attribute groupby which is not callable. You need to apply both of it together.
df.groupby(['name', df.index.to_series().diff().ne(1).cumsum()]).groups
Out:
{('Kim', 2): [6, 7],
('Kim', 3): [9, 10, 11],
('Mary', 3): [12],
('Waren', 1): [0, 1],
('Waren', 2): [3, 4, 5]}