Plotting Multiple Lines using GroupBy Function in Pandas / Matplotlib - python

I'm trying to plot a pandas dataframe using matplotlib however having issues with the grouping. The dataframe contains statistics for a player in each round of the season. My dataframe is much larger however for this example I have simplified it:
Desc Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 Round 8 Round 9 Round 10
Ben 22.3 33.3 21.5 27.7 31.3 43 33.5 20 29.7 22.7
Tom 28.2 29.2 23.1 25 21.4 22.3 26.2 25.3 19.6
Jack 21.3 30.4 20.8 18 24.5 28.3 32.6 17 25.1 23.7
However when I simply try to plot this using:
df.plot()
plt.show()
The lines are grouped by the round number instead of the player's name and it appears the Y values are actually the player's row index. Here is the plot it outputs.
So I believe maybe the pandas dataframe isn't corrected indexed for rows / columns thus causing this problem. I've looked into using the df.groupby but can't find a solution.
I can easily create the line graph I'm after using MS Excel - Here is the output I would like:
Does anyone have a solution on what I can do to either my dataframe or plot code to get the desired outcome? I have already made sure I have set the row Index's to the players name using:
df.set_index('Desc')
However this hasn't fixed the issue.

Use set_index then transpose:
Creating data
colNames = ['Desc', 'Round1', 'Round2', 'Round3', 'Round4', 'Round5', 'Round6', 'Round7', 'Round8', 'Round9', 'Round10']
df = pd.DataFrame(columns = colNames)
df.loc[len(df)] = ['Ben', '22.3', '33.3', '21.5', '27.7', '31.3', '43', '33.5', '20', '29.7', '22.7']
df.loc[len(df)] = ['Tom', '', '28.2', '29.2', '23.1', '25', '21.4', '22.3', '26.2', '25.3', '19.6']
df.loc[len(df)] = ['Jack', '21.3', '30.4', '20.8', '18', '24.5', '28.3', '32.6', '17', '25.1', '23.7']
Pre-processing
df.set_index("Desc", inplace = True)
df = df.apply(pd.to_numeric, errors='coerce')
Plotting the data
df.T.plot()
plt.show()
This gives us expected graph :

Related

How to create a new data frame which takes average of 1 column on the condition of another column whilst grouping the IDs

So I have the following data frame:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame([['0068edf090ceaf1356', '0068edf090ceaf1356', '0068edf090ceaf1356','0068edf090ceaf1356', '0068edf090ceaf1356', '009eft67eaa133cea4', '009eft67eaa133cea4', '009eft67eaa133cea4', '009eft67eaa133cea4'], [-26, -26 -36, 81, 181, -51, -81, 61, 71], [18.0, 27.0, 53.0, 43.6, 12.4, 24.4, 63.0,72.8]], columns = ['uuid', 'prom_completed_date', 'prom_score'])
In [2]: df
Out[2]:
uuid prom_completed_date prom_score
0068edf090ceaf1356 -26 18.0
0068edf090ceaf1356 -26 18.0
0068edf090ceaf1356 -36 27.0
0068edf090ceaf1356 81 53.0
0068edf090ceaf1356 181 43.6
009eft67eaa133cea4 -51 12.4
009eft67eaa133cea4 -81 24.4
009eft67eaa133cea4 61 63.0
009eft67eaa133cea4 71 72.8
Where each patient has multiple entries. Bearing in mind that the first two entries are not duplicates but are the same as there are other columns with different options so rather than the pre-op average being (18+18+27)/3 it should be (18+27)/2.
I want to create a new data frame where each uuid has three new columns:
an average PROM score where the values in the prom_completed_date_relative are negative
an average PROM score where the values in the prom_completed_date_relative are positive
the difference between the two above averages.
I'm not exactly sure how to do the coding for this in python, whilst ensuring that the uuid's are grouped.
I'm looking for something like this:
In [3]:
Out[3]:
uuid postop_avgPROM preop_avgPROM difference
0068edf090ceaf1356 48.3 22.5 25.8
009eft67eaa133cea4 67.9 18.4 49.5
I have tried the following:
df.query("prom_completed_date_relative">0).groupby("uuid")["prom_score"].mean().reset_index(name="postop_avgPROM_score")
but it does not seem to work, unfortunately.
Thank you.
There may be a more concise way to achieve your result, but here is a multi-step way that is pretty clear
#get each column
post_op=df[df['prom_completed_date']>0].groupby('uuid').mean()['prom_score']
pre_op=df[df['prom_completed_date']<0].groupby('uuid').mean()['prom_score']
difference=post_op-pre_op
#concat them together
df1=pd.concat([post_op,pre_op,difference], axis=1)
#rename the columns
df1.columns=['postop_avgPROM','preop_avgPROM','difference']
df1
postop_avgPROM preop_avgPROM difference
uuid
0068edf090ceaf1356 48.3 22.5 25.8
009eft67eaa133cea4 67.9 18.4 49.5
Here is a solution for what you tried.
This will give you the prom_scoreaverage for each combination of uuid value and prom_completed_date_relative positive/negative.
df_avg = df.groupby(["uuid",df["prom_completed_date_relative"]>=0])["prom_score"].mean().reset_index()
You will need to process it a little more in order to get the columns the way you want.
Using .pivot() on df_avg :
df_avg = df_avg.pivot(index="uuid", columns="prom_completed_date_relative" ,values="prom_score")

Python dataframe remove top n rows and moveup remaining

I have a data frame of 2500 rows. I am trying to remove top n rows and move up remaining without changing the index. I am giving an example of my problem and what I wanted
df =
A
10 10.5
11 20.5
12 30.5
13 40.5
14 50.5
15 60.5
16 70.5
In the above, I would like to remove top two rows and moveup the remaining without disturbing the index. My code and present output:
idx = df.index
df.drop(df.index[:2],inplace=True)
df.set_index(idx[:len(df)],inplace=True)
df =
A
10 30.5
11 40.5
12 50.5
13 60.5
14 70.5
I got the output that I wanted. Is there a better way to do it? Like, oneline code?
You can use iloc to remove the rows and set the index to the original without the last 2 values.
df = df.iloc[2:].set_index(df.index[:-2])
df = pd.DataFrame(df.A.shift(-2).dropna(how='all'))
You can also use shift() to delete the resulting Na line to create a data frame.

Unable to find Column Name after Transpose

After transposing my Python Dataframe, I could not access my column name to plot a graph. I want to choose two columns but failed. It keeps saying no such column names. I am pretty new to Python, dataframe and transpose. Could someone help please?
Below is my input file and I want to transpose row to Column. It was successful when I transposed but I could not select "Canada" and "Cameroon" to plot a graph.
country 1990 1991 1992 1993 1994 1995
0 Cambodia 65.4 65.7 66.2 66.7 67.1 68.4
1 Cameroon 63.9 63.7 64.7 65.6 66.6 67.6
2 Canada 98.6 99.6 99.6 99.8 99.9 99.9
3 Cape Verde 77.7 77.0 76.6 89.0 79.0 78.0
import pandas as pd
import numpy as np
import re
import math
import matplotlib.pyplot as plt
missing_values=["n/a","na","-","-","N/A"]
df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values)
# Transpose
df = df.transpose()
plt.figure(figsize=(12,8))
plt.plot(df['Canada','Cameroon'], linewidth = 0.5)
plt.title("Time Series for Canada")
plt.show()
It produces a long list of error messages but the final message is
KeyError: ('Canada', 'Cameroon')
There a few things you might need to do when working with the data.
If the csv file has no header then use df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values, header = None).
When you transpose, you need to name the columns
df.columns= df.iloc[0].
Having done this you need to drop the first row of your table (because it contains the column names) df = df.reindex(df.index.drop(0)).
Finally, when accessing the data by columns (in the plt.plot() command) you need to use df[] on the list of columns, i.e. df[['Canada', 'Cameroon']].
EDIT So the code, as it works for me is as follows
df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values, header = None)
df = df.T
df.columns= df.iloc[0]
df = df.reindex(df.index.drop('country'))
df.index.name = 'Year'
plt.figure(figsize=(12,8))
plt.plot(df[['Canada','Cameroon']], linewidth = 0.5)
plt.title("Time Series for Canada")
plt.show()

What is the best way to oversample a dataframe preserving its statistical properties in Python 3?

I have the following toy df:
FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar)
0 0.156 1 29.5 28.4 29.6 28.4
2 0.149 1.3 29.567 28.9
3 0.149 1 29.567 28.9
4 0.148 1.6 29.6 29.4
This is just a sample. The original have over 1200 rows. What's the best way to oversample it preserving its statistical propierties?
I have googled it for some time and i hve only come across resampling algorithms for imbalalnced classes. but that's not what i want, i'm not interested in balancing thr data anyhow, i just would like to produce more samples in a way that more or less preserves the original data distributions and statistical properties.
Thanks in advance
Using scipy.stats.rv_histogram(np.histogram(data)).isf(np.random.random(size=n)) will create n new samples randomly chosen from the distribution (histogram) of the data. You can do this for each column:
Example:
import pandas as pd
import scipy.stats as stats
df = pd.DataFrame({'x': np.random.random(100)*3, 'y': np.random.random(100) * 4 -2})
n = 5
new_values = pd.DataFrame({s: stats.rv_histogram(np.histogram(df[s])).isf(np.random.random(size=n)) for s in df.columns})
df = df.assign(data_type='original').append(new_values.assign(data_type='oversampled'))
df.tail(7)
>> x y data_type
98 1.176073 -0.207858 original
99 0.734781 -0.223110 original
0 2.014739 -0.369475 oversampled
1 2.825933 -1.122614 oversampled
2 0.155204 1.421869 oversampled
3 1.072144 -1.834163 oversampled
4 1.251650 1.353681 oversampled

Fill NaN values from another DataFrame (with different shape)

I'm looking for a faster approach to improve the performance of my solution for the following problem: a certain DataFrame has two columns with a few NaN values in them. The challenge is to replace these NaNs with values from a secondary DataFrame.
Below I'll share the data and code used to implement my approach. Let me explain the scenario: merged_df is the original DataFrame with a few columns and some of them have rows with NaN values:
As you can see from the image above, columns day_of_week and holiday_flg are of particular interest. I would like to fill the NaN values of these columns by looking into a second DataFrame called date_info_df, which looks like this:
By using the values from column visit_date in merged_df it is possible to search the second DataFrame on calendar_date and find equivalent matches. This method allows to get the values for day_of_week and holiday_flg from the second DataFrame.
The end result for this exercise is a DataFrame that looks like this:
You'll notice the approach I'm using relies on apply() to execute a custom function on every row of merged_df:
For every row, search for NaN values in day_of_week and holiday_flg;
When a NaN is found on any or both of these columns, use the date available in from that row's visit_date to find an equivalent match in the second DataFrame, specifically the date_info_df['calendar_date'] column;
After a successful match, the value from date_info_df['day_of_week'] must be copied into merged_df['day_of_week'] and the value from date_info_df['holiday_flg'] must also be copied into date_info_df['holiday_flg'].
Here is a working source code:
import math
import pandas as pd
import numpy as np
from IPython.display import display
### Data for df
data = { 'air_store_id': [ 'air_a1', 'air_a2', 'air_a3', 'air_a4' ],
'area_name': [ 'Tokyo', np.nan, np.nan, np.nan ],
'genre_name': [ 'Japanese', np.nan, np.nan, np.nan ],
'hpg_store_id': [ 'hpg_h1', np.nan, np.nan, np.nan ],
'latitude': [ 1234, np.nan, np.nan, np.nan ],
'longitude': [ 5678, np.nan, np.nan, np.nan ],
'reserve_datetime': [ '2017-04-22 11:00:00', np.nan, np.nan, np.nan ],
'reserve_visitors': [ 25, 35, 45, np.nan ],
'visit_datetime': [ '2017-05-23 12:00:00', np.nan, np.nan, np.nan ],
'visit_date': [ '2017-05-23' , '2017-05-24', '2017-05-25', '2017-05-27' ],
'day_of_week': [ 'Tuesday', 'Wednesday', np.nan, np.nan ],
'holiday_flg': [ 0, np.nan, np.nan, np.nan ]
}
merged_df = pd.DataFrame(data)
display(merged_df)
### Data for date_info_df
data = { 'calendar_date': [ '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26', '2017-05-27', '2017-05-28' ],
'day_of_week': [ 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday' ],
'holiday_flg': [ 0, 0, 0, 0, 1, 1 ]
}
date_info_df = pd.DataFrame(data)
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
display(date_info_df)
# Fix the NaN values in day_of_week and holiday_flg by inspecting data from another dataframe (date_info_df)
def fix_weekday_and_holiday(row):
weekday = row['day_of_week']
holiday = row['holiday_flg']
# search dataframe date_info_df for the appropriate value when weekday is NaN
if (type(weekday) == float and math.isnan(weekday)):
search_date = row['visit_date']
#print(' --> weekday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
weekday = date_info_df.at[idx,'day_of_week']
#print(' --> weekday search_date=', search_date, 'is', weekday)
row['day_of_week'] = weekday
# search dataframe date_info_df for the appropriate value when holiday is NaN
if (type(holiday) == float and math.isnan(holiday)):
search_date = row['visit_date']
#print(' --> holiday search_date=', search_date, 'type=', type(search_date))
indexes = date_info_df.index[date_info_df['calendar_date'] == search_date].tolist()
idx = indexes[0]
holiday = date_info_df.at[idx,'holiday_flg']
#print(' --> holiday search_date=', search_date, 'is', holiday)
row['holiday_flg'] = int(holiday)
return row
# send every row to fix_day_of_week
merged_df = merged_df.apply(fix_weekday_and_holiday, axis=1)
# Convert data from float to int (to remove decimal places)
merged_df['holiday_flg'] = merged_df['holiday_flg'].astype(int)
display(merged_df)
I did a few measurements so you can understand the struggle:
On a DataFrame with 6 rows, apply() takes 3.01 ms;
On a DataFrame with ~250000 rows, apply() takes 2min 51s.
On a DataFrame with ~1215000 rows, apply() takes 4min 2s.
How do I improve the performance of this task?
you can use Index to speed up the lookup, use combine_first() to fill NaN:
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
merged_df[cols] = merged_df[cols].combine_first(
date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
print(merged_df[cols])
the result:
day_of_week holiday_flg
0 Tuesday 0.0
1 Wednesday 0.0
2 Thursday 0.0
3 Saturday 1.0
This is one solution. It should be efficient as there is no explicit merge or apply.
merged_df['visit_date'] = pd.to_datetime(merged_df['visit_date'])
date_info_df['calendar_date'] = pd.to_datetime(date_info_df['calendar_date'])
s = date_info_df.set_index('calendar_date')['day_of_week']
t = date_info_df.set_index('day_of_week')['holiday_flg']
merged_df['day_of_week'] = merged_df['day_of_week'].fillna(merged_df['visit_date'].map(s))
merged_df['holiday_flg'] = merged_df['holiday_flg'].fillna(merged_df['day_of_week'].map(t))
Result
air_store_id area_name day_of_week genre_name holiday_flg hpg_store_id \
0 air_a1 Tokyo Tuesday Japanese 0.0 hpg_h1
1 air_a2 NaN Wednesday NaN 0.0 NaN
2 air_a3 NaN Thursday NaN 0.0 NaN
3 air_a4 NaN Saturday NaN 1.0 NaN
latitude longitude reserve_datetime reserve_visitors visit_date \
0 1234.0 5678.0 2017-04-22 11:00:00 25.0 2017-05-23
1 NaN NaN NaN 35.0 2017-05-24
2 NaN NaN NaN 45.0 2017-05-25
3 NaN NaN NaN NaN 2017-05-27
visit_datetime
0 2017-05-23 12:00:00
1 NaN
2 NaN
3 NaN
Explanation
s is a pd.Series mapping calendar_date to day_of_week from date_info_df.
Use pd.Series.map, which takes pd.Series as an input, to update missing values, where possible.
Edit: one can also use merge to solve the problem. 10 times faster than the old approach. (Need to make sure "visit_date" and "calendar_date" are of the same format.)
# don't need to `set_index` for date_info_df but select columns needed.
merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]],
left_on="visit_date",
right_on="calendar_date",
how="left") # outer should also work
The desired result will be at "day_of_week_y" and "holiday_flg_y" column right now. In this approach and the map approach, we don't use the old "day_of_week" and "holiday_flg" at all. We just need to map the results from data_info_df to merged_df.
merge can also do the job because data_info_df's data entries are unique. (No duplicates will be created.)
You can also try using pandas.Series.map. What it does is
Map values of Series using input correspondence (which can be a dict, Series, or function)
# set "calendar_date" as the index such that
# mapping["day_of_week"] and mapping["holiday_flg"] will be two series
# with date_info_df["calendar_date"] as their index.
mapping = date_info_df.set_index("calendar_date")
# this line is optional (depending on the layout of data.)
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
# do replacement here.
merged_df["day_of_week"] = merged_df.visit_date.map(mapping["day_of_week"])
merged_df["holiday_flg"] = merged_df.visit_date.map(mapping["holiday_flg"])
Note merged_df.visit_date originally was of string type. Thus, we use
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
to make it datetime.
Timings date_info_df dataset and merged_df provided by karlphillip.
date_info_df = pd.read_csv("full_date_info_data.csv")
merged_df = pd.read_csv("full_data.csv")
merged_df.visit_date = pd.to_datetime(merged_df.visit_date)
date_info_df.calendar_date = pd.to_datetime(date_info_df.calendar_date)
cols = ["day_of_week", "holiday_flg"]
visit_date = pd.to_datetime(merged_df.visit_date)
# merge method I proprose on the top.
%timeit merged_df.merge(date_info_df[["calendar_date", "day_of_week", "holiday_flg"]], left_on="visit_date", right_on="calendar_date", how="left")
511 ms ± 34.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method without assigning it back
%timeit merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
772 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# HYRY's method with assigning it back
%timeit merged_df[cols] = merged_df[cols].combine_first(date_info_df.set_index("calendar_date").loc[visit_date, cols].set_index(merged_df.index))
258 ms ± 69.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One can see that HYRY's method runs 3 times faster if assigning the result back to the merged_df. This is why I thought HARY's method was faster than mine at first glance. I suspect that is because of the nature of combine_first. I guess that the speed of HARY's method will depend on how sparse it is in merged_df. Thus, while assigning the results back, the columns become full; therefore, while rerunning it, it is faster.
The performances of the merge and combine_first methods are nearly equivalent. Perhaps there can be circumstances that one is faster than another. It should be left to each user to do some tests on their datasets.
Another thing to note between the two methods is that the merge method assumed every date in merged_df is contained in data_info_df. If there are some dates that are contained in merged_df but not data_info_df, it should return NaN. And NaN can override some part of merged_df that originally contains values! This is when combine_first method should be preferred. See the discussion by MaxU in Pandas replace, multi column criteria

Categories