After transposing my Python Dataframe, I could not access my column name to plot a graph. I want to choose two columns but failed. It keeps saying no such column names. I am pretty new to Python, dataframe and transpose. Could someone help please?
Below is my input file and I want to transpose row to Column. It was successful when I transposed but I could not select "Canada" and "Cameroon" to plot a graph.
country 1990 1991 1992 1993 1994 1995
0 Cambodia 65.4 65.7 66.2 66.7 67.1 68.4
1 Cameroon 63.9 63.7 64.7 65.6 66.6 67.6
2 Canada 98.6 99.6 99.6 99.8 99.9 99.9
3 Cape Verde 77.7 77.0 76.6 89.0 79.0 78.0
import pandas as pd
import numpy as np
import re
import math
import matplotlib.pyplot as plt
missing_values=["n/a","na","-","-","N/A"]
df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values)
# Transpose
df = df.transpose()
plt.figure(figsize=(12,8))
plt.plot(df['Canada','Cameroon'], linewidth = 0.5)
plt.title("Time Series for Canada")
plt.show()
It produces a long list of error messages but the final message is
KeyError: ('Canada', 'Cameroon')
There a few things you might need to do when working with the data.
If the csv file has no header then use df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values, header = None).
When you transpose, you need to name the columns
df.columns= df.iloc[0].
Having done this you need to drop the first row of your table (because it contains the column names) df = df.reindex(df.index.drop(0)).
Finally, when accessing the data by columns (in the plt.plot() command) you need to use df[] on the list of columns, i.e. df[['Canada', 'Cameroon']].
EDIT So the code, as it works for me is as follows
df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values, header = None)
df = df.T
df.columns= df.iloc[0]
df = df.reindex(df.index.drop('country'))
df.index.name = 'Year'
plt.figure(figsize=(12,8))
plt.plot(df[['Canada','Cameroon']], linewidth = 0.5)
plt.title("Time Series for Canada")
plt.show()
Related
I have written a program like so:
# Author: Evan Gertis
# Date : 11/09
# program: Linear Regression
# Resource: https://seaborn.pydata.org/generated/seaborn.scatterplot.html
import seaborn as sns
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Step 1: load the data
grades = pd.read_csv("grades.csv")
logging.info(grades.head())
# Step 2: plot the data
plot = sns.scatterplot(data=grades, x="Hours", y="GPA")
fig = plot.get_figure()
fig.savefig("out.png")
Using the data set
Hours,GPA,Hours,GPA,Hours,GPA
11,2.84,9,2.85,25,1.85
5,3.20,5,3.35,6,3.14
22,2.18,14,2.60,9,2.96
23,2.12,18,2.35,20,2.30
20,2.55,6,3.14,14,2.66
20,2.24,9,3.05,19,2.36
10,2.90,24,2.06,21,2.24
19,2.36,25,2.00,7,3.08
15,2.60,12,2.78,11,2.84
18,2.42,6,2.90,20,2.45
I would like to plot out all of the relationships at this time I just get one plot:
Expected:
all relationships plotted
Actual:
I wrote a basic program and I was expecting all of the relationships to be plotted.
The origin of the problem is that the columns names in your file are the same and thus when pandas read the columns adds number to the loaded data frame
import seaborn as sns
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
grades = pd.read_csv("grades.csv")
print(grades.columns)
>>> Index(['Hours', 'GPA', 'Hours.1', 'GPA.1', 'Hours.2', 'GPA.2'], dtype='object')
therefore when you plot the scatter plot you need to give the name of the column names that pandas give
# in case you want all scatter plots in the same figure
plot = sns.scatterplot(data=grades, x="Hours", y="GPA", label='GPA')
sns.scatterplot(data=grades, x='Hours.1', y='GPA.1', ax=plot, label="GPA.1")
sns.scatterplot(data=grades, x='Hours.2', y='GPA.2', ax=plot, label='GPA.2')
fig = plot.get_figure()
fig.savefig("out.png")
There are better options than manually creating a plot for each group of columns
Because the columns in the file have redundant names, pandas automatically renames them.
Imports and DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# read the data from the file
df = pd.read_csv('d:/data/gpa.csv')
# display(df)
Hours GPA Hours.1 GPA.1 Hours.2 GPA.2
0 11 2.84 9 2.85 25 1.85
1 5 3.20 5 3.35 6 3.14
2 22 2.18 14 2.60 9 2.96
3 23 2.12 18 2.35 20 2.30
4 20 2.55 6 3.14 14 2.66
5 20 2.24 9 3.05 19 2.36
6 10 2.90 24 2.06 21 2.24
7 19 2.36 25 2.00 7 3.08
8 15 2.60 12 2.78 11 2.84
9 18 2.42 6 2.90 20 2.45
Option 1: Chunk the column names
This option can be used to plot the data in a loop without manually creating each plot
Using this answer from How to iterate over a list in chunks will create a list of column name groups:
[Index(['Hours', 'GPA'], dtype='object'), Index(['Hours.1', 'GPA.1'], dtype='object'), Index(['Hours.2', 'GPA.2'], dtype='object')]
# create groups of column names to be plotted together
def chunker(seq, size):
return [seq[pos:pos + size] for pos in range(0, len(seq), size)]
# function call
col_list = chunker(df.columns, 2)
# iterate through each group of column names to plot
for x, y in chunker(df.columns, 2):
sns.scatterplot(data=df, x=x, y=y, label=y)
Option 2: Fix the data
# filter each group of columns, melt the result into a long form, and get the value
h = df.filter(like='Hours').melt().value
g = df.filter(like='GPA').melt().value
# get the gpa column names
gpa_cols = df.columns[1::2]
# use numpy to create a list of labels with the appropriate length
labels = np.repeat(gpa_cols, len(df))
# otherwise use a list comprehension to create the labels
# labels = [v for x in gpa_cols for v in [x]*len(df)]
# create a new dataframe
dfl = pd.DataFrame({'hours': h, 'gpa': g, 'label': labels})
# save dfl if desired
dfl.to_csv('gpa_long.csv', index=False)
# plot
sns.scatterplot(data=dfl, x='hours', y='gpa', hue='label')
Plot Result
I'm trying to plot a pandas dataframe using matplotlib however having issues with the grouping. The dataframe contains statistics for a player in each round of the season. My dataframe is much larger however for this example I have simplified it:
Desc Round 1 Round 2 Round 3 Round 4 Round 5 Round 6 Round 7 Round 8 Round 9 Round 10
Ben 22.3 33.3 21.5 27.7 31.3 43 33.5 20 29.7 22.7
Tom 28.2 29.2 23.1 25 21.4 22.3 26.2 25.3 19.6
Jack 21.3 30.4 20.8 18 24.5 28.3 32.6 17 25.1 23.7
However when I simply try to plot this using:
df.plot()
plt.show()
The lines are grouped by the round number instead of the player's name and it appears the Y values are actually the player's row index. Here is the plot it outputs.
So I believe maybe the pandas dataframe isn't corrected indexed for rows / columns thus causing this problem. I've looked into using the df.groupby but can't find a solution.
I can easily create the line graph I'm after using MS Excel - Here is the output I would like:
Does anyone have a solution on what I can do to either my dataframe or plot code to get the desired outcome? I have already made sure I have set the row Index's to the players name using:
df.set_index('Desc')
However this hasn't fixed the issue.
Use set_index then transpose:
Creating data
colNames = ['Desc', 'Round1', 'Round2', 'Round3', 'Round4', 'Round5', 'Round6', 'Round7', 'Round8', 'Round9', 'Round10']
df = pd.DataFrame(columns = colNames)
df.loc[len(df)] = ['Ben', '22.3', '33.3', '21.5', '27.7', '31.3', '43', '33.5', '20', '29.7', '22.7']
df.loc[len(df)] = ['Tom', '', '28.2', '29.2', '23.1', '25', '21.4', '22.3', '26.2', '25.3', '19.6']
df.loc[len(df)] = ['Jack', '21.3', '30.4', '20.8', '18', '24.5', '28.3', '32.6', '17', '25.1', '23.7']
Pre-processing
df.set_index("Desc", inplace = True)
df = df.apply(pd.to_numeric, errors='coerce')
Plotting the data
df.T.plot()
plt.show()
This gives us expected graph :
I want to write the average between two columns(Max and Min) into another column(Mean) for each row.
Specifically, as it iterates through rows, determine the mean from first 2 cells and write this into the cell of the 3rd row.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
sheet._cell_overwrite_ok = True
df = pd.read_excel('tempMean.xlsx', sheet_name='tempMeanSheet')
listMax = df['Max']
listMin = df['Min']
listMean = df['Mean']
for df, row in df.iterrows():
print('Max', row.Max, 'Min', row.Min, 'Mean', row.Mean)
Current Results:
Max 29.7 Min 20.5 Mean nan
Max 29.2 Min 20.2 Mean nan
Max 29.1 Min 21.2 Mean nan
Results I want:
Max 29.7 Min 20.5 Mean 24.95
Max 29.2 Min 20.2 Mean 24.7
Max 29.1 Min 21.2 Mean 25.15
I have been able to iterate through rows as seen in code.
However, I am not sure how to apply the equation to find mean for each of these rows.
Consequently, the row for mean has no data.
Let me know if anything doesnt make sense
Try this:
df = pd.read_excel('tempMean.xlsx', sheet_name='tempMeanSheet')
mean = [(row["Min"] + row["Max"]) / 2 for index, row in df.iterrows()]
df = df.assign(Mean=mean)
Consider calculating column beforehand, add dummy columns for your Min, Max, Mean labels and output with to_string, avoiding any loops:
# VECTORIZED CALCULATION OF MEAN
df['Mean'] = (df['Max'] + df['Min']) / 2
# ADD LABEL COLUMNS AND RE-ORDER COLUMNS
df = (df.assign(MaxLabel='Max', MinLabel='Min', MeanLabel='Mean')
.reindex(['MaxLabel', 'Max', 'MinLabel', 'Min', 'MeanLabel', 'Mean'], axis='columns')
)
# OUTPUT TO SCREEN IN ONE CALL
print(df.to_string())
I have dataframe total_year, which contains three columns (year, action, comedy).
How can I plot two columns (action and comedy) on y-axis?
My code plots only one:
total_year[-15:].plot(x='year', y='action', figsize=(10,5), grid=True)
Several column names may be provided to the y argument of the pandas plotting function. Those should be specified in a list, as follows.
df.plot(x="year", y=["action", "comedy"])
Complete example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"year": [1914,1915,1916,1919,1920],
"action" : [2.6,3.4,3.25,2.8,1.75],
"comedy" : [2.5,2.9,3.0,3.3,3.4] })
df.plot(x="year", y=["action", "comedy"])
plt.show()
Pandas.DataFrame.plot() per default uses index for plotting X axis, all other numeric columns will be used as Y values.
So setting year column as index will do the trick:
total_year.set_index('year').plot(figsize=(10,5), grid=True)
When using pandas.DataFrame.plot, it's only necessary to specify a column to the x parameter.
The caveat is, the rest of the columns with numeric values will be used for y.
The following code contains extra columns to demonstrate. Note, 'date' is left as a string. However, if 'date' is converted to a datetime dtype, the plot API will also plot the 'date' column on the y-axis.
If the dataframe includes many columns, some of which should not be plotted, then specify the y parameter as shown in this answer, but if the dataframe contains only columns to be plotted, then specify only the x parameter.
In cases where the index is to be used as the x-axis, then it is not necessary to specify x=.
import pandas as pd
# test data
data = {'year': [1914, 1915, 1916, 1919, 1920],
'action': [2.67, 3.43, 3.26, 2.82, 1.75],
'comedy': [2.53, 2.93, 3.02, 3.37, 3.45],
'test1': ['a', 'b', 'c', 'd', 'e'],
'date': ['1914-01-01', '1915-01-01', '1916-01-01', '1919-01-01', '1920-01-01']}
# create the dataframe
df = pd.DataFrame(data)
# display(df)
year action comedy test1 date
0 1914 2.67 2.53 a 1914-01-01
1 1915 3.43 2.93 b 1915-01-01
2 1916 3.26 3.02 c 1916-01-01
3 1919 2.82 3.37 d 1919-01-01
4 1920 1.75 3.45 e 1920-01-01
# plot the dataframe
df.plot(x='year', figsize=(10, 5), grid=True)
From
Fill in missing row values in pandas dataframe
I have the following dataframe and would like to fill in missing values.
mukey hzdept_r hzdepb_r sandtotal_r silttotal_r
425897 0 61
425897 61 152 5.3 44.7
425911 0 30 30.1 54.9
425911 30 74 17.7 49.8
425911 74 84
I want each missing value to be the average of values corresponding to that mukey. In this case, e.g. the first row missing values will be the average of sandtotal_r and silttotal_r corresponding to mukey==425897. pandas fillna doesn't seem to do the trick. Any help?
While the code works for the sample dataframe in that example, it is failing on the larger dataset I have uploaded here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
df1.fillna(df.groupby('mukey').mean(),inplace=True)
df1.reset_index()
I get the error: InvalidIndexError. Why is it not working?
Use combine_first. It allows you to patch up the missing data on the left dataframe with the matching data on the right dataframe based on same index.
In this case, df1 is on the left and df2, the means, as the one on the right.
In [48]: df = pd.read_csv('www004.csv')
...: df1 = df.set_index('mukey')
...: df2 = df.groupby('mukey').mean()
In [49]: df1.loc[426178,:]
Out[49]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 NaN NaN NaN 72.50
426178 36 66 NaN NaN NaN 72.50
426178 66 152 42.1 37.9 20 0.25
In [50]: df2.loc[426178,:]
Out[50]:
hzdept_r 34.000000
hzdepb_r 84.666667
sandtotal_r 42.100000
silttotal_r 37.900000
claytotal_r 20.000000
om_r 48.416667
Name: 426178, dtype: float64
In [51]: df3 = df1.combine_first(df2)
...: df3.loc[426178,:]
Out[51]:
hzdept_r hzdepb_r sandtotal_r silttotal_r claytotal_r om_r
mukey
426178 0 36 42.1 37.9 20 72.50
426178 36 66 42.1 37.9 20 72.50
426178 66 152 42.1 37.9 20 0.25
Note that the following rows still won't have values in the resulting df3
426162
426163
426174
426174
426255
because they were single rows to begin with, hence, .mean() doesn't mean anything to them (eh, see what I did there?).
The problem is the duplicate index values. When you use df1.fillna(df2), if you have multiple NaN entries in df1 where both the index and the column label are the same, pandas will get confused when trying to slice df1, and throw that InvalidIndexError.
Your sample dataframe works because even though you have duplicate index values there, only one of each index value is null. Your larger dataframe contains null entries that share both the index value and column label in some cases.
To make this work, you can do this one column at a time. For some reason, when operating on a series, pandas will not get confused by multiple entries of the same index, and will simply fill the same value in each one. Hence, this should work:
import pandas as pd
df = pd.read_csv('www004.csv')
# CSV file is here: https://www.dropbox.com/s/w3m0jppnq74op4c/www004.csv?dl=0
df1 = df.set_index('mukey')
grouped = df.groupby('mukey').mean()
for col in ['sandtotal_r', 'silttotal_r']:
df1[col] = df1[col].fillna(grouped[col])
df1.reset_index()
NOTE: Be careful using the combine_first method if you ever have "extra" data in the dataframe you're filling from. The combine_first function will include ALL indices from the dataframe you're filling from, even if they're not present in the original dataframe.