Parsing CSV file using Panda - python

I have been using matplotlib for quite some time now and it is great however, I want to switch to panda and my first attempt at it didn't go so well.
My data set looks like this:
sam,123,184,2.6,543
winter,124,284,2.6,541
summer,178,384,2.6,542
summer,165,484,2.6,544
winter,178,584,2.6,545
sam,112,684,2.6,546
zack,145,784,2.6,547
mike,110,984,2.6,548
etc.....
I want first to search the csv for anything with the name mike and create it own list. Now with this list I want to be able to do some math for example add sam[3] + winter[4] or sam[1]/10. The last part would be to plot it columns against each other.
Going through this page
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
The only thing I see is if I have a column header, however, I don't have any headers. I only know the position in a row of the values I want.
So my question is:
How do I create a bunch of list for each row (sam, winter, summer)
Is this method efficient if my csv has millions of data point?
Could I use matplotlib plotting to plot pandas dataframe?
ie :
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike[1], winter[3], label='Mike vs Winter speed', color = 'red')

You can read a csv without headers:
data=pd.read_csv(filepath, header=None)
Columns will be numbered starting from 0.
Selecting and filtering:
all_summers = data[data[0]=='summer']
If you want to do some operations grouping by the first column, it will look like this:
data.groupby(0).sum()
data.groupby(0).count()
...
Selecting a row after grouping:
sums = data.groupby(0).sum()
sums.loc['sam']
Plotting example:
sums.plot()
import matplotlib.pyplot as plt
plt.show()
For more details about plotting, see: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html

df = pd.read_csv(filepath, header=None)
mike = df[df[0]=='mike'].values.tolist()
winter = df[df[0]=='winter'].values.tolist()
Then you can plot those list as you wanted to above
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike, winter, label='Mike vs Winter speed', color = 'red')

Related

Python: graph from csv filtered by pandas shows no graph

I would like to create a graph on filtered data from a .csv file. A graph is created but without content.
Here is the code and the result:
# var 4 graph
xs = []
ys = []
name = "Anna"
gender = "F"
state = "CA"
# 4 reading csv file
import pandas as pd
# reading csv file
dataFrame = pd.read_csv("../Kursmaterialien/data/names.csv")
#print("DataFrame...\n",dataFrame)
# select rows containing text
dataFrame = dataFrame[(dataFrame['Name'] == name)&dataFrame['State'].str.contains(state)&dataFrame['Gender'].str.contains(gender)]
#print("\nFetching rows with text ...\n",dataFrame)
print(dataFrame)
# append var with value
xs.append(list(dataFrame['Year']))
ys.append(list(dataFrame['Count']))
#xs.append(list(map(str,dataFrame['Year'])))
#ys.append(list(map(str,dataFrame['Count'])))
print(xs)
print(ys)
Result from print(xs) and print(ys)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(xs,ys)
plt.show()
Resulting plot:
I see that the variables start with two brackets, but don't know if that is the problem and how to fix it.
The graphic should look something like this :
You are correct about the two brackets, you have to extract the data from the inner bracket. This is done by setting the indice to 0 to get the first column (which is also the only one).
This should work:
XS=xs[0]
YS=ys[0]
plt.plot(XS,YS)
plt.show()
With your double brackets, the plt.plot is plotting each pairs of points as a different element. And plotting a point doesn't draw a line, as by default the makers are off. If you try to add markers to your plot, you will see the markers in different colours, but no lines.
plt.plot(xs,ys,'o') #round marker

how to make stacked plots for dataframe with multiple index in python?

I have trade export data which is collected weekly. I intend to make stacked bar plot with matplotlib but I have little difficulties managing pandas dataframe with multiple indexes. I looked into this post but not able to get what I am expecting. Can anyone suggest a possible way of doing this in python? Seems I made the wrong data aggregation and I think I might use for loop to iterate year then make a stacked bar plot on a weekly base. Does anyone know how to make this easier in matplotlib? any idea?
reproducible data and my attempt
import pandas as pd
import matplotlib.pyplot as plt
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/0eb9d60374c8a0c17449eef4583705d7/raw/edea1777466284f2958ffac6cafb86683e08a65e/mydata.csv'
df = pd.read_csv(url, parse_dates=['weekly'])
df.drop('Unnamed: 0', axis=1, inplace=True)
nn = df.set_index(['year','week'])
nn.drop("weekly", axis=1, inplace=True)
f, a = plt.subplots(3,1)
nn.xs('2018').plot(kind='bar',ax=a[0])
nn.xs('2019').plot(kind='bar',ax=a[1])
nn.xs('2020').plot(kind='bar',ax=a[2])
plt.show()
plt.close()
this attempt didn't work for me. instead of explicitly selecting years like 2018, 2019, ..., is there any more efficient to make stacked bar plots for dataframe with multiple indexes? Any thoughts?
desired output
this is the desired stacked bar plot for year of 2018 as an example
how should I get my desired stacked bar plot? Any better ideas?
Try this:
nn.groupby(level=0).plot.bar(stacked=True)
or to prevent year as tuple in x axis:
for n, g in nn.groupby(level=0):
g.loc[n].plot.bar(stacked=True)
Update per request in comments
for n, g in nn.groupby(level=0):
ax = g.loc[n].plot.bar(stacked=True, title=f'{n} Year', figsize=(8,5))
ax.legend(loc='lower center')
Change layout position
fig, ax = plt.subplots(1,3)
axi = iter(ax)
for n, g in nn.groupby(level=0):
axs = next(axi)
g.loc[n].plot.bar(stacked=True, title=f'{n}', figsize=(15,8), ax=axs)
axs.legend(loc='lower center')
Try using loc instead of xs:
f, a = plt.subplots(3,1)
for x, ax in zip(nn.index.unique('year'),a.ravel()):
nn.loc[x].plot.bar(stacked=True, ax=ax)

How to plot a heatmap using seaborn or matplotlib?

I have a dataframe that I am trying to visualize into a heatmap, I used matplotlib to make a heatmap but it is showing data that is not apart of my dataframe.
I've tried to create a heatmap using matplotlib from an example I found online and changed the code to work for my data. But on the left side of the graph and top of it there are random values that are not apart of my data and I'm not sure how to remove them.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to
pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header
= None))
dflist[0]
#Created a new list, due to the melt we are going to do not been able to
replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in
our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win
Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win
Percent'], axis = 1)
allTeamStats
#Graph visualization Making a HeatMap
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
y=["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16"]
x=["16","15","14","13","12","11","10","9","8","7","6","5","4","3","2","1"]
winp = []
for i in x:
lst = []
for j in y:
percent = allTeamStats.loc[(allTeamStats["Team"]== 'Team '+i) &\
(allTeamStats["VS"]== "vs.#"+j)]['Win
Percent'].iloc[0]
percent = float(percent[:-1])
lst.append(percent)
winp.append(lst)
winpercentage= np.array([[]])
fig,ax=plt.subplots(figsize=(18,18))
im= ax.imshow(winp, cmap='hot')
# We want to show all ticks...
ax.set_xticks(np.arange(len(y)))
ax.set_yticks(np.arange(len(x)))
# ... and label them with the respective list entries
ax.set_xticklabels(y)
ax.set_yticklabels(x)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(x)):
for j in range(len(y)):
text = ax.text(j, i, winp[i][j],
ha="center", va="center", color="red")
ax.set_title("Win Percentage of Each Matchup", fontsize= 40)
heatmap = plt.pcolor(winp)
plt.colorbar(heatmap)
ax.set_ylabel('Seeds', fontsize=40)
ax.set_xlabel('Seeds', fontsize=40)
plt.show()
The results I get are what I want except for the two lines that are on the left side and top of the heatmap. I'm unsure what these values are coming from and to easier see them I used cmap= 'hot' to show the values that are not supposed to be there. If you could help me fix my code to plot it correctly or plot an entire new heatmap using seaborn (my TA told me to try using seaborn but I've never used it yet) with my data. Anything helps Thanks!
I think the culprit is this line: im= ax.imshow(winp, cmap='hot') in your code. Delete it and try again. Basically, anything that you plotted after that line was laid over what that line created. The left and top "margins" were the only parts of the image on the bottom that you could see.

In Python, how do I create a line plot based on groupby() of two categories, with one of those categories being the legend?

I used this code to group the avg. life expectancy by year and continent:
avg_lifeExp_by_cont_yr = df.groupby(['year','continent'])['lifeExp'].mean()
The result looks like this:
I want to create a line chart that has the year on the x-axis, avg. life expectancy on the y-axis, and the continent to be used as the legend (so one line per continent).
You can use df.unstack('continent') to place continent as columns, then this dataframe becomes a 2D table where the 1st column is the X, and other columns are Y. You can directly call plot function or control the plot yourself by raw matplotlib operations.
Thanks for your data, here is the complete code sample for your request:
# imports
import pandas as pd
import matplotlib.pyplot as plt
# prepare dataframe
df = pd.read_csv('gapminder.tsv', sep='\t')
df = df.groupby(['year','continent']).lifeExp.mean()
# unstack the `continent` index, to place it as columns
df = df.unstack(level='continent')
# The name of columns would become the name of legend
# when using dataframe plot
df.columns.name = 'Life Expectation'
# Now, we have a 2d talbe, 1st column become to X
# and other columns become to Y
# In [14]: df.head()
# Out[14]:
# Life Expectation Africa Americas Asia Europe Oceania
# year
# 1952 39.135500 53.27984 46.314394 64.408500 69.255
# 1957 41.266346 55.96028 49.318544 66.703067 70.295
# 1962 43.319442 58.39876 51.563223 68.539233 71.085
# 1967 45.334538 60.41092 54.663640 69.737600 71.310
# 1972 47.450942 62.39492 57.319269 70.775033 71.910
# matplotlib operations
# Here we use dataframe plot function
# You could also use raw matplotlib plot one column each to do fine control
# Please polish the figure with more configurations
fig, ax = plt.subplots(figsize=(6, 4.5))
df.plot()
There are several tricks in the data processing, please check the comments in the code. The rough plot looks like
Please polish your figure with more matplotlib operations. For example:
Set y-label
Heigh of the two large, set legend to two columns to reduce it
Colors of the line, or shapes of the line
Line with markers?
Here are some tweaks
# set axis labels
ax.set_xlabel('Year')
ax.set_ylabel('Life Expection')
# set markers
markers = ['o', 's', 'd', '^', 'v']
for i, line in enumerate(ax.get_lines()):
line.set_marker(markers[i])
# update legend
ax.legend(ax.get_lines(), df.columns, loc='best', ncol=2)
plt.tight_layout()
The figure now looks like:
Use pivot_table (notebook):
data = pd.read_csv("https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv", sep="\t")
data.pivot_table(values="lifeExp", index="year", columns="continent").plot()

Adding Legends in Pandas Plot

I am plotting Density Graphs using Pandas Plot. But I am not able to add appropriate legends for each of the graphs. My code and result is as as below:-
for i in tickers:
df = pd.DataFrame(dic_2[i])
mean=np.average(dic_2[i])
std=np.std(dic_2[i])
maximum=np.max(dic_2[i])
minimum=np.min(dic_2[i])
df1=pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(dic_2[i])))
ax=df.plot(kind='density', title='Returns Density Plot for '+ str(i),colormap='Reds_r')
df1.plot(ax=ax,kind='density',colormap='Blues_r')
You can see in the pic, top right side box, the legends are coming as 0. How do I add something meaningful over there?
print(df.head())
0
0 -0.019043
1 -0.0212065
2 0.0060413
3 0.0229895
4 -0.0189266
I think you may want to restructure the way you've created the graph. An easy way to do this is to create the ax before plotting:
# sample data
df = pd.DataFrame()
df['returns_a'] = [x for x in np.random.randn(100)]
df['returns_b'] = [x for x in np.random.randn(100)]
print(df.head())
returns_a returns_b
0 1.110042 -0.111122
1 -0.045298 -0.140299
2 -0.394844 1.011648
3 0.296254 -0.027588
4 0.603935 1.382290
fig, ax = plt.subplots()
I then created the dataframe using the parameters specified in your variables:
mean=np.average(df.returns_a)
std=np.std(df.returns_a)
maximum=np.max(df.returns_a)
minimum=np.min(df.returns_a)
pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(df.returns_a))).rename(columns={0: 'std_normal'}).plot(kind='density',colormap='Blues_r', ax=ax)
df.plot('returns_a', kind='density', ax=ax)
This second dataframe you're working with is created by default with column 0. You'll need to rename this.
I figured out a simpler way to do this. Just add column names to the dataframes.
for i in tickers:
df = pd.DataFrame(dic_2[i],columns=['Empirical PDF'])
print(df.head())
mean=np.average(dic_2[i])
std=np.std(dic_2[i])
maximum=np.max(dic_2[i])
minimum=np.min(dic_2[i])
df1=pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(dic_2[i])),columns=['Normal PDF'])
ax=df.plot(kind='density', title='Returns Density Plot for '+ str(i),colormap='Reds_r')
df1.plot(ax=ax,kind='density',colormap='Blues_r')

Categories