i have a dataset with many columns, and i want to summarize the total rows for each country. i grouped by the data by country and tried by count function and plot it. but the result has shown all columns for each country. i want a summary for each country to be shown on the graph bar, with one line or dot.
i want something like the R function - summarize(Total = n()).
thats my method on python:
newData = myData.groupby('Country').count();
newData.plot(kind='bar', figsize=(15, 10))
Try the following for the basic count per country:
newData = df.groupby('Country').apply(lambda x: len(x))
newData.plot(kind='bar', figsize=(15, 10))
Related
I have a dataframe, where I would like to make a time series plot with three different lines that each show the daily occurrences (the number of rows per day) for each of the values in another column.
To give an example, for the following dataframe, I would like to see the development for how many a's, b's and c's there have been each day.
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
When I try the command below (my best guess so far), however, it does not filter for the different dates (I would like three lines representing each of the letters.
Any ideas on how to solve this?
df.groupby(['date']).count().plot()['letter']
I have also tried a solution in Matplotlib, though this one gives an error..
fig, ax = plt.subplots()
ax.plot(df['date'], df['letter'].count())
Based on your question, I believe you are looking for a line plot which has dates in X-axis and the counts of letters in the Y-axis. To achieve this, these are the steps you will need to do...
Group the dataframe by date and then letter - get the number of entries/rows for each which you can do using size()
Flatten the grouped dataframe using reset_index(), rename the new column to Counts and sort by letter column (so that the legend shows the data in the alphabetical format)... these are more to do with keeping the new dataframe and graph clean and presentable. I would suggest you do each step separately and print, so that you know what is happening in each step
Plot each line plot separately using filtering the dataframe by each specific letter
Show legend and rotate date so that it comes out with better visibility
The code is shown below....
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
df_grouped = df.groupby(by=['date', 'letter']).size().reset_index() ## New DF for grouped data
df_grouped.rename(columns = {0 : 'Counts'}, inplace = True)
df_grouped.sort_values(['letter'], inplace=True)
colors = ['r', 'g', 'b'] ## New list for each color, change as per your preference
for i, ltr in enumerate(df_grouped.letter.unique()):
plt.plot(df_grouped[df_grouped.letter == ltr].date, df_grouped[df_grouped.letter == ltr].Counts, '-o', label=ltr, c=colors[i])
plt.gcf().autofmt_xdate() ## Rotate X-axis so you can see dates clearly without overlap
plt.legend() ## Show legend
Output graph
I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.
I have a program that forecasts individual stock data. It's very simple and straightforward. The user needs to select one stock and the range of data.
I'm ready to take it up to the next level by allowing this application to create individual forecasts for multiple stocks in one sitting by passing a list of stock symbols to my model. For example, instead of running this program 20 times for 20 different stocks, It would only have to run once for 20 individual stocks. Before, I could only use this application for one at a time.
Let's look at where I currently am. I have already made a dummy list of stocks in tickers and started a loop which turned into poorly designed data frames and dictionaries.
import yfinance as yf
#stock symbol
tickers = ["LRCX", "FB", "COF"]
# mm-dd-yy formate
start_date = "01-01-2014"
end_date = "11-23-2019"
stocks = {}
for symbol in tickers:
stock_info = pdr.get_data_yahoo(tickers, start=start_date,end=end_date)
stock_info['date'] = stock_info.index
key_name = 'df_' + symbol
stock_info.drop(['Open', 'High', 'Low','Volume'], axis=1)
stock_info.rename(columns={'Close': 'y', 'date': 'ds'}, inplace=True)
stocks[key_name] = stock_info
This is the current data frame that the code above produced:
https://i.stack.imgur.com/Yh1F9.png I call it with stocks[key_name]. However, this is not the dataframe I had in mind. I want to have a loop that creates individual dataframes for each stock above in my list of tickers. Then process the individual dataframes by dropping and renaming each necessary column. In this case, finalizing dataframes with only y and ds columns for each stock.
An example of a dataframe I wanted to create for stocks in my list, one df per stock
Once that is settled, I would I would like to create loops that pass these dataframes into my model and plots out the data.
The method below did not work for me because I'm using a dictionary and it got overly complicated, I also found out that I need to pass dataframes for .fit() when using Prophet() (its a forecasting model developed by Facebook). I would need to loop through each dataframe created and fit them indivudally as such below.
for k, v in stocks.items():
m = Prophet()
m.fit(stocks)
Below is what I have in mind for plotting each dataframe and their respective columns of data. It might help you understand this workflow better. I'm assuming that its very easy to loop over a list for plotting, but I'm also struggling with that. Would i need to automate the size of the subplots as well? Incase I want to try out 30 stocks? Just some of the questions I keep running into.
for stock in list_of_df
# First Subplot
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,5))
ax1.plot(stock_info["date"], stock_info["Close"])
ax1.set_xlabel("Date", fontsize=12)
ax1.set_ylabel("Stock Price")
ax1.set_title(f"{ticker} Close Price History")
# Second Subplot
ax1.plot(stock_info["date"], stock_info["High"], color="green")
ax1.set_xlabel("Date", fontsize=12)
ax1.set_ylabel("Stock Price")
ax1.set_title(f"{ticker} High Price History")
# Third Subplot
ax1.plot(stock_info["date"], stock_info["Low"], color="red")
ax1.set_xlabel("Date", fontsize=12)
ax1.set_ylabel("Stock Price")
ax1.set_title(f"{ticker} Low Price History")
# Fourth Subplot
ax2.plot(stock_info["date"], stock_info["Volume"], color="orange")
ax2.set_xlabel("Date", fontsize=12)
ax2.set_ylabel("Stock Price")
ax2.set_title(f"{ticker} Volume History")
plt.show()
I would greatly appreciate some guidance here from any dataframe and looping expert. Streamlining this workflow has turned out a lot more difficult than I thought, but essentially I'm trying to make a loop or a function that works for creating any amount of dataframes at once given the proper data.
cols = [i for i in stock_info.columns]
cols = [ i for i in cols if "date" not in i]
for col in cols:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14,5))
ax1.plot(stock_info[col], stock_info["Close"])
ax1.set_xlabel("Date", fontsize=12)
ax1.set_ylabel(col)
ax1.set_title(f"{ticker} Close Price History")
plt.show()
I need to plot a pie chart of frequencies from a column of a dataframe, but a lot of lower frequencies appear and visualization is poor.
the code I wrote is :
df[column].value_counts(normalize=True).plot(kind="pie")
I know that df[column].value_counts(normalize=True) will give me percentages of every unique value, but I want to apply the filter percentage>0.05
What I tried?:
new_df = df[column].value_counts(normalize=True)
but this gives me column as index, so I reset the index
new_df = new_df.reset_index()
and then tried
new_df.plot(kind = "pie")
but nothing appears.
I want some 1 line code that can make something like:
df[column].value_counts(normalize=True).plot(kind="pie" if value_counts > 0.05)
Try this:
df['column'].value_counts()[df['column'].value_counts(normalize=True)>0.05].plot(kind='pie')
I have a dataframe which has three columns. The first one represents the country the second one is number of days and the third one is a count column. A sample would look like this:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','IND','UK','UK','UK'],
'Days':[4,5,6,8,9,4],
'Count': [10,13,7,8,2,10]})
I want to plot the Days on the X-axis and the Count on the Y-axis for each country (a line plot) but i want the graphs to be in one frame much like the pair plot. Is there a way to achieve this ? Also I am not sure how to filter the dataframe and plot the filtered object as i want one graph per country?
I want something along this line where for America it would look like this
Days = [4,5]
Count = [10,13]
plt.plot(Days, Count, color='green')
plt.xlabel('Days')
plt.ylabel('Count')
plt.title('Days vs count for USA')
plt.show()
But i want it for every country in a seperate plot but in one frame like a pair-plot.
Any help would be useful.Thanks!
There are probably better built in methods for this, but I would use:
for country in df['Country'].unique():
df[df['Country']==country].sort_values('Days').plot.line(x='Days',
y='Count',
title=country)