plot the relationship between two variables with pandas - python

I am new to python but am aware about the usefulness of pandas, thus I would like to kindly ask if someone can help me to use pandas in order to address the below problem.
I have a dataset with buses, which looks like:
BusModel;BusID;ModeName;Value;Unit;UtcTime
Alpha;0001;Engine hours;985;h;2016-06-22 19:58:09.000
Alpha;0001;Engine hours;987;h;2016-06-22 21:58:09.000
Alpha;0001;Engine hours;989;h;2016-06-22 23:59:09.000
Alpha;0001;Fuel consumption;78;l;2016-06-22 19:58:09.000
Alpha;0001;Fuel consumption;88;l;2016-06-22 21:58:09.000
Alpha;0001;Fuel consumption;98;l;2016-06-22 23:59:09.000
The file is .csv format and is separated by semicolon (;). Please note that I would like to plot the relationship between ‘Engine hours’ and ‘Fuel consumption’ by 'calculating the mean value of both for each day' based on the UtcTime. Moreover, I would like to plot graphs for all the busses in the dataset (not only 0001 but also 0002, 0003 etc.). How I can do that with simple loop?

Start with the following interactive mode
import pandas as pd
df = pd.read_csv('bus.csv', sep=";", parse_dates=['UtcTime'])
You should be able to start playing around with the DataFrame and discovering functions you can directly use with the data. To get a list of buses by ID just do:
>>> bus1 = df[df.BusID == 1]
>>> bus1
Substitute 1 with the ID of the bus you require. This will return you a sub-DataFrame. To get BusID 1 and just their engine hours do:
>>> bus1[bus1.ModeName == "Engine hours"]
You can quickly get statistics of columns by doing
>>> bus1.Value.describe()
Once you grouped the data you need you can start plotting:
>>> bus1[bus1.ModeName == "Engine hours"].plot()
>>> bus1[bus1.ModeName == "Fuel consumption"].plot()
>>> plt.show()
There is more explanation on the docs. Please refer to http://pandas.pydata.org/pandas-docs/stable/.

If you really want to use pandas, remember this simple thing: never use a loop. Loops aren't scalable, so try to use built-in functions. First let's read your dataframe:
import pandas as pd
data = pd.read_csv('bus.csv',sep = ';')
Here is the weak point of my answer, I don't know how to manage dates efficently. So create a column named day which contains the day from UtcTime (I would use an apply methode like this data['day'] = data['UtcTime'].apply(lambda x: x[:10]) but it's a hidden loop so don't do that!)
Then to take only the data of a single bus, try a slicing method:
data_bus1 = data[data.BusID == 1]
Finally use the groupby function:
data_bus1[['Modename','Value','day']].groupby(['ModeName','day'],as_index = False).mean()
Or if you don't need to separate your busses in different dataframes, you can use the groupby on the whole data:
data[['BusID','ModeName','Value','day']].groupby(['BusID','ModeName','day'],as_index = False).mean()

Related

dynamically create and name DFs using a function?

Suppose you have a survey, and you want to calculate the Net Promoter Score (NPS) of different cuts of respondents. My data may look something like this:
import pandas as pd
data = [[1,111,1,1,35,'F','UK','High'], [1,112,0,1,42,'F','Saudi Arabia','Low'], [1,113,1,1,17,'M','Belize','High'],[1,1234,1,1,35,'F','Saudi Arabia','High'],[2,1854,1,1,35,'M','Belize','Low'],[2,1445,1,1,35,'F','UK','Low']]
df = pd.DataFrame(data, columns = ['survey_num','id_num','nps_sum','nps_count','age','gender','country','income_level'])
df
I want to be able to write a function that cycles through this data and does the following each time:
col_list = ['survey_num','nps_sum','nps_count']
df_customname = df_customname1[col_list]
df_customname = df_customname.groupby('survey_num').sum()
df_customname['nps_customname'] = (df_customname['nps_sum'] / df_customname['nps_count'])*100
df_customname = df_customname.sort_values(by=['survey_num'],ascending=True)
df_customname= pd.DataFrame(df_customname.drop(['nps_sum','nps_count'], axis=1))
df_customname
The reason I need this to be dynamic is because I need to repeat this process for different cuts of data. For example, I want to be able to filter for gender = F AND country = Saudi Arabia, for example. Or Just Gender = M. Or just income = High. I then want to do a left join of that to the original df that is currently called customname (this would be my base case, so it may just be called 'all'
So the final table after running the function a few times, defining my cuts each time, my final output will look like this:
data = [[1,66.67,83.5,22.5,47.7,74.1],[2,75.67,23.5,24.5,76.7,91.1]]
df_final = pd.DataFrame(data, columns = ['survey_num','nps_all','nps_saudi_f','nps_m','nps_high','nps_40plus'])
df_final
Note there may be better ways to run this, but I'm looking for the quickest/simplest possible way that is as close as this as possible. I don't yet know what my cuts will be, but there are likely to be a lot of them, so the easier it is to just define those outside the function and have the function run that code, then left join to the original df, the better.
Thank you!

assigning list of strings as name for dataframe

I have searched and searched and not found what I would think was a common question. Which makes me think I'm going about this wrong. So I humbly ask these two versions of the same question.
I have a list of currency names, as strings. A short version would look like this:
col_names = ['australian_dollar', 'bulgarian_lev', 'brazilian_real']
I also have a list of dataframes (df_list). Each one is has a column for data, currency exchange rate, etc. Here's the head for one of them (sorry it's blurry, it was fine bigger but I stuck in an m in the URL because it was huge):
I would be stoked to assign each one of those strings col_list as a variable name for a data frame in df_list. I did make a dictionary where key/value was currency name and the corresponding df. But I didn't really know how to use it, primarily because it was unordered. Is there a way to zip col_list and df_list together? I could also just unpack each df in df_list and use the title of the second column be the title of the frame. That seems really cool.
So instead I just wrote something that gave me index numbers and then hand put them into the function I needed. Super kludgy but I want to make the overall project work for now. I end up with this in my figure code:
for ax, currency in zip((ax1, ax2, ax3, ax4), (df_list[38], df_list[19], df_list[10], df_list[0])):
ax.plot(currency["date"], currency["rolling_mean_30"])
And that's OK. I'm learning, not delivering something to a client. I can use it to make eight line plots. But I want to do this with 40 frames so I can get the annual or monthly volatility. I have to take a list of data frames and unpack them by hand.
Here is the second version of my question. Take df_list and:
def framer(currency):
index = col_names.index(currency)
df = df_list[index] # this is a dataframe containing a single currency and the columns built in cell 3
return df
brazilian_real = framer("brazilian_real")
Which unpacks the a df (but only if type out the name) and then:
def volatizer(currency):
all_the_years = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of dataframes for each year
c_name = currency.columns[1]
df_dict = {}
for frame in all_the_years:
year_name = frame.iat[0,4] # the year for each df, becomes the "year" cell for annual volatility df
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
df_dict[year_name] = annual_volatility
df = pd.DataFrame.from_dict(df_dict, orient="index", columns=[c_name+"_annual_vol"]) # indexing on year, not sure if this is cool
return df
br_vol = volatizer(brazilian_real)
which returns a df with a row for each year and annual volatility. Then I want to concatenate them and use that for more charts. Ultimately make a little dashboard that lets you switch between weekly, monthly, annual and maybe set date lims.
So maybe there's some cool way to run those functions on the original df or on the lists of dfs that I don't know about. I have started using df.map and df.apply some.
But it seems to me it would be pretty handy to be able to unpack the one list using the names from the other. Basically same question, how do I get the dataframes in df_list out and attached to variable names?
Sorry if this is waaaay too long or a really bad way to do this. Thanks ahead of time!
Do you want something like this?
dfs = {df.columns[1]: df for df in df_list}
Then you can reference them like this for example:
dfs['brazilian_real']
This is how I took the approach suggested by Kelvin:
def volatizer(currency):
annual_df_list = [currency[currency['year'] == y] for y in currency['year'].unique()] # list of annual dfs
c_name = currency.columns[1]
row_dict = {} # dictionary with year:annual_volatility as key:value
for frame in annual_df_list:
year_name = frame.iat[0,4] # first cell of the "year" column, becomes the "year" key for row_dict
annual_volatility = frame["log_rate"].std()*253**.5 # volatility measured by standard deviation * 253 trading days per year raised to the 0.5 power
row_dict[year_name] = annual_volatility # dictionary with year:annual_volatility as key:value
df = pd.DataFrame.from_dict(row_dict, orient="index", columns=[c_name+"_annual_vol"]) # new df from dictionary indexing on year
return df
# apply volatizer to each currency df
for key in df_dict:
df_dict[key] = volatizer(df_dict[key])
It worked fine. I can use a list of strings to access any of the key:value pairs. It feels like a better way than trying to instantiate a bunch of new objects.

Filling in missing values with Pandas

Link: CSV with missing Values
I am trying to figure out the best way to fill in the 'region_cd' and 'model_cd' fields in my CSV file with Pandas. The 'RevenueProduced' field can tell you what the right value is for either missing fields. My idea is to make some query in my dataframe that looks for all the fields that have the same 'region_cd' and 'RevenueProduced' and make all the 'model_cd' match (vice versa for the missing 'region_cd').
import pandas as pd
import requests as r
#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'
res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')
There is likely many ways to solve this but I am just starting Pandas and I am stumped to say the least. Any help would be awesome.
Assuming that each RevenueProduced maps to exactly one region_cd and one model_cd.
Take a look at the groupby pandas function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
You could do the following:
# create mask to grab only regions with values
mask = df['region_cd'].notna()
# group by region, collect the first `RevenueProduced` and reset the index
region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()
# checkout the built-in zip function to understand what's happening here
region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))
# store data in new column, although you could overwrite "region_cd"
df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)
You would do the exact same process with model_cd. I haven't run this code since at the time of writing this I don't have access to your csv, but I hope this helps.
Here is the documentation for .map series method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
(Keep in mind a series is just a column in a dataframe)

Creating multiples pandas dataframes as outputs of a function iteration over a list

I'm trying to use the function itis.hierarchy_full of the pytaxize package in order to retrieve information about a biological species from a specific Id.
The function takes only one values/Id and save all the taxonomic information inside a pandas dataframe that I can edit later.
import pandas as pd
from pytaxize import itis
test1 = itis.hierarchy_full(180530, as_dataframe = True)
I have something like 800 species Ids, and I want to automate the process to obtain 800 different dataframes.
I have somehow created a test with a small list (be aware, I am a biologist so the code is really basic and maybe inefficient:
species = [180530, 48739, 567823]
tx = {}
for e in species2:
tx[e] = pd.DataFrame(itis.hierarchy_full(e, as_dataframe = True))
Now if I input tx (I'm using a Jupyter Notebook) I obtain a dictionary of pandas dataframes (I think it is a nested dictionary). And if I input tx[180530] I obtain exactly a single dataframe equal to the ones that I can create with the original function.
from pandas.testing import assert_frame_equal
assert_frame_equal(test_180530, sp_180530)
Now I can write something to save each result stored in dictionary as a separate dataframe:
sp_180530 = tx[180530]
sp_48739 = tx[48739]
sp_567823 = tx[567823]
There is a way to automate the process and save each dataframe to a sp_id? Or even better, there is a way to include in the original function where I create tx, to output directly multiple dataframes?
Not exactly what you asked, but to be able to elaborate a bit more on working with the dataframes in the dictionary... To work with the dictionary, loop over the dict and then use every contained dataframe one by one...
for key in tx.keys():
df_temp = tx[key]
# < do all your stuff to df_temp .....>
# Save the dataframe as you want/need (I assume as csv for here)
df_temp.to_csv(f'sp_{key}.csv')

How do I find all rows with a certain date using Pandas?

I have a simple Pandas DataFrame containing columns 'valid_time' and 'value'. The frequency of the sampling is roughly hourly, but irregular and with some large gaps. I want to be able to efficiently pull out all rows for a given day (i.e. within a calender day). How can I do this using DataFrame.where() or something else?
I naively want to do something like this (which obviously doesn't work):
dt = datetime.datetime(<someday>)
rows = data.where( data['valid_time'].year == dt.year and
data['valid_time'].day == dt.day and
data['valid_time'].month == dt.month)
There's at least a few problems with the above code. I am new to pandas so am fumbling with something that is probably straightforward.
Pandas is absolutely terrific for things like this. I would recommend making your datetime field your index as can be seen here. If you give a little bit more information about the structure of your dataframe, I would be happy to include more detailed directions.
Then, you can easily grab all rows from a date using df['1-12-2014'] which would grab everything from Jan 12, 2014. You can edit that to get everything from January by using df[1-2014]. If you want to grab data from a range of dates and/or times, you can do something like:
df['1-2014':'2-2014']
Pandas is pretty powerful, especially for time-indexed data.
Try this (is just like the continuation of your idea):
import pandas as pd
import numpy.random as rd
import datetime
times = pd.date_range('2014/01/01','2014/01/6',freq='H')
values = rd.random_integers(0,10,times.size)
data = pd.DataFrame({'valid_time':times, 'values': values})
dt = datetime.datetime(2014,1,3)
rows = data['valid_time'].apply(
lambda x: x.year == dt.year and x.month==dt.month and x.day== dt.day
)
print data[rows]

Categories