Filling in missing values with Pandas - python

Link: CSV with missing Values
I am trying to figure out the best way to fill in the 'region_cd' and 'model_cd' fields in my CSV file with Pandas. The 'RevenueProduced' field can tell you what the right value is for either missing fields. My idea is to make some query in my dataframe that looks for all the fields that have the same 'region_cd' and 'RevenueProduced' and make all the 'model_cd' match (vice versa for the missing 'region_cd').
import pandas as pd
import requests as r
#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'
res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')
There is likely many ways to solve this but I am just starting Pandas and I am stumped to say the least. Any help would be awesome.

Assuming that each RevenueProduced maps to exactly one region_cd and one model_cd.
Take a look at the groupby pandas function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
You could do the following:
# create mask to grab only regions with values
mask = df['region_cd'].notna()
# group by region, collect the first `RevenueProduced` and reset the index
region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()
# checkout the built-in zip function to understand what's happening here
region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))
# store data in new column, although you could overwrite "region_cd"
df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)
You would do the exact same process with model_cd. I haven't run this code since at the time of writing this I don't have access to your csv, but I hope this helps.
Here is the documentation for .map series method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
(Keep in mind a series is just a column in a dataframe)

Related

An efficient way to fill a dearpygui table using pandas

For now, I just make each column in to a list df['Name'].to_list() -> zip(list1,list2 ,....) all the lists, and iterate over them and then I add them in the table.
I would imagine this is far from an ideal solution. Is there anything better to fill the dearpygui table while using pandas?
I don't know much about your approach but here is a generalized example of what i use:
dataset = pd.read_csv(filename) # Take your df from wherever
with dpg.table(label='DatasetTable'):
for i in range(dataset.shape[1]): # Generates the correct amount of columns
dpg.add_table_column(label=dataset.columns[i]) # Adds the headers
for i in range(n): # Shows the first n rows
with dpg.table_row():
for j in range(dataset.shape[1]):
dpg.add_text(f"{dataset.iloc[i,j]}") # Displays the value of
# each row/column combination
I hope it can be useful to someone.

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Creating multiples pandas dataframes as outputs of a function iteration over a list

I'm trying to use the function itis.hierarchy_full of the pytaxize package in order to retrieve information about a biological species from a specific Id.
The function takes only one values/Id and save all the taxonomic information inside a pandas dataframe that I can edit later.
import pandas as pd
from pytaxize import itis
test1 = itis.hierarchy_full(180530, as_dataframe = True)
I have something like 800 species Ids, and I want to automate the process to obtain 800 different dataframes.
I have somehow created a test with a small list (be aware, I am a biologist so the code is really basic and maybe inefficient:
species = [180530, 48739, 567823]
tx = {}
for e in species2:
tx[e] = pd.DataFrame(itis.hierarchy_full(e, as_dataframe = True))
Now if I input tx (I'm using a Jupyter Notebook) I obtain a dictionary of pandas dataframes (I think it is a nested dictionary). And if I input tx[180530] I obtain exactly a single dataframe equal to the ones that I can create with the original function.
from pandas.testing import assert_frame_equal
assert_frame_equal(test_180530, sp_180530)
Now I can write something to save each result stored in dictionary as a separate dataframe:
sp_180530 = tx[180530]
sp_48739 = tx[48739]
sp_567823 = tx[567823]
There is a way to automate the process and save each dataframe to a sp_id? Or even better, there is a way to include in the original function where I create tx, to output directly multiple dataframes?
Not exactly what you asked, but to be able to elaborate a bit more on working with the dataframes in the dictionary... To work with the dictionary, loop over the dict and then use every contained dataframe one by one...
for key in tx.keys():
df_temp = tx[key]
# < do all your stuff to df_temp .....>
# Save the dataframe as you want/need (I assume as csv for here)
df_temp.to_csv(f'sp_{key}.csv')

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.
You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

plot the relationship between two variables with pandas

I am new to python but am aware about the usefulness of pandas, thus I would like to kindly ask if someone can help me to use pandas in order to address the below problem.
I have a dataset with buses, which looks like:
BusModel;BusID;ModeName;Value;Unit;UtcTime
Alpha;0001;Engine hours;985;h;2016-06-22 19:58:09.000
Alpha;0001;Engine hours;987;h;2016-06-22 21:58:09.000
Alpha;0001;Engine hours;989;h;2016-06-22 23:59:09.000
Alpha;0001;Fuel consumption;78;l;2016-06-22 19:58:09.000
Alpha;0001;Fuel consumption;88;l;2016-06-22 21:58:09.000
Alpha;0001;Fuel consumption;98;l;2016-06-22 23:59:09.000
The file is .csv format and is separated by semicolon (;). Please note that I would like to plot the relationship between ‘Engine hours’ and ‘Fuel consumption’ by 'calculating the mean value of both for each day' based on the UtcTime. Moreover, I would like to plot graphs for all the busses in the dataset (not only 0001 but also 0002, 0003 etc.). How I can do that with simple loop?
Start with the following interactive mode
import pandas as pd
df = pd.read_csv('bus.csv', sep=";", parse_dates=['UtcTime'])
You should be able to start playing around with the DataFrame and discovering functions you can directly use with the data. To get a list of buses by ID just do:
>>> bus1 = df[df.BusID == 1]
>>> bus1
Substitute 1 with the ID of the bus you require. This will return you a sub-DataFrame. To get BusID 1 and just their engine hours do:
>>> bus1[bus1.ModeName == "Engine hours"]
You can quickly get statistics of columns by doing
>>> bus1.Value.describe()
Once you grouped the data you need you can start plotting:
>>> bus1[bus1.ModeName == "Engine hours"].plot()
>>> bus1[bus1.ModeName == "Fuel consumption"].plot()
>>> plt.show()
There is more explanation on the docs. Please refer to http://pandas.pydata.org/pandas-docs/stable/.
If you really want to use pandas, remember this simple thing: never use a loop. Loops aren't scalable, so try to use built-in functions. First let's read your dataframe:
import pandas as pd
data = pd.read_csv('bus.csv',sep = ';')
Here is the weak point of my answer, I don't know how to manage dates efficently. So create a column named day which contains the day from UtcTime (I would use an apply methode like this data['day'] = data['UtcTime'].apply(lambda x: x[:10]) but it's a hidden loop so don't do that!)
Then to take only the data of a single bus, try a slicing method:
data_bus1 = data[data.BusID == 1]
Finally use the groupby function:
data_bus1[['Modename','Value','day']].groupby(['ModeName','day'],as_index = False).mean()
Or if you don't need to separate your busses in different dataframes, you can use the groupby on the whole data:
data[['BusID','ModeName','Value','day']].groupby(['BusID','ModeName','day'],as_index = False).mean()

Categories