For now, I just make each column in to a list df['Name'].to_list() -> zip(list1,list2 ,....) all the lists, and iterate over them and then I add them in the table.
I would imagine this is far from an ideal solution. Is there anything better to fill the dearpygui table while using pandas?
I don't know much about your approach but here is a generalized example of what i use:
dataset = pd.read_csv(filename) # Take your df from wherever
with dpg.table(label='DatasetTable'):
for i in range(dataset.shape[1]): # Generates the correct amount of columns
dpg.add_table_column(label=dataset.columns[i]) # Adds the headers
for i in range(n): # Shows the first n rows
with dpg.table_row():
for j in range(dataset.shape[1]):
dpg.add_text(f"{dataset.iloc[i,j]}") # Displays the value of
# each row/column combination
I hope it can be useful to someone.
Related
Link: CSV with missing Values
I am trying to figure out the best way to fill in the 'region_cd' and 'model_cd' fields in my CSV file with Pandas. The 'RevenueProduced' field can tell you what the right value is for either missing fields. My idea is to make some query in my dataframe that looks for all the fields that have the same 'region_cd' and 'RevenueProduced' and make all the 'model_cd' match (vice versa for the missing 'region_cd').
import pandas as pd
import requests as r
#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'
res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')
There is likely many ways to solve this but I am just starting Pandas and I am stumped to say the least. Any help would be awesome.
Assuming that each RevenueProduced maps to exactly one region_cd and one model_cd.
Take a look at the groupby pandas function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
You could do the following:
# create mask to grab only regions with values
mask = df['region_cd'].notna()
# group by region, collect the first `RevenueProduced` and reset the index
region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()
# checkout the built-in zip function to understand what's happening here
region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))
# store data in new column, although you could overwrite "region_cd"
df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)
You would do the exact same process with model_cd. I haven't run this code since at the time of writing this I don't have access to your csv, but I hope this helps.
Here is the documentation for .map series method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
(Keep in mind a series is just a column in a dataframe)
The Function is to find the correlation of any store with another store
input=store number which is to be compared
output=dataframe with correlation coefficient values
def calcCorr(store):
a=[]
metrix=pre_df[['TOT_SALES','TXN_PER_CUST']]```#add metrics as required e.g.
,'TXN_PER_CUST'
for i in metrix.index:
a.append(metrix.loc[store].corrwith(metrix.loc[i[0]]))
df= pd.DataFrame(a)
df.index=metrix.index
df=df.drop_duplicates()
df.index=[s[0] for s in df.index]
df.index.name="STORE_NBR"
return df
I dont' understand this part :corrwith(metrix.loc[i[0]])) Why there has a [0]? Thanks for your help!
The dataframe pre_df is looked like this:
enter image description here
As commented, this should not be the way to go as it produces a lot of duplicates (looping through all the rows but only keep the first level). The function can be written as:
def calcCorr1(store, df):
return pd.DataFrame({k:df.loc[store].corrwith(df.loc[k])
for k in df.index.unique('STORE_NBR')
}).T
Notice that instead of looping through all the rows, we just loop through the unique values in the first level (STORE_NBR) only. Since each store contains many rows, we are looking at a magnitude less of runtime here.
I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]
Say I construct a dataframe with pandas, having multi-indexed columns:
mi = pd.MultiIndex.from_product([['trial_1', 'trial_2', 'trial_3'], ['motor_neuron','afferent_neuron','interneuron'], ['time','voltage','calcium']])
ind = np.arange(1,11)
df = pd.DataFrame(np.random.randn(10,27),index=ind, columns=mi)
Link to image of output dataframe
Say I want only the voltage data from trial 1. I know that the following code fails, because the indices are not sorted lexically:
idx = pd.IndexSlice
df.loc[:,idx['trial_1',:,'voltage']]
As explained in another post, the solution is to sort the dataframe's indices, which works as expected:
dfSorted = df.sortlevel(axis=1)
dfSorted.loc[:,idx['trial_1',:,'voltage']]
I understand why this is necessary. However, say I want to add a new column:
dfSorted.loc[:,('trial_1','interneuron','scaledTime')] = 100 * dfSorted.loc[:,('trial_1','interneuron','time')]
Now dfSorted is not sorted anymore, since the new column was tacked onto the end, rather than snuggled into order. Again, I have to call sortlevel before selecting multiple columns.
I feel this makes for repetitive, bug-prone code, especially when adding lots of columns to the much bigger dataframe in my own project. Is there a (preferably clean-looking) way of inserting new columns in lexical order without having to call sortlevel over and over again?
One approach would be to use filter which does a text filter on the column names:
In [117]: df['trial_1'].filter(like='voltage')
Out[117]:
motor_neuron afferent_neuron interneuron
voltage voltage voltage
1 -0.548699 0.986121 -1.339783
2 -1.320589 -0.509410 -0.529686
I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?
it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)
I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.