Plot Subset of Dataframe without Being Redundant - python

a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?

You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.

Related

How can I iterate over columns of a csv file to split it into several files?

I am completely new to Python (I started last week!), so while I looked at similar questions, I have difficulty understanding what's going on and even more difficulty adapting them to my situation.
I have a csv file where rows are dates and columns are different regions (see image 1). I would like to create a file that has 3 columns: Date, Region, and Indicator where for each date and region name the third column would have the correct indicator (see image 2).
I tried turning wide into long data, but I could not quite get it to work, as I said, I am completely new to Python. My second approach was to split it up by columns and then merge it again. I'd be grateful for any suggestions.
This gives your solution using stack() in pandas:
import pandas as pd
# In your case, use pd.read_csv instead of this:
frame = pd.DataFrame({
'Date': ['3/24/2020', '3/25/2020', '3/26/2020', '3/27/2020'],
'Algoma': [None,0,0,0],
'Brant': [None,1,0,0],
'Chatham': [None,0,0,0],
})
solution = frame.set_index('Date').stack().reset_index(name='Indicator').rename(columns={'level_1':'Region'})
solution.to_csv('solution.csv')
This is the inverse of doing a pivot, as explained here: Doing the opposite of pivot in pandas Python. As you can see there, you could also consider using the melt function as an alternative.
first, you're region column is currently 'one hot encoded'. What you are trying to do is to "reverse" one hot encode your region column. Maybe check if this link answers your question:
Reversing 'one-hot' encoding in Pandas.

I want to subtract one column from another in pandas, but I keep getting a copy error. Is there a better way to do this operation?

I have a data frame TB_greater_2018 that 3 columns: country, e_inc_100k_2000 and e_inc_100k_2018. I would like to subtract e_inc_100k_2000 from e_inc_100k_2018 and then use those values returned to create a new column of the differences and then sort by the countries with the largest difference. My current code is:
case_increase_per_100k = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018["case_increase_per_100k"] = case_increase_per_100k
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
When I run this, I get a SettingwithCopyWarning. Is there a way to do this without getting this warning? Or just overall a better way of accomplishing the task?
You can do
TB_greater_2018["case_increase_per_100k"] = TB_greater_2018["e_inc_100k_2018"] - TB_greater_2018["e_inc_100k_2000"]
TB_greater_2018.sort_values("case_increase_per_100k", ascending=[False]).head()
It looks like the error is from finding the difference and using that as a column in separate operations, although tbh I'm not clear why that would be.

call pandas plot function for path entry

I have a pandas dataframe that holds the file path to .wav data. Can I use pandas DataFrame.plot() function to plot the data referenced?
Example:
typical usage:
df.plot()
what I'm trying to do:
df.plot(df.path_to_data)???
I suspect some combination of apply and lambda will do the trick, but I'm not very familiar with these tools.
No, that isn't possible. plot is first order function that operates on pd.DataFrame objects. Here, df would be the same thing. What you'd need to do is
Load your dataframe using pd.read_* (usually, pd.read_csv(file)) and assign to df
Now call df.plot
So, in summary, you need -
df = pd.read_csv(filename)
... # some processing here (if needed)
df.plot()
As for the question of whether this can be done "without loading data in memory"... you can't plot data that isn't in memory. If you want to, you can limit tha number of rows you read, or you can load it efficiently, by loading it in chunks. You can also write code to aggregate/summarise data, or sample it.
I think you need first create DataFrame obviously by read_csv and then DataFrame.plot:
pd.read_csv('path_to_data').plot()
But if need plot DataFrames created from paths from column in DataFrame:
df.path_to_data.apply(lambda x: pd.read_csv(x).plot())
Or use custom function:
def f(x):
pd.read_csv(x).plot()
df.path_to_data.apply(f)
Or use loop:
for x in df.path_to_data:
pd.read_csv(x).plot()

Seaborn Distplot and Barplot

I have a dataframe that contains a column value of 'A','B','C','D'... This is just a grouping of some sorts. I wanted to produce a histogram with the column values vs its count.
import seaborn as sns
sns.distplot(dfGroupingWithoutNan['patient_group'])
This produced an error:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
I thought maybe because im not familiar with distplot, i'm not using it the right way. I was thinking, i can just pass a Series into it and it will be able to determine the counts for each values and display them in the histogram accordingly.
Anyway, i thought of other solution and this is what I came up with.
series1 = dfGroupingWithoutNan['patient_group'].value_counts()
dfPatientGroup = pd.DataFrame( {'levels' : series1.index, 'level_values' : series1.values})
sns.set_style("whitegrid")
sns.barplot(x="levels", y="level_values", data=dfPatientGroup)
This time I was able to produce a plot of each values versus its count though using a bar plot.
I just wanted to ask, was there any other way to do this, like how it would have worked if i use the distplot? Also, do i really need to create a new dataframe just to have some sort of repository that holds the values and the count? I was thinking, wont it be possible for the distplot to determine the counts automatically without going through the hassle of creating a new dataframe?
I would use a Counter to do this. The logic is very similar to what you are doing, but you don't need to create an extra dataframe:
from collections import Counter
cnt = Counter(dfGroupingWithoutNan.patient_group)
sns.barplot(x=cnt.keys(), y=cnt.values())
I'm not aware of any solution that automatically handle string values in seaborn or matplotlib histograms.

Python xarray: grouping by multiple parameters

When using the xarray package for Python 2.7, is it possible to group over multiple parameters like you can in pandas? In essence, an operation like:
data.groupby(['time.year','time.month']).mean()
if you wanted to get mean values for each year and month of a dataset.
Unfortunately, xarray does not support grouping with multiple arguments yet. This is something we would like to support and it would be relatively straightforward, but nobody has had the time to implement it yet (contributions would be welcome!).
An easy way around is to construct a multiindex and group by that "new" coordinate:
da_multiindex = da.stack(my_multiindex=['time.year','time.month'])
da_mean = da.groupby("my_multiindex").mean()
da_mean.unstack() # go back to normal index

Categories