I have a pandas dataframe that holds the file path to .wav data. Can I use pandas DataFrame.plot() function to plot the data referenced?
Example:
typical usage:
df.plot()
what I'm trying to do:
df.plot(df.path_to_data)???
I suspect some combination of apply and lambda will do the trick, but I'm not very familiar with these tools.
No, that isn't possible. plot is first order function that operates on pd.DataFrame objects. Here, df would be the same thing. What you'd need to do is
Load your dataframe using pd.read_* (usually, pd.read_csv(file)) and assign to df
Now call df.plot
So, in summary, you need -
df = pd.read_csv(filename)
... # some processing here (if needed)
df.plot()
As for the question of whether this can be done "without loading data in memory"... you can't plot data that isn't in memory. If you want to, you can limit tha number of rows you read, or you can load it efficiently, by loading it in chunks. You can also write code to aggregate/summarise data, or sample it.
I think you need first create DataFrame obviously by read_csv and then DataFrame.plot:
pd.read_csv('path_to_data').plot()
But if need plot DataFrames created from paths from column in DataFrame:
df.path_to_data.apply(lambda x: pd.read_csv(x).plot())
Or use custom function:
def f(x):
pd.read_csv(x).plot()
df.path_to_data.apply(f)
Or use loop:
for x in df.path_to_data:
pd.read_csv(x).plot()
Related
I have a csv file where I validate the each cell by some rule on the column.
df.drop(df[~validator].index, inplace=True)
The validator here is can be different functions checking if a cell is integer-like, or if a string inside a cell is smaller than 10 characters etc. So a cell alone hase all the information to be validated without requiring any other cells from the same row or same column.
And I have this:
bad_dfs = []
for validator, error in people_csv_validators:
bad_dfs.append(df.loc[~validator])
df.drop(df[~validator].index, inplace=True)
bad_df = pd.concat(bad_dfs)
Prior the dataframes were smaller than 1m rows with 20 columns or less, column count didn't change but the rows increased by a lot and I want to be able to process this with a fixed amount of memory. So I figured I'd chunk it since the validation doesn't depend on anything.
Now, I know I can just put chunk argument in to the read_csv I have, then write to a csv file chunk by chunk with mode="a", but I head about dask and couple other libraries that does something similar underneath with their dataframe class, and I figured there might be some other methods to do this.
Is there any standard way of doing this, like
df = pd.read_csv(path, chunk_in_the_background_and_write_to_this_file=output_path, chunk_count=10^6)
some_row_based_operations(df)
# It automatically reads the first 10^6 rows and processes them,
# then writes them to `output_path` and then reads the next 10^6 rows and so on
Again, this is rather a simple thing but I want to know if there is a canonical way.
The rough code to do this with dask is as follows:
import dask.dataframe as dd
# let's use ddf for dask df
ddf = dd.read_csv(path) # can also provide list of files
def some_row_based_operations(df):
# a function that accepts and returns pandas df
# implementing required logic
return df
# the line below is fine only if the function is row-based
# (no dependencies across different rows)
modified_ddf = ddf.map_partitions(some_row_based_operations)
# single_file kwarg is only if you want one file at the end
modified_ddf.to_csv(output_path, single_file=True)
One caution: with the approach above there should be no inplace changes to the df inside some_row_based_operations, but hopefully making a change like the one below is feasible:
# change this: df.drop(df[~validator].index, inplace=True)
# also note, that this logic should be part of `some_row_based_operations`
df = df.drop(df[~validator].index)
How it's currently done is as follows:
geoms = df["wkt"].apply(shapely.wkt.loads).values
Here df["wkt"] has rows with data like:
"MULTIPOLYGON (((24.2401805 70.8385222,24.2402333 70.83850555,24.2402166 70.83848885,24.24015 70.83848885,24.2401277 70.83850555,24.2401805 70.8385222)))"
But as the dataframe that the function is being applied to is huge this takes a while. Is there a way to speed this up? I've tried looking at multithreading or similar, but didn't really get it to work.
The same applies to this line:
df_geoms = [shapely.wkt.loads(x) for x in df.geom.values]
You can try the from_wkt command from the GeoPandas library:
geoms = geopandas.GeoSeries.from_wkt(df["wkt"])
Let data be a giant pandas dataframe. It has many functions. The functions do not modify in place but return a new dataframe. How then am I supposed to perform multiple operations, to maximize performance?
For example, say I want to do
data = data.method1().method2().method()
where method1 could be set_index, and so on.
Is this the way you are supposed to do it? My wory is that pandas creates a copy every time I call a method, so that there are 3 copies being made of my data in the above, when in reality, all I want is to modify the original one.
So is it faster to say
data = data.method1(inplace=True)
data = data.method2(inplace=True)
data = data.method3(inplace=True)
This is just way too verbose for me?
Yes you can do that, i.e applying methods one after the other. Since you overwrite data, you are not creating 3 copies, but only keeping one copy
data = data.method1().method2().method()
However regarding your second example, the general rule is to either write over the existing dataframe or do it "inplace", but not both at the same time
data = data.method1()
or
data.method1(inplace=True)
a bit of a Python newb here. As a beginner it's easy to learn different functions and methods from training classes but it's another thing to learn how to "best" code in Python.
I have a simple scenario where I'm looking to plot a portion of a dataframe spdf. I only want to plot instances where speed is greater than 0 and use datetime as my X-axis. The way I've managed to get the job done seems awfully redundant to me:
ts = pd.Series(spdf[spdf['speed']>0]['speed'].values, index=spdf[spdf['speed']>0]['datetime'])
ts.dropna().plot(title='SP1 over Time')
Is there a better way to plot this data without specifying the subset of my dataframe twice?
You don't need to build a new Series. You can plot using your original df
df[df['col'] > 0]].plot()
In your case:
spdf[spdf['speed'] > 0].dropna().plot(title='SP1 over Time')
I'm not sure what your spdf object is or how it was created. If you'll often need to plot using the 'datetime' column you can set that to be the index of the df.If you're reading the data from a csv you can do this using the parse_dates keyword argument or it you already have the dfyou can change the index using df.set_index('datetime'). You can use df.info() to see what is currently being used at your index and its datatype.
I have imported an array into my IPython notebook using the following method:
SDSS_local_AGN = np.fromfile('/Users/iMacHome/Downloads/SDSS_local_AGN_Spectra.dat', dtype=float)
The array is of the form:
SPECOBJID_1 RA DEC SUBCLASS ...
299528159882143744 146.29988 -0.12001413 AGN ...
299532283050747904 146.32957 -0.30622363 AGN ...
Essentially each column has a header, and I now need to plot certain values.
As an example, I want to plot RA against DEC...how would I go about doing this?
Perhaps:
axScatter.plot(SDSS_local_AGN[RA], SDSS_local_AGN[DEC])
Answer is mistaken, see comments
If you want to access them via name, you should use pandas instead of numpy. In numpy, you need to lookup by index:
plt.scatter(SDSS_local_AGN[1], SDSS_local_AGN[2])
But in pandas, it would be as simple as:
df = read_csv('myfile')
df.plot(kind='scatter', x='RA', y='DEC')
http://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html#scatter-plot
SDSS_local_AGN['RA'] is a valid operation in pandas, but not in numpy.
PS, since you are working in a Notebook, pandas DataFrames will nicely render as HTML tables, making them much more readable.