Conditional Pandas combined with matplotlib

Conditional Pandas combined with matplotlib - python

I have quite an interesting problem. I have a data frame containing data of a car that is triggered by events in the GPS. The data also contains GPS data. However, I have made a Polygon in Matplot that I want to limit my GPS data. With other words, I am trying filter out data that has been recorded inside of the Polygon that has been created with GPS points. I have tried solving this problem by first adding a new column with a built in condition in matplotlib (poly_path.contains_point(point)).
My GPS column looks like this:
GPS DATA:
GPS
(57.723124, 11.923557)
(57.724115, 11.933557)
(57.723124, 11.923557)
...
And I would like to add a new column using this condition.
GPS DATA:
GPS Is inside Polygon
(57.723124, 11.923557) True
(57.724115, 11.933557) False
(57.723124, 11.923557) True
...
And I have tried solving this problem by adding this line:
df1filt["Is inside Polygon"] = poly_path.contains_point(df1filt['GPS'])
However doesnt work.
Any ideas?
Thank you in advance!

Try:
df1filt["Is inside Polygon"] = df1filt['GPS'].apply(poly_path.contains_point)
Edit:
If the data type of the column is string, you need to create a cleaning function, apply it, then try my first solution.
E.g.
def clean_gps_col(text):
text = text[1:-1]
text = text.split(',')
return (float(text[0]), float(text[1]))
df1filt['GPS_cleaned'] = df1filt['GPS'].apply(clean_gps_col)
now try the first soluton
df1filt["Is inside Polygon"] = df1filt['GPS_cleaned'].apply(poly_path.contains_point)

Related

Efficient method to extract data from netCDF files, with Xarray, into a tall DataFrame

I have a list of about 350 coordinates, which are coordinates within a specified area, that I want to extract from a netCDF file using Xarray. In case it is relevant, I am trying to extract SWE (snow water equivalent) data from a particular land surface model.
My problem is that this for loop takes forever to go through each item in the list and get the relevant timeseries data. Perhaps to some extent this is unavoidable since I am having to actually load the data from the netCDF file for each coodinate. What I need help with is speeding up the code in any way possible. Right now this is taking a very long time to run, 3+ hours and counting to be more precise.
Here is everything I have done so far:
import xarray as xr
import numpy as np
import pandas as pd
from datetime import datetime as dt
1) First, open all of the files (daily data from 1915-2011).
df = xr.open_mfdataset(r'C:\temp\*.nc',combine='by_coords')
2) Narrow my location to a smaller box within the continental United States
swe_sub = df.swe.sel(lon=slice(246.695, 251), lat=slice(33.189, 35.666))
3) I just want to extract the first daily value for each month, which also narrows the timeseries.
swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1)
Now I want to load up my list list of coordinates (which happens to be in an Excel file).
coord = pd.read_excel(r'C:\Documents\Coordinate_List.xlsx')
print(coord)
lat = coord['Lat']
lon = coord['Lon']
lon = 360+lon
name = coord['OBJECTID']
The following for loop goes through each coordinate in my list of coordinates, extracts the timeseries at each coordinate, and rolls it into a tall DataFrame.
Newdf = pd.DataFrame([])
for i,j,k in zip(lat,lon,name):
dsloc = swe_first.sel(lat=i,lon=j,method='nearest')
DT=dsloc.to_dataframe()
# Insert the name of the station with preferred column title:
DT.insert(loc=0,column="Station",value=k)
Newdf=Newdf.append(DT,sort=True)
I would greatly appreciate any help or advice y'all can offer!

Alright I figured this one out. Turns out I needed to load my subset of data into memory first since Xarray "lazy loads" the into Dataset by default.
Here is the line of code that I revised to make this work properly:
swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1).persist()
Here is a link I found helpful for this issue:
https://examples.dask.org/xarray.html
I hope this helps someone else out too!

Using python to query keywords in a arcmap layer

I currently have a data set of latitude and longitude points plotted in ArcMap. These coordinates were imported from excel and have a "notes" column. I was wondering if there was any way to query select words from this column to change the symbol on the map.
I am not well versed in python but my attempted logic is as follows:
def FindKeyWord ([Notes]):
if Notes.str.contains("Detrital zircon")]:
return (symbol as a black triangle)
else:
return (symbol as a black circle)
I hope what I am attempting to accomplish makes sense. I might have to just make a whole new spreadsheet for the rows with the attribute and have two separate layers.

Im not sure about the symbol part but you can query your data easily with pandas
import pandas as pd
df = pd.read_excel('excel_file')
mask = df['notes'].str.contains('Detrital zircon')
black_traingle = df[mask]
black_circle = df[~mask]

Iterating through dataframe to produce chart title (Python)

Good morning,
I am trying to iterate through a CSV to produce a title for each stock chart that I am making.
The CSV is formatted as: Ticker, Description spanning about 200 rows.
The code is shown below:
df_symbol_description = pd.read_csv('C:/TS/Combined/Tickers Desc.csv')
print(df_symbol_description['Description'])
for r in df_symbol_description['Description']:
plt.suptitle(df_symbol_description['Description'][r],size = '20')
It is erroneous as it comes back with this error: "KeyError: 'iShrs MSCI ACWI ETF'"
This error is just showing me the first ticker description in the CSV. If anyone knows how to fix this is is much appreciated!
Thank you

I don't know how to fix the error, since it's unclear what you are trying to achieve, but we can have a look at the problem itself.
Consider this example, which is essentially your code in small.
import pandas as pd
df=pd.DataFrame({"x" : ["A","B","C"]})
for r in df['x']:
print(r, df['x'][r])
The dataframe consists of one column, called x which contains the values "A","B","C". In the for loop you select those values, such that for the first iteration r is "A". You are then using "A" as an index to the column, which is not possible, since the column would need to be indexed by 0,1 or 2, but not the string that it contains.
So in order to print the column values, you can simply use
for r in df['x']:
print(r)

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

I have many files in a folder that like this one:
enter image description here
and I'm trying to implement a dictionary for data. I'm interested in create it with 2 keys (the first one is the http address and the second is the third field (plugin used), like adblock). The values are referred to different metrics so my intention is to compute the for each site and plugin the mean,median and variance of each metric, once the dictionary has been implemented. For example for the mean, my intention is to consider all the 4-th field values in the file, etc. I tried to write this code but, first of all, I'm not sure that it is correct.
enter image description here
I read others posts but no-one solved my problem, since they threats or only one key or they don't show how to access the different values inside the dictionary to compute the mean,median and variance.
The problem is simple, admitting that the dictionary implementation is ok, in which way must I access the different values for the key1:www.google.it -> key2:adblock ?
Any kind oh help is accepted and I'm available for any other answer.

You can do what you want using a dictionary, but you should really consider using the Pandas library. This library is centered around tabular data structure called "DataFrame" that excels in column-wise and row-wise calculations such as the one that you seem to need.
To get you started, here is the Pandas code that reads one text file using the read_fwf() method. It also displays the mean and variance for the fourth column:
# import the Pandas library:
import pandas as pd
# Read the file 'table.txt' into a DataFrame object. Assume
# a header-less, fixed-width file like in your example:
df = pd.read_fwf("table.txt", header=None)
# Show the content of the DataFrame object:
print(df)
# Print the fourth column (zero-indexed):
print(df[3])
# Print the mean for the fourth column:
print(df[3].mean())
# Print the variance for the fourth column:
print(df[3].var())
There are different ways of selecting columns and rows from a DataFrame object. The square brackets [ ] in the previous examples selected a column in the data frame by column number. If you want to calculate the mean of the fourth column only from those rows that contain adblock in the third column, you can do it like so:
# Print those rows from the data frame that have the value 'adblock'
# in the third column (zero-indexed):
print(df[df[2] == "adblock"])
# Print only the fourth column (zero-indexed) from that data frame:
print(df[df[2] == "adblock"][3])
# Print the mean of the fourth column from that data frame:
print(df[df[2] == "adblock"][3].mean())
EDIT:
You can also calculate the mean or variance for more than one column at the same time:
# Use a list of column numbers to calculate the mean for all of them
# at the same time:
l = [3, 4, 5]
print(df[l].mean())
END EDIT
If you want to read the data from several files and do the calculations for the concatenated data, you can use the concat() method. This method takes a list of DataFrame objects and concatenates them (by default, row-wise). Use the following line to create a DataFrame from all *.txt files in your directory:
df = pd.concat([pd.read_fwf(file, header=None) for file in glob.glob("*.txt")],
ignore_index=True)

Appending data sets to an OpenPyxl Chart using a For-Loop

In Python, I have the ability to add series data to my chart object to plot as a line graph.
I'm using the following lines:
overall_stats_sheet2 = current_book.worksheets[0]
overall_chart_sheet = current_book.worksheets[1]
chart_object = charts.LineChart()
for x in top_down_reference_points[0]:
chart_object.append(charts.Series(charts.Reference(overall_stats_sheet, (x,1), (x, overall_stats_sheet2.get_highest_column()+1)), title = 'Erasure Decodes'))
chart_object.drawing.top = 0
chart_object.drawing.left = 400
chart_object.drawing.width = 650
chart_object.drawing.height = 400
overall_chart_sheet.add_chart(chart_object)
top_down_reference_points[0] contains all of the row numbers that erasure decode exists on. In the example picture, the numbers are row 19 and row 39.
My for loop code currently iterates through those and appends them to the graph, but it creates a new legend label and line for each erasure-decode set. I want to combine all that data from the sheet and graph one line associated with all the erasure decode data. Is this possible?

It's not entirely clear from your code which cells you want in your chart and how. It may be as simple as creating a single series that refers to multiple cells. At the moment you're creating multiple series which is why you're seeing multiple items in the legend.
BTW. I strongly recommend you start using the 2.3 beta of openpyxl which has much better chart support.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional Pandas combined with matplotlib - python

Related

Efficient method to extract data from netCDF files, with Xarray, into a tall DataFrame

Using python to query keywords in a arcmap layer

Iterating through dataframe to produce chart title (Python)

How Can I implement functions like mean.median and variance if I have a dictionary with 2 keys in Python?

Appending data sets to an OpenPyxl Chart using a For-Loop

Categories

Resources