Using python to query keywords in a arcmap layer - python

I currently have a data set of latitude and longitude points plotted in ArcMap. These coordinates were imported from excel and have a "notes" column. I was wondering if there was any way to query select words from this column to change the symbol on the map.
I am not well versed in python but my attempted logic is as follows:
def FindKeyWord ([Notes]):
if Notes.str.contains("Detrital zircon")]:
return (symbol as a black triangle)
else:
return (symbol as a black circle)
I hope what I am attempting to accomplish makes sense. I might have to just make a whole new spreadsheet for the rows with the attribute and have two separate layers.

Im not sure about the symbol part but you can query your data easily with pandas
import pandas as pd
df = pd.read_excel('excel_file')
mask = df['notes'].str.contains('Detrital zircon')
black_traingle = df[mask]
black_circle = df[~mask]

Related

How can I make my data numeral, so I can visualize them via matplotlib?

So i have this df, the columns that im intrested in visualizing
later with matplotlib are the 'incident_date', 'fatalities'. I want to create two diagrams. The one will display the number of the incidents with injuries (the column named 'fatalities' says whether it was a fatal accident, or just one with injuries or neither), the other will display the dates with the most deaths. So, in order to do those, I need somehow to turn the data in the 'fatalities' column into numeral ones.
This is my df's head, so you get an idea
I created dummy data based on picture you provided
data = {'incident_date':['1-Mar-20','1-Mar-20','3-Mar-20','3-Mar-20','3-Mar-20','5-Mar-20','6-Mar-20','7-Mar-20','7-Mar-20'] \
,'fatalities':['Fatal','Fatal','Injuries','Injuries','Neither','Fatal','Fatal','Fatal','Fatal'] \
, 'conclusion_number':[1,1,3,23,23,34,23,24,123]}
df = pd.DataFrame(data)
All you need is to do a group by incident_data and fatalities and you will get the numerical values for that particular date and that particular incident.
df_grp = df.groupby(['incident_date','fatalities'],as_index=False)['conclusion_number'].count()
df_grp.rename({'conclusion_number':'counts'},inplace=True, axis=1)
The Output of above looks like this.
output dataframe
Once you get counts column you can perform your matplot diagrams.
Let me know if you need help with diagrams as well

Conditional Pandas combined with matplotlib

I have quite an interesting problem. I have a data frame containing data of a car that is triggered by events in the GPS. The data also contains GPS data. However, I have made a Polygon in Matplot that I want to limit my GPS data. With other words, I am trying filter out data that has been recorded inside of the Polygon that has been created with GPS points. I have tried solving this problem by first adding a new column with a built in condition in matplotlib (poly_path.contains_point(point)).
My GPS column looks like this:
GPS DATA:
GPS
(57.723124, 11.923557)
(57.724115, 11.933557)
(57.723124, 11.923557)
...
And I would like to add a new column using this condition.
GPS DATA:
GPS Is inside Polygon
(57.723124, 11.923557) True
(57.724115, 11.933557) False
(57.723124, 11.923557) True
...
And I have tried solving this problem by adding this line:
df1filt["Is inside Polygon"] = poly_path.contains_point(df1filt['GPS'])
However doesnt work.
Any ideas?
Thank you in advance!
Try:
df1filt["Is inside Polygon"] = df1filt['GPS'].apply(poly_path.contains_point)
Edit:
If the data type of the column is string, you need to create a cleaning function, apply it, then try my first solution.
E.g.
def clean_gps_col(text):
text = text[1:-1]
text = text.split(',')
return (float(text[0]), float(text[1]))
df1filt['GPS_cleaned'] = df1filt['GPS'].apply(clean_gps_col)
now try the first soluton
df1filt["Is inside Polygon"] = df1filt['GPS_cleaned'].apply(poly_path.contains_point)

Filling in missing values with Pandas

Link: CSV with missing Values
I am trying to figure out the best way to fill in the 'region_cd' and 'model_cd' fields in my CSV file with Pandas. The 'RevenueProduced' field can tell you what the right value is for either missing fields. My idea is to make some query in my dataframe that looks for all the fields that have the same 'region_cd' and 'RevenueProduced' and make all the 'model_cd' match (vice versa for the missing 'region_cd').
import pandas as pd
import requests as r
#variables needed for ease of file access
url = 'http://drd.ba.ttu.edu/isqs3358/hw/hw2/'
file_1 = 'powergeneration.csv'
res = r.get(url + file_1)
res.status_code
df = pd.read_csv(io.StringIO(res.text), delimiter=',')
There is likely many ways to solve this but I am just starting Pandas and I am stumped to say the least. Any help would be awesome.
Assuming that each RevenueProduced maps to exactly one region_cd and one model_cd.
Take a look at the groupby pandas function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
You could do the following:
# create mask to grab only regions with values
mask = df['region_cd'].notna()
# group by region, collect the first `RevenueProduced` and reset the index
region_df = df[mask].groupby('RevenueProduced')["region_cd"].first().reset_index()
# checkout the built-in zip function to understand what's happening here
region_map = dict(zip(region_df.RevenueProduced, region_df.region_cd))
# store data in new column, although you could overwrite "region_cd"
df.loc[:, 'region_cd_NEW'] = df["RevenueProduced"].map(region_map)
You would do the exact same process with model_cd. I haven't run this code since at the time of writing this I don't have access to your csv, but I hope this helps.
Here is the documentation for .map series method. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
(Keep in mind a series is just a column in a dataframe)

Efficient method to extract data from netCDF files, with Xarray, into a tall DataFrame

I have a list of about 350 coordinates, which are coordinates within a specified area, that I want to extract from a netCDF file using Xarray. In case it is relevant, I am trying to extract SWE (snow water equivalent) data from a particular land surface model.
My problem is that this for loop takes forever to go through each item in the list and get the relevant timeseries data. Perhaps to some extent this is unavoidable since I am having to actually load the data from the netCDF file for each coodinate. What I need help with is speeding up the code in any way possible. Right now this is taking a very long time to run, 3+ hours and counting to be more precise.
Here is everything I have done so far:
import xarray as xr
import numpy as np
import pandas as pd
from datetime import datetime as dt
1) First, open all of the files (daily data from 1915-2011).
df = xr.open_mfdataset(r'C:\temp\*.nc',combine='by_coords')
2) Narrow my location to a smaller box within the continental United States
swe_sub = df.swe.sel(lon=slice(246.695, 251), lat=slice(33.189, 35.666))
3) I just want to extract the first daily value for each month, which also narrows the timeseries.
swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1)
Now I want to load up my list list of coordinates (which happens to be in an Excel file).
coord = pd.read_excel(r'C:\Documents\Coordinate_List.xlsx')
print(coord)
lat = coord['Lat']
lon = coord['Lon']
lon = 360+lon
name = coord['OBJECTID']
The following for loop goes through each coordinate in my list of coordinates, extracts the timeseries at each coordinate, and rolls it into a tall DataFrame.
Newdf = pd.DataFrame([])
for i,j,k in zip(lat,lon,name):
dsloc = swe_first.sel(lat=i,lon=j,method='nearest')
DT=dsloc.to_dataframe()
# Insert the name of the station with preferred column title:
DT.insert(loc=0,column="Station",value=k)
Newdf=Newdf.append(DT,sort=True)
I would greatly appreciate any help or advice y'all can offer!
Alright I figured this one out. Turns out I needed to load my subset of data into memory first since Xarray "lazy loads" the into Dataset by default.
Here is the line of code that I revised to make this work properly:
swe_first = swe_sub.sel(time=swe_sub.time.dt.day == 1).persist()
Here is a link I found helpful for this issue:
https://examples.dask.org/xarray.html
I hope this helps someone else out too!

Selecting Specific Data to Sum and Plot

this is some of the data that is located in the excel sheet
I want to select musical theater shows (known in the code as 'ID') that had more minorities than Caucasians in the cast
once determined, I wanted to place the information of the code selected shows into a new data frame that
will only hold the shows, becasue it will be easier to manipulate. In the new data frame, I want to have in the same row for the show the related ethnicity, so I can compare to audience ethnicity. I then tried to plot this information.
So generally, I want to add up the values in specific rows if that row fits specific summation criteria. All data used in this project is located in an excel sheet that is converted to a csv and uploaded as a data frame. I would like to then plot the values of the cast in its entirety and compare the cast's ethnicity to the audience ethnicity.
I am working with python and I have tried to remove data that is not needed by selecting the columns by using an if statements so that the data frame only includes the shows that have more minorities than Caucasians, I then tried to use this information in the plot. I am unsure if I, have to filter all the unneeded columns if I am not using them in the calculations
import numpy as np
import pandas as pd
#first need to import numpy so that calculations can be made
from google.colab import files
uploaded = files.upload()
# df = pd.read_csv('/content/drive/My Drive/allTheaterDataV2.csv')
import io
df = pd.read_csv(io.BytesIO(uploaded['allTheaterDataV2.csv']))
# need to download excel sheet as csv and then upload into colab so that it can
# be manipulated as a dataframe
# want to select shows(ID) that had more minorities than Caucasians in the cast
# once determined, the selected shows should be placed into a new data frame that
# will only hold the shows and the related ethnicity, and compared to audience ethnicity
# this information should then be plotted
# first we will determine the shows that have a majority ethnic cast
minorcal = list(df)
minorcal.remove('CAU')
minoritycastSUM = df[minorcal].sum(axis=1)
# print(minorcal)
# next, we determine how many people in the cast were Caucasian, so remove all others
caucasiancal = list(df)
# i first wanted to do caucasiancal.remove('AFRAM', 'ASIAM', 'LAT', 'OTH')
# but got the statement I could only have 1 argument so i just put each on their own line
caucasiancal.remove('AFRAM')
caucasiancal.remove('ASIAM')
caucasiancal.remove('LAT')
caucasiancal.remove('OTH')
idrowcaucal = df[caucasiancal].sum(axis=1)
minoritycompare = old.filter(['idrowcaucal','minoritycastSUM'])
print(minoritycompare)
# now compare the two values per line
if minoritycastSUM < caucasiancal:
minoritydf = pd.df.minorcal.append()
# plot new data frame per each show and compare to audience ethnicity
df.plot(x=['AFRAM', 'ASIAM', 'CAU', 'LAT', 'OTH', 'WHT', 'BLK', 'ASN', 'HSP', 'MRO'], y = [''])
# i am unsure how to call the specific value for each column
plt.title('ID Ethnicity Comparison')
# i am unsure how to call the specific show so that only one show is per plot so for now i just subbed in 'ID'
plt.xlabel('Ethnicity comparison')
plt.ylabel('Number of Cast Members/Audience Members')
plt.show()
I would like to see the data frame with specific shows that fit within the criteria, and then the plot for the show, but right now I am getting errors on how to formulate the new data frame and python saying that the lengths of the if statements cannot be used.[2]
First of all, this will not be a complete answer, as
I don't know how you're imagining your final plot to look like
I don't know what the columns in your DataFrame are (consider using more descriptive column labels, e.g. 'caucasian actors' instead of 'CAU',…)
it is unclear to me whether any trend can be formed from your data, since the screenshot you've posted shows equal audience compositions for the first movies
Nevertheless, I built upon the DataFrame in this answer, and maybe this initial plot of "non caucasian/caucasian ratio" per movie can point you in the right direction.
Perhaps you could build a similar set of sum & ratio columns for the audience columns, and then plot the actor ratio as a function of the audience ratio to see whether a more caucasian audience prefers more or less caucasian actors (I guess that's what you're after?).
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'ID':['Billy Elliot','next to normal','shrek','guys and dolls',
'west side story', 'pal joey'],
'Season' : [20082009,20082009,20082009,
20082009,20082009,20082009],
'AFRAM' : [2,0,4,4,0,1],
'ASIAM' : [0,0,1,0,0,0],
'CAU' : [48,10,25,24,28,20],
'LAT' : [1,0,1,3,18,0],
'OTH' : [0,0,0,0,0,0],
'WHT' : [73.7,73.7,73.7,73.7,73.7,73.7]})
## define a sum column for non caucasian actors (I suppose?)
df['non_cau']=df[['AFRAM','ASIAM','LAT','OTH']].sum(axis=1)
## build a ratio of non caucasian to caucasian
df['cau_ratio']=df['non_cau']/df['CAU']
## make a quick plot
fig,ax=plt.subplots()
ax.scatter(df['ID'],df['cau_ratio'])
ax.set_ylabel('non cau / cau ratio')
plt.tight_layout()
plt.show()

Categories