Accessing Binned Data with pandas - python

I have a set of data for which i have put into a data frame and then binned:
print(data1)
[[-1.90658883e+00 5.66881290e-01 1.45443907e+00]
[-1.82926850e+00 2.53325112e-01 1.45480072e+00]
[-1.59073925e+00 5.33264011e-01 1.45461954e+00]
...
[ 2.86246982e+02 4.52961148e-01 6.19121328e+00]]
df = pd.DataFrame(data=data1,)
print(df)
bins = [0,50,100,150,200,250,300,400]
df1 = pd.cut(df[0],bins, labels = False)
print(df1)
1 0
2 0
..
500 4
501 4
502 5
0 through 5 are the bin labels. I want to be able to access the data in each bin/category and store it in a variable. Something like this:
x = df1(4) # this doesnt work, just an example.
^ meaning I want to access the data stored in the 4th bin in the pandas dataframe and assign it to the variable x as an array, but I am unsure how to do that.

You can use pandas.DataFrame.loc and pass a boolean array to it.
bi = pd.cut(df[0], bins, labels=False)
x = df.loc[bi == 4]

Related

Getting data into a map

I got my .dat data formatted into arrays I could use in graphs and whatnot.
I got my data from this website and it requires an account if you want to download it yourself. The data will still be provided below, however.
https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1028
data in python:
import pandas as pd
df = pd.read_csv("ocean_flux_co2_2d.dat", header=None)
print(df.head())
0 1 2 3
0 -178.75 -77.0 0.000003 32128.7
1 -176.25 -77.0 0.000599 32128.7
2 -173.75 -77.0 0.001649 39113.5
3 -171.25 -77.0 0.003838 58934.0
4 -168.75 -77.0 0.007192 179959.0
I then decided to put this data into arrays that could be put into graphs and other functions.
Like so:
lat = []
lon = []
sed = []
area = []
with open('/home/srowpie/SrowFinProj/Datas/ocean_flux_tss_2d.dat') as f:
for line in f:
parts = line.split(',')
lat.append(float(parts[0]))
lon.append(float(parts[1]))
sed.append(float(parts[2]))
area.append(float(parts[3]))
lat = np.array(lat)
lon = np.array(lon)
sed = np.array(sed)
area = np.array(area)
My question now is how can I put this data into a map with data points? Column 1 is latitude, Column 2 is longitude, Column 3 is sediment flux, and Column 4 is the area covered. Or do I have to bootleg it by making a graph that takes into account the variables lat, lon, and sed?
You don't need to get the data into an array. Just apply df.values and you would have a numpy array of all the data in the dataframe.
Example -
array([[-1.78750e+02, -7.70000e+01, 3.00000e-06, 3.21287e+04],
[-1.76250e+02, -7.70000e+01, 5.99000e-04, 3.21287e+04],
[-1.73750e+02, -7.70000e+01, 1.64900e-03, 3.91135e+04],
[-1.71250e+02, -7.70000e+01, 3.83800e-03, 5.89340e+04],
[-1.68750e+02, -7.70000e+01, 7.19200e-03, 1.79959e+05]])
I'll not recommend storing individual columns as variable. Instead just set the column names for the dataframe and then use them to extract a pandas Series of the data in that column.
df.columns = ["Latitude", "Longitude", "Sediment Flux", "Area covered"]
This what the table would look like after this,
Latitude
Longitude
Sediment Flux
Area covered
0
-178.75
-77.0
3e-06
32128.7
1
-176.25
-77.0
0.000599
32128.7
2
-173.75
-77.0
0.001649
39113.5
3
-171.25
-77.0
0.003838
58934.0
4
-168.75
-77.0
0.007192
179959.0
Simply do df[column_name] to get the data in that column.
For example -> df["Latitude"]
Output -
0 -178.75
1 -176.25
2 -173.75
3 -171.25
4 -168.75
Name: Latitude, dtype: float64
Once you have done all this, you can use folium to plot the rows on real interactive maps.
import folium as fl
map = fl.Map(df.iloc[0, :2], zoom_start = 100)
for index in df.index:
row = df.loc[index, :]
fl.Marker(row[:2].values, f"{dict(row[2:])}").add_to(map)
map

Filtering function for pandas - VIewing NaN values within a column

Function I have created:
#Create a function that identifies blank values
def GPID_blank(df, variable):
df = df.loc[df['GPID'] == variable]
return df
Test:
variable = ''
test = GPID_blank(df, variable)
test
Goal: Create a function that can filter any dataframe column 'GPID' to see all of the rows where GPID has missing data.
I have tried running variable = 'NaN' and still no luck. However, I know the function works, as if I use a real-life variable "OH82CD85" the function filters my dataset accordingly.
Therefore, why doesn't it filter out the blank cells variable = 'NaN'? I know for my dataset, there are 5 rows with GPID missing data.
Example df:
df = pd.DataFrame({'Client': ['A','B','C'], 'GPID':['BRUNS2','OH82CD85','']})
Client GPID
0 A BRUNS2
1 B OH82CD85
2 C
Sample of GPID column:
0 OH82CD85
1 BW07TI20
2 OW36HW81
3 PE56TA73
4 CT46SX81
5 OD79AU80
6 GF46DB60
7 OL07ST01
8 VP38SM57
9 AH90AE61
10 PG86KO78
11 NaN
12 NaN
13 SO21GR72
14 DY85IN90
15 KW80CV02
16 CM15QP83
17 VC38FP82
18 DA36RX05
19 DD74HD38
You can't use == with NaN. NaN != NaN.
Instead, you can modify your function a little to check if the parameter is NaN using pd.isna() (or np.isnan()):
def GPID_blank(df, variable):
if pd.isna(variable):
return df.loc[df['GPID'].isna()]
else:
return df.loc[df['GPID'] == variable]
You can't really search for NaN values like an expression. Also, in your example dataframe, '' is not NaN, but is str, and can be searched like an expression.
Instead, you need to check when you want to filter for NaN, and filter differently:
def GPID_blank(df, variable):
if pd.isna(variable):
df = df.loc[df['GPID'].isna()]
else:
df = df.loc[df['GPID'] == variable]
return df
It's not working because with variable = 'NaN' you're looking for a string which content is 'NaN', not for missing values.
You can try:
import pandas as pd
def GPID_blank(df):
# filtered dataframe with NaN values in GPID column
blanks = df[df['GPID'].isnull()].copy()
return blanks
filtered_df = GPID_blank(df)

Why PCA output some components duplicately?

I'm working on CTU-13 dataset, which you can see the overview of its distributions in the dataset here. I'm using the 11th scenario of CTU-13 dataset which is (S11.csv) and you can access here.
Concerning the synthetic nature of the dataset, I need to understand the top most important features for feature engineering stage.
#dataset loading
df = pd.read_csv('/content/drive/My Drive/s11.csv')
#Keep events/rows which have 'Normal' or 'Bot'
df = df.loc[(df['Label'].str.contains('Normal') == True) | (df['Label'].str.contains('Bot') == True)]
#binary labeling
df.loc[(df['Label'].str.contains('Normal') == True),'Label'] = 0
df.loc[(df['Label'].str.contains('Bot') == True),'Label'] = 1
#data cleaning
null_columns = df.columns[df.isnull().any()]
#omit columns have more than 70% missing values
for i in null_columns:
B = df[i].isnull().sum()
if B > (df.shape[0]*70)//100:
del df[i]
name_columns = list(df.columns)
for i in name_columns:
if df[i].dtype == object:
df[i] = pd.factorize(df[i])[0]+1
#impute mean of each column for missing values
name_columns = list(df.columns)
for i in name_columns:
mean1 = df[i].mean()
df[i] = df[i].replace(np.nan, mean1)
#Apply PCA
arr = df.to_numpy()
arr=arr[:,:-1]
pca=PCA(n_components=10)
x_pca=pca.fit_transform(arr)
explain=pca.explained_variance_ratio_
#sort and index pca top 10
n_pcs= pca.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = []
for col in df.columns:
initial_feature_names.append(col)
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
print('important column by order: ')
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
top_components = pd.DataFrame(dic.items())
print(top_components)
Problem: I was wondering why the output of PCA duplicate some components?!
important column by order:
0 1
0 PC0 TotBytes
1 PC1 SrcBytes
2 PC2 Load
3 PC3 Seq
4 PC4 DstLoad
5 PC5 DstLoad
6 PC6 Sport
7 PC7 Load
8 PC8 Rate
9 PC9 Rate
Any help to debug this problem will be appreciated! Probably I'm missing something in the implementation.

Pandas - Split column entry (each other seperator)

I have a pandas data frame that looks something like this
| id | name | latlon |
0 sat -28,14 | -23, 12 | -21, 13...
the latlon column entry contains multiple latitude/longitude entries, seperated with the | symbol, I need to split them into a list as follows: lat = [-28,-23,-21] lon = [14,12,13]
running the following command will create a list of all the values
sat_df["latlon"]= sat_df["latlon"].str.split("|", expand=False)
example:indexnumber [-58.562242560404705,52.82662430990185, -61.300361184039964,64.0645716165538, -62.8683906074927,76.96557954998904, -63.078154849236505,90.49660509514713, -61.95530287454162,103.39930010176977, -59.727998547544765,114.629246065411, -56.63116878989326,124.07501384844198, -52.9408690779807,131.75498199669985, -48.85803704806645,137.9821558270659, -44.56621244973711,143.03546934613863, -40.08092215592037,147.27807367743728, -35.5075351924213,150.86679792543603,]
how can I continue to split the data, so each other entry is assgined to the lat/lon list respectivley, for the entire dataframe. Alternativley, is there some way to create two columns (lat/lon) which both hold a list object with all the values?
EDIT:
import pandas as pd
sat_df = pd.DataFrame({'卫星编号': {0: 38858, 1: 5, 2: 16}, 'path': {0: '-2023240,1636954,-1409847|-2120945,1594435,-1311586|-2213791,1547970,-1209918|', 1: '8847,-974294,-168045|69303,-972089,-207786|129332,-963859,-246237|189050,-949637,-283483|', 2: '283880,751564,538726|214030,782804,550729|142133,808810,558964|69271,829348,563411|'}, 'latlon': {0: '-28.566504816706743,-58.42623323318429|-26.424915546197877,-58.03051668423269|-24.24957760771616,-57.709052434729294|-22.049419348341488,-57.45429550739338|-19.82765114196696,-57.258197633964414|-17.58719794818057,-57.113255687570714|-15.33074070109176,-57.01245109909582|-13.060755383916138,-56.949188922655416|-10.779548173615462,-56.91723753411087|-8.48928513939462,-56.910669632641685|-6.192021225701933,-56.92380598464241|-3.8897270110140494,-56.951159278680606|-1.5843114029280712,-56.987381318629815|0.7223533959819478,-57.02721062232328|3.028411197431552,-57.06542107180802|5.331999106238248,-57.09677071391785|7.631224662503422,-57.115951252231326|9.924144733525859,-57.11753523668981|12.20873984934678,-57.09592379302077|14.482890506579363,-57.045292032888945|16.744349099342163,-56.95953284633186|18.99070929829218,-56.83219872719919|', 1: '-9.826016080133869,71.12640824438319|-12.077961267269185,74.17040194928683|-14.251942328865088,77.22102880126546|-16.362232784638383,80.31943171515469|-18.372371674164317,83.43158582640798|-20.311489634835258,86.62273098947678|-22.14461262803909,89.85609377674561|-23.896490600856566,93.19765633031801|-25.53339979617313,96.60696767976263|-27.063070616439813,100.12254137641649|-28.488648081761962,103.78528610926675|-29.778331008010497,107.54645547637602|-30.942622037767002,111.47495996053523|-31.95152016226762,115.51397654947516|-32.80866797590735,119.73211812295206|-33.486858278098815,124.06227007574186|-33.98257678066123,128.57116785317814|-34.27304876808886,133.17990028392123|-34.34804732039687,137.91355482600457|-34.19053759979979,142.79776551711302|-33.788689805715364,147.73758823197466|-33.12248489727676,152.7937677542324|', 2: '34.00069374375586,-130.03583418452314|34.3070000099521,-125.16691893340256|34.37547230320849,-120.37930544344802|34.219644836708575,-115.72548686095767|33.8599777210809,-111.25048787484094|33.307236654159695,-106.89130089454063|32.579218893589676,-102.68672977394559|31.69071108398145,-98.63657044455137|30.663892680279847,-94.76720076317056|29.49498481622457,-91.01231662520239|28.20247456939903,-87.39472628213446|26.796048279088225,-83.90476041381801|25.29620394685256,-80.5572008057606|23.686627724590036,-77.28791855670698|21.984668849769005,-74.1108962902788|20.209508481020038,-71.0367205896831|18.337433788359615,-68.00383542959851|16.385207987194672,-65.02251732177939|14.355346635752394,-62.078279068092414|12.266387624465171,-59.17870114389838|10.087160866120724,-56.262880710180255|7.8348695447113235,-53.336971029542006|'}})
#splits latlon data into a list
sat_df.dropna(inplace=True)
sat_df["latlon"]= sat_df["latlon"].str.split("|", expand=False)
sat_df
#need to write each entries latlon list as two lists (alternating lat and lon)
lat = []
lon = []
#for sat_df["latlon"]:
lets go a step back from your str.strip and make use of explode which was added in pandas 0.25
then merge it back based on the index.
df = sat_df['latlon'].str.split('|').explode().str.split(',',expand=True)
new_df = pd.merge(sat_df.drop('latlon',axis=1),
df,left_index=True,
right_index=True).rename(columns={0 : 'Lat', 1 : 'Lon'})
print(new_df.drop('path',axis=1))
卫星编号 Lat Lon
0 38858 -28.566504816706743 -58.42623323318429
0 38858 -26.424915546197877 -58.03051668423269
0 38858 -24.24957760771616 -57.709052434729294
0 38858 -22.049419348341488 -57.45429550739338
0 38858 -19.82765114196696 -57.258197633964414
.. ... ... ...
2 16 14.355346635752394 -62.078279068092414
2 16 12.266387624465171 -59.17870114389838
2 16 10.087160866120724 -56.262880710180255
2 16 7.8348695447113235 -53.336971029542006
2 16 None
For this purpose we are using pandas library.
Initially I have created a dataframe as you have mentioned.
Code:
import pandas as pd
latlon = [-58.562242560404705,52.82662430990185, -61.300361184039964,64.0645716165538, -62.8683906074927,76.96557954998904, -63.078154849236505,90.49660509514713, -61.95530287454162,103.39930010176977, -59.727998547544765,114.629246065411, -56.63116878989326,124.07501384844198, -52.9408690779807,131.75498199669985, -48.85803704806645,137.9821558270659, -44.56621244973711,143.03546934613863, -40.08092215592037,147.27807367743728, -35.5075351924213,150.86679792543603,]
# print(latlon)
data = pd.DataFrame({'id':[0],'name':['sat'],'latlon':[latlon]})
print(data)
Output:
id name latlon
0 0 sat [-58.562242560404705, 52.82662430990185, -61.3...
Now I've converted the latlon to string in order to iterate because if you try to iterate float value you may get error. Then we are passing the lattitude and longitude values to corresponding columns of the dataframe.
This code will work even if you have more any number of records or rows in your dataframe.
Code:
#splittint latlon and making adding the values to lat and lon columns
lats = []
lons = []
for i in range(len(data)):
lat_lon = [str(x) for x in (data['latlon'].tolist()[i])]
lat = []
lon = []
for i in range(len(lat_lon)):
if i%2==0:
lat.append(float(lat_lon[i]))
else:
lon.append(float(lat_lon[i]))
lats.append(lat)
lons.append(lon)
data = data.drop('latlon',axis=1) #dropping latlon column
data.insert(2,'lat',lats) #adding lat column
data.insert(3,'lon',lons) #adding lon column
# print(data)
data #displaying dataframe
Output:
id name lat lon
0 0 sat [-58.562242560404705, -61.300361184039964, -62... [52.82662430990185, 64.0645716165538, 76.96557...
I hope it would be helpful.

Normalize columns in pandas data frame while once column is in a specific range

I have a data frame in pandas which contains my Experimental data. It looks like this:
KE BE EXP_DATA COL_1 COL_2 COL_3 .....
10 1 5 1 2 3
9 2 . . . .
8 3 . .
7 4
6 5
.
.
The column KE is not used. BE are the Values for the x-axis and all other colums are y-axis values.
For normalisation i use the idea wich is also presented here Normalise in the post of Michael Aquilina.
There fore i need to find the maximum and the minimum of my Data. I do it like this
minBE = self.data[EXP_DATA].min()
maxBE = self.data[EXP_DATA].max()
Now i want to find the maximum and minimum value of this column but only for the Range in the "column" EXP_DATA when the "column" BE is in a certain range. So in essence i want to normalize the data only in a certain X-Range.
Solution
Thanks to the solution Milo gave me i now use this function:
def normalize(self, BE="Exp",NRANGE=False):
"""
Normalize data by dividing all components by the max value of the data.
"""
if BE not in self.data.columns:
raise NameError("'{}' is not an existing column. ".format(BE) +
"Try list_columns()")
if NRANGE and len(NRANGE)==2:
upper_be = max(NRANGE)
lower_be = min(NRANGE)
minBE = self.data[BE][(self.data.index > lower_be) & (self.data.index < upper_be)].min()
maxBE = self.data[BE][(self.data.index > lower_be) & (self.data.index < upper_be)].max()
for col in self.data.columns: # this is done so the data in NRANGE is realy scalled between [0,1]
msk = (self.data[col].index < max(NRANGE)) & (self.data[col].index > min(NRANGE))
self.data[col]=self.data[col][msk]
else:
minBE = self.data[BE].min()
maxBE = self.data[BE].max()
for col in self.data.columns:
self.data[col] = (self.data[col] - minBE) / (maxBE - minBE)
If i call the function with the parameter NRANGE=[a,b] and a and b are also the x limits of my plot it automatically scales the visible Y-values between 0 and 1 as the rest of the data is masked. IF the function is called without the NRANGE parameter the whole Range of the data passed to the function is scaled from 0 o 1.
Thank you for your help!
You can use boolean indexing. For example to select max and min values in column EXP_DATA where BE is larger than 2 and less than 5:
lower_be = 2
upper_be = 5
max_in_range = self.data['EXP_DATA'][(self.data['BE'] > lower_be) & (self.data['BE'] < upper_be)].max()
min_in_range = self.data['EXP_DATA'][(self.data['BE'] > lower_be) & (self.data['BE'] < upper_be)].min()

Categories