Removing data points above/below value in python

Removing data points above/below value in python - python

I have a dataframe where I am trying to remove all the values outside the range [-500,500], I simply want to remove the particular colum/"Index" values that exceed this limit. I have tried a lot of different things, but nothing really seems to work. I have tried using this code, but then I get the error. enter image description here
File "C:\Users\Jeffs.spyder-py3\kplr006779699.py", line 30, in data = data[data['0'] < abs(500)]
File "C:\Users\Jeffs\anaconda3\lib\site-packages\pandas\core\frame.py", line 3024, in getitem indexer = self.columns.get_loc(key)
File "C:\Users\Jeffs\anaconda3\lib\site-packages\pandas\core\indexes\range.py", line 354, in get_loc raise KeyError(key)
KeyError: '0'
which i'm guessing is because the column named '0' doesn't have really have a column name.
from astropy.io import ascii
import numpy as np
import matplotlib.pyplot as plt
import math
import pandas as pd
#Data from KIC 6779699
df = ascii.read(r'G:\Downloads\kplr006779699_kasoc-ts_llc_v1-2.dat')
# print(df)
x_Julian_data = df['col1']
x_data_raw = (x_Julian_data-54000)*86400 #Julian time to seconds: 60*60*24
data = np.linspace(0, 65541, num = int(65541) , endpoint = True)
y_data_raw = df['col2'] #Relative flux ppm
for i in range (65541-2):#Cleaning up data
data[i+1] = y_data_raw[i+1]-.5*(y_data_raw[i]+y_data_raw[i+2])
data[0] = 0
data[65540] = 0
data = pd.DataFrame(data)
data = data[data['0'] < abs(500)]
plt.plot(x_data_raw, data)
plt.xlim([1.1E8,1.25E8])
plt.ylim([-500,500])
I can't quite get it to work, even if I try using a definition.
Is there an easier way to approach this?

"data" is a numpy array (created using np.linspace), so you can filter it by value *before you create the data frame:
data = data[data < abs(500)]
new_df = pd.DataFrame(data, columns=['a_useful_column_name'])
(while debugging consider using a new variable name for the DataFrame)

Related

Change dateformat

I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!

After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)

How to fix "wrong number of items passed 5, placement implies 1"

I am trying to make 6 separate graphs from a dataframe that has 5 columns and multiple rows that is imported from Excel. I want to add two lines to the graph that are the point in the dataframe plus and minus the rolling standard deviation at each point in each column and row of the dataframe. To do this I am using a nested for loop and then graphing, however, it is saying wrong number of items pass placement implies 1. I do not know how to fix this.
I have tried converting the dataframe to a list and appending rows as well. Nothing seems to work. I know this could be easily done.
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for k,p in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,columns=[k])
dfnew=pd.DataFrame(dfrollingStd,columns=[p])
for i,j in dfStorage, dfrollingStd:
dftemp = pd.DataFrame(dfStorage,index=[i])
dfnew=pd.DataFrame(dfrollingStd,index=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
I expect the output to be 6 separate graphs each with 3 lines. Instead I am not getting anything. My loop is also not executing properly.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
excel_file = 'C:/Users/afrydman/Documents/Storage and Data Centers FFO Multiples Data.xlsx'
dfStorage = pd.read_excel(excel_file,sheet_name='Storage Data', index_col='Date')
dfrollingStd = dfStorage.rolling(12).std().shift(-11)
#dfrollingStd.fillna(0)
#print(dfStorage[1][3])
for i in dfStorage:
dftemp = pd.DataFrame(dfStorage,columns=[i])
for j in dfrollingStd:
dfnew=pd.DataFrame(dfrollingStd,columns=[j])
dftemp['-1std'] = pd.DataFrame(dftemp).subtract(dfnew)
dftemp['+1std'] = pd.DataFrame(dftemp).add(dfnew)
pd.DataFrame(dftemp).plot()
plt.ylabel('P/FFO')
This is my updated code and I am still getting the same error. This time it is saying Wrong number of items passed 2, placement implies 1

Read Shapefiles into Dataframe

I have a shapefile that I would like to convert into dataframe in Python 3.7. I have tried the following codes:
import pandas as pd
import shapefile
sf_path = r'data/shapefile'
sf = shapefile.Reader(sf_path, encoding = 'Shift-JIS')
fields = [x[0] for x in sf.fields][1:]
records = sf.records()
shps = [s.points for s in sf.shapes()]
sf_df = pd.DataFrame(columns = fields, data = records)
But I got this error message saying
TypeError: Expected list, got _Record
So how should I convert the list to _Record or is there a way around it? I have tried GeoPandas too, but had some trouble installing it. Thanks!

def read_shapefile(sf_shape):
"""
Read a shapefile into a Pandas dataframe with a 'coords'
column holding the geometry information. This uses the pyshp
package
"""
fields = [x[0] for x in sf_shape.fields][1:]
records = [y[:] for y in sf_shape.records()]
#records = sf_shape.records()
shps = [s.points for s in sf_shape.shapes()]
df = pd.DataFrame(columns=fields, data=records)
df = df.assign(coords=shps)
return df

I had the same problem and this is because the .shp file has a kind of key field in each record and when converting to dataframe, a list is expected and only that field is found, test changing:
records = [y[:] for y in sf.records()]
I hope this works!

If loop and saving the boolean results

I have 3 different CSV files. Each has 70 rows and 430 columns. I want to create and save a boolean result file (with the same shape) that put true if the condition is met.
one file include temperature data, one wind data and one Rh data.condition is: [(t>=35) & (w>=7) & (rh<30)]
I want the saved file to be 0 and 1 file that show in which cell the condition has been meet (1) or not (0). The problem is that results are not true! I really appreciate your help.
import numpy as np
import pandas as pd
dft = pd.read_csv ("D:/practicet.csv",header = None)
dfrh = pd.read_csv ("D:/practicerh.csv",header = None)
dfw = pd.read_csv ("D:/practicew.csv",header = None)
result_set = []
for i in range (0,dft.shape[1]):
t=dft[i]
w=dfw[i]
rh=dfrh[i]
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
result_set = np.append(result_set,result)
np.savetxt("D:/result.csv", result_set, delimiter = ",")

You can generate boolean series by testing each column of the frame. You simply then concatenate columns back into a DataFrame object.
import pandas as pd
data = pd.read_csv('data.csv')
bool_temp = data['temperature'] > 22
bool_week = data['week'] > 5
bool_humid = data['humidity'] > 50
data_tmp = [bool_humid, bool_temp, bool_week]
df = pd.concat(data_tmp, axis=1, keys=[s.name for s in data_tmp])
The dummy data:
temperature,week,humidity
25,3,80
29,4,60
22,4,20
20,5,30
2,7,80
30,9,80
are written to data.csv

Give this a shot.
This is a proxy problem for yours, with random arrays from [0,100] in the same shape as your CSV.
import numpy as np
dft = np.random.rand(70,430)*100.
dfrh = np.random.rand(70,430)*100.
dfw = np.random.rand(70,430)*100.
result_set = []
for i in range(dft.shape[0]):
result = ((dft[i] >= 35) & (dfw[i] >= 7) & (dfrh[i] < 30))
result_set.append(result)
np.savetxt("result.csv", result_set, delimiter = ",")
The critical problem with your code is:
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
This does not do what you think it's doing. You (i) initialize an empty array (which will have garbage values), and then you (ii) apply your boolean mask to it. So, now you have a garbage array masked into another garbage array according to your specified boolean rules.
As an example...
In [5]: a = np.array([1,2,3,4,5])
In [6]: mask = np.array([True,False,False,False,True])
In [7]: a[mask]
Out[7]: array([1, 5])

Drop 0 values, NaN values, and empty strings

import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
filevalues = filevalues.fillna(int(0))
int_series = filevalues.astype(int)
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)
So I have hundreds of csv files with many empty spots for values. Some of the blanks spaces are detected as NaNs and others are empty strings.This has Forced me to create my code in the way it is right now, and the reason so is that I need to conduct a formula on each value so I changed all such NaNs and empty strings to 0 so that I am able to conduct any formula ( in this example 1/1.2.) The problem is that I do not want to see values that are 0, NaN or empty strings when printing my dataframe.
I have tried to use the following:
filevalues = filevalues.dropna()
But because certain csv files have empty strings, this method does not fully work and get the error:
ValueError: invalid literal for int() with base 10: ' '
I have also tried the following after converting all values to 0:
filevalues = filevalues.loc[:, (filevalues != 0).all(axis=0)]
and
mask = np.any(np.isnan(filevalues) | np.equal(a, 0), axis=1)
Every method seems to be giving different errors. Is there a clean way to not count these types of values when I am printing my pandas dataframe? Please let me know if an example csv file is needed.

Got it to work! Here is the answer if it is of use to anyone.
import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(" ", "", regex=True)
filevalues.replace("", np.nan, inplace=True) # replace empty string with np.nan
filevalues.dropna(inplace=True) # drop nan values
int_series = filevalues.astype(int) # change type
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing data points above/below value in python - python

"data" is a numpy array (created using np.linspace), so you can filter it by value *before you create the data frame: data = data[data < abs(500)] new_df = pd.DataFrame(data, columns=['a_useful_column_name']) (while debugging consider using a new variable name for the DataFrame)

Related

Change dateformat

How to fix "wrong number of items passed 5, placement implies 1"

Read Shapefiles into Dataframe

If loop and saving the boolean results

Drop 0 values, NaN values, and empty strings

Categories

Resources