If loop and saving the boolean results - python

I have 3 different CSV files. Each has 70 rows and 430 columns. I want to create and save a boolean result file (with the same shape) that put true if the condition is met.
one file include temperature data, one wind data and one Rh data.condition is: [(t>=35) & (w>=7) & (rh<30)]
I want the saved file to be 0 and 1 file that show in which cell the condition has been meet (1) or not (0). The problem is that results are not true! I really appreciate your help.
import numpy as np
import pandas as pd
dft = pd.read_csv ("D:/practicet.csv",header = None)
dfrh = pd.read_csv ("D:/practicerh.csv",header = None)
dfw = pd.read_csv ("D:/practicew.csv",header = None)
result_set = []
for i in range (0,dft.shape[1]):
t=dft[i]
w=dfw[i]
rh=dfrh[i]
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
result_set = np.append(result_set,result)
np.savetxt("D:/result.csv", result_set, delimiter = ",")

You can generate boolean series by testing each column of the frame. You simply then concatenate columns back into a DataFrame object.
import pandas as pd
data = pd.read_csv('data.csv')
bool_temp = data['temperature'] > 22
bool_week = data['week'] > 5
bool_humid = data['humidity'] > 50
data_tmp = [bool_humid, bool_temp, bool_week]
df = pd.concat(data_tmp, axis=1, keys=[s.name for s in data_tmp])
The dummy data:
temperature,week,humidity
25,3,80
29,4,60
22,4,20
20,5,30
2,7,80
30,9,80
are written to data.csv

Give this a shot.
This is a proxy problem for yours, with random arrays from [0,100] in the same shape as your CSV.
import numpy as np
dft = np.random.rand(70,430)*100.
dfrh = np.random.rand(70,430)*100.
dfw = np.random.rand(70,430)*100.
result_set = []
for i in range(dft.shape[0]):
result = ((dft[i] >= 35) & (dfw[i] >= 7) & (dfrh[i] < 30))
result_set.append(result)
np.savetxt("result.csv", result_set, delimiter = ",")
The critical problem with your code is:
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
This does not do what you think it's doing. You (i) initialize an empty array (which will have garbage values), and then you (ii) apply your boolean mask to it. So, now you have a garbage array masked into another garbage array according to your specified boolean rules.
As an example...
In [5]: a = np.array([1,2,3,4,5])
In [6]: mask = np.array([True,False,False,False,True])
In [7]: a[mask]
Out[7]: array([1, 5])

Related

loop over columns in dataframes python

I want to loop over 2 columns in a specific dataframe and I want to access the data by the name of the column but it gives me this error (type error) on line 3
i=0
for name,value in df.iteritems():
q1=df[name].quantile(0.25)
q3=df[name].quantile(0.75)
IQR=q3-q1
min=q1-1.5*IQR
max=q3+1.5*IQR
minout=df[df[name]<min]
maxout=df[df[name]>max]
new_df=df[(df[name]<max) & (df[name]>min)]
i+=1
if i==2:
break
It looks like you want to exclude outliers based on the 1.5*IQR rule. Here is a simpler solution:
Input dummy data:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'col%s' % (i+1): np.random.normal(size=1000)
for i in range(4)})
Removing the outliers (keep data: Q1-1.5IQR < data < Q3+1.5IQR):
Q1 = df.iloc[:, :2].quantile(.25)
Q3 = df.iloc[:, :2].quantile(.75)
IQR = Q3-Q1
non_outliers = (df.iloc[:, :2] > Q1-1.5*IQR) & (df.iloc[:, :2] < Q3+1.5*IQR)
new_df = df[non_outliers.all(axis=1)]
output:
Type error might happen for a lot of reasons so it will be better if you add part of the DF to try to understand the issue.
Also to loop over columns you can also use the iterrows() function:
import pandas as pd
df = pd.read_csv('filename.csv')
for _, content in df.iterrows():
print(content['columnname']) #add the name of the columns you want to loop over
refer to the following link for more information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows

Pandas loc gives different values for same filters

The number of rows in my data frame for similar filters is different and I cannot figure why. Here is my code -
import numpy as np
import pandas as pd
df = pd.read_csv("Automobile_price_data_clean-f18.csv")
df
df.loc[(df['body-style']== 'hatchback') & df['city-mpg']]
a = df.loc[(df['body-style']== 'hatchback') & df['city-mpg']]
foo_1 = a.count()
b = df.loc[(df['body-style']== 'hatchback')]
foo_2 = b.count()
foo_1 == foo_2
Here is my data - https://paste.pythondiscord.com/apizixigay.apache
Surely the queries are not the same.
a = df.loc[(df['body-style']== 'hatchback') & df['city-mpg']]# incorporates city-mpg and hence restrictive. To check further try;
a.shape versus b.shape
and
a['city-mpg'].nunique() versus b['city-mpg'].nunique()

Trouble importing Excel fields into Python via Pandas - index out of bounds error

I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)
"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.
IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]
I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.

Using a PeriodIndex to slice a pandas series

I have a few pandas series with PeriodIndex of varying frequency. I'd like to filter these based on another PeriodIndex of which the frequency is in principle unknown (specified directly in the example below as selectionA or selectionB, but in practice stripped from another series).
I've found 3 approaches, each with its own downside, shown in the example below. Is there a better way?
import numpy as np
import pandas as pd
y = pd.Series(np.random.random(4), index=pd.period_range('2018', '2021', freq='A'), name='speed')
q = pd.Series(np.random.random(16), index=pd.period_range('2018Q1', '2021Q4', freq='Q'), name='speed')
m = pd.Series(np.random.random(48), index=pd.period_range('2018-01', '2021-12', freq='M'), name='speed')
selectionA = pd.period_range('2018Q3', '2020Q2', freq='Q') #subset of y, q, and m
selectionB = pd.period_range('2014Q3', '2015Q2', freq='Q') #not subset of y, q, and m
#Comparing some options:
#1: filter method
#2: slicing
#3: selection based on boolean comparison
#1: problem when frequencies unequal: always returns empty series
yA_1 = y.filter(selectionA, axis=0) #Fail: empty series
qA_1 = q.filter(selectionA, axis=0)
mA_1 = m.filter(selectionA, axis=0) #Fail: empty series
yB_1 = y.filter(selectionB, axis=0)
qB_1 = q.filter(selectionB, axis=0)
mB_1 = m.filter(selectionB, axis=0)
#2: problem when frequencies unequal: wrong selection and error instead of empty result
yA_2 = y[selectionA[0]:selectionA[-1]]
qA_2 = q[selectionA[0]:selectionA[-1]]
mA_2 = m[selectionA[0]:selectionA[-1]] #Fail: selects 22 months instead of 24
yB_2 = y[selectionB[0]:selectionB[-1]] #Fail: error
qB_2 = q[selectionB[0]:selectionB[-1]]
mB_2 = m[selectionB[0]:selectionB[-1]] #Fail: error
#3: works, but very verbose
yA_3 =y[(y.index >= selectionA[0].start_time) & (y.index <= selectionA[-1].end_time)]
qA_3 =q[(q.index >= selectionA[0].start_time) & (q.index <= selectionA[-1].end_time)]
mA_3 =m[(m.index >= selectionA[0].start_time) & (m.index <= selectionA[-1].end_time)]
yB_3 =y[(y.index >= selectionB[0].start_time) & (y.index <= selectionB[-1].end_time)]
qB_3 =q[(q.index >= selectionB[0].start_time) & (q.index <= selectionB[-1].end_time)]
mB_3 =m[(m.index >= selectionB[0].start_time) & (m.index <= selectionB[-1].end_time)]
Many thanks
I've solved it by adding start_time and end_time to the slice range:
yA_2fixed = y[selectionA[0].start_time: selectionA[-1].end_time]
qA_2fixed = q[selectionA[0].start_time: selectionA[-1].end_time]
mA_2fixed = m[selectionA[0].start_time: selectionA[-1].end_time] #now has 24 rows
yB_2fixed = y[selectionB[0].start_time: selectionB[-1].end_time] #doesn't fail; returns empty series
qB_2fixed = q[selectionB[0].start_time: selectionB[-1].end_time]
mB_2fixed = m[selectionB[0].start_time: selectionB[-1].end_time] #doesn't fail; returns empty series
But if there's a more concise way to write this, I'm still all ears. I especially would like to know if it's possible to do this filtering in a way that is more 'native' to the PeriodIndex, i.e., not converting it into datetime instances first with the start_time and end_time attributes.

Writing to a csv using pandas with filters

I'm using the pandas library to load in a csv file using Python.
import pandas as pd
df = pd.read_csv("movies.csv")
I'm then checking the columns for specific values or statements, such as:
viewNum = df["views"] >= 1000
starringActorNum = df["starring"] > 3
df["title"] = df["title"].astype("str")
titleLen = df["title"].str.len() <= 10
I want to create a new csv file using the criteria above, but am unsure how to do that as well as how to combine all those attributes into one csv.
Anyone have any ideas?
Combine the boolean masks using & (bitwise-and):
mask = viewNum & starringActorNum & titleLen
Select the rows of df where mask is True:
df_filtered = df.loc[mask]
Write the DataFrame to a csv:
df_filtered.to_csv('movies-filtered.csv')
import pandas as pd
df = pd.read_csv("movies.csv")
viewNum = df["views"] >= 1000
starringActorNum = df["starring"] > 3
df["title"] = df["title"].astype("str")
titleLen = df["title"].str.len() <= 10
mask = viewNum & starringActorNum & titleLen
df_filtered = df.loc[mask]
df_filtered.to_csv('movies-filtered.csv')
You can use the panda.DataFrame.query() interface. It allows text string queries, and is very fast for large data sets.
Something like this should work:
import pandas as pd
df = pd.read_csv("movies.csv")
# the len() method is not available to query, so pre-calculate
title_len = df["title"].str.len()
# build the data frame and send to csv file, title_len is a local variable
df.query('views >= 1000 and starring > 3 and #title_len <= 10').to_csv(...)

Categories