I want to loop over 2 columns in a specific dataframe and I want to access the data by the name of the column but it gives me this error (type error) on line 3
i=0
for name,value in df.iteritems():
q1=df[name].quantile(0.25)
q3=df[name].quantile(0.75)
IQR=q3-q1
min=q1-1.5*IQR
max=q3+1.5*IQR
minout=df[df[name]<min]
maxout=df[df[name]>max]
new_df=df[(df[name]<max) & (df[name]>min)]
i+=1
if i==2:
break
It looks like you want to exclude outliers based on the 1.5*IQR rule. Here is a simpler solution:
Input dummy data:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'col%s' % (i+1): np.random.normal(size=1000)
for i in range(4)})
Removing the outliers (keep data: Q1-1.5IQR < data < Q3+1.5IQR):
Q1 = df.iloc[:, :2].quantile(.25)
Q3 = df.iloc[:, :2].quantile(.75)
IQR = Q3-Q1
non_outliers = (df.iloc[:, :2] > Q1-1.5*IQR) & (df.iloc[:, :2] < Q3+1.5*IQR)
new_df = df[non_outliers.all(axis=1)]
output:
Type error might happen for a lot of reasons so it will be better if you add part of the DF to try to understand the issue.
Also to loop over columns you can also use the iterrows() function:
import pandas as pd
df = pd.read_csv('filename.csv')
for _, content in df.iterrows():
print(content['columnname']) #add the name of the columns you want to loop over
refer to the following link for more information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows
Related
I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")
Following my previous question, now i'm trying to put data in a table and convert it to an excel file but i can't get the table i want, if anyone can help or explain what's the cause of it, this is the final output i want to get
this the data i'm printing
Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}
and here is my code
import pandas as pd
import ast
s="Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}"
ds = []
for l in s.splitlines():
d = l.split("-")
if len(d) > 1:
df = pd.DataFrame(ast.literal_eval(d[1].strip()))
ds.append(df)
for df in ds:
df.reset_index(drop=True, inplace=True)
df = pd.concat(ds, axis= 1)
cols = df.columns
cols = [((col.split('.')[0], col)) for col in df.columns]
df.columns=pd.MultiIndex.from_tuples(cols)
print(df.T)
df.to_excel("v.xlsx")
but this is what i get
How can i solve the probleme please this the final and most important part and thank you in advance.
Within the for loop, the value "Hotel1 : chambre double" is held in d[0]
(try it by yourself by printing d[0].)
In your previous question, the "Name3" column was built by the following line of code:
cols = [((col.split('.')[0], col)) for col in df.columns]
Now, to save "Hotel1 : chambre double", you need to access it within the first for loop.
import pandas as pd
import ast
s="Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}"
ds = []
cols = []
for l in s.splitlines():
d = l.split("-")
if len(d) > 1:
df = pd.DataFrame(ast.literal_eval(d[1].strip()))
ds.append(df)
cols2 = df.columns
cols = [((d[0], col)) for col in df.columns]
for df in ds:
df.reset_index(drop=True, inplace=True)
df = pd.concat(ds, axis= 1)
df.columns=pd.MultiIndex.from_tuples(cols)
print(df.T)
df.T.to_csv(r"v.csv")
This works, because you are taking the d[0] (hotel name) within the for loop, and creating tuples for your column names whilst you have access to that object.
you then create a multi index column in the line of code you already had, outside the loop:
df.columns=pd.MultiIndex.from_tuples(cols)
Finally, to answer the output to excel query you had, please add the following line of code at the bottom:
df.T.to_csv(r"v.csv")
I currently have the following wikipedia scraper:
import wikipedia as wp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wikipedia __scraper__
wiki_page = 'Climate_of_Italy'
html = wp.page(wiki_page).html().replace(u'\u2212', '-')
def dataframe_cleaning(table_number: int):
global html
df = pd.read_html(html, encoding='utf-8')[table_number]
df.drop(np.arange(5, len(df.index)), inplace=True)
df.columns = df.columns.droplevel()
df.drop('Year', axis=1, inplace=True)
find = '\((.*?)\)'
for i, column in enumerate(df.columns):
if i>0:
df[column] = (df[column]
.str.findall(find)
.map(lambda x: np.round((float(x[0])-32)* (5/9), 2)))
return df
potenza_df = dataframe_cleaning(3)
milan_df = dataframe_cleaning(4)
florence_df = dataframe_cleaning(6)
italy_df = pd.concat((potenza_df, milan_df, florence_df))
Produces the following DataFrame:
As you may see I have concatenated the DataFrames, which result in a number of repeating lines. Using the groupby I want to filter all of these to be in a single DataFrame and using .agg method I want to ensure that there would application of min, max, mean. The issue that I am facing is inability to apply .agg method on row by row. I know it is a very simple question, but I've been looking through documentation and sadly cannot figure it out.
Thank you for your help in advance.
P.S. sorry if it is a repeated question post, but I was unable to find similar solution.
EDIT:
Added desired output (NOTE: was done on excel)
Just a quick update, I was able to achieve my desired goal, however I was not able to find a good resolution to it.
concat_df = pd.concat((potenza_df, milan_df, florence_df))
italy_df = pd.DataFrame()
for i, index in enumerate(list(set(concat_df['Month']))):
if i == 0:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.max)
if i in range(1, 4):
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.mean)
if i == 4:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.min)
italy_df = italy_df.append(temp_df)
italy_df = italy_df.apply(lambda x: np.round(x, 2))
italy_df
The following code achieves the desired result, however, it is highly dependent on the user's manual configuration:
I have a dataframe like this in a .csv:
Consequence,N_samples
A,227
B,413
C,194
D,1
E,1610
F,10
G,7
H,1
I,1
J,5
K,1
L,5
M,5
N,30
O,7
P,3
And I want to make a plot pie out of it, but grouping all values lower than 150 into "Other" category. I've tried running this code but it's not working.
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plotother = {'Consequence' : 'Other', 'N_samples':0}
df=pd.read_csv('df.csv', sep=',')
df = df.append(other,ignore_index=True)
for i in df:
if (x in df['N_samples']) < 150:
df['N_samples'].iloc[-1]=df['N_samples'].iloc[-1] + (x in df['N_samples'])
df.drop([x])
df.plot.pie(label="", title="Consequence", startangle=90);
plot.savefig('Consequence.svg')
Once I run it I get the following error:
KeyError: "['Consequence'] not found in axis"
I would really appreciate any help.
You are making it more difficult than it is.
First get all the rows, where sample size is below 150:
small_sizes = df[df['N_Samples']<150]
The sum up their values:
other_samples = small_sizes['N_Samples'].sum()
Finally drop the rows and add the other row:
df = df[~df['N_Samples']<150]
df.loc['other','N_samples'] = other_samples
That should do the trick.
you can do this as follows:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('df.csv')
collect the rows <150 into a new df:
df_other=pd.DataFrame([{'Consequence':'Other','N_samples':df[df.N_samples<150].N_samples.sum()}])
add that to the rows >= 150 and plot
df2=df[df.N_samples>=150]
df3=pd.concat([df2,df_other],axis=0)
df3.plot.pie(y='N_samples',labels=df3['Consequence'])
plt.show()
if you find yourself iterating thru a dataframe, be aware there's often a built-in way to do whatever you're trying to do.
Define your filtering condition:
cond = df.N_samples < 150
Sum values from filtering condition:
other_sum = df.N_samples[cond].sum()
Filter by opposite to condition and add 'other' row at the bottom in the same line:
df = df.loc[~cond].append({'Consequence': 'other', 'N_samples': other_sum}, ignore_index=True)
Sorry if this has been asked before -- I couldn't find this specific question.
In python, I'd like to subtract every even column from the previous odd column:
so go from:
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113
to
101.849 110.349 68.513
109.95 110.912 61.274
100.612 110.05 62.15
107.75 118.687 59.712
There will be an unknown number of columns. should I use something in pandas or numpy?
Thanks in advance.
You can accomplish this using pandas. You can select the even- and odd-indexed columns separately and then subtract them.
#hiro protagonist, I didn't know you could do that StringIO magic. That's spicy.
import pandas as pd
import io
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
df = pd.read_csv(data, sep='\s+')
Note that the even/odd terms may be counterintuitive because python is 0-indexed, meaning that the signal columns are actually even-indexed and the background columns odd-indexed. If I understand your question properly, this is contrary to your use of the even/odd terminology. Just pointing out the difference to avoid confusion.
# strip the columns into their appropriate signal or background groups
bg_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
signal_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(signal_df.values - bg_df.values)
result_df contains columns which are the difference between the signal and background columns. You probably want to rename these column names, though.
>>> result_df
0 1 2
0 101.849 110.349 68.513
1 109.950 110.912 61.274
2 100.612 110.050 62.150
3 107.750 118.687 59.712
import io
# faking the data file
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
header = next(data) # read the first line from data
# print(header[:-1])
for line in data:
# print(line)
floats = [float(val) for val in line.split()] # create a list of floats
for prev, cur in zip(floats[::2], floats[1::2]):
print('{:6.3f}'.format(prev-cur), end=' ')
print()
with output:
101.849 110.349 68.513
109.950 110.912 61.274
100.612 110.050 62.150
107.750 118.687 59.712
if you know what data[start:stop:step] means and how zip works this should be easily understood.