I currently have the following wikipedia scraper:
import wikipedia as wp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wikipedia __scraper__
wiki_page = 'Climate_of_Italy'
html = wp.page(wiki_page).html().replace(u'\u2212', '-')
def dataframe_cleaning(table_number: int):
global html
df = pd.read_html(html, encoding='utf-8')[table_number]
df.drop(np.arange(5, len(df.index)), inplace=True)
df.columns = df.columns.droplevel()
df.drop('Year', axis=1, inplace=True)
find = '\((.*?)\)'
for i, column in enumerate(df.columns):
if i>0:
df[column] = (df[column]
.str.findall(find)
.map(lambda x: np.round((float(x[0])-32)* (5/9), 2)))
return df
potenza_df = dataframe_cleaning(3)
milan_df = dataframe_cleaning(4)
florence_df = dataframe_cleaning(6)
italy_df = pd.concat((potenza_df, milan_df, florence_df))
Produces the following DataFrame:
As you may see I have concatenated the DataFrames, which result in a number of repeating lines. Using the groupby I want to filter all of these to be in a single DataFrame and using .agg method I want to ensure that there would application of min, max, mean. The issue that I am facing is inability to apply .agg method on row by row. I know it is a very simple question, but I've been looking through documentation and sadly cannot figure it out.
Thank you for your help in advance.
P.S. sorry if it is a repeated question post, but I was unable to find similar solution.
EDIT:
Added desired output (NOTE: was done on excel)
Just a quick update, I was able to achieve my desired goal, however I was not able to find a good resolution to it.
concat_df = pd.concat((potenza_df, milan_df, florence_df))
italy_df = pd.DataFrame()
for i, index in enumerate(list(set(concat_df['Month']))):
if i == 0:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.max)
if i in range(1, 4):
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.mean)
if i == 4:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.min)
italy_df = italy_df.append(temp_df)
italy_df = italy_df.apply(lambda x: np.round(x, 2))
italy_df
The following code achieves the desired result, however, it is highly dependent on the user's manual configuration:
Related
I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")
I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.
df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
row_ids = []
for index, row in df.iterrows():
if (index % 1000) == 0:
print("Row node index: {}".format(str(index)))
caculated_id = get_id(row['name', row['sex']])
row_ids.append(caculated_id)
df['id'] = row_ids
Is there a way to make it much faster without going row by row?
Add more info based on suggested solutions:
Use apply instead:
def func(x):
if (x.name % 1000) == 0:
print("Row node index: {}".format(str(x.name)))
caculated_id = get_id(row['name', row['sex']])
return caculated_id
df['id'] = df.apply(func, axis=1)
If you are working on a large dataset then np.vectorize() should help bypass the apply() overhead, which should be a bit faster.
import numpy as np
v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)
Edit:
To get even more of a speed up you could also just pass the function get_id instead of using a lambda function and pass df.*.values instead of df.*.
v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)
Instead of printing updates about the progression through the process try using tqdm to show the progression using a progress bar.
import numpy as np
from tqdm import tqdm
#np.vectorize
def get_id(name, sex):
global pbar
...
pbar.update(1)
...
return
global pbar
with tqdm(total=len(df)) as pbar:
df['id'] = get_id(df['name'].values, df['sex'].values)
Following my previous question, now i'm trying to put data in a table and convert it to an excel file but i can't get the table i want, if anyone can help or explain what's the cause of it, this is the final output i want to get
this the data i'm printing
Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}
and here is my code
import pandas as pd
import ast
s="Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}"
ds = []
for l in s.splitlines():
d = l.split("-")
if len(d) > 1:
df = pd.DataFrame(ast.literal_eval(d[1].strip()))
ds.append(df)
for df in ds:
df.reset_index(drop=True, inplace=True)
df = pd.concat(ds, axis= 1)
cols = df.columns
cols = [((col.split('.')[0], col)) for col in df.columns]
df.columns=pd.MultiIndex.from_tuples(cols)
print(df.T)
df.to_excel("v.xlsx")
but this is what i get
How can i solve the probleme please this the final and most important part and thank you in advance.
Within the for loop, the value "Hotel1 : chambre double" is held in d[0]
(try it by yourself by printing d[0].)
In your previous question, the "Name3" column was built by the following line of code:
cols = [((col.split('.')[0], col)) for col in df.columns]
Now, to save "Hotel1 : chambre double", you need to access it within the first for loop.
import pandas as pd
import ast
s="Hotel1 : chambre double - {'lpd': ('112', '90','10'), 'pc': ('200', '140','10')}"
ds = []
cols = []
for l in s.splitlines():
d = l.split("-")
if len(d) > 1:
df = pd.DataFrame(ast.literal_eval(d[1].strip()))
ds.append(df)
cols2 = df.columns
cols = [((d[0], col)) for col in df.columns]
for df in ds:
df.reset_index(drop=True, inplace=True)
df = pd.concat(ds, axis= 1)
df.columns=pd.MultiIndex.from_tuples(cols)
print(df.T)
df.T.to_csv(r"v.csv")
This works, because you are taking the d[0] (hotel name) within the for loop, and creating tuples for your column names whilst you have access to that object.
you then create a multi index column in the line of code you already had, outside the loop:
df.columns=pd.MultiIndex.from_tuples(cols)
Finally, to answer the output to excel query you had, please add the following line of code at the bottom:
df.T.to_csv(r"v.csv")
I want to loop over 2 columns in a specific dataframe and I want to access the data by the name of the column but it gives me this error (type error) on line 3
i=0
for name,value in df.iteritems():
q1=df[name].quantile(0.25)
q3=df[name].quantile(0.75)
IQR=q3-q1
min=q1-1.5*IQR
max=q3+1.5*IQR
minout=df[df[name]<min]
maxout=df[df[name]>max]
new_df=df[(df[name]<max) & (df[name]>min)]
i+=1
if i==2:
break
It looks like you want to exclude outliers based on the 1.5*IQR rule. Here is a simpler solution:
Input dummy data:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'col%s' % (i+1): np.random.normal(size=1000)
for i in range(4)})
Removing the outliers (keep data: Q1-1.5IQR < data < Q3+1.5IQR):
Q1 = df.iloc[:, :2].quantile(.25)
Q3 = df.iloc[:, :2].quantile(.75)
IQR = Q3-Q1
non_outliers = (df.iloc[:, :2] > Q1-1.5*IQR) & (df.iloc[:, :2] < Q3+1.5*IQR)
new_df = df[non_outliers.all(axis=1)]
output:
Type error might happen for a lot of reasons so it will be better if you add part of the DF to try to understand the issue.
Also to loop over columns you can also use the iterrows() function:
import pandas as pd
df = pd.read_csv('filename.csv')
for _, content in df.iterrows():
print(content['columnname']) #add the name of the columns you want to loop over
refer to the following link for more information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows
I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price