My Pandas function is returning "None" as a result instead of the DataFrame that I am trying to filter using the function that I have written. Why is this so? And how can I resolve this? Thank you!
import pandas as pd
nz_data = pd.read_csv('research-and-development-survey-2016-2019-csv.csv', index_col = 2)
def count_of_mining_biz():
if "B_Mining" in nz_data[["Breakdown_category"]] and "Count of businesses" in nz_data[["Units"]]:
return nz_data.loc["2019", "RD_Value"]
print(count_of_mining_biz())
Here is how the data looks like.
I am trying to find out the RD Value in 2019 for the Mining industry. The reason why I have to set a conditional for the "Units" column is because there is another type of data that is not the count for the business mentioned.
.loc[..., ...] means .loc[row_index, col_index] but there's no row index called 2019.
Try using .loc with boolean masks in this case:
def count_of_mining_biz():
category = nz_data['Breakdown_category'] == 'B_Mining'
units = nz_data['Units'] == 'Count of businesses'
year = nz_data['Year'] == 2019
return nz_data.loc[category & units & year].RD_Value
Related
I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue
I'm trying to create a new column in a DataFrame that comes from a CSV file. What makes a this little bit tricky is that the values from this new column depends on conditions from other columns from the DataFrame.
The output column depends on the values from the following columns from this dataframe: VaccineCode | Occurrence | VaccineN | firstVaccineDate
So if the condition is met for a specific vaccine, I have to sum the respective date from the ApplicationDate column, in order to tell the vaccine date of the second dose.
My code:
import pandas as pd
import datetime
from datetime import timedelta, date, datetime
df = pd.read_csv(path_csv, engine='python', sep=';')
criteria_Astrazeneca = (df.VaccineCode == 85) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_Pfizer = (df.VaccineCode == 86) & (df.Occurrence == 1) & (df.VaccineN == 1)
criteria_CoronaVac = (df.VaccineCode == 87) & (df.Occurrence == 1) & (df.VaccineN == 1)
days_pfizer = 56
days_coronaVac = 28
days_astraZeneca = 84
What I've tried so far:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
This works until the point that I have to complete the same New_Column with the others results, like this:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df['New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df['New_Column'] = df[criteria_AstraZeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
Naturally, the problem with this approach comes from the fact that the next statement overwrites those before, so I end up just with the New_Column filled with the results that came from the last statement. I need a way to put all results in the same column.
My last try was:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df[criteria_Pfizer].loc[:,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
But it gives the following error:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_column(ilocs[0], value, pi)
Thank you very much #ddejohn, the first link helped me to solve my problem as follows:
df['New_Column'] = df[criteria_CoronaVac].firstVaccineDate + timedelta(days=days_coronaVac)
df.loc[criteria_Pfizer,'New_Column'] = df[criteria_Pfizer].firstVaccineDate + timedelta(days=days_pfizer)
df.loc[criteria_Astrazeneca,'New_Column'] = df[criteria_Astrazeneca].firstVaccineDate + timedelta(days=days_astraZeneca)
That way, the first statement create the column and fill with the coronavac indexes and the next ones fill the same column just in the respective indexes.
Problem solved, thanks again.
You could also use an data frame transform to create a new rule
I am trying to filter the dataframe conditionally where I intend to cycle through the columns and its values list for conditional filtering but the resulted dataframe was not filtered correctly. I used a known filtering approach for pandas dataframe on SO such as post 1, post2 and I want to parameterize these data selection processes in the function but cycle through the columns with value list is not working correctly. Any possible idea to overcome this issue? any thought?
minimal reproducible example:
here is the minimal reproducible example on gist that I used in my attempt.
my attempt:
I tried this approach and it worked pretty good but I want to parameterize in the function.
import pandas as pd
df = pd.read_csv('minimal_df.csv', encoding='utf-8')
df= df[(df['meat_type']=='Beef') & (df['trade_type']=='E') & (df['origin']=='US') & (df['date'] >'2014-01-01') & (df['date'] <'2019-01-01')]
as I said, I want to wrap up a data filtering function so I may do something like this:
def data_filter(df, colList, vaList, startDate, endDate):
for col in colList:
for val in vaList:
masker = df[df[col]==val]
masker.reset_index(drop=True)
masker = masker.loc[(masker['date']> startDate) & (masker['date'] < endDate)]
return masker
columns = ['meat_type', 'temperature','origin']
values = ['Beef', 'Frozen','US']
dat_filter(df=df, colList=columns, vaList=values, startDate='2013-12-31', endDate='2019-01-01')
but such an attempt doesn't work for me because resulted filtered dataframe wasn't filtered actually. Any idea to make this work correctly?
How can I make my function even more efficient such as instead of using array as parameter is there any better way to pass parameter to the function so we can select multiple columns with value list? any thoughts? Thanks
You can use pd.query to achieve this.
First create a query string and pass it to the function.
A sample query string looks like this:
'meat_type=="Beef"&temperature=="Frozen"&origin=="US"&startDate>"2013-12-31"&endDate<"2019-01-01"'
For parameterization, you can use it in two ways:
pass columns and values as list
make columns as parameters using kwargs
The two functions are as below:
def filter_1(df, startDate, endDate, date_colname="date", cols=None, vals=None, inplace=False):
s = ''
for i,j in zip(cols,vals):
s += '{}=="{}"&'.format(i,j)
s += '{}>"{}"&'.format(date_colname, startDate)
s += '{}<"{}"'.format(date_colname, endDate)
return df.query(s, inplace=inplace)
def filter_2(df, startDate, endDate, date_colname="date", inplace=False, **kwargs):
s = ''
for i,j in kwargs.items():
s += '{}=="{}"&'.format(i,j)
s += '{}>"{}"&'.format(date_colname, startDate)
s += '{}<"{}"'.format(date_colname, endDate)
return df.query(s, inplace=inplace)
print(filter_1(df, startDate, endDate, cols=columns, vals=values))
print(filter_2(df, startDate, endDate, meat_type='Beef', temperature='Frozen', origin='US'))
I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.
For instance:
df[df['type of use'] == 'housing'].sample(frac=0.2)
This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.
You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:
for i in list(df['type of use'].unique()):
print(df[df['type of use'] == i].sample(frac=0.2))
or
i = 0
while i < len(list(df['type of use'].unique())):
df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
print(df1.head())
i = i + 1
For storing you can create a dictionary:
dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0
while i < len(dfs):
dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
i = i + 1
print(dicdf)
This will print a dictionary of the dataframes.
You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])
Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:
import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?
#First, create a new 'helper' dataframe:
random_state = 41 # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True) #may need this to simplify multi-index dataframe
# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
I am a Java programmer and I am learning python for Data Science and Analysis purposes.
I wish to clean the data in a Dataframe, but I am confused with the pandas logic and syntax.
What I wish to achieve is the something like the following Java code:
for( String name : names ) {
if (name == "test") {
name = "myValue";}
}
How can do it with python and pandas dataframe.
I tried as following but it does not work
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset V02.csv')
array = df['Order Number'].unique()
#On average, one order how many items has?
for value in array:
count = 0
if df['Order Number'] == value:
......
I get error at df['Order Number']==value.
How can I identify the specific values and edit them?
In short, I want to:
-Check all the entries of 'Order Number' column
-Execute an action (example: replace the value, or count the value) each time the record is equal to a given value (example, the order code)
Just use the vectorised form for replacement:
df.loc[df['Order Number'] == 'test'
This will compare the entire column against a specific value, where this is True it will replace just those rows with the new value
For the second part if doesn't understand boolean arrays, it expects a scalar result. If you're just doing a unique value/frequency count then just do:
df['Order Number'].value_counts()
The code goes this way
import pandas as pd
df = pd.read_csv("Dataset V02.csv")
array = df['Order Number'].unique()
for value in array:
count = 0
if value in df['Order Number']:
.......
You need to use "in" to check the presence. Did I understand your problem correctly. If I did not, please comment, I will try to understand further.