Replace unknown values (with different median values) - python

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))

I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)

so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

Related

Calculate Gunning-Fog score on excel values

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

Taking first value in a rolling window that is not numeric

This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')

.How to subtract a percentage from a csv file and then output it into another file? I'd preferably like a formula like x*.10=y

Sorry if I haven't explained things very well. I'm a complete novice please feel free to critic
I've searched every where but I havent found anything close to subtracting a percent. when its done on its own(x-.10=y) it works wonderfully. the only problem is Im trying to make 'x' stand for sample_.csv[0] or the numerical value from first column from my understanding.
import csv
import numpy as np
import pandas as pd
readdata = csv.reader(open("sample_.csv"))
x = input(sample_.csv[0])
y = input(x * .10)
print(x + y)
the column looks something like this
"20,a,"
"25,b,"
"35,c,"
"45,d,"
I think you should only need pandas for this task. I'm guessing you want to apply this operation on one column:
import pandas as pd
df = pd.read_csv('sample_.csv') # assuming columns within csv header.
df['new_col'] = df['20,a'] * 1.1 # Faster than adding to a percentage x + 0.1x = 1.1*x
df.to_csv('new_sample.csv', index=False) # Default behavior is to write index, which I personally don't like.
BTW: input is a reserved command in python and asks for input from the user. I'm guessing you don't want this behavior but I could be wrong.
import pandas as pd
df = pd.read_csv("sample_.csv")
df['newcolumn'] = df['column'].apply(lambda x : x * .10)
Please try this.

Python Dataframe

I am a Java programmer and I am learning python for Data Science and Analysis purposes.
I wish to clean the data in a Dataframe, but I am confused with the pandas logic and syntax.
What I wish to achieve is the something like the following Java code:
for( String name : names ) {
if (name == "test") {
name = "myValue";}
}
How can do it with python and pandas dataframe.
I tried as following but it does not work
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset V02.csv')
array = df['Order Number'].unique()
#On average, one order how many items has?
for value in array:
count = 0
if df['Order Number'] == value:
......
I get error at df['Order Number']==value.
How can I identify the specific values and edit them?
In short, I want to:
-Check all the entries of 'Order Number' column
-Execute an action (example: replace the value, or count the value) each time the record is equal to a given value (example, the order code)
Just use the vectorised form for replacement:
df.loc[df['Order Number'] == 'test'
This will compare the entire column against a specific value, where this is True it will replace just those rows with the new value
For the second part if doesn't understand boolean arrays, it expects a scalar result. If you're just doing a unique value/frequency count then just do:
df['Order Number'].value_counts()
The code goes this way
import pandas as pd
df = pd.read_csv("Dataset V02.csv")
array = df['Order Number'].unique()
for value in array:
count = 0
if value in df['Order Number']:
.......
You need to use "in" to check the presence. Did I understand your problem correctly. If I did not, please comment, I will try to understand further.

python pandas: how to avoid chained assignment

I have a pandas dataframe with two columns: x and value.
I want to find all the rows where x == 10, and for all these rows set value = 1,000. I tried the code below but I get the warning that
A value is trying to be set on a copy of a slice from a DataFrame.
I understand I can avoid this by using .loc or .ix, but I would first need to find the location or the indices of all the rows which meet my condition of x ==10. Is there a more direct way?
Thanks!
import numpy as np
import pandas as pd
df=pd.DataFrame()
df['x']=np.arange(10,14)
df['value']=np.arange(200,204)
print df
df[ df['x']== 10 ]['value'] = 1000 # this doesn't work
print df
You should use loc to ensure you're working on a view, on your example the following will work and not raise a warning:
df.loc[df['x'] == 10, 'value'] = 1000
So the general form is:
df.loc[<mask or index label values>, <optional column>] = < new scalar value or array like>
The docs highlights the errors and there is the intro, granted some of the function docs are sparse, feel free to submit improvements.

Categories