Calculate Gunning-Fog score on excel values - python

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)

TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

Related

Replace unknown values (with different median values)

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

How to run different functions in different parts of a dataframe in python?

I have a dataframe(df).
I need to find the standard deviation dataframe from this one.For the first row I want to use the traditional variance formula.
sum of the(x - x(mean))/n
and from second row(=i) I want to use the following formula
lamb*(variance of first row) + (1-lamb)* (first row of returns)^2
※by first row, I meant the previous row.
# Generate Sample Dataframe
import numpy as np
import pandas as pd
df=pd.Dataframe({'a':range(1,7),
'b':[x**2 for x in range(1,7)],
'c':[x**3 for x in range(1,7)]})
# Generate return Dataframe
returns=df.pct_change()
# Generate new Zero dataframe
d=pd.DataFrame(0,index=np.arange(len(returns)),columns=returns.columns)
#populate first row
lamb=0.94
d.iloc[0]=list(returns.var())
Now my question is how to populated the second row till the end using the second formula?
It should be something like
d[1:].agg(lambda x: lamb*x.shift(-1)+(1-lamb)*returns[:2]
but it obviously returned a long error.
Could you please help?
for i in range(1,len(d)):
d.iloc[i].apply(lambda x: lamb*d.iloc[i-1] + (1-lamb)*returns.iloc[i-1])
I'm not completely sure if this gives the right answer but it wont throw an error. But try using apply, for loop and .iloc for iterating over rows and this should do the job for you if you use the correct formula.

Python Dataframe

I am a Java programmer and I am learning python for Data Science and Analysis purposes.
I wish to clean the data in a Dataframe, but I am confused with the pandas logic and syntax.
What I wish to achieve is the something like the following Java code:
for( String name : names ) {
if (name == "test") {
name = "myValue";}
}
How can do it with python and pandas dataframe.
I tried as following but it does not work
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset V02.csv')
array = df['Order Number'].unique()
#On average, one order how many items has?
for value in array:
count = 0
if df['Order Number'] == value:
......
I get error at df['Order Number']==value.
How can I identify the specific values and edit them?
In short, I want to:
-Check all the entries of 'Order Number' column
-Execute an action (example: replace the value, or count the value) each time the record is equal to a given value (example, the order code)
Just use the vectorised form for replacement:
df.loc[df['Order Number'] == 'test'
This will compare the entire column against a specific value, where this is True it will replace just those rows with the new value
For the second part if doesn't understand boolean arrays, it expects a scalar result. If you're just doing a unique value/frequency count then just do:
df['Order Number'].value_counts()
The code goes this way
import pandas as pd
df = pd.read_csv("Dataset V02.csv")
array = df['Order Number'].unique()
for value in array:
count = 0
if value in df['Order Number']:
.......
You need to use "in" to check the presence. Did I understand your problem correctly. If I did not, please comment, I will try to understand further.

Tricky str value replacement within PANDAS DataFrame

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')
The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.
If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)
Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)
Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)
Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

Cell value when given column title and value from row

I have an excel file/csv that has both Column and row titles(row 1 is all titles, column A is all row titles). I was hoping to use dictreader to return the value of the (x,y) coordinate when I supply the column and row.
Eventually, I went to be able to give multiple columns and a single row and it will combine the value in each given column for that row. But I will start with baby steps as I currently can't even return the first value I want. Here is a small sample of my excel file/CSV:
PinName RF_Switch_TX1 RF_Switch_TX2 RF_Switch_TX3 RF_Switch_TX3_Scope1 RF_Switch_TX3_Scope2
DM_D_0 1255,1266,1311 1154,1105,
DM_D_1 1256,1266,1311 1154,1105,
DQS 1101,1161 1105 1153,1105
How can i build a function that if supplied Pin Name "DM_D_1" and the column title "RF_Switch_TX3_Scope1" it would return 1154,1105,
I was hoping to just use dictreader but do I need to build an iterative function that searches through my file?
Would using Pandas be an acceptable approach? (The initial question mentioned Python, but now it seems not to.) I'm not sure that this is the most idiomatic use of Pandas, but it seems to do what you want to do.
The data
I put this into a CSV file.
PinName,RF_Switch_TX1, RF_Switch_TX2,RF_Switch_TX3,RF_Switch_TX3_Scope1,RF_Switch_TX3_Scope2
DM_D_0,"1255,1266,1311",,,"1154,1105,",
DM_D_1,"1256,1266,1311",,,"1154,1105,",
DQS,,"1101,1161",1105,,"1153,1105"
Some code
from pandas import read_csv
df = read_csv("/Users/igow/Desktop/so_data.csv")
df = df.set_index(['PinName'])
def get_value(row, col):
return df[col][row]
print(get_value(col='RF_Switch_TX3_Scope1', row='DM_D_1'))
If you put data in CSV or at least specify the delimiter properly, then you can do the following:
In [56]: q = StringIO('''PinName,RF_Switch_TX1, RF_Switch_TX2,RF_Switch_TX3,RF_Switch_TX3_Scope1,RF_Switch_TX3_Scope2
....: DM_D_0,"1255,1266,1311",,,"1154,1105,",
....: DM_D_1,"1256,1266,1311",,,"1154,1105,",
....: DQS,,"1101,1161",1105,,"1153,1105"''')
In [57]: df1 = pd.read_csv(q,)
In [58]: df1.loc[df1['PinName'] == 'DM_D_1']['RF_Switch_TX3_Scope1'].values[0]
Out[58]: '1154,1105,'
In [59]:

Categories