Python pandas - Function to get columns with their names - python

I have an Excel file :
Test_Case Value
Case_1 0.988532846
Case_2 0.829241525
Case_3 0.257209267
Case_4 0.871698313
Case_5 0.63913665
with pandas I have seen that we can get a column like this :
import pandas as pd
myExcelFile = "data.xlsx"
readExcelFile = pd.read_excel(myExcelFile, sheet_name=0, index=0)
testCaseColumn = readExceFile.Test_Case
result :
0 Case_1
1 Case_2
2 Case_3
3 Case_4
4 Case_5
The name of the column can be change, and I would like to create a function with two arguments to get the column I want :
def getColumn(readExceFile, columnName):
return readExceFile.columnName
I would like to know how can I attribute the name of the column to my readExcelFile parameter ?
Thanks for your help

You can use getattr.
def getColumn(readExceFile, columnName):
return getattr(readExceFile, columnName)

Since your_dataframe.column_name works only with column names w/o space character and you've mentioned that column name can be changed, you can call for column name with your_dataframe.loc[:,'column_name'] (see Alexander Céciles comment).
On the other hand, if your dataset has always the same structure (n-columns, first one with some categorical data, second one with values, etc.) then you can call it also directly with your_dataframe.iloc[:,0], with 0 being your first column-of-interest in the set.
Finally if you really need to have a separate function (besides at least those two I've mentioned) which returns exactly the same output then you may use this:
def get_column(your_dataframe, column_name):
return your_dataframe.loc[:,column_name]
... what is highly non-pythonic way of writing the code (see Zen of Python)

Related

Calculate Gunning-Fog score on excel values

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

Find if a column in dataframe has neither nan nor none

I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.

How to extract value out of an array of Ordereddicts?

If I have a csv file rows where one column has ordereddicts in them, how do I create a new column extract a single element of each ordereddict using python (3.+)/ pandas(.18)?
Here's an example. My column, attributes, has billingPostalCodes hidden in ordereddicts. All I care about is creating a column with the billingPostalCodes.
Here's what my data looks like now:
import pandas as pd
from datetime import datetime
import csv
from collections import OrderedDict
df = pd.read_csv('sf_account_sites.csv')
print(df)
yields:
id attributes
1 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
2 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'55555')])
...
I know on an individual level if I do this:
dict = OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
print(dict['BillingPostalCode'])
I'll get 85020 back as a result.
What do I have to get it to look like this?
id zip_codes
1 85020
2 55555
...
Do I have to use an apply function? A for loop? I've tried a lot of different things but I can't get anything to work on the dataframe.
Thanks in advance, and let me know if I need to be more specific.
This took me a while to work out, but the problem is resolved by doing the following:
df.apply(lambda row: row["attributes"]["BillingPostalCode"], axis = 1)
The trick here is to note that axis = 1 forces pandas to iterate through every row, rather than each column (which is the default setting, as seen in the docs).
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None,
args=(), **kwds)
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1). Return type
depends on whether passed function aggregates, or the reduce argument
if the DataFrame is empty.
Parameters:
func : function Function to apply to each column/row
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
From there, it is a simple matter to first extract the relevant column - in this case attributes - and then from there extract only the BillingPostalCode.
You'll need to format the resulting DataFrame to have the correct column names.

Pandas DataFrame - Combining one column's values with same index into list

I've been at this issue for awhile to no avail. This is almost a duplicate of at least one other question on here, but I can't quite figure out how to do exactly what I'm looking for from related answers online.
I have a Pandas DataFrame (we'll call it df) that looks something like:
Name Value Value2
'A' '8.8.8.8' 'x'
'B' '6.6.6.6' 'y'
'A' '6.6.6.6' 'x'
'A' '8.8.8.8' 'x'
Where Name is the index. I want to convert this to something like that looks like:
Name Value Value2
'A' ['8.8.8.8', '6.6.6.6'] 'x'
'B' ['6.6.6.6'] 'y'
So, basically, every Value that corresponds to the same index should be combined into a list (or a set, or a tuple) and that list made to be the Value for the corresponding index. And, as shown, Value2 is the same between like-indexed rows, so it should just stay the same in the end.
All I've done (successfully) is figure out how to make each element in the Value column into a list with:
df['Value'] = pd.Series([[val] for val in df['Value']])
In the question I linked at the start of this post, the recommended way to combine columns with duplicate indices offers a solution using df.groupby(df.index).sum(). I know I need something besides df.index as an argument to groupby since the Value column is treated as special, and I'm not sure what to put in place of sum() since that's not quite what I'm looking for.
Hopefully it's clear what I'm looking for, let me know if there's anything I can elaborate on. I've also tried simply looping through the DataFrame myself, finding rows with the same index, combining the Values into a list and updating df accordingly. After trying to get this method to work for a bit I thought I'd look for a more Pandas-esque way of handling this problem.
Edit: As a follow up to dermen's answer, that solution kind of worked. The Values did seem to concatenate correctly into a list. One thing I realized was that the unique function returns a Series, as opposed to a DataFrame. Also, I do have more columns in the actual setup than just Name, Value, and Value2. But I think I was able to get around both of the issues successfully with the following:
gb = df.groupby(tuple(df.columns.difference(['Value'])))
result = pd.DataFrame(gb['Value'].unique(), columns=df.columns)
Where the first line gives an argument to groupby of the list of columns minus the Value column, and the second line converts the Series returned by unique into a DataFrame with the same columns as df.
But I think with all of that in place (unless anyone sees an issue with this), almost everything works as intended. There does seem to be something that's a bit off here, though. When I try to output this to a file with to_csv, there are duplicate headers across the top (but only certain headers are duplicated, and there's no real pattern as to which as far as I can tell). Also, the Value lists are truncated, which is probably a simpler issue to fix. The csv output currenlty looks like:
Name Value Value2 Name Value2
'A' ['8.8.8.8' '7.7.7.7' 'x'
'B' ['6.6.6.6'] 'y'
The above looks weird, but that is exactly how it looks in the output. Note that, contrary to the example presented at the start of this post, there are assumed to be more than 2 Values for A (so that I can illustrate this point). When I do this with the actual data, the Value lists get cut off after the first 4 elements.
I think you are looking to use pandas.Series.unique. First, make the 'Name' index a column
df
# Value2 Value
#Name
#A x 8.8
#B y 6.6
#A x 6.6
#A x 8.8
df.reset_index(inplace=True)
# Name Value2 Value
#0 A x 8.8
#1 B y 6.6
#2 A x 6.6
#3 A x 8.8
Next call groupby and call the unique function on the 'Value' series
gb = df.groupby(['Name','Value2'])
result = gb['Value'].unique()
result.reset_index(inplace=True) #lastly, reset the index
# Name Value2 Value
#0 A x [8.8, 6.6]
#1 B y [6.6]
Finally, if you want 'Name' as the index again, just do
result.set_index( 'Name', inplace=True)
# Value2 Value
#Name
#A x [8.8, 6.6]
#B y [6.6]
UPDATE
As a follow up, make sure you re-assign result after resetting the index
result = gb['Value'].unique()
type(result)
#pandas.core.series.Series
result = result.reset_index()
type(result)
#pandas.core.frame.DataFrame
saving as CSV (rather TSV)
You don't want to use CSV here because there are commas in the Value column entries. Rather, save as TSV, you still use the same method to_csv, just change the sep arg:
result.to_csv( 'result.txt', sep='\t')
If I load result.txt in EXCEL as a TSV, I get

How to self-reference column in pandas Data Frame?

In Python's Pandas, I am using the Data Frame as such:
drinks = pandas.read_csv(data_url)
Where data_url is a string URL to a CSV file
When indexing the frame for all "light drinkers" where light drinkers is constituted by 1 drink, the following is written:
drinks.light_drinker[drinks.light_drinker == 1]
Is there a more DRY-like way to self-reference the "parent"? I.e. something like:
drinks.light_drinker[self == 1]
You can now use query or assign depending on what you need:
drinks.query('light_drinker == 1')
or to mutate the the df:
df.assign(strong_drinker = lambda x: x.light_drinker + 100)
Old answer
Not at the moment, but an enhancement with your ideas is being discussed here. For simple cases where might be enough. The new API might look like this:
df.set(new_column=lambda self: self.light_drinker*2)
In the most current version of pandas, .where() also accepts a callable!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html?highlight=where#pandas.DataFrame.where
So, the following is now possible:
drinks.light_drinker.where(lambda x: x == 1)
which is particularly useful in method-chains. However, this will return only the Series (not the DataFrame filtered based on the values in the light_drinker column). This is consistent with your question, but I will elaborate for the other case.
To get a filtered DataFrame, use:
drinks.where(lambda x: x.light_drinker == 1)
Note that this will keep the shape of the self (meaning you will have rows where all entries will be NaN, because the condition failed for the light_drinker value at that index).
If you don't want to preserve the shape of the DataFrame (i.e you wish to drop the NaN rows), use:
drinks.query('light_drinker == 1')
Note that the items in DataFrame.index and DataFrame.columns are placed in the query namespace by default, meaning that you don't have to reference the self.
I don't know of any way to reference parent objects like self or this in Pandas, but perhaps another way of doing what you want which could be considered more DRY is where().
drinks.where(drinks.light_drinker == 1, inplace=True)

Categories