Tricky str value replacement within PANDAS DataFrame - python

Problem Overview:
I am attempting to clean stock data loaded from CSV file into Pandas DataFrame. The indexing operation I perform works. If I call print, I can see the values I want are being pulled from the frame. However, when I try to replace the values, as shown in the screenshot, PANDAS ignores my request. Ultimately, I'm just trying to extract a value out of one column and move it to another. The PANDAS documentation suggests using the .replace() method, but that doesn't seem to be working with the operation I'm trying to perform.
Here's a pic of the code and data before and after code is run.
And the for loop (as referenced in the pic):
for i, j in zip(all_exchanges['MarketCap'], all_exchanges['MarketCapSym']):
if 'M' in i: j = j.replace('n/a','M')
elif 'B' in i: j = j.replace('n/a','M')

The problem is that j is a string, thus immutable.
You're replacing data, but not in the original dataset.
You have to do it another way, less elegant, without zip (I simplified your test BTW since it did the same on both conditions):
aem = all_exchanges['MarketCap']
aems = all_exchanges['MarketCapSym']
for i in range(min(len(aem),len(aems)): # like zip: shortest of both
if 'M' in aem[i] or 'B' in aem[i]:
aems[i] = aems[i].replace('n/a','M')
now you're replacing in the original dataset.

If both columns are in the same dataframe, all_exchanges, iterate over the rows.
for i, row in enumerate ( all_exchanges ):
# get whatever you want from row
# using the index you should be able to set a value
all_exchanges.loc[i, 'columnname'] = xyz
That should be the syntax of I remember ;)

Here is quite exhaustive tutorial on missing values and pandas. I suggest using fillna():
df['MarketCap'].fillna('M', inplace=True)
df['MarketCapSym'].fillna('M', inplace=True)

Avoid iterating if you can. As already pointed out, you're not modifying the original data. Index on the MarketCap column and perform the replace as follows.
# overwrites any data in the MarketCapSym column
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'] = 'M'
# only replaces 'n/a'
all_exchanges.loc[(all_exchanges['MarketCap'].str.contains('M|B'),
'MarketCapSym'].replace({'n/a', 'M'}, inplace=True)

Thanks to all who posted. After thinking about your solutions and the problem a bit longer, I realized there might be a different approach. Instead of initializing a MarketCapSym column with 'n/a', I instead created that column as a copy of MarketCap and then extracted anything that wasn't an "M" or "B".
I was able to get the solution down to one line:
all_exchanges['MarketCapSymbol'] = [ re.sub('[$.0-9]', '', i) for i in all_exchanges.loc[:,'MarketCap'] ]
A breakdown of the solution is as follows:
all_exchanges['MarketCapSymbol'] = - Make a new column on the DataFrame called 'MarketCapSymbol.
all_exchanges.loc[:,'MarketCap'] - Initialize the values in the new column to those in 'MarketCap'.
re.sub('[$.0-9]', '', i) for i in - Since all I want is the 'M' or 'B', apply re.sub() on each element, extracting [$.0-9] and leaving only the M|B.
Using a list comprehension this way seemed a bit more natural / readable to me in my limited experience with PANDAS. Let me know what you think!

Related

Pandas values not being updated

I'm new to working with Pandas and I'm trying to do a very simple thing with it. Using the flights.csv file I'm defining a new column which defines a new column with underperforming if the number of passengers is below average, the value is 1. My problem is that it might be something wrong with the logic since it's not updating the values. Here is an example:
df = pd.read_csv('flights.csv')
passengers_mean = df['passengers'].mean()
df['underperforming'] = 0
for idx, row in df.iterrows():
if (row['passengers'] < passengers_mean):
row['underperforming'] = 1
print(df)
print(passengers_mean)
Any clue?
According to the docs:
You should never modify something you are iterating over. This is not guaranteed to work in all cases.
iterrows docs
What you can do instead is:
df["underperforming"] = (df.passengers < x.passengers.mean()).astype('int')
Quoting the documentation:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Kindly use vectorized operations like apply()

How can I access a row by its index using pandas itertuples() methode?

this is my code where I need to access a certain tuple from the DataFrame df. Can you please help me with this matter as I can't find any answer regarding this issue.
import pandas as pd
import openpyxl
df_sheet_index = pd.read_excel("path/to/excel/file.xlsx")
df = df_sheet_index.itertuples()
for tuple in df:
print(tuple)
This is the output
Pandas(Index=0, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=1, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
Pandas(Index=2, _1=nan, _2=nan, _3=nan, _4=nan, _5=nan, _6=nan, _7=nan, _8=nan, _9=nan, _10=nan, _11=nan, _12=nan, _13=nan, _14=nan, _15=nan, _16=nan, _17=nan, _18=nan, _19=nan, _20=nan, _21=nan, _22=nan)
...
EDIT: As a general rule, you should use pandas builtin functions to search inside and not iterate on it. It's more efficient and more readable.
But if you really want to access the tuples:
target_index= 10
for tu in df.itertuples():
if tu[0] == target_index:
print(tu)
in a more general view, this is a regular tuple so you can access each element by its position. index will be tuple[0] then you first column tuple[1], the second tuple[2] etc.
NOTE: do not use tuple as a variable name, this is a reserved name in Python for the tuple type and it may create issue (on top of not being a good practice)
if u are trying to get an element at a specific place. U can use .iloc()
It takes two pars the row & column
df.iloc[-1]["column"]
This will get the last row element at that column
For df.loc["row","column"]
df.loc[[]] returns a df while df.loc[] returns a series

Calculate Gunning-Fog score on excel values

I have a spreadsheet with fields containing a body of text.
I want to calculate the Gunning-Fog score on each row and have the value output to that same excel file as a new column. To do that, I first need to calculate the score for each row. The code below works if I hard key the text into the df variable. However, it does not work when I define the field in the sheet (i.e., rfds) and pass that through to my r variable. I get the following error, but two fields I am testing contain 3,896 and 4,843 words respectively.
readability.exceptions.ReadabilityException: 100 words required.
Am I missing something obvious? Disclaimer, I am very new to python and coding in general! Any help is appreciated.
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
rfd = df["Item 1A"]
rfds = rfd.to_string() # to fix "TypeError: expected string or buffer"
r = Readability(rfds)
fog = r.gunning_fog()
print(fog.score)
TL;DR: You need to pass the cell value and are currently passing a column of cells.
This line rfd = df["Item 1A"] returns a reference to a column. rfd.to_string() then generates a string containing either length (number of rows in the column) or the column reference. This is why a TypeError was thrown - neither the length nor the reference are strings.
Rather than taking a column and going down it, approach it from the other direction. Take the rows and then pull out the column:
for index, row in df.iterrows():
print(row.iloc[2])
The [2] is the column index.
Now a cell identifier exists, this can be passed to the Readability calculator:
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Note that these can be combined together into one command:
print(Readability(row.iloc[2]).gunning_fog())
This shows you how commands can be chained together - which way you find it easier is up to you. The chaining is useful when you give it to something like apply or applymap.
Putting the whole thing together (the step by step way):
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
for index, row in df.iterrows():
r = Readability(row.iloc[2])
fog = r.gunning_fog()
print(fog.score)
Or the clever way:
from readability import Readability
import pandas as pd
df = pd.read_excel(r"C:/Users/name/edgar/test/item1a_sandbox.xls")
print(df["Item 1A"].apply(lambda x: Readability(x).gunning_fog()))

how to append a dataframe to an existing dataframe inside a loop

I made a simple DataFrame named middle_dataframe in python which looks like this and only has one row of data:
display of the existing dataframe
And I want to append a new dataframe generated each time in a loop to this existing dataframe. This is my program:
k = 2
for k in range(2, 32021):
header = whole_seq_data[k]
if header.startswith('>'):
id_name = get_ucsc_ids(header)
(chromosome, start_p, end_p) = get_chr_coordinates_from_string(header)
if whole_seq_data[k + 1].startswith('[ATGC]'):
seq = whole_seq_data[k + 1]
df_temp = pd.DataFrame(
{
"ucsc_id":[id_name],
"chromosome":[chromosome],
"start_position":[start_p],
"end_position":[end_p],
"whole_sequence":[seq]
}
)
middle_dataframe.append(df_temp)
k = k + 2
My iterations in the for loop seems to be fine and I checked the variables that stored the correct value after using regular expression. But the middle_dataframe doesn`t have any changes. And I can not figure out why.
The DataFrame.append method returns the result of the append, rather than appending in-place (link to the official docs on append). The fix should be to replace this line:
middle_dataframe.append(df_temp)
with this:
middle_dataframe = middle_dataframe.append(df_temp)
Depending on how that works with your data, you might need also to pass in the parameter ignore_index=True.
The docs warn that appending one row at a time to a DataFrame can be more computationally intensive than building a python list and converting it into a DataFrame all at once. That's something to look into if your current approach ends up too slow for your purposes.

Insert data from one dataframe into another by Index

I am working in Python and have a large dataset in a pandas dataframe. I have taken a section of this data and put it into another dataframe, where I have created a new column and populated it. I now want to put this new column back into the original dataframe, overwriting one of the existing columns, but only for the section I have edited.
Please can you help advise how this is best done? The only unique identifier is the index that is automatically generated. The 2nd dataframe has kept the same index values as the larger one so it should be quite straight forward but I cannot work out how to
a) reference the automatically created indexes
b) use these indexes to overwrite the existing data in the column from another dataframe
So, it should be something like this (I realise this is a mashup of syntax but just trying to better explain what I am trying to do!):
where df1.ROW.INDEX == df2.ROW.INDEX insert into
df1['col_name'].value from df2.['col_name'].value
Any help would be greatly appreciated.
UPDATE:
I now have this code which almost works:
index_values = edited_df.index.values
for i in index_values:
main_df.iloc[i]['pop'] =
edited_df.iloc[i]['new_col']
I get a caveats error, and the main_df is not changed. It looks like it is making copies in each iteration rather than updating the main dataframe.
UPDATE: FIXED
I finally managed to work out the kinks, solution below for anyone that has a similar problem.
index_values = edited_df.index.values
for i in index_values:
main_df.iloc[i, main_df.columns.get_loc('pop')] =
edited_df.iloc[i]['new_col']
Consider using pandas.DataFrame.update for an inplace update from passed in dataframe. Be sure column names match both datasets.
main_df.update(edited_df, join='left', overwrite=True)
I appreciate that you've found a solution that works. However, you're using a for loop when you don't need to. I'll start by improving your loop. Then I'll back up #Partfait's update idea
You use loc to reference by index and column values. You're relying on the coincidence that your index values are sequenced integers.
index_values = edited_df.index.values
for i in index_values:
main_df.loc[i, 'pop'] = edited_df.loc[i, 'new_col']
However, loc can take array like indexers and you're only using scalar indexers. That means you're better off using at
index_values = edited_df.index.values
for i in index_values:
main_df.at[i, 'pop'] = edited_df.at[i, 'new_col']
Or you can go even faster with set_value
index_values = edited_df.index.values
for i in index_values:
main_df.set_value(i, 'pop', edited_df.get_value(i, 'new_col'))
All that said, here is how you could use loc in one go
main_df.loc[:, 'pop'] = edited_df['new_col']
Or as #Partfait suggested
main_df.update(edited_df['new_col'].rename('pop'))

Categories