Append to Series in python/pandas not working - python

I am trying to append values to a pandas Series obtained by finding the difference between the nth and nth + 1 element:
q = pd.Series([])
while i < len(other array):
diff = some int value
a = pd.Series([diff], ignore_index=True)
q.append(a)
i+=1
The output I get is:
Series([], dtype: float64)
Why am I not getting an array with all the appended values?
--
P.S. This is a data science question where I have to find state with the most counties by searching through a dataframe. I am using the index values where one state ends and the next one begins (the values in the array that I am using to find the difference) to determine how many counties are in that state. If anyone knows how to solve this problem better than I am above, please let me know!

The append method doesn't work in-place. Instead, it returns a new Series object. So it should be:
q = q.append(a)
Hope it helps!

The Series.append documentation states that append rows of other to the end of this frame, returning a new object.
The examples are a little confusing as it appears to show it working but if you look closely you'll notice they are using interactive python which prints the result of the last call (the new object) rather than showing the original object.
The result of calling append is actually a brand new Series.
In your example you would need to assign q each time to the new object returned by .append:
q = pd.Series([])
while i < len(other array):
diff = some int value
a = pd.Series([diff], ignore_index=True)
# change of code here
q = q.append(a)
i+=1

Related

Pandas values not being updated

I'm new to working with Pandas and I'm trying to do a very simple thing with it. Using the flights.csv file I'm defining a new column which defines a new column with underperforming if the number of passengers is below average, the value is 1. My problem is that it might be something wrong with the logic since it's not updating the values. Here is an example:
df = pd.read_csv('flights.csv')
passengers_mean = df['passengers'].mean()
df['underperforming'] = 0
for idx, row in df.iterrows():
if (row['passengers'] < passengers_mean):
row['underperforming'] = 1
print(df)
print(passengers_mean)
Any clue?
According to the docs:
You should never modify something you are iterating over. This is not guaranteed to work in all cases.
iterrows docs
What you can do instead is:
df["underperforming"] = (df.passengers < x.passengers.mean()).astype('int')
Quoting the documentation:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
Kindly use vectorized operations like apply()

Getting a python panda data frame's row name as a string

Having found the maximum value in a panda data frame column, I am just trying to get the equivalent row name as a string.
Here's my code:
df[df['ColumnName'] == df['ColumnName'].max()].index
Which returns me an answer:
Index(['RowName'], dtype='object')
How do I just get RowName back?
(stretch question - why does .idmax() fry in the formulation df['Colname'].idmax? And, yes, I have tried it as .idmax() and also appended it to df.loc[:,'ColName'] etc.)
Just use integer indexing:
df[df['ColumnName'] == df['ColumnName'].max()].index[0]
Here [0] extracts the first element. Note your criterion may yield multiple indices.

How to return the index value of an element in a pandas dataframe

I have a dataframe of corporate actions for a specific equity. it looks something like this:
0 Declared Date Ex-Date Record Date
BAR_DATE
2018-01-17 2017-02-21 2017-08-09 2017-08-11
2018-01-16 2017-02-21 2017-05-10 2017-06-05
except that it has hundreds of rows, but that is unimportant. I created the index "BAR_DATE" from one of the columns which is where the 0 comes from above BAR_DATE.
What I want to do is to be able to reference a specific element of the dataframe and return the index value, or BAR_DATE, I think it would go something like this:
index_value = cacs.iloc[5, :].index.get_values()
except index_value becomes the column names, not the index. Now, this may stem from a poor understanding of indexing in pandas dataframes, so this may or may not be really easy to solve for someone else.
I have looked at a number of other questions including this one, but it returns column values as well.
Your code is really close, but you took it just one step further than you needed to.
# creates a slice of the dataframe where the row is at iloc 5 (row number 5) and where the slice includes all columns
slice_of_df = cacs.iloc[5, :]
# returns the index of the slice
# this will be an Index() object
index_of_slice = slice_of_df.index
From here we can use the documentation on the Index object: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
# turns the index into a list of values in the index
index_list = index_of_slice.to_list()
# gets the first index value
first_value = index_list[0]
The most important thing to remember about the Index is that it is an object of its own, and thus we need to change it to the type we expect to work with if we want something other than an index. This is where documentation can be a huge help.
EDIT: It turns out that the iloc in this case is returning a Series object which is why the solution is returning the wrong value. Knowing this, the new solution would be:
# creates a Series object from row 5 (technically the 6th row)
row_as_series = cacs.iloc[5, :]
# the name of a series relates to it's index
index_of_series = row_as_series.name
This would be the approach for single-row indexing. You would use the former approach with multi-row indexing where the return value is a DataFrame and not a Series.
Unfortunately, I don't know how to coerce the Series into a DataFrame for single-row slicingbeyond explicit conversion:
row_as_df = DataFrame(cacs.iloc[5, :])
While this will work, and the first approach will happily take this and return the index, there is likely a reason why Pandas doesn't return a DataFrame for single-row slicing so I am hesitant to offer this as a solution.

Row wise operations or UDF by row on a dataframe in pyspark

I have to implement pandas .apply(function, axis=1) (to apply row wise function) in pyspark. As I am a novice, I am not sure if It can be implemented either through map function or using UDFs. I am not able to find any similar implementation anywhere.
Basically all I want is to pass a row to a function do some operations to create new columns which depend on the values of current and previous rows and then return modified rows to create a new dataframe.
One of the function used with pandas is given below:
previous = 1
def row_operation(row):
global previous
if pd.isnull(row["PREV_COL_A"])==True or (row["COL_A"]) != (row["PREV_COL_A"]):
current = 1
elif row["COL_C"] > cutoff:
current = previous +1
elif row["COL_C"]<=cutoff:
current = previous
else:
current = Nan
previous = current
return current
Here PREV_COL_A is nothing but COL_A lagged by 1 row.
Please note that this function is the simplest and does not return rows however others do.
If anyone can guide me on how to implement row operations in pyspark it would be a great help.
TIA
You could use rdd.mapPartition. It will give you an iterator over the rows and you yield out the result rows you want to return. The iterable you are given won't allow you to go index forward or backwards, just return the next row. However you can save off rows as you are processing to do whatever you need to do. For example
def my_cool_function(rows):
prev_rows = []
for row in rows:
# Do some processing with all the rows, and return a result
yield my_new_row
if len(prev_rows) >= 2:
prev_rows = prev_rows[1:]
prev_rows.append(row)
updated_rdd = rdd.mapPartitions(my_cool_function)
Note, I used a list to track the partitions for the sake of example, but python lists are really arrays which don't have efficient head push/pop methods, so you will probably want to use an actual Queue.

Applying the output of a function to two columns using .apply

I'm working on a script that takes in an address and spits out two values: coordinates (as a list) and result (whether the geocoding was successful or not. This works fine, but since the data is returned as a list, I then have to assign new columns based on the indices of that list, which works but returns a warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy.
EDIT: Just to be clear, I think I understand from that page that I should be using .loc to access the nested values. My question is more along the lines of generating two columns directly from a function as opposed to this workaround of having to dig the information out later.
I'd like to know the correct way to approach problems like these, as I actually have this problem twice in this project.
The actual specifics of the problem aren't important, so here's a simple example of how I've been approaching it:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return coords, result
df['output'] = df['address'].apply(geo)
Since this then yields a nested list into my df column, I then extract that into new columns as such:
df['coordinates'] = None
df['gps_status'] = None
for index, row in df.iterrows():
df['coordinates'][index] = df['output'][index][0]
df['gps_status'][index] = df['output'][index][1]
And again, I get the warning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Any advice on the correct way to do this would be appreciated.
Usually you want to avoid iterrows() since it is faster to operate on an entire column at once. You can assign the result from output directly to a new column.
import pandas as pd
def geo(x):
return x*2, x*3
df = pd.DataFrame({'address':[1,2,3]})
output = df['address'].apply(geo)
df['a'] = [x[0] for x in output]
df['b'] = [x[1] for x in output]
gives you
address a b
0 1 2 3
1 2 4 6
2 3 6 9
with no copy warning.
Your function should return a Series:
def geo(address):
location = geocode(address)
result = location.result
coords = location.coords
return pd.Series([coords, result], ['coordinates', 'gps_status'])
df['output'] = df['address'].apply(geo)
That said, this may be better written as a merge.

Categories