How to calculate the sum of two adjacent row from one column? - python

I have a data frame with the following columns, the first column is index:
para
0 223.46
1 92.26
2 66.86
3 52.14
4 69.55
5 94.20
6 129.96
7 297.48
The sum will be the two adjacent row from one column
new_index 0 will be the first value,
new_index1= old_index0+old_index1,
new_index2=old_index1 + old_index2 ,......and so on.
so I guess I need a for loop here(or maybe not)
I tried several ways, really have no idea how to do it.
The follow is what I tried:
def sum(i):
for i in range (0,i):
sum = data_10.icol[i] + data_10.icol[i+1]
return sum
I excepted to get:
para
0 223.46
1 315.72
2 159.12
3 119.00
4 121.69
5 163.75
6 224.16
7 427.38

This is rolling sum
df.rolling(2,min_periods=1).sum()

Related

Count text records in a column with pandas

I need to count how many times a value appears in the column. I did something similar in Excel and I want to understand how to play in pandas. Thanks
You can try something like this:
import pandas as pd
df = pd.DataFrame({'char_list':list('aabbbbssbbaaabdddcccsbcderfffrrcashhttyy')})
df = df['char_list'].value_counts().reset_index()
df.columns = ['char_list', 'count']
print(df)
Output:
char_list count
0 b 8
1 a 6
2 c 5
3 s 4
4 d 4
5 r 3
6 f 3
7 h 2
8 t 2
9 y 2
10 e 1
Do you want something like this :
df = pd.DataFrame({"a":[1,2,3,1,1,4,5,6,2,1]})
oc = df.groupby("a").size
df["count"]=df["a"].map(oc)
print(oc)
print()
print(df)
to get
a
1 4
2 2
3 1
4 1
5 1
6 1
dtype: int64
a count
0 1 4
1 2 2
2 3 1
3 1 4
4 1 4
5 4 1
6 5 1
7 6 1
8 2 2
9 1 4
or do you prefer something like that Pandas: Incrementally count occurrences in a column with an increment of occurrences ?
Clarify and describe your requirements
Count the occurrence of string X inside what?
Where to look, how to count?
What is X?
What does your Excel formula?
Your Excel formula is doing a window-based aggregation, where the aggregation function is a count (function COUNT.IF) and the window is from first row until current row (first parameter of type range). The counted (given criteria) is specified per row (second parameter as cell value).
See Excel's function COUNTIF:
Counts the number of cells within a range that meet the given criteria
Illustrate by example
Instead of "window-based" we could also say cumulative:
The formula counts the occurrence of string key123 (value in column A, current row, e.g. 1) in rows between first ($A$1) to current ($A1).
Given a column with strings where the first string is key123, then
its first occurrence should have count 1,
the second should have count 2
and so on
Equivalent functions in pandas
So your Excel formula =COUNTIF($A$1:$A1; A1) would directly translate to pandas GroupBy.cumcount like
df.groupby("Column_A").cumcount()+1
as already answered in:
Pandas: Incrementally count occurrences in a column
Terminology
The cumulative count increases the count for each occurrence. Similar to a cumulative sum also known as running total.
See also related SQL keywords/concepts:
GROUP BY: grouping records and applying aggregate-functions
COUNT: an aggregate-function like SUM, AVG, MAX, MIN
window functions: allow further fine-grained aggregation

How to restrict DataFrame number of rows to the Xth unique value in certain column?

Say for example we have the following DataFrame:
A B
1 2
1 2
2 3
3 4
4 5
4 2
And we would know we wanted an x(say 3) number of unique values in column A.
Then the desired output would be:
A B
1 2
1 2
2 3
3 4
I thought about looping through the column in question, counting the number of unique values by tracking and taking the subset of the DataFrame with the right index. I am still a newbie to Python and I believe there would be a more efficient way to do this, please share your solutions. Appreciated!
You can try series.factorize which indexes the unique values starting at 0 and then select the values which is <= n-1 (because index starts at 0),hence reserves order too:
n=3
df[df['A'].factorize()[0]<=n-1]
A B
0 1 2
1 1 2
2 2 3
3 3 4
You can use np.random.choice to select the unique id, then isin to select rows with those id:
selected_ids = np.random.choice(df['A'].unique(), replace=False, size=3)
df[df['A'].isin(selected_ids)]

Multiple Condition Apply Function that iterates over itself

So I have a Dataframe that is the same thing 348 times, but with a different date as a static column. What I would like to do is add a column that checks against that date and then counts the number of rows that are within 20 miles using a lat/lon column and geopy.
My frame is like this:
What I am looking to do is something like an apply function that takes all of the identifying dates that are equal to the column and then run this:
geopy.distance.vincenty(x, y).miles
X would be the location's lat/lon and y would be the iterative lat/lon. I'd want the count of locations in which the above is < 20. I'd then like to store this count as a column in the initial Dataframe.
I'm ok with Pandas, but this is just outside my comfort zone. Thanks.
I started with this DataFrame (because I did not want to type that much by hand and you did not provide any code for the data):
df
Index Number la ID
0 0 1 [43.3948, -23.9483] 1/1/90
1 1 2 [22.8483, -34.3948] 1/1/90
2 2 3 [44.9584, -14.4938] 1/1/90
3 3 4 [22.39458, -55.34924] 1/1/90
4 4 5 [33.9383, -23.4938] 1/1/90
5 5 6 [22.849, -34.397] 1/1/90
Now I introduced an artificial column which is only there to help us get the cartesian product of the distances
df['join'] = 1
df_c = pd.merge(df, df[['la', 'join','Index']], on='join')
The next step is to apply the vincenty function via .apply and store the result in an extra column
df_c['distance'] = df_c.apply(lambda x: distance.vincenty(x.la_x, x.la_y).miles, 1)
Now we have the cartesian product of the original matrix, which means we have the comparison of each city with itself, too. But we will take that into account in the next step by performing -1. We groupby the Index_x and sum all the distances smaller the 20 miles.
df['num_close_cities'] = df_c.groupby('Index_x').apply(lambda x: sum((x.distance < 20))) -1
df.drop('join', 1)
Index Number la ID num_close_cities
0 0 1 [43.3948, -23.9483] 1/1/90 0
1 1 2 [22.8483, -34.3948] 1/1/90 1
2 2 3 [44.9584, -14.4938] 1/1/90 0
3 3 4 [22.39458, -55.34924] 1/1/90 0
4 4 5 [33.9383, -23.4938] 1/1/90 0
5 5 6 [22.849, -34.397] 1/1/90 1

Running total on a pandas df

I am trying to add a running count to a pandas df.
For the values in Column A, I want to add '5' and for values in Column B I want to add '1'.
So for the df below I'm hoping to produce:
A B Total
0 0 0 0
1 0 0 0
2 1 0 5
3 1 1 6
4 1 1 6
5 2 1 11
6 2 2 12
So for every incremental integer in Column A, it equal '5' in the total. While Column B is the '+1'.
I tried:
df['Total'] = df['A'].cumsum(axis = 0)
But this doesn't include Column B
df['Total'] = df['A'] * 5 + df['B']
As far as I can tell, you are simply doing row wise operations, not a cumulative sum. This snippet calculates the row value of A times 5 and adds the row value of B for each row. Please don't make it any more complicated than it really is.
What is a cumulative sum (also called running total)?
From Wikipedia:
Consider the sequence < 5 8 3 2 >. What is the total of this sequence?
Answer: 5 + 8 + 3 + 2 = 18. This is arrived at by simple summation of the sequence.
Now we insert the number 6 at the end of the sequence to get < 5 8 3 2 6 >. What is the total of that sequence?
Answer: 5 + 8 + 3 + 2 + 6 = 24. This is arrived at by simple summation of the sequence. But if we regarded 18 as the running total, we need only add 6 to 18 to get 24. So, 18 was, and 24 now is, the running total. In fact, we would not even need to know the sequence at all, but simply add 6 to 18 to get the new running total; as each new number is added, we get a new running total.

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

Categories