struggle trying create new series - python

I have the following df :
            A C
Date
2015-06-29   196.0  1
2015-09-18   255.0  2
2015-08-24   236.0  3
2014-11-20    39.0  4
2014-10-02  4.0  5
How can I generate a new series that is the sum of all the previous rows of column c ?
This would be the desired output:
D
1
#This second row of value 3 is the sum of first and second row of column c
3
#This third row of value 6 is the sum of first, second and third row
value of column c , and so on
6
10
15
I have tried a loop such as:
for j in range (len(df)):
new_series.iloc[j]+=df['C'].iloc[j]
return new_series
But does not seem to work

IIUC you can use cumsum to perform this:
In [373]:
df['C'].cumsum()
Out[373]:
Date
2015-06-29 1
2015-09-18 3
2015-08-24 6
2014-11-20 10
2014-10-02 15
Name: C, dtype: int64

Numpy alternatives:
In [207]: np.add.accumulate(df['C'])
Out[207]:
2015-06-29 1
2015-09-18 3
2015-08-24 6
2014-11-20 10
2014-10-02 15
Name: C, dtype: int64
In [208]: np.cumsum(df['C'])
Out[208]:
2015-06-29 1
2015-09-18 3
2015-08-24 6
2014-11-20 10
2014-10-02 15
Name: C, dtype: int64
In [209]: df['C'].values.cumsum()
Out[209]: array([ 1, 3, 6, 10, 15], dtype=int64)

Related

When to use .count() and .value_counts() in Pandas?

I am learning pandas. I'm not sure when to use the .count() function and when to use .value_counts().
count() is used to count the number of non-NA/null observations across the given axis. It works with non-floating type data as well.
Now as an example create a dataframe df
df = pd.DataFrame({"A":[10, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["Shreyas", "Aman", "Apoorv", np.nan, "Kunal", "Ayush"]})
Find the count of non-NA value across the row axis.
df.count(axis = 0)
Output:
A 5
B 4
C 5
dtype: int64
Find the number of non-NA/null value across the column.
df.count(axis = 1)
Output:
0 3
1 2
2 3
3 1
4 2
5 3
dtype: int64
value_counts() function returns Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So for the example shown below
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()
The output would be:
3.0 2
4.0 1
2.0 1
1.0 1
dtype: int64
value_counts() aggregates the data and counts each unique value. You can achieve the same by using groupby which is a more broad function to aggregate data in pandas.
count() simply returns the number of non NaN/Null values in column (series) you apply it on.
df = pd.DataFrame({'Id':['A', 'B', 'B', 'C', 'D', 'E', 'F', 'F'],
'Value':[10, 20, 15, 5, 35, 20, 10, 25]})
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 C 5
4 D 35
5 E 20
6 F 10
7 F 25
# Value counts
df['Id'].value_counts()
F 2
B 2
C 1
A 1
D 1
E 1
Name: Id, dtype: int64
# Same operation but with groupby
df.groupby('Id')['Id'].count()
Id
A 1
B 2
C 1
D 1
E 1
F 2
Name: Id, dtype: int64
# Count()
df['Id'].count()
8
Example with NaN values and count:
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 NaN 5
4 D 35
5 E 20
6 F 10
7 F 25
df['Id'].count()
7
count() returns the total number of non-null values in the series.
value_counts() returns a series of the number of times each unique non-null value appears, sorted from most to least frequent.
As usual, an example is the best way to convey this:
ser = pd.Series(list('aaaabbbccdef'))
ser
>
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 c
8 c
9 d
10 e
11 f
dtype: object
ser.count()
>
12
ser.value_counts()
>
a 4
b 3
c 2
f 1
d 1
e 1
dtype: int64
Note that a dataframe has the count() method, which returns a series of the count() (scalar) value for each column in the df. However, a dataframe has no value_counts() method.

Pandas iterate over DataFrame row pairs

How can I iterate over pairs of rows of a Pandas DataFrame?
For example:
content = [(1,2,[1,3]),(3,4,[2,4]),(5,6,[6,9]),(7,8,[9,10])]
df = pd.DataFrame( content, columns=["a","b","interval"])
print df
output:
a b interval
0 1 2 [1, 3]
1 3 4 [2, 4]
2 5 6 [6, 9]
3 7 8 [9, 10]
Now I would like to do something like
for (indx1,row1), (indx2,row2) in df.?
print "row1:\n", row1
print "row2:\n", row2
print "\n"
which should output
row1:
a 1
b 2
interval [1,3]
Name: 0, dtype: int64
row2:
a 3
b 4
interval [2,4]
Name: 1, dtype: int64
row1:
a 3
b 4
interval [2,4]
Name: 1, dtype: int64
row2:
a 5
b 6
interval [6,9]
Name: 2, dtype: int64
row1:
a 5
b 6
interval [6,9]
Name: 2, dtype: int64
row2:
a 7
b 8
interval [9,10]
Name: 3, dtype: int64
Is there a builtin way to achieve this?
I looked at df.groupby(df.index // 2) and df.itertuples but none of these methods seems to do what I want.
Edit:
The overall goal is to get a list of bools indicating whether the intervals in column "interval" overlap. In the above example the list would be
overlaps = [True, False, False]
So one bool for each pair.
shift the dataframe & concat it back to the original using axis=1 so that each interval & the next interval are in the same row
df_merged = pd.concat([df, df.shift(-1).add_prefix('next_')], axis=1)
df_merged
#Out:
a b interval next_a next_b next_interval
0 1 2 [1, 3] 3.0 4.0 [2, 4]
1 3 4 [2, 4] 5.0 6.0 [6, 9]
2 5 6 [6, 9] 7.0 8.0 [9, 10]
3 7 8 [9, 10] NaN NaN NaN
define an intersects function that works with your lists representation & apply on the merged data frame ignoring the last row where the shifted_interval is null
def intersects(left, right):
return left[1] > right[0]
df_merged[:-1].apply(lambda x: intersects(x.interval, x.next_interval), axis=1)
#Out:
0 True
1 False
2 False
dtype: bool
If you want to keep the loop for, using zip and iterrows could be a way
for (indx1,row1),(indx2,row2) in zip(df[:-1].iterrows(),df[1:].iterrows()):
print "row1:\n", row1
print "row2:\n", row2
print "\n"
To access the next row at the same time, start the second iterrow one row after with df[1:].iterrows(). and you get the output the way you want.
row1:
a 1
b 2
Name: 0, dtype: int64
row2:
a 3
b 4
Name: 1, dtype: int64
row1:
a 3
b 4
Name: 1, dtype: int64
row2:
a 5
b 6
Name: 2, dtype: int64
row1:
a 5
b 6
Name: 2, dtype: int64
row2:
a 7
b 8
Name: 3, dtype: int64
But as said #RafaelC, doing for loop might not be the best method for your general problem.
To get the output you've shown use:
for row in df.index[:-1]:
print 'row 1:'
print df.iloc[row].squeeze()
print 'row 2:'
print df.iloc[row+1].squeeze()
print
You could try the iloc indexing.
Exmaple:
for i in range(df.shape[0] - 1):
idx1,idx2=i,i+1
row1,row2=df.iloc[idx1],df.iloc[idx2]
print(row1)
print(row2)
print()

How do I save a certain row in my database to a new Excel file?

I have located a specific row in my database using:
df.loc[df["Cost per m^3/$"].idxmin()]
However I would now like to save this row to a new Excel spreadsheet, how can I do this?
You can do it like this:
row = df.loc[df["Cost per m^3/$"].idxmin()]
pd.DataFrame(row).to_excel('NewFile.xlsx')
You can use the following trick:
Data:
In [120]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc'))
In [121]: df
Out[121]:
a b c
0 5 9 4
1 4 5 3
2 8 0 1
3 0 3 9
4 6 6 5
This returns a series:
In [122]: df.loc[df.a.idxmin()]
Out[122]:
a 0
b 3
c 9
Name: 3, dtype: int32
let's use list of indexes instead of a scalar value:
In [123]: df.loc[[df.a.idxmin()]]
Out[123]:
a b c
3 0 3 9
Now you can use DataFrame.to_excel() method

Pandas Series with column names for each value above a minimum

I try to get a new series from a DataFrame. This series should contain the column names of the DataFrame's values that are above some value for each row of the DataFrame. But beginning from the left of the DataFrame, like this:
df = pd.DataFrame(np.random.randint(0,10,size=(5, 6)), columns=list('ABCDEF'))
>>> df
A B C D E F
0 2 4 6 8 8 4
1 2 0 9 7 7 1
2 1 7 7 7 3 0
3 5 4 4 0 1 7
4 9 6 1 5 1 5
min = 3
Expected Output:
0 B
1 C
2 B
3 A
4 A
dtype: object
Here the output's row 0 is "B" because in the DataFrame row index 0 column "B" is the most left column that has a value that is equal or bigger than min = 3.
I know that I an use df.idxmin(axis = 1) to get the column names of the minimum for each row but I have now clue at all how to tackle this more complex problem.
Thanks for help or hints!
UPDATE - index of the first element in each row, satisfying condition:
more elegant and more efficient version from #DSM:
In [156]: (df>=3).idxmax(1)
Out[156]:
0 B
1 C
2 B
3 A
4 A
dtype: object
my version:
In [149]: df[df>=3].apply(lambda x: x.first_valid_index(), axis=1)
Out[149]:
0 B
1 C
2 B
3 A
4 A
dtype: object
Old answer - index of the minimum element for each row:
In [27]: df[df>=3].idxmin(1)
Out[27]:
0 E
1 A
2 C
3 C
4 F
dtype: object

df.iloc[0:1,:].apply(func, axis=1, x,y,z) executes func() 2 times

I have a dataframe df that has thousands of rows.
For each row I want to apply function func.
As a test, I wanted to run func for only the first row of df. In func() I placed a print statement. I realized that the print statement was run 2 times even though I am slicing df to one row (there is an additional row for columns but those are columns).
When I do the following
df[0:1].apply(func, axis=1, x,y,z)
or
df.iloc[0:1,:].apply(func, axis=1, x,y,z)
The print statement is run 2 times, which means func() was executed twice.
Any idea why this is happening?
The doc clearly says:
In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path.
pay attention at different slicing techniques:
In [134]: df
Out[134]:
a b c
0 9 5 4
1 4 7 2
2 1 3 7
3 6 3 2
4 4 5 2
In [135]: df.iloc[0:1]
Out[135]:
a b c
0 9 5 4
In [136]: df.loc[0:1]
Out[136]:
a b c
0 9 5 4
1 4 7 2
with printing:
print one row as Series:
In [139]: df[0:1].apply(lambda r: print(r), axis=1)
a 9
b 5
c 4
Name: 0, dtype: int32
Out[139]:
0 None
dtype: object
or using iloc:
In [144]: df.iloc[0:1, :].apply(lambda r: print(r), axis=1)
a 9
b 5
c 4
Name: 0, dtype: int32
Out[144]:
0 None
dtype: object
print two rows/Series:
In [140]: df.loc[0:1].apply(lambda r: print(r), axis=1)
a 9
b 5
c 4
Name: 0, dtype: int32
a 4
b 7
c 2
Name: 1, dtype: int32
Out[140]:
0 None
1 None
dtype: object
OP:
"the print statement was run 2 times even though I am slicing df to
one row"
actually, you were slicing it into two rows

Categories