It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?
This is what my data looks like:
This is what I have so far, but when running run_analysis my "result" equation is bringing up an error, and I do not know how to store the results in a new dataframe. I'm a beginner at all of this so any help would be much appreciated.
storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
return a-b
def run_analysis (frame, store):
for first_col_index in range(len(frame)): #finding the first column to use
temp=[] #temporary place to store the column-wise values from the analysis
for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to
#the current column, then skip to next column
continue
else:
result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
#if column above our current column, the subtract values in the column and keep the result in temp
temp.append(result)
store.append(temp) #save the complete analysis in the store
Something like this?
#dummy ddataframe
df = pd.DataFrame({'a':list(range(10)), 'b':list(range(10,20)), 'c':list(range(10))})
print(df)
output:
a b c
0 0 10 0
1 1 11 1
2 2 12 2
3 3 13 3
4 4 14 4
5 5 15 5
6 6 16 6
7 7 17 7
8 8 18 8
9 9 19 9
Now iterate over pairs of columns and subtract them while assigning another column to the dataframe
for c1, c2 in zip(df.columns[:-1], df.columns[1:]):
df[f'{c2}-{c1}'] = df[c2]-df[c1]
print(df)
output:
a b c b-a c-b
0 0 10 0 10 -10
1 1 11 1 10 -10
2 2 12 2 10 -10
3 3 13 3 10 -10
4 4 14 4 10 -10
5 5 15 5 10 -10
6 6 16 6 10 -10
7 7 17 7 10 -10
8 8 18 8 10 -10
9 9 19 9 10 -10
I have a pandas data frame that looks like this:
1: As you can see, I have the index "State" and "City"
And I want to filter by state using loc, for example using:
nuevo4.loc["Bulgaria"]
(The name of the Dataframe is "nuevo4"), but instead of getting the results, I want I get the error:
KeyError: 'Bulgaria'
I read the loc documentation online and I cannot see the fail here, I'm sorry if this is too obvious, the names are well spelled, and that...
You command should work see the below example. You might have some whitespace issues with your data:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(25).reshape(-1,5), index = pd.MultiIndex.from_tuples([('A',1), ('A',2),('B', 0), ('C', 1), ('C', 2)]))
print(df)
print(df.loc['A'])
Output:
0 1 2 3 4
A 1 0 1 2 3 4
2 5 6 7 8 9
B 0 10 11 12 13 14
C 1 15 16 17 18 19
2 20 21 22 23 24
Using loc:
0 1 2 3 4
1 0 1 2 3 4
2 5 6 7 8 9
I'm having problems with pd.rolling() method that returns several outputs even though the function returns a single value.
My objective is to:
Calculate the absolute percentage difference between two DataFrames with 3 columns in each df.
Sum all values
I can do this using pd.iterrows(). But working with larger datasets makes this method ineffective.
This is the test data im working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
This method produces the output I want by using pd.iterrows()
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.sum(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[991.2698412698413,
636.2698412698412,
456.19047619047626,
616.6666666666667,
935.7142857142858,
627.3809523809524,
592.8571428571429,
350.8333333333333,
449.1666666666667,
1290.0,
658.531746031746,
646.031746031746,
597.4603174603175,
478.80952380952385,
383.0952380952381,
980.5555555555555,
612.5]
Finally, below is my attemt to use pd.rolling() so that I dont need to loop through each row.
def SumOfAverageFunction(vals):
Div = abs((((df2.values / vals.reset_index(drop="True").values)-1)*100))
Average = Div.sum()
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSums = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
Here is my problem because printing RunningSums from above outputs several values and is not close to the results I'm getting using iterrows method. How do I solve this?
print(RunningSums)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 702.380952 780.000000 283.333333
3 533.333333 640.000000 533.333333
4 1200.000000 475.000000 403.174603
5 833.333333 1280.000000 625.396825
6 563.333333 760.000000 1385.714286
7 346.666667 386.666667 1016.666667
8 473.333333 573.333333 447.619048
9 533.333333 1213.333333 327.619048
10 375.000000 746.666667 415.714286
11 408.333333 453.333333 515.000000
12 604.166667 338.333333 1250.000000
13 1366.666667 577.500000 775.000000
14 847.619048 1400.000000 683.333333
15 314.285714 733.333333 455.555556
16 533.333333 441.666667 474.444444
17 347.619048 616.666667 546.666667
18 735.714286 466.666667 1290.000000
19 350.000000 488.888889 875.000000
20 525.000000 1361.111111 1266.666667
It's just the way rolling behaves, it's going to window around all of the columns and I don't know that there is a way around it. One solution is to apply rolling to a single column, and use the indexes from those windows to slice the dataframe inside your function. Still expensive, but probably not as bad as what you're doing.
Also the output of your first method looks wrong. You're actually starting your calculations a few rows too late.
import numpy as np
def SumOfAverageFunction(vals):
return (abs(np.divide(df2.values, df1.loc[vals.index].values)-1)*100).sum()
vals = df1.column1.rolling(3)
vals.apply(SumOfAverageFunction, raw=False)
I have a pandas dataframe of N columns of integer object values. The values in the columns are associated with outcome of a particular random experiment. For example, if I were to call df.head():
0 1 2 3
0 13 4 0 5
1 8 2 16 6
2 6 20 14 0
3 17 4 8 4
4 17 2 12 0
What I am interesting in doing is identifying the number of times each of unique values occur for a particular column. Concerning ourselves with column 0 only, I may wish to know of the number of times I have observe the value '17' this experiment, and in our box above we can see this occurred twice over the first 5 entries in column 0.
What would be the optimal method of doing this, via Pandas itself or otherwise?
The first approach I considered was to collapse that column down into a Dictionary where the Key is the observed data value, and the Dictionary Value being associated with the count of that Particular Key. I used the Counter datastructure from Python Collections.
# converting the Dataset into a Pandas Dataframe
df = pd.read_csv("newdataset.txt",
header=None,
#skiprows=0,
delim_whitespace=True)
print(df.head())
user0Counter = Counter()
for dataEntry in df[0]:
user0Counter.update(dataEntry)
This leads to a type error.
TypeError Traceback (most recent call last)
<ipython-input-15-d2a83c38d0d0> in <module>
----> 1 import codecs, os;__pyfile = codecs.open('''~/dir/foo/bar.py''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();exec(compile(__code, '''~/dir/foo/bar.py''', 'exec'));
~/dir/foo/bar.py in <module>
28
29 for dataEntry in df[0]:
---> 30 user0Counter.update(dataEntry)
31
32 print(len(user0Counter))
~/anaconda3/lib/python3.7/collections/__init__.py in update(*args, **kwds)
651 super(Counter, self).update(iterable) # fast path when counter is empty
652 else:
--> 653 _count_elements(self, iterable)
654 if kwds:
655 self.update(kwds)
TypeError: 'int' object is not iterable
If I replace the user0Counter.update() method with a print(dataEntry) block, there is no issue iterating over df[0].
0 1 2 3
0 13 4 0 5
1 8 2 16 6
2 6 20 14 0
3 17 4 8 4
4 17 2 12 0
13
8
6
17
17
1
1
4
6
19
3
11
3
4
12
7
1
9
4
2
1
2
5
1
2
13
And so forth.
You can use pandas directly.
import pandas as pd
df['col_frequency'] = df.groupby(['col_to_count'])['col_to_count'].count()
Is there an efficient way to change the value of a previous row whenever a conditional is met in a subsequent entry? Specifically I am wondering if there is anyway to adapt pandas.where to modify the entry in a row prior or subsequent to the conditional test. Suppose
Data={'Energy':[12,13,14,12,15,16],'Time':[2,3,4,2,5,6]}
DF = pd.DataFrame(Data)
DF
Out[123]:
Energy Time
0 12 2
1 13 3
2 14 4
3 12 2
4 15 5
5 16 6
If I wanted to change the value of Energy to 'X' whenever Time <= 2 I could just do something like.
DF['ENERGY']=DF['ENERGY'].where(DF['TIME'] >2,'X')
or
DF.loc[DF['Time']<=2,'Energy']='X'
Which would output
Energy Time
0 X 2
1 13 3
2 14 4
3 X 2
4 15 5
5 16 6
But what if I want to change the value of 'Energy' in the row after Time <=2 so that the output would actually be.
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Is there an easy modification for a vectorized approach to this?
Shift the values one row down using Series.shift and then compare:
df.loc[df['Time'].shift() <= 2, 'Energy'] = 'X'
df
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Side note: I assume 'X' is actually something else here, but FYI, mixing strings and numeric data leads to object type columns which is a known pandas anti-pattern.