how to slicing csv file using python's panda

how to slicing csv file using python's panda - python

I have this csv file:
DATE RELEASE 10GB 100GB 200GB 400GB 600GB 800GB 1000GB
5/5/16 2.67 0.36 4.18 8.54 18 27 38 46
5/5/16 2.68 0.5 4.24 9.24 18 27 37 46
5/6/16 2.69 0.32 4.3 9.18 19 29 37 46
5/6/16 2.7 0.35 4.3 9.3 19 28 37 46
5/6/16 2.71 0.3 4.1 8 8.18 16 24 33 41
I need to calculate the difference of each column (10 GB ~ 1000GB)between release 2.71 and release 2.70. That means last row - the row above.
My code to access each desired row are: row1=df[df.RELEASE == 2.70], and row2 = df[df.RELEASE == 2.71]
My issue is: I do not know how to access each element. I try to put the row1 and row2 into list. list(row1), list(row2), but that only print the title rather than the value of each cell.
My question is: how do I acces each element of desired row, so I can calculat: "0.3 -0.35" Thanks for helping!

If its really the last two rows you're chasing try out negative indexing
df.loc[-1,:] - df.loc[-2,:]
I'm on a phone so haven't run the code but it should get you closer.

Related

Why is checking for a variables existence taking more time than copying an array which should be a O(1) vs O(n) operation?

These numbers don't make sense to me.
Why does checking for list existence or checking len() of an list take longer than an copy()?
It's O(1) vs O(n) operations.
Total time: 3.01392 s
File: all_combinations.py
Function: recurse at line 15
Line # Hits Time Per Hit % Time Line Contents
==============================================================
15 #profile
16 def recurse(in_arr, result=[]):
17 nonlocal count
18
19 1048576 311204.0 0.3 10.3 if not in_arr:
20 524288 141102.0 0.3 4.7 return
21
22 524288 193554.0 0.4 6.4 in_arr = in_arr.copy() # Note: this adds a O(n) operation
23
24 1572863 619102.0 0.4 20.5 for i in range(len(in_arr)):
25 1048575 541166.0 0.5 18.0 next = result + [in_arr.pop(0)]
26 1048575 854453.0 0.8 28.4 recurse(in_arr, next)
27 1048575 353342.0 0.3 11.7 count += 1
Total time: 2.84882 s
File: all_combinations.py
Function: recurse at line 38
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 #profile
39 def recurse(result=[], index=0):
40 nonlocal count
41 nonlocal in_arr
42
43 # base
44 1048576 374126.0 0.4 13.1 if index > len(in_arr):
45 return
46
47 # recur
48 2097151 846711.0 0.4 29.7 for i in range(index, len(in_arr)):
49 1048575 454619.0 0.4 16.0 next_result = result + [in_arr[i]]
50 1048575 838434.0 0.8 29.4 recurse(next_result, i + 1)
51 1048575 334930.0 0.3 11.8 count = count + 1

It's not that making the copy takes longer by itself than the O(1) operations you mentioned.
But remember that your base case is running far more often than the recursive case.

I'm not sure what you reference with "it" in your question. The generic ("royal") "it" does not take longer; your implementation is what takes longer. In most language implementations, len is O(1), because it's an instance attribute maintained along with any changes in the object. This existence implementation is slower because it recurs instead of simply iterating through the list, although it's still O(N).

DataFrame max() not return max

Real beginner question here, but it is so simple, I'm genuinely stumped. Python/DataFrame newbie.
I've loaded a DataFrame from a Google Sheet, however any graphing or attempts at calculations are generating bogus results. Loading code:
# Setup
!pip install --upgrade -q gspread
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('Linear Regression - Brain vs. Body Predictor').worksheet("Raw Data")
rows = worksheet.get_all_values()
# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
This seems to work fine and the data looks to be correctly loaded when I print out the DataFrame but running max() returns obviously false results. For example:
print(df[0])
print(df[0].max())
Will output:
0 3.385
1 0.48
2 1.35
3 465
4 36.33
5 27.66
6 14.83
7 1.04
8 4.19
9 0.425
10 0.101
11 0.92
12 1
13 0.005
14 0.06
15 3.5
16 2
17 1.7
18 2547
19 0.023
20 187.1
21 521
22 0.785
23 10
24 3.3
25 0.2
26 1.41
27 529
28 207
29 85
...
32 6654
33 3.5
34 6.8
35 35
36 4.05
37 0.12
38 0.023
39 0.01
40 1.4
41 250
42 2.5
43 55.5
44 100
45 52.16
46 10.55
47 0.55
48 60
49 3.6
50 4.288
51 0.28
52 0.075
53 0.122
54 0.048
55 192
56 3
57 160
58 0.9
59 1.62
60 0.104
61 4.235
Name: 0, Length: 62, dtype: object
Max: 85
Obviously, the maximum value is way out -- it should be 6654, not 85.
What on earth am I doing wrong?
First StackOverflow post, so thanks in advance.

If you check it, you'll see at the end of your print() that dtype=object. Also, you'll notice your pandas Series have "int" values along with "float" values (e.g. you have 6654 and 3.5 in the same Series).
These are good hints you have a series of strings, and the max operator here is comparing based on string comparing. You want, however, to have a series of numbers (specifically floats) and to compare based on number comparing.
Check the following reproducible example:
>>> df = pd.DataFrame({'col': ['0.02', '9', '85']}, dtype=object)
>>> df.col.max()
'9'
You can check that because
>>> '9' > '85'
True
You want these values to be considered floats instead. Use pd.to_numeric
>>> df['col'] = pd.to_numeric(df.col)
>>> df.col.max()
85
For more on str and int comparison, check this question

Using Pandas in Python to Join Multiple Files Based on Date

I have csv files that I need to join together based upon date but the dates in each file are not the same (i.e. some files start on 1/1/1991 and other in 1998). I have a basic start to the code (see below) but I am not sure where to go from here. Any tips are appreciated. Below please find a sample of the different csv I am trying to join.
import os, pandas as pd, glob
directory = r'C:\data\Monthly_Data'
files = os.listdir(directory)
print(files)
all_data =pd.DataFrame()
for f in glob.glob(directory):
df=pd.read_csv(f)
all_data=all_data.append(df,ignore_index=True)
all_data.describe()
File 1
DateTime F1_cfs F2_cfs F3_cfs F4_cfs F5_cfs F6_cfs F7_cfs
3/31/1991 0.860702028 1.167239264 0 0 0 0 0
4/30/1991 2.116930556 2.463493056 3.316688418
5/31/1991 4.056572581 4.544307796 5.562668011
6/30/1991 1.587513889 2.348215278 2.611659722
7/31/1991 0.55328629 1.089637097 1.132043011
8/31/1991 0.29702957 0.54186828 0.585073925 2.624375
9/30/1991 0.237083333 0.323902778 0.362583333 0.925563094 1.157786606 2.68722973 2.104090278
File 2
DateTime F1_mg-P_L F2_mg-P_L F3_mg-P_L F4_mg-P_L F5_mg-P_L F6_mg-P_L F7_mg-P_L
6/1/1992 0.05 0.05 0.06 0.04 0.03 0.18 0.08
7/1/1992 0.03 0.05 0.04 0.03 0.04 0.05 0.09
8/1/1992 0.02 0.03 0.02 0.02 0.02 0.02 0.02
File 3
DateTime F1_TSS_mgL F1_TVS_mgL F2_TSS_mgL F2_TVS_mgL F3_TSS_mgL F3_TVS_mgL F4_TSS_mgL F4_TVS_mgL F5_TSS_mgL F5_TVS_mgL F6_TSS_mgL F6_TVS_mgL F7_TSS_mgL F7_TVS_mgL
4/30/1991 10 7.285714286 8.5 6.083333333 3.7 3.1
5/31/1991 5.042553191 3.723404255 6.8 6.3 3.769230769 2.980769231
6/30/1991 5 5 1 1
7/31/1991
8/31/1991
9/30/1991 5.75 3.75 6.75 4.75 9.666666667 6.333333333 8.666666667 5 12 7.666666667 8 5.5 9 6.75
10/31/1991 14.33333333 9 14 10.66666667 16.25 11 12.75 9.25 10.25 7.25 29.33333333 18.33333333 13.66666667 9
11/30/1991 2.2 1.933333333 2 1.88 0 0 4.208333333 3.708333333 10.15151515 7.909090909 9.5 6.785714286 4.612903226 3.580645161

You didn't read the csv files correctly.
1) You need to comment out the following lines because you never use it later in your code.
files = os.listdir(directory)
print(files)
2) glob.glob(directory) didnt return any match files. glob.glob() takes pattern as argument, for example: 'C:\data\Monthly_Data\File*.csv', unfortunately you put a directory as a pattern, and no files are found
for f in glob.glob(directory):
I modified the above 2 parts and print all_data, the file contents display on my console

Pairwise Elements Using Python - Calculating Average of individual elements of array

So I have a query; I am accessing an API that gives the following response:
[["22014",201939,"0021401229","APR 15 2015",Team1 vs. Team2","W",
19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],["22014",201939,"0021401","APR
13 2015",Team1 vs. Team3","W",
15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],["22014",201939,"0021401192","APR
11 2015",Team1 vs. Team4","W",
22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
I could just as easily have 16 different variables that I assign zero to, then print them out like the following example:
sum_pts = 0
for n in range(0,len(shot_data)): #range of games; these lengths vary per player
sum_pts= sum_pts+float(json.dumps(shots_array[n][24]))
print sum_pts/float(len(shots_array))
Output:
>>>
23.75
But I'd rather not create 16 different variables that calculate the average of the individual elements in this list. I'm looking for an easier way that I could get the average of Team1
I would like it the output to eventually be, so that I can apply this to infinite number of players or individual stats:
Team1 AVGPTS AVGAST AVGSTL AVGREB...
23.75 5.3 2.1 3.2
Or it could be:
Player1 AVGPTS AVGAST AVGSTL AVGREB ...
23.75 5.3 2.1 3.2 ...

To get the averages of the last 16 entries of each entry, you could use the following approach, this avoids the need to define multiple variables for each column:
data = [
["22014",201939,"0021401229","APR 15 2015", "Team1 vs. Team2","W", 19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],
["22014",201939,"0021401","APR 13 2015","Team1 vs. Team3","W", 15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],
["22014",201939,"0021401192","APR 11 2015","Team1 vs. Team4","W", 22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
length = float(len(data))
values = []
for entry in data:
values.append(entry[6:])
values = zip(*values)
averages = [sum(v) / length for v in values]
for col in averages:
print "{:.2f} ".format(col),
This would display:
18.67 4.33 11.00 0.40 2.00 6.00 0.50 0.00 0.00 0.00 2.00 2.00 4.00 7.00 5.00 0.00 4.00 1.00 10.00 14.00 1.00
Note, your data is missing an opening quote before each Team1 vs Team2.

Pandas issue iterating over DataFrame

I'm using pandas to scrape a web page and iterate through a DataFrame object. Here's the function I'm calling:
def getTeamRoster(teamURL):
teamPlayers = []
table = pd.read_html(requests.get(teamURL).content)[4]
nameTitle = '\n\t\t\t\tPlayers\n\t\t\t'
ratingTitle = 'SinglesRating'
finalTable = table[[nameTitle, ratingTitle]][:-1]
print(finalTable)
for index, row in finalTable:
print(index, row)
I'm using the syntax advocated here:
http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/
However, I'm getting this error:
File "SquashScraper.py", line 46, in getTeamRoster
for index, row in finalTable:
ValueError: too many values to unpack (expected 2)
For what it's worth, my finalTable prints as this:
\n\t\t\t\tPlayers\n\t\t\t SinglesRating
0 Browne,Noah 5.56
1 Ellis,Thornton 4.27
2 Line,James 4.25
3 Desantis,Scott J. 5.08
4 Bahadori,Cameron 4.97
5 Groot,Michael 4.76
6 Ehsani,Darian 4.76
7 Kardon,Max 4.83
8 Van,Jeremy 4.66
9 Southmayd,Alexander T. 4.91
10 Cacouris,Stephen A 4.68
11 Groot,Christopher 4.62
12 Mack,Peter D. (sub) 3.94
13 Shrager,Nathaniel O. 0.00
14 Woolverton,Peter C. 4.06
which looks right to me.
Any idea why python doesn't like my syntax?
Thanks for the help,
bclayman

You probably want to try this:
for index, row in finalTable.iterrows():
print(index, row)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to slicing csv file using python's panda - python

If its really the last two rows you're chasing try out negative indexing df.loc[-1,:] - df.loc[-2,:] I'm on a phone so haven't run the code but it should get you closer.

Related

Why is checking for a variables existence taking more time than copying an array which should be a O(1) vs O(n) operation?

DataFrame max() not return max

Using Pandas in Python to Join Multiple Files Based on Date

Pairwise Elements Using Python - Calculating Average of individual elements of array

Pandas issue iterating over DataFrame

Categories

Resources