Extract data out of a csv file - python

I am trying to extract data out of a csv file and output the data to another csv file.
relate task perform
0 avc asd
1 12 24
2 34 54
3 22 33
4 11 11
5 335 534
Time A B C D
0 0.334 0.334 0.334 0.334
1 0.543 0.543 0.543 0.543
2 0.752 0.752 0.752 0.752
3 0.961 0.961 0.961 0.961
4 1.17 1.17 1.17 1.17
5 1.379 1.379 1.379 1.379
I am writing a python script to read the above table. I want all the data from Time, A, B,C, and D onwards in a separate file.
import csv
import pandas as pd
import os
read_file = False
with open ('xyz.csv', mode = 'r', encoding = 'utf-8') as f_read:
reader = csv.reader(f_read)
for row in reader:
if 'Time' in row
I am stuck here. I read all the data in 'reader'. All the rows should have been parsed inside 'reader'. Now, how can I extract the data from line with Time and onwards into a separate file?
Is there a better method to achieve the above objective?
Should I use pandas instead of regular python commands?
I read many similar answers on stackoverflow but I am confused on how to finish this problem. Your help is appreciated.
Best

Related

Turn an HTML table into a CSV file

How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?
I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).
Here is the code I have so far:
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
def stir_the_soup():
player_links = open('player_links.txt', 'r')
player_ID_nums = open('player_ID_nums.txt', 'r')
id_nums = [x.rstrip('\n') for x in player_ID_nums]
idx = 0
for url in player_links:
print url
soup = BeautifulSoup(urlopen(url), "lxml")
p_type = ""
if url[-12] == 'p':
p_type = "pitching"
elif url[-12] == 'b':
p_type = "batting"
table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('th')])
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open("%s.csv" % id_nums[idx], 'wb') as f:
writer = csv.writer(f)
writer.writerow(header)
writer.writerows(row for row in rows if row)
idx += 1
player_links.close()
if __name__ == "__main__":
stir_the_soup()
The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.
For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?
this code gets you the big table of stats, which is what I think you want.
Make sure you have lxml, beautifulsoup4 and pandas installed.
df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:
df[4].head(5)
Rk Gcar Gtm Date Tm Unnamed: 5 Opp Rslt Inngs PA ... CS BA OBP SLG OPS BOP aLI WPA RE24 Pos
0 1 66 2 (1) Apr 6 ARI NaN SDP L,3-6 7-8 1 ... 0 1.000 1.000 1.000 2.000 9 .94 0.041 0.51 PH
1 2 67 3 Apr 7 ARI NaN SDP W,5-3 7-8 1 ... 0 .500 .500 .500 1.000 9 1.16 -0.062 -0.79 PH
2 3 68 4 Apr 9 ARI NaN PIT W,9-1 8-GF 1 ... 0 .667 .667 .667 1.333 2 .00 0.000 0.13 PH SS
3 4 69 5 Apr 10 ARI NaN PIT L,3-6 CG 4 ... 0 .500 .429 .500 .929 2 1.30 -0.040 -0.37 SS
4 5 70 7 (1) Apr 13 ARI # LAD L,5-9 6-6 1 ... 0 .429 .375 .429 .804 9 1.52 -0.034 -0.46 PH
to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)
Example: df[4]['Gcar']
Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)
bs = BeautifulSoup(html,'lxml')
table = str(bs.find('table',{'id':'batting_gamelogs'}))
dfs = pd.read_html(table)
This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

How do I get only those lines that has highest value if they are inside a timewindow?

I am new to the python and scripting in general, so I would really appreciate some guidance in writing a python script.
So, to the point:
I have a big number of files in a directory. Some files are empty, other contain rows like that:
16 2009-09-30T20:07:59.659Z 0.05 0.27 13.559 6
16 2009-09-30T20:08:49.409Z 0.22 0.312 15.691 7
16 2009-09-30T20:12:17.409Z -0.09 0.235 11.826 4
16 2009-09-30T20:12:51.159Z 0.15 0.249 12.513 6
16 2009-09-30T20:15:57.209Z 0.16 0.234 11.776 4
16 2009-09-30T20:21:17.109Z 0.38 0.303 15.201 6
16 2009-09-30T20:23:47.959Z 0.07 0.259 13.008 5
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
16 2009-09-30T20:37:48.609Z -0.02 0.256 12.861 4
16 2009-09-30T20:44:19.359Z 0.14 0.251 12.597 4
16 2009-09-30T20:48:39.759Z 0.03 0.284 14.244 5
16 2009-09-30T20:49:36.159Z -0.07 0.278 13.98 4
16 2009-09-30T20:57:54.609Z 0.01 0.304 15.294 4
16 2009-09-30T20:59:47.759Z 0.27 0.265 13.333 4
16 2009-09-30T21:02:56.209Z 0.28 0.272 13.645 6
and so on.
I want to get this lines out of the files into a new file. But there are some conditionals!
If two or more successive lines are inside a timewindow of 6 seconds, then only the line with highest treshold should be printed into the new file.
So, something like that:
Original:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
16 2009-09-30T20:32:10.309Z 0.0 0.239 12.009 5
in output file:
16 2009-09-30T20:32:10.109Z 0.0 0.283 14.195 5
Keep in mind, that lines from different files can have times inside 6s window with lines from other files, so the line, that will be in output is the one that has highest treshold from different files.
The code that explains what is what in the lines is here:
import glob
from datetime import datetime
path = './*.cat'
files=glob.glob(path)
for file in files:
in_file=open(file, 'r')
out_file = open("times_final", "w")
for line in in_file.readlines():
split_line = line.strip().split(' ')
template_number = split_line[0]
t = datetime.strptime(split_line[1], '%Y-%m-%dT%H:%M:%S.%fZ')
mag = split_line[2]
num = split_line[3]
threshold = float(split_line[4])
no_detections = split_line[5]
in_file.close()
out_file.close()
Thank you very much for hints, guidelines, ...
you said in the comments you know how to merge multiple files into 1 sorted by t and that the 6 second windows start with the first row and are based on actual data.
so, you need a way to remember the maximum threshold per window and write only after you are sure you processed all rows in a window. sample implementation:
from datetime import datetime, timedelta
from csv import DictReader, DictWriter
fieldnames=("template_number", "t", "mag","num", "threshold", "no_detections")
with open('master_data') as f_in, open("times_final", "w") as f_out:
reader = DictReader(f_in, delimiter=" ", fieldnames=fieldnames)
writer = DictWriter(f_out, delimiter=" ", fieldnames=fieldnames,
lineterminator="\n")
window_start = datetime(1900, 1, 1)
window_timedelta = timedelta(seconds=6)
window_max = 0
window_row = None
for row in reader:
try:
t = datetime.strptime(row["t"], "%Y-%m-%dT%H:%M:%S.%fZ")
threshold = float(row["threshold"])
except ValueError:
# replace by actual error handling
print("Problem with: {}".format(row))
# switch to new window after 6 seconds
if t - window_start > window_timedelta:
# write out previous window before switching
if window_row:
writer.writerow(window_row)
window_start = t
window_max = threshold
window_row = row
# remember max threshold inside a single window
elif threshold > window_max:
window_max = threshold
window_row = row
# don't forget the last window
if window_row:
writer.writerow(window_row)

Using Pandas in Python to Join Multiple Files Based on Date

I have csv files that I need to join together based upon date but the dates in each file are not the same (i.e. some files start on 1/1/1991 and other in 1998). I have a basic start to the code (see below) but I am not sure where to go from here. Any tips are appreciated. Below please find a sample of the different csv I am trying to join.
import os, pandas as pd, glob
directory = r'C:\data\Monthly_Data'
files = os.listdir(directory)
print(files)
all_data =pd.DataFrame()
for f in glob.glob(directory):
df=pd.read_csv(f)
all_data=all_data.append(df,ignore_index=True)
all_data.describe()
File 1
DateTime F1_cfs F2_cfs F3_cfs F4_cfs F5_cfs F6_cfs F7_cfs
3/31/1991 0.860702028 1.167239264 0 0 0 0 0
4/30/1991 2.116930556 2.463493056 3.316688418
5/31/1991 4.056572581 4.544307796 5.562668011
6/30/1991 1.587513889 2.348215278 2.611659722
7/31/1991 0.55328629 1.089637097 1.132043011
8/31/1991 0.29702957 0.54186828 0.585073925 2.624375
9/30/1991 0.237083333 0.323902778 0.362583333 0.925563094 1.157786606 2.68722973 2.104090278
File 2
DateTime F1_mg-P_L F2_mg-P_L F3_mg-P_L F4_mg-P_L F5_mg-P_L F6_mg-P_L F7_mg-P_L
6/1/1992 0.05 0.05 0.06 0.04 0.03 0.18 0.08
7/1/1992 0.03 0.05 0.04 0.03 0.04 0.05 0.09
8/1/1992 0.02 0.03 0.02 0.02 0.02 0.02 0.02
File 3
DateTime F1_TSS_mgL F1_TVS_mgL F2_TSS_mgL F2_TVS_mgL F3_TSS_mgL F3_TVS_mgL F4_TSS_mgL F4_TVS_mgL F5_TSS_mgL F5_TVS_mgL F6_TSS_mgL F6_TVS_mgL F7_TSS_mgL F7_TVS_mgL
4/30/1991 10 7.285714286 8.5 6.083333333 3.7 3.1
5/31/1991 5.042553191 3.723404255 6.8 6.3 3.769230769 2.980769231
6/30/1991 5 5 1 1
7/31/1991
8/31/1991
9/30/1991 5.75 3.75 6.75 4.75 9.666666667 6.333333333 8.666666667 5 12 7.666666667 8 5.5 9 6.75
10/31/1991 14.33333333 9 14 10.66666667 16.25 11 12.75 9.25 10.25 7.25 29.33333333 18.33333333 13.66666667 9
11/30/1991 2.2 1.933333333 2 1.88 0 0 4.208333333 3.708333333 10.15151515 7.909090909 9.5 6.785714286 4.612903226 3.580645161
You didn't read the csv files correctly.
1) You need to comment out the following lines because you never use it later in your code.
files = os.listdir(directory)
print(files)
2) glob.glob(directory) didnt return any match files. glob.glob() takes pattern as argument, for example: 'C:\data\Monthly_Data\File*.csv', unfortunately you put a directory as a pattern, and no files are found
for f in glob.glob(directory):
I modified the above 2 parts and print all_data, the file contents display on my console

Pandas issue iterating over DataFrame

I'm using pandas to scrape a web page and iterate through a DataFrame object. Here's the function I'm calling:
def getTeamRoster(teamURL):
teamPlayers = []
table = pd.read_html(requests.get(teamURL).content)[4]
nameTitle = '\n\t\t\t\tPlayers\n\t\t\t'
ratingTitle = 'SinglesRating'
finalTable = table[[nameTitle, ratingTitle]][:-1]
print(finalTable)
for index, row in finalTable:
print(index, row)
I'm using the syntax advocated here:
http://www.swegler.com/becky/blog/2014/08/06/useful-pandas-snippets/
However, I'm getting this error:
File "SquashScraper.py", line 46, in getTeamRoster
for index, row in finalTable:
ValueError: too many values to unpack (expected 2)
For what it's worth, my finalTable prints as this:
\n\t\t\t\tPlayers\n\t\t\t SinglesRating
0 Browne,Noah 5.56
1 Ellis,Thornton 4.27
2 Line,James 4.25
3 Desantis,Scott J. 5.08
4 Bahadori,Cameron 4.97
5 Groot,Michael 4.76
6 Ehsani,Darian 4.76
7 Kardon,Max 4.83
8 Van,Jeremy 4.66
9 Southmayd,Alexander T. 4.91
10 Cacouris,Stephen A 4.68
11 Groot,Christopher 4.62
12 Mack,Peter D. (sub) 3.94
13 Shrager,Nathaniel O. 0.00
14 Woolverton,Peter C. 4.06
which looks right to me.
Any idea why python doesn't like my syntax?
Thanks for the help,
bclayman
You probably want to try this:
for index, row in finalTable.iterrows():
print(index, row)

Delimiter error from np.loadtxt Python

I'm trying to sum some values in a list so i loaded the .dat file that contains the values, but the only way Python makes the sum of the data is by separate it with ','. Now, this is what I get.
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype='float')
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 846, in loadtxt
vals = [vals[i] for i in usecols]
IndexError: list index out of range
This is my code
import numpy as np
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype='str')
print altura
And this is the file 'bio.dat'
1 Katherine Oquendo M 18 1.50 50
2 Pablo Restrepo H 20 1.83 79
3 Ana Agudelo M 18 1.58 45
4 Kevin Vargas H 20 1.74 80
5 Pablo madrid H 20 1.70 55
What I intend to do is
x=sum(altura)
What should i do with the 'separate'?
In my case, some line includes # character.
Then numpy will ignore all the rests of the line, because that means ‘comment’. So try again with comments parameter like
altura = np.loadtxt("bio.dat",delimiter=',',usecols=(5,),dtype=‘str’,comments=‘')
And I recommend you not to use np.loadtxt. Because it’s incredibly slow if you must process a large(>1M lines) file.
Alternatively, you can convert the tab delimited file to csv first.
csv supports tab delimited files. Supply the delimiter argument to reader:
import csv
txt_file = r"mytxt.txt"
csv_file = r"mycsv.csv"
# use 'with' if the program isn't going to immediately terminate
# so you don't leave files open
# the 'b' is necessary on Windows
# it prevents \x1a, Ctrl-z, from ending the stream prematurely
# and also stops Python converting to / from different line terminators
# On other platforms, it has no effect
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
This answer is not my work; it is the work of agf found at https://stackoverflow.com/a/10220428/3767980.
The file doesn't need to be comma separated. Here's my sample run, using StringIO to simulate a file. I assume you want to sum the numbers that look a person's height (in meters).
In [17]: from StringIO import StringIO
In [18]: s="""\
1 Katherine Oquendo M 18 1.50 50
2 Pablo Restrepo H 20 1.83 79
3 Ana Agudelo M 18 1.58 45
4 Kevin Vargas H 20 1.74 80
5 Pablo madrid H 20 1.70 55
"""
In [19]: S=StringIO(s)
In [20]: data=np.loadtxt(S,dtype=float,usecols=(5,))
In [21]: data
Out[21]: array([ 1.5 , 1.83, 1.58, 1.74, 1.7 ])
In [22]: np.sum(data)
Out[22]: 8.3499999999999996
as script (with the data in a .txt file)
import numpy as np
fname = 'stack25828405.txt'
data=np.loadtxt(fname,dtype=float,usecols=(5,))
print data
print np.sum(data)
2119:~/mypy$ python2.7 stack25828405.py
[ 1.5 1.83 1.58 1.74 1.7 ]
8.35

Categories