Opening a space(?) delimited text file in python 2.7? - python

I have what I think is a space delimited text file that I would like to open and copy some of the data to lists (Python 2.7). This is a snippet of the data file:
0.000000 11.00 737.09 1.00 1116.00
0.001000 14.00 669.29 10.00 613.70
0.002000 15.00 962.27 2.00 623.50
0.003000 7.00 880.86 7.00 800.71
0.004000 9.00 634.67 3.00 1045.00
0.005000 12.00 614.67 3.00 913.33
0.006000 12.00 782.58 6.00 841.00
0.007000 13.00 860.08 6.00 354.00
0.008000 14.00 541.07 4.00 665.25
0.009000 14.00 763.00 6.00 1063.00
0.010000 9.00 790.33 6.00 857.83
0.011000 6.00 899.83 4.00 1070.75
0.012000 16.00 710.88 10.00 809.90
0.013000 12.00 863.50 7.00 923.14
0.014000 9.00 591.67 6.00 633.17
0.015000 12.00 740.58 6.00 837.00
0.016000 10.00 727.60 7.00 758.00
0.017000 12.00 838.75 4.00 638.75
0.018000 9.00 991.33 7.00 731.57
0.019000 12.00 680.75 5.00 1079.40
0.020000 15.00 843.20 3.00 546.00
0.021000 11.00 795.18 5.00 1317.20
0.022000 9.00 943.33 5.00 911.00
0.023000 13.00 711.23 3.00 981.67
0.024000 11.00 922.73 5.00 1111.00
0.025000 1112.00 683.58 6.00 542.83
0.026000 15.00 1053.80 5.00 1144.40
Below is the code I have tried, which does not work. I would like to have two lists, one each from the second and the fourth column.
listb = []
listd = []
with open('data_file.txt', 'r') as file:
reader = csv.reader(file,delimiter=' ')
for a,b,c,d,e in reader:
listb.append(int(b))
listd.append(int(d))
What am I doing wrong?

One alternative is to take advantage of the built-in str.split():
a, b, c, d, e = zip(*((map(float, line.split()) for line in open('data_file.txt'))))

The problem is the multiple spaces between fields (columns).
CSV stands for comma-separated values. Imagine for a second that you are using commas instead of spaces. Line 1 in your file would then look like:
,,,,0.000000,,,,,,,11.00,,,,,,737.09,,,,,,,1.00,,,,,1116.00
So, the CSV reader sees more than 5 fields (columns) in that row.
You have two options:
Switch to using single space separators
Use a simple split() to deal with multiple whitespace:
:
listb = []
listd = []
with open('text', 'r') as file:
for row in file:
a, b, c, d, e = row.split()
listb.append(int(b))
listd.append(int(d))
P.S: Once this part is working, you will run into a problem calling int() on strings like "11.00" which aren't really integers.
So I recommend using something like:
int(float(b))

f=open("input.txt",'r')
x=f.readlines()
list1=[]
list2=[]
import re
for line in x:
pattern=re.compile(r"(\d+)(?=\.)")
li=pattern.findall(line)
list1.append(li[1])
list2.append(li[3])
You can use this if you only want to capture integersand not floats.

You can find all values you need, using regexp
import re
list_b = []
list_d = []
with open('C://data_file.txt', 'r') as f:
for line in f:
list_line = re.findall(r"[\d.\d+']+", line)
list_b.append(float(list_line[1])) #appends second column
list_d.append(float(list_line[3])) #appends fourth column
print list_b
print list_d

Related

How to sum all rows from multiple columns

I want to do several operations that are repeated for several columns but I can't do it with a list-comprehension or with a loop.
The dataframe I have is concern_polls and I want to rescale the percentages and the total amounts.
text very somewhat \
0 How concerned are you that the coronavirus wil... 19.00 33.00
1 How concerned are you that the coronavirus wil... 26.00 32.00
2 Taking into consideration both your risk of co... 13.00 26.00
3 How concerned are you that the coronavirus wil... 23.00 32.00
4 How concerned are you that you or someone in y... 11.00 24.00
.. ... ... ...
625 How worried are you personally about experienc... 33.09 36.55
626 How do you feel about the possibility that you... 30.00 31.00
627 Are you concerned, or not concerned about your... 34.00 35.00
628 Are you personally afraid of contracting the C... 28.00 32.00
629 Taking into consideration both your risk of co... 22.00 40.00
not_very not_at_all url
0 23.00 11.00 https://morningconsult.com/wp-content/uploads/...
1 25.00 7.00 https://morningconsult.com/wp-content/uploads/...
2 43.00 18.00 https://d25d2506sfb94s.cloudfront.net/cumulus_...
3 24.00 9.00 https://morningconsult.com/wp-content/uploads/...
4 33.00 20.00 https://projects.fivethirtyeight.com/polls/202...
.. ... ... ...
625 14.92 12.78 https://docs.google.com/spreadsheets/d/1cIEEkz...
626 14.00 16.00 https://www.washingtonpost.com/context/jan-10-...
627 19.00 12.00 https://drive.google.com/file/d/1H3uFRD7X0Qttk...
628 16.00 15.00 https://leger360.com/wp-content/uploads/2021/0...
629 21.00 16.00 https://docs.cdn.yougov.com/4k61xul7y7/econTab...
[630 rows x 15 columns]
Variables very, somewhat, not_very and not_at_all they are represented as percentages of the column SAMPLE_SIZE, not shown in the sample share. The percentages don't always add up to 100% so I want to rescale it
To do this, I take the following steps: I calculate the sum of the columns -> variable I sum calculate the amount per %. This step could leave it as a variable and not create a new column in it df. I calculate the final amounts
The code I have so far is this:
sums = concern_polls['very'] + concern_polls['somewhat'] + concern_polls['not_very'] + concern_polls['not_at_all']
concern_polls['Very'] = concern_polls['very'] / sums * 100
concern_polls['Somewhat'] = concern_polls['somewhat'] / sums * 100
concern_polls['Not_very'] = concern_polls['not_very'] / sums * 100
concern_polls['Not_at_all'] = concern_polls['not_at_all'] / sums * 100
concern_polls['Total_Very'] = concern_polls['Very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Somewhat'] = concern_polls['Somewhat'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_very'] = concern_polls['Not_very'] / 100 * concern_polls['sample_size']
concern_polls['Total_Not_at_all'] = concern_polls['Not_at_all'] / 100 * concern_polls['sample_size']
I have tried to raise the function with "list comprehension" but I can't.
Could someone make me a suggestion?
The problems that I find is that I want to add all the rows of several columns, but they are not all of the df doing repetitive operations on several columns, but they are not all of the df
Thank you.
df[newcolumn] = df.apply(lambda row : function(row), axis=1)
is your friend here I think.
"axis=1" means it does it row by row.
As an example :
concern_polls['Very'] = concern_polls.apply(lambda row: row['very'] / sums * 100, axis=1)
And if you want sums to be the total of each of those df columns it'll be
sums = concern_polls[['very', 'somewhat', 'not_very', 'not_at_all']].sum().sum()

Encoding data with LabelEncoder()

I'm having the following dataset as a csv file.
Dataset ecoli.csv:
seq_name,mcg,gvh,lip,chg,aac,alm1,alm2,class
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
(more entries...)
ACKA_ECOLI,0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp
ADI_ECOLI,0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp
My purpose for this dataset is to apply some classification algorithms. In order to handle ecoli.csv file I'm trying to change the class column and put in as first one while seq_name column is dropped. Then I'm printing a test to search for null values. Afterwards I'm plotting with the help of sns library.
Code before error:
column_drop = 'seq_name'
dataframe = pd.read_csv('ecoli.txt', header='infer')
dataframe.drop(column_drop, axis=1, inplace=True) # Dropping columns that I don't need
print(dataframe.isnull().sum())
plt.figure(figsize=(10,8))
sns.heatmap(dataframe.corr(), annot=True)
plt.show()
Before the encoding, and the error I'm facing, I group the values of the dataset based on class. Finally I'm trying to encode the dataset with LabelEncoder but and an error appears:
Error code:
result = dataframe.groupby(by=("class")).sum().reset_index()
print(result)
le = preprocessing.LabelEncoder()
dataframe.result = le.fit_transform(dataframe.result)
print(result)
Error:
AttributeError: 'DataFrame' object has no attribute 'result'
Update: result is filled with the following index
class mcg gvh lip chg aac alm1 alm2
0 cp 51.99 58.59 68.64 71.5 64.99 44.71 56.52
1 im 36.84 38.24 37.48 38.5 41.28 58.33 56.24
2 imL 1.45 0.94 2.00 1.5 0.91 1.29 1.14
3 imS 1.48 1.02 0.96 1.0 1.07 1.28 1.14
4 imU 25.41 16.06 17.32 17.5 19.56 26.04 26.18
5 om 13.45 14.20 10.12 10.0 14.78 9.25 6.11
6 omL 3.49 2.56 5.00 2.5 2.71 2.82 1.11
7 pp 33.91 36.39 24.96 26.0 22.71 24.34 19.47
Desired output:
Any thoughts?

Pairwise Elements Using Python - Calculating Average of individual elements of array

So I have a query; I am accessing an API that gives the following response:
[["22014",201939,"0021401229","APR 15 2015",Team1 vs. Team2","W",
19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],["22014",201939,"0021401","APR
13 2015",Team1 vs. Team3","W",
15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],["22014",201939,"0021401192","APR
11 2015",Team1 vs. Team4","W",
22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
I could just as easily have 16 different variables that I assign zero to, then print them out like the following example:
sum_pts = 0
for n in range(0,len(shot_data)): #range of games; these lengths vary per player
sum_pts= sum_pts+float(json.dumps(shots_array[n][24]))
print sum_pts/float(len(shots_array))
Output:
>>>
23.75
But I'd rather not create 16 different variables that calculate the average of the individual elements in this list. I'm looking for an easier way that I could get the average of Team1
I would like it the output to eventually be, so that I can apply this to infinite number of players or individual stats:
Team1 AVGPTS AVGAST AVGSTL AVGREB...
23.75 5.3 2.1 3.2
Or it could be:
Player1 AVGPTS AVGAST AVGSTL AVGREB ...
23.75 5.3 2.1 3.2 ...
To get the averages of the last 16 entries of each entry, you could use the following approach, this avoids the need to define multiple variables for each column:
data = [
["22014",201939,"0021401229","APR 15 2015", "Team1 vs. Team2","W", 19,4,10,0.4,2,4,0.5,0,0,0,2,2,4,7,5,0,2,1,10,14,1],
["22014",201939,"0021401","APR 13 2015","Team1 vs. Team3","W", 15,4,13,0.4,2,8,0.5,0,0,0,2,2,4,7,5,0,8,1,12,14,1],
["22014",201939,"0021401192","APR 11 2015","Team1 vs. Team4","W", 22,5,10,0.4,2,6,0.5,0,0,0,2,2,4,7,5,0,2,1,8,14,1]]
length = float(len(data))
values = []
for entry in data:
values.append(entry[6:])
values = zip(*values)
averages = [sum(v) / length for v in values]
for col in averages:
print "{:.2f} ".format(col),
This would display:
18.67 4.33 11.00 0.40 2.00 6.00 0.50 0.00 0.00 0.00 2.00 2.00 4.00 7.00 5.00 0.00 4.00 1.00 10.00 14.00 1.00
Note, your data is missing an opening quote before each Team1 vs Team2.

how to merge rows from a .dat file using python

I have data like this in a .dat format
1 13 0.54
1 15 0.65
1 67 0.55
2 355 0.54
2 456 0.29
3 432 0.55
3 542 0.333
I want to merge the rows starting with 1, 2 and so on and want a final file like this:
1 13 0.54 15 0.65 67 0.55
2 355 0.54 456 0.29
3 432 0.55 542 0.333
Can someone please help me? I am new to Python. Unless I get this format file I cannot run my abaqus code.
Explanation - first we split the file into lines, and then we split the lines on white space.
We then use itertools.groupby to group the lines by their first element.
We then take the values, ignore the first element and join on spaces, and prepend the key that we were grouping by and a space.
from itertools import groupby
with open("file.dat") as f:
lines = [line.split() for line in f.readlines()]
filteredlines = [line for line in lines if len(line)]
for k, v in groupby(filteredlines, lambda x: x[0]):
print k + " " + " ".join([" ".join(velem[1:]) for velem in v])
print
The Python CSV library can be used to both read your DAT file and also create your CSV file as follows:
import csv, itertools
with open("input.dat", "r") as f_input, open("output.csv", "wb") as f_output:
csv_input = csv.reader(f_input, delimiter=" ", skipinitialspace=True)
csv_output = csv.writer(f_output, delimiter=" ")
dat = [line for line in csv_input if len(line)]
for k, g in itertools.groupby(dat, lambda x: x[0]):
csv_output.writerow([k] + list(itertools.chain.from_iterable([value[1:] for value in g])))
It produces an output file as follows:
1 13 0.54 15 0.65 67 0.55
2 355 0.54 456 0.29
3 432 0.55 542 0.333
Tested using Python 2.7

Download stocks data from google finance

I'm trying to download data from Google Finance from a list of stocks symbols inside a .csv file.
This is the class that I'm trying to adapt from this site:
import urllib,time,datetime
import csv
class Quote(object):
DATE_FMT = '%Y-%m-%d'
TIME_FMT = '%H:%M:%S'
def __init__(self):
self.symbol = ''
self.date,self.time,self.open_,self.high,self.low,self.close,self.volume = ([] for _ in range(7))
def append(self,dt,open_,high,low,close,volume):
self.date.append(dt.date())
self.time.append(dt.time())
self.open_.append(float(open_))
self.high.append(float(high))
self.low.append(float(low))
self.close.append(float(close))
self.volume.append(int(volume))
def append_csv(self, filename):
with open(filename, 'a') as f:
f.write(self.to_csv())
def __repr__(self):
return self.to_csv()
def get_symbols(self, filename):
for line in open(filename,'r'):
if line != 'codigo':
print line
q = GoogleQuote(line,'2014-01-01','2014-06-20')
q.append_csv('data.csv')
class GoogleQuote(Quote):
''' Daily quotes from Google. Date format='yyyy-mm-dd' '''
def __init__(self,symbol,start_date,end_date=datetime.date.today().isoformat()):
super(GoogleQuote,self).__init__()
self.symbol = symbol.upper()
start = datetime.date(int(start_date[0:4]),int(start_date[5:7]),int(start_date[8:10]))
end = datetime.date(int(end_date[0:4]),int(end_date[5:7]),int(end_date[8:10]))
url_string = "http://www.google.com/finance/historical?q={0}".format(self.symbol)
url_string += "&startdate={0}&enddate={1}&output=csv".format(
start.strftime('%b %d, %Y'),end.strftime('%b %d, %Y'))
csv = urllib.urlopen(url_string).readlines()
csv.reverse()
for bar in xrange(0,len(csv)-1):
try:
#ds,open_,high,low,close,volume = csv[bar].rstrip().split(',')
#open_,high,low,close = [float(x) for x in [open_,high,low,close]]
#dt = datetime.datetime.strptime(ds,'%d-%b-%y')
#self.append(dt,open_,high,low,close,volume)
data = csv[bar].rstrip().split(',')
dt = datetime.datetime.strftime(data[0],'%d-%b-%y')
close = data[4]
self.append(dt,close)
except:
print "error " + str(len(csv)-1)
print "error " + csv[bar]
if __name__ == '__main__':
q = Quote() # create a generic quote object
q.get_symbols('list.csv')
But, for some quotes, the code doesn't return all data (e.g. BIOM3), some fields return as '-'. How can I handle the split in these cases?
For last, at some point of the script, it stops of download the data because the script stops, it doesn't return any message. How can I handle this problem?
It should work, but notice that the ticker should be: BVMF:ABRE11
In [250]:
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
df=web.DataReader("BVMF:ABRE11", 'google', start, end)
print df.head(10)
Open High Low Close Volume
?Date
2011-07-26 19.79 19.79 18.30 18.50 1843700
2011-07-27 18.45 18.60 17.65 17.89 1475100
2011-07-28 18.00 18.50 18.00 18.30 441700
2011-07-29 18.30 18.84 18.20 18.70 392800
2011-08-01 18.29 19.50 18.29 18.86 217800
2011-08-02 18.86 18.86 18.60 18.80 154600
2011-08-03 18.90 18.90 18.00 18.00 168700
2011-08-04 17.50 17.85 16.50 16.90 238700
2011-08-05 17.00 17.00 15.63 16.00 253000
2011-08-08 15.50 15.96 14.35 14.50 224300
[10 rows x 5 columns]
In [251]:
df=web.DataReader("BVMF:BIOM3", 'google', start, end)
print df.head(10)
Open High Low Close Volume
?Date
2010-01-04 2.90 2.90 2.90 2.90 0
2010-01-05 3.00 3.00 3.00 3.00 0
2010-01-06 3.01 3.01 3.01 3.01 0
2010-01-07 3.01 3.09 3.01 3.09 2000
2010-01-08 3.01 3.01 3.01 3.01 0
2010-01-11 3.00 3.00 3.00 3.00 0
2010-01-12 3.00 3.00 3.00 3.00 0
2010-01-13 3.00 3.10 3.00 3.00 7000
2010-01-14 3.00 3.00 3.00 3.00 0
2010-01-15 3.00 3.00 3.00 3.00 1000
[10 rows x 5 columns]

Categories