I have a csv file contain of daily precipitation with (253 rows and 191 column daily) so for one year I have 191 * 365 column.
I want to extract data for certain row and column that are my area of interest example row 20 and column 40 for the first day and the 2,3,4 ... 365 days has the same distance between the column.
I'm new in python, is there any way that I can extract the data and store it in a new csv for a certain row and column for one year?
Thanks
To get value from certain row and column you can try smth like this:
from itertools import islice
def get_value(f, row, col):
line = next(islice(f, row - 1, row))
values = line.split(',')
return values[col - 1]
with open('data.csv', 'r') as f:
print(get_value(f, 10, 4))
Apart from extracting the data, the first thing you need to do is rearrange your data.
As it is now, 191 columns are added every day. To do that, the whole file needs to be parsed (probably in memory, data growing every day), data gets added to the end of each row, and everything has to be fully written to disk again.
Usually, to add data to a csv, rows are added at the end of the file. No need to parse and rewrite the whole file each time.
On top of that, most software to read csv files starts having problems when the number of columns gets higher.
So it would be a lot better to add the daily data as rows at the end of the csv file.
While we're at it: assuming the 253 x 191 is some sort of grid, or at least every cell has te same data type, this would be a great candidate for binary storage (not sure how/if Python can handle that).
All data could be stored in it's binary form resulting in a fixed length field/cell. To access a field, it's position could simply be calculated and there would be no need to parse and convert all the data each time. Retrieving data would be almost instant.
I already managed to do the cutting with this script after read some examples and try it
`import netCDF4 as nc
pixel = [[1,36,77],[2,37,77],[3,35,78],[4,36,78],[5,37,78],[6,38,78],[7,39,78],[8,40,78],[9,35,79],[10,36,79],[11,37,79],[12,38,79],[13,39,79],[14,40,79],[15,35,80],[16,36,80],[17,37,80],[18,38,80],[19,35,81],[20,36,81],[21,37,81],[22,36,82]]
print pixel
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('D:\RCP45\Hujan_Harian') if isfile(join('D:\RCP45\Hujan_Harian',f))]
print onlyfiles
folder = 'D:\RCP45\Hujan_Harian\\'
fout = open ("D:\My Documents\precipitation.txt", "w")
for j in range (0,len(onlyfiles)):
filename = onlyfiles[j]
print filename
tahun = filename[0:4]
print tahun
from scipy.io import netcdf
f1 = netcdf.netcdf_file(folder+filename,'r')
print (f1.variables)
jlh_hari = int(len(f1.variables['time_bnds'][:]))
print jlh_hari
output = []
for h in range (0,(jlh_hari)):
for i in range (0,22):
x=pixel[i][1]
y=pixel[i][2]
pr=f1.variables['pr'][h,x,y]
fout.write(str(pixel[i][0]) + ', , ' + str(tahun) + ', ' + str(pr) + '\n')
fout.write('\n')
print output`
Related
I have a CSV file with 3 rows: "Username", "Date", "Energy saved" and I would like to sum the "Energy saved" of a specific user by date.
For example, if username = 'merrytan', how can I print all the rows with "merrytan" such that the total energy saved is aggregated by date? (Date: 24/2/2022 Total Energy saved = 1001 , Date: 24/2/2022 Total Energy saved = 700)
I am a beginner at python and typically, I would use pandas to resolve this issue but it is not allowed for this project so I am at a complete loss on where to even begin. I would appreciate any help and guidance. Thank you.
My alternative to opening csv files is to use csv module of native python. You read them as a "file" and just extract the values that you need. I filter using the first column and keep only keep the equal index values from the concerned column. (which is thrid and index 2.)
import csv
energy_saved = []
with open(r"D:\test_stack.csv", newline="") as csvfile:
file = csv.reader(csvfile)
for row in file:
if row[0]=="merrytan":
energy_saved.append(row[2])
energy_saved = sum(map(int, energy_saved))
Now you have a list of just concerned values, and you can sum them afterwards.
Edit - So, I just realized that I left out the time part of your request completely lol. Here's the update.
import csv
my_dict = {}
with open(r"D:\test_stack.csv", newline="") as file:
for row in csv.reader(file):
if row[0]=="merrytan":
my_dict[row[1]] = my_dict.get(row[1], 0) + int(row[2])
So, we need to get the date column of the file as well. We need to make a presentation of two "rows" but when Pandas has been prohibited, we will go to dictionary with date as keys and energy as values.
But your date column has repeated values (regardless intended or else) and Dictionaries require keys to be unique. So, we use a loop. You add one date value after another as key and corresponding energy as value to the new dictionary, but when it is already present, you will sum with the existing value instead.
I would turn your CSV file into a two-level dictionary, with username and then date as the keys
infile = open("data.csv", "r").readlines()
savings = dict()
# Skip the first line of the CSV, since that has the column names
# not data
for row in infile[1:]:
username, date_col, saved = row.strip().split(",")
saved = int(saved)
if username in savings:
if date_col in savings[username]:
savings[username][date_col] = savings[username][date_col] + saved
else:
savings[username][date_col] = saved
else:
savings[username] = {date_col: saved}
I'm quite new to coding and don't have a proper education on the subject (most of my experience has been just stumbling through google searches) and I have a task that I would like assistance with.
I have 38 files which look something like this:
NGANo: 000a16d_1
Zeta: 0.050000
Ds5-95: 5.290000
Comments:
Period, SD, SV, SA
0.010000 0.000433 0.013167 170.812839
0.020000 0.001749 0.071471 172.720229
0.030000 0.004014 0.187542 176.055129
0.040000 0.007631 0.468785 189.322248
0.050000 0.012815 0.912067 203.359441
0.060000 0.019246 1.556853 210.602517
0.070000 0.025400 1.571091 206.360018
They're all .DAT files and are four columns of data (Period, SD, SV, SA) that are single space delimited in each row, additionally there are two spaces at the end of each line of data.
The only important data for me is the SA data, and I'd like to take the SA data and the title (this particular example being 000a16d_1) from each of these 38 files and put them all on the same sheet of an excel spreadsheet (one column after the next) with just the title followed by the SA data.
I've tried a few different things, but I'm stuck on how to separate the rows of data from one column into 4. I'm not too knowledgeable on whether I should use numpy or pandas. I know that everything up to the second to last line is correct, as when I have print(table) it does print the rows of data, I just don't understand how to separate the single column into multiple. Here is my current code, all assistance is appreciated.
import pandas as pd
import numpy as np
import os
import xlsxwriter
#
path = "C:/Users/amihi/Downloads/Plotter_Output"
dirs = os.listdir(path)
#
#
for file in dirs:
table = pd.read_table(file, skiprows=4)
SA = table.loc[:,"SA"]
print(SA)
You could also do this without using pandas if you wanted. The code below will deal only with the table section of it, but wont deal with the info at the top of the file.
finalColumns = []
for file in dirs:
with open(file, "r") as f:
for l in f:
line = l.strip("\n")
splitted = line.split()
if len(splitted) > len(columns):
for i in range(len(splitted)):
columns.append([])
counter = 0
for item in splitted:
columns[counter].append(item)
counter += 1
finalColumns.append(columns[3])
When adding to your other file, simply loop through finalColumns and each item will be what should be a new column in your file.
Thank you in advance for your answer !
I have lot of files wich contain columns.
I want to export all the columns separately in multiple files.
Moreover, I want to use the first value of each column to be the index of the file name.
For example. If I have the file "test_.dat" which contains 3 columns :
12 54 159
2 9 87
5 99 201
...
...
91 1 777
I want three files: "test_12.dat", "test_54.dat" & "test_159.dat".
Where "test_12.dat" is :
2
5
...
...
91
I know that I need to consider two loops (one for the initial files) and another one for the reading/export of the columns.
I know only how tu use append, but this is a very time consuming approach.
I would deeply appreciate your support.
Here is my try:
Find all ".dat" from a folder :
for fname in glob.glob(‘test_*.dat’):
temp=numpy.loadtxt(fname,skiprows=2)
data.append(temp)
namefiles=glob.glob('test*.dat')
Append all the columns together (Very long step):
for i in range (len(nomfichier)):
for k in range (1,nbrechi+1):
for j in range (points):
ikj.append(data[i][j][k])
Define two variables to split the variables (points is the number of rows)
seq2=[ikj[i:i+points] for i in range(0, len(ikj), points)]
chunks = [ikj[points*i:points*(i+1)] for i in range(len(ikj)/points + 1)]
Export the columns in the specific files:
for j in range(len(nomfichier)):
for i in range(len(seq2)/len(namefiles)):
z=z+1
savetxt(namefiles[j][:-4] + « _number_ » + str(flattened[i]) + ".dat", zip(firstcolumn,seq2[z]))
print(namefiles[j][:-4] + « _number_ » + str(flattened[i]))
zz.append(z)
An easy way is to read the large file with pandas, it is designed for big data handling.
To read the data use the following:
import pandas as pd
df = pd.read_csv('test.bat', sep='\s', engine='python', header=None)
to save the columns as individual files you can use the following code:
for ci in df.columns.values:
data = df[ci]
data.to_csv('test_{}.bat'.format(data[0]))
You can change the sep depending on what is used in your bat file. The defualt for pandas is a comma but in this case, like in your example data ,I used space. Hope it helps!
fps_out = []
with open('test_.dat', 'r') as fp_in:
for line in fp_in:
if not fps_out:
for data in line.split():
fps_out.append(open('test_%s.dat' % data, 'w'))
else:
for pos, data in enumerate(line.split()):
fps_out[pos].write(data + '\n')
for fp in fps_out:
fp.close()
I believe that I have successfully read in my files with a "for loop" as shown in the code below.
import pandas as pd
import glob
filename = glob.glob('1511**.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
for i in filename:
f_nov15_hereford = pd.read_csv(i, skiprows = 33, sep='\s+')
frames.append(f_nov15_hereford)
data_nov15_hereford = pd.concat(frames)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)
My problem now is that I want to take out some information from the files. Specifically I want the 80 m wind speed from the files. When I read in just one file, instead of looping over multiple files the code works like I need it to by simply doing this:
height = data_nov15_hereford['#']
wspd = data_nov15_hereford["z"]
hub = np.where(height==80)
print hub
hub_wspd = wspd[5:4582:32]
hub_wspd is the 80 m wind speed that I am interested in. And I get the index numbers 5:4582 by printing out hub. And then all I have to do is skip every 32 rows to continue to pull out the 80 m wind speed from the file. However, now that I have read in multiple files (that all look the same and have the same layout as this one file) I can't seem to pull out the 80 m wind speed the same way. Basically, I print out hub and get the indices 5:65418 and then I skip every 32 rows but when I print out the tail end of the hub_wspd it doesn't match the file so I must be doing something wrong. Any ideas why it isn't working with multiple files but worked with the single file? I can also attach a copy of the single data file if that would help. Thanks!
I'm trying to create code that checks if the value in the index column of a CSV is equivalent in different rows, and if so, find the most occurring values in the other columns and use those as the final data. Not a very good explanation, basically I want to take this data.csv:
customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1
And create a new answer.csv that recognizes that there are multiple rows for the same customer, so it finds the values that occur the most in each column and outputs those into one row:
customer_ID,month,ABC
1003,Jan,114
1004,Feb,251
I'd also like to learn that if there are values with the same number of occurrences (Month and B for customer 1004) how can I choose which one I want to be outputted?
I've currently written (thanks to Andy Hayden on a previous question I just asked):
import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')
All this does, however, is create this (I was ignoring month previously, but now I'd like to incorporate it so that I can learn how to not only find the mode of a column of numbers, but also the most occurring string):
customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0
Note: I don't know why it is outputting the .0 at the end of the ABC, it seems to be in the wrong variable format. I want each column to be outputted as just the 3 digit number.
Edit: I'm also having an issue that if the value in column A is 0 then the output becomes 2 digits and does not incorporate the leading 0.
What about something like this? This is not using Pandas though, I am not a Pandas expert.
from collections import Counter
dataDict = {}
# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
for line in dataFile:
# split the line by ',' since it is a csv file...
entry = line.split(',')
# Check to make sure that there is data in the line
if entry and len(entry[0])>0:
# if the customer_id is not in dataDict, add it
if entry[0] not in dataDict:
dataDict[entry[0]] = {'month':[entry[1]],
'time':[entry[2]],
'ABC':[''.join(entry[3:])],
}
# customer_id is already in dataDict, add values
else:
dataDict[entry[0]]['month'].append(entry[1])
dataDict[entry[0]]['time'].append(entry[2])
dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))
# Now write the output file
with open('out.csv','w') as f:
# Loop through sorted customers
for customer in sorted(dataDict.keys()):
# use Counter to find the most common entries
commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]
# Write the line to the csv file
f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))
It generates a file called out.csv that looks like this:
1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,