I've got a CSV file with 20 columns & about 60000 rows.
I'd like to read fields 2 to 20 only. I've tried the below code but the browser(using ipython) freezes & it just goes n for ages
import numpy as np
from numpy import genfromtxt
myFile = 'sampleData.csv'
myData = genfromtxt(myFile, delimiter=',', usecols(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)
print myData
How could I tweak this to work better & actually produce output please?
import pandas as pd
myFile = 'sampleData.csv'
df = pd.DataFrame(pd.read_csv(myFile,skiprows=1)) // Skipping header
print df
This works like a charm
Related
I am reading a csv file and trying to convert the data into json array.But I am facing issues as "only size-1 arrays can be converted to Python scalars"
The csv file contents are
4.4.4.4
5.5.5.5
My code is below
import numpy as np
import pandas as pd
df1 = pd.read_csv('/Users/Documents/datasetfiles/test123.csv', header=None)
df1.head(5)
0
0 4.4.4.4
1 5.5.5.5
df_to_array = np.array(df1)
app_json = json.dumps(df_to_array,default=int)
I need output as
["4.4.4.4", "5.5.5.5", "3.3.3.3"]
As other answers mentioned, just use list: json.dumps(list(df[0]))
FYI, the data shape is your problem:
if you absolutely must use numpy, then transpose the array first:
json.dumps(list(df_to_array.transpose()[0]))
Given test.csv:
4.4.4.4
5.5.5.5
Doing:
import json
with open('test.csv') as f:
data = f.read().splitlines()
print(data)
print(json.dumps(data))
Output:
['4.4.4.4', '5.5.5.5']
["4.4.4.4", "5.5.5.5"]
You're overcomplicating things using pandas is this is all you want to do~
import json
import pandas as pd
df1 = pd.read_csv('/Users/Documents/datasetfiles/test123.csv', header=None)
df1.head(5)
0
0 4.4.4.4
1 5.5.5.5
df_to_array = list(df1[0])
app_json = json.dumps(df_to_array,default=int)
print(app_json)
["4.4.4.4", "5.5.5.5", "3.3.3.3"]
After some applying some procedure I am getting millions of numpy arrays (in the below case procedure converts e to a numpy array):
for e in l:
procedure(e)
How can I save correctly each iteration into a single numpy file for later read it and load it?
So far I tried two options, with np.savez:
for i, e in enumerate(l):
np.savez(f'/Users/user/array.npz',i=e)
And with pandas:
(1) For saving into a single file:
for e in l:
arr = pd.DataFrame(procedure(i)).T
arr.to_csv('/Users/user/Downloads/arr.csv', mode='a', index=False, header=False)
(2) For reading:
arr = np.genfromtxt("/Users/user/Downloads/arr.csv", delimiter=',', dtype='float32', float_format='%.16f')
So far the solution that works is with pandas. However, I guess I am losing presicion in the numpy matrices. Because instead of having values like this (with the e):
-6.82821393e-01 -2.65419781e-01
I am getting values like this:
-0.6828214 , -0.26541978
However, the numpy matrices are not been saved correctly.
What is the most efficient and correct way to dump into a single file each numpy matrix after the for loop iteration?
I don't know if csv is the right format in this case, but you can specify float format to avoid precission loss.
Append to CSV using pandas
import pandas as pd
import numpy as np
pd.set_option('precision', 16) # for print command
fn = 'pandasfile.csv'
arr = np.linspace(1,100,10000).reshape(5000,2)
df = pd.DataFrame(arr)
df.to_csv(fn, mode='a', index=False, header=False, float_format='%.16f', sep='\t')
Append to CSV using numpy
import numpy as np
np.set_printoptions(precision=16)
fn = 'numpyfile.csv'
arr = np.linspace(1,100,10000).reshape(5000,2)
print(arr)
with open(fn, "a") as f:
np.savetxt(f, arr, fmt='%.16f', delimiter='\t')
I used tabulator as separator, it is more readable (some call it TSV file). You can use "," or " " instead.
Load CSV to numpy
arr2 = np.loadtxt(fn, delimiter='\t')
print(arr2)
Load CSV to pandas
df = pd.read_csv(fn, header=None, sep='\t', dtype='float32')
print(df)
Numpy version is a bit faster if it is important.
m#o780:~$ time python3 pdsave.py
real 0m0,473s
user 0m0,448s
sys 0m0,102s
m#o780:~$ time python3 npsave.py
real 0m0,199s
user 0m0,214s
sys 0m0,072s
m#o780:~$
np.savez saves an array in a zip-style format, with the default name arr_0. If you use it again, it will overwrite your current file, meaning the latest one will be there after saving. The good thing is that you can name the file in the zip, so you can use a custom name for each numpy array, or just the indices, like in the example below.
for i, e in enumerate(l):
np.savez(f'/Users/user/array.npz',i=e)
Sorry for my bad English.
I am in an internship, I never have used Python before this.
I need to extract data from a NetCDF file.
I already have created a loop which creates a DataFrame, but when I try to extract this DataFrame I only have 201 values on 41000.
import csv
import numpy as np
import pandas as pd
import netCDF4
from netCDF4 import Dataset, num2date
nc = Dataset('Q:/QGIS/2011001.nc', 'r')
chla = nc.variables['chlorophyll_a'][0]
lons = nc.variables['lon'][:]
lat = nc.variables['lat'][:]
time = nc.variables['time'][:]
nlons=len(lons)
nlat=len(lat)
The first loop give me the 41000 values in ArcGIS python console
for i in range(0,nlat) :
dla = {'lat':lat[i],'long':lons,'chla':chla[i]}
z = pd.DataFrame(dla)
print (z)
z.to_csv('Q:/QGIS/fichier.csv', sep =',', index= True)
But when I do the to.csv I only get 201 values in the csv file.
for y in range(0,nlat):
q[y].to_csv('Q:/QGIS/fichier.csv', sep =',', index= True)
for i in range(0,nlat):
dlo ={'lat':lat[i],'long':lons,'chla':chla[i]}
q[y] = pd.DataFrame(dlo)
print(q)
I hope that you will have an answer to solve this, moreover if you have any script to extract values for create an shp file I would be very grateful if you can share it!
Best regards
Thank you in advance
I have data in 10 individual csv files. Each csv file just has one row of data entires (500000 data points, no headers etc.). Three questions:
How can I transform the data to be one column with 500000 rows?
Is it better to import them into one numpy array: 500000 x 10 to analyze them. If so, how can one do this?
Or is it better to import them into one DataFrame 500000 x 10, to analyze it.
Assume you have a list of file names files. Then:
df = pd.concat([pd.read_csv(f, header=None) for f in files], ignore_index=True)
df is a 10 x 500000 dataframe. Make it a 500000 x 10 with df.T
Answers to 2 and 3 depends on your task.
First, read all 10 csv:
import os, csv, numpy
import pandas as pd
my_csvs = os.listdir('path to folder with 10 csvs') #selects all files in folder
list_of_columns = []
os.chdir('path to folder with 10 csvs')
for file in my_csvs:
column = []
with open(file, 'r') as f:
reader = csv.reader(f)
for row in reader:
column.append(row)
list_of_columns.append(column)
This is how you get a list of lists-columns. Next transform them to pandas df or numpy or whatever you feel comfortable to work with.
Hi I am new with python, I am using pandas to read the csv file data, and print it. The code is shown as following:
import numpy as np
import pandas as pd
import codecs
from pandas import Series, DataFrame
dframe = pd.read_csv("/home/vagrant/geonlp_japan_station.csv",sep=',',
encoding="Shift-JIS")
print (dframe.head(2))
but the data is printed like as following(I just give example to show it)
However, I want the data to be order with columns like as following:
I don't know how to make the printed data be clear, thanks in advance!
You can check unicode-formatting and set:
pd.set_option('display.unicode.east_asian_width', True)
I test it with UTF-8 version csv:
dframe = pd.read_csv("test/geonlp_japan_station/geonlp_japan_station_20130912_u.csv")
and it seems align of output is better.
pd.set_option('display.unicode.east_asian_width', True)
print dframe
pd.set_option('display.unicode.east_asian_width', False)
print dframe