I need to read columns of complex numbers in the format:
# index; (real part, imaginary part); (real part, imaginary part)
1 (1.2, 0.16) (2.8, 1.1)
2 (2.85, 6.9) (5.8, 2.2)
NumPy seems great for reading in columns of data with only a single delimiter, but the parenthesis seem to ruin any attempt at using numpy.loadtxt().
Is there a clever way to read in the file with Python, or is it best to just read the file, remove all of the parenthesis, then feed it to NumPy?
This will need to be done for thousands of files so I would like an automated way, but maybe NumPy is not capable of this.
Here's a more direct way than #Jeff's answer, telling loadtxt to load it in straight to a complex array, using a helper function parse_pair that maps (1.2,0.16) to 1.20+0.16j:
>>> import re
>>> import numpy as np
>>> pair = re.compile(r'\(([^,\)]+),([^,\)]+)\)')
>>> def parse_pair(s):
... return complex(*map(float, pair.match(s).groups()))
>>> s = '''1 (1.2,0.16) (2.8,1.1)
2 (2.85,6.9) (5.8,2.2)'''
>>> from cStringIO import StringIO
>>> f = StringIO(s)
>>> np.loadtxt(f, delimiter=' ', dtype=np.complex,
... converters={1: parse_pair, 2: parse_pair})
array([[ 1.00+0.j , 1.20+0.16j, 2.80+1.1j ],
[ 2.00+0.j , 2.85+6.9j , 5.80+2.2j ]])
Or in pandas:
>>> import pandas as pd
>>> f.seek(0)
>>> pd.read_csv(f, delimiter=' ', index_col=0, names=['a', 'b'],
... converters={1: parse_pair, 2: parse_pair})
a b
1 (1.2+0.16j) (2.8+1.1j)
2 (2.85+6.9j) (5.8+2.2j)
Since this issue is still not resolved in pandas, let me add another solution. You could modify your DataFrame with a one-liner after reading it in:
import pandas as pd
df = pd.read_csv('data.csv')
df = df.apply(lambda col: col.apply(lambda val: complex(val.strip('()'))))
If your file only has 5 columns like you've shown, you could feed it to pandas with a regex for conversion, replacing the parentheses with commas on every line. After that, you could combine them as suggested in this SO answer to get complex numbers.
Pandas makes it easier, because you can pass a regex to its read_csv method, which lets you write clearer code and use a converter like this. The advantage over the numpy version is that you can pass a regex for the delimiter.
import pandas as pd
from StringIO import StringIO
f_str = "1 (2, 3) (5, 6)\n2 (3, 4) (4, 8)\n3 (0.2, 0.5) (0.6, 0.1)"
f.seek(0)
def complex_converter(txt):
txt = txt.strip("()").replace(", ", "+").replace("+-", "-") + "j"
return complex(txt)
df = pd.read_csv(buf, delimiter=r" \(|\) \(", converters = {1: complex_converter, 2: complex_converter}, index_col=0)
EDIT: Looks like #Dougal came up with this just before I posted this...really just depends on how you want to handle the complex number. I like being able to avoid the explicit use of the re module.
Related
After some applying some procedure I am getting millions of numpy arrays (in the below case procedure converts e to a numpy array):
for e in l:
procedure(e)
How can I save correctly each iteration into a single numpy file for later read it and load it?
So far I tried two options, with np.savez:
for i, e in enumerate(l):
np.savez(f'/Users/user/array.npz',i=e)
And with pandas:
(1) For saving into a single file:
for e in l:
arr = pd.DataFrame(procedure(i)).T
arr.to_csv('/Users/user/Downloads/arr.csv', mode='a', index=False, header=False)
(2) For reading:
arr = np.genfromtxt("/Users/user/Downloads/arr.csv", delimiter=',', dtype='float32', float_format='%.16f')
So far the solution that works is with pandas. However, I guess I am losing presicion in the numpy matrices. Because instead of having values like this (with the e):
-6.82821393e-01 -2.65419781e-01
I am getting values like this:
-0.6828214 , -0.26541978
However, the numpy matrices are not been saved correctly.
What is the most efficient and correct way to dump into a single file each numpy matrix after the for loop iteration?
I don't know if csv is the right format in this case, but you can specify float format to avoid precission loss.
Append to CSV using pandas
import pandas as pd
import numpy as np
pd.set_option('precision', 16) # for print command
fn = 'pandasfile.csv'
arr = np.linspace(1,100,10000).reshape(5000,2)
df = pd.DataFrame(arr)
df.to_csv(fn, mode='a', index=False, header=False, float_format='%.16f', sep='\t')
Append to CSV using numpy
import numpy as np
np.set_printoptions(precision=16)
fn = 'numpyfile.csv'
arr = np.linspace(1,100,10000).reshape(5000,2)
print(arr)
with open(fn, "a") as f:
np.savetxt(f, arr, fmt='%.16f', delimiter='\t')
I used tabulator as separator, it is more readable (some call it TSV file). You can use "," or " " instead.
Load CSV to numpy
arr2 = np.loadtxt(fn, delimiter='\t')
print(arr2)
Load CSV to pandas
df = pd.read_csv(fn, header=None, sep='\t', dtype='float32')
print(df)
Numpy version is a bit faster if it is important.
m#o780:~$ time python3 pdsave.py
real 0m0,473s
user 0m0,448s
sys 0m0,102s
m#o780:~$ time python3 npsave.py
real 0m0,199s
user 0m0,214s
sys 0m0,072s
m#o780:~$
np.savez saves an array in a zip-style format, with the default name arr_0. If you use it again, it will overwrite your current file, meaning the latest one will be there after saving. The good thing is that you can name the file in the zip, so you can use a custom name for each numpy array, or just the indices, like in the example below.
for i, e in enumerate(l):
np.savez(f'/Users/user/array.npz',i=e)
So I have an input.txt file like this:
2,1
.,1
2,1
1,1
2,1
3,1
3,1
There is a . in there because I parsed a long decimal number and that is the decimal point.
I want to find the mean of the numbers on the left, which would be 2+2+1+2+3+3/6 and find the most repeated digit which would be 2.
Im trying to run a loop over the text file of the form:
for line in text
But it when I print(line) it only prints 2 and not the whole 2,1 which I can then separate using line.split(',').
Any help would be appreciated thanks.
You can use csv and statistics.mean/mode with a comprehension:
from statistics import mean, mode
from csv import reader
with open('input.txt') as f:
vals = [int(i) for i,_ in reader(f) if i.isnumeric())]
avg = mean(vals)
most_freq = mode(vals)
perhaps use numpy to load the file and get the mean as well and most frequent value:
import numpy as np
values = np.genfromtxt('file.txt', delimiter=',')[:, 0]
mean = np.mean(values)
mostFrequent = np.argmax(np.bincount(values))
Given
import pandas as pd
filename = "test.txt"
Code
df = pd.read_csv(filename, header=None, na_values=".")
Demo
df
df[0].mean()
# 2.1666666666666665
df[0].mode()[0]
# 2.0
df.describe()
Start from here:
from numpy import genfromtxt
my_data = genfromtxt('input.txt', delimiter=',')
print(my_data)
Then look into numpy.place.
I'd like to perform some dimension-reduction (DR) methods such as PCA, ICA and tSNE, maybe LEM on the a data set data.txt to compare the methods.
Therefore, I need to read in the data as a numpy.ndarray.
Every line corresponds to a row in the matrix, with delimiter = ' '.
Alternatively, I have the file as a numpy.array now but as a string:
[ '16.72083152\t12.91868366\t14.37818919\n'
...
'16.9504402\t7.81951173\t12.81342726']
How can I convert this quickly into a numpy.array of the desired format: n x 3, delimiter for rows = ' ', delimiter between elements in each row = '\t' chopping the '\n' at the end?
A quick answer much appreciated. Other tips as well. Thanks!
you could just try below code:
import numpy as np
data = np.loadtxt('data.txt',delimiter='\t')
This should do it
import numpy
try: from StringIO import StringIO
except ImportError: from io import StringIO
foo = '16.72083152\t12.91868366\t14.37818919\n16.9504402\t7.81951173\t12.81342726\n'
fn = StringIO.StringIO(foo) #make a file object from the string
data = numpy.loadtxt(fn) #use loadtxt with default settings.
In Python, I'm writing my Pandas Dataframe to a csv file and want to change the decimal delimiter to a comma (,). Like this:
results.to_csv('D:/Data/Kaeashi/BigData/ProcessMining/Voorbeelden/Voorbeeld/CaseEventsCel.csv', sep=';', decimal=',')
But the decimal delimiter in the csv file still is a .
Why? What do I do wrong?
If the decimal parameter doesn't work, maybe it's because the type of the column is object. (check the dtype value in the last line when you do df[column_name])
That can happen if some rows have values that couldn't be parsed as numbers.
You can force the column to change type:
Change data type of columns in Pandas.
But that can make you lose non numerical data in that column.
This functionality wasn't added until 0.16.0
Added decimal option in to_csv to provide formatting for non-‘.’ decimal separators (GH781)
Upgrade pandas to something more recent and it will work. The code below uses the 10 minute tutorial and pandas version 0.18.1
>>> import pandas as pd
>>> import numpy as np
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
>>> df
A B C D
2013-01-01 -0.157833 1.719554 0.564592 -0.228870
2013-01-02 -0.316600 1.545763 -0.206499 0.793412
2013-01-03 1.905803 1.172803 0.744010 1.563306
2013-01-04 -0.142676 -0.362548 -0.554799 -0.086404
2013-01-05 1.708246 -0.505940 -1.135422 0.810446
2013-01-06 -0.150899 0.794215 -0.628903 0.598574
>>> df.to_csv("test.csv", sep=';', decimal=',')
This creates a "test.csv" file that looks like this:
;A;B;C;D
2013-01-01;-0,157833276159;1,71955439009;0,564592278787;-0,228870244247
2013-01-02;-0,316599953358;1,54576303958;-0,206499307398;0,793411528039
2013-01-03;1,90580284184;1,17280324924;0,744010110291;1,56330623177
2013-01-04;-0,142676406494;-0,36254842687;-0,554799190671;-0,0864039782679
2013-01-05;1,70824597265;-0,50594004498;-1,13542154086;0,810446051841
2013-01-06;-0,150899136973;0,794214730009;-0,628902891897;0,598573645748
In the case when data is an object, and not a plain float type, for example python decimal.Decimal(10.12). First, change a type and then write to CSV file:
import pandas as pd
from decimal import Decimal
data_frame = pd.DataFrame(data={'col1': [1.1, 2.2], 'col2': [Decimal(3.3), Decimal(4.4)]})
data_frame.to_csv('report_decimal_dot.csv', sep=';', decimal=',', float_format='%.2f')
data_frame = data_frame.applymap(lambda x: float(x) if isinstance(x, Decimal) else x)
data_frame.to_csv('report_decimal_comma.csv', sep=';', decimal=',', float_format='%.2f')
Somehow i don't get this to work either. I always just end up using the following script to rectify it. It's dirty but it works for my ends:
for col in df.columns:
try:
df[col] = df[col].apply(lambda x: float(x.replace('.','').replace(',','.')))
except:
pass
EDIT: misread the question, you might use the same tactic the other way around by changing all your floats to strings :). Then again, you should probably just figure out why it's not working. Due post it if you get it to work.
This example suppose to work (as it works for me):
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10))
with open('Data/out.csv', 'w') as f:
s.to_csv(f, index=True, header=True, decimal=',', sep=';', float_format='%.3f')
out.csv:
;0 0;0,091 1;-0,009 2;-1,427 3;0,022 4;-1,270
5;-1,134 6;-0,965 7;-1,298 8;-0,854 9;0,150
I don't see exactly why your code doesn't work, but anyway, try to use the above example to your needs.
Is there a direct way to import the contents of a CSV file into a record array, just like how R's read.table(), read.delim(), and read.csv() import data into R dataframes?
Or should I use csv.reader() and then apply numpy.core.records.fromrecords()?
Use numpy.genfromtxt() by setting the delimiter kwarg to a comma:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
Use pandas.read_csv:
import pandas as pd
df = pd.read_csv('myfile.csv', sep=',', header=None)
print(df.values)
array([[ 1. , 2. , 3. ],
[ 4. , 5.5, 6. ]])
This gives a pandas DataFrame which provides many useful data manipulation functions which are not directly available with numpy record arrays.
DataFrame is a 2-dimensional labeled data structure with columns of
potentially different types. You can think of it like a spreadsheet or
SQL table...
I would also recommend numpy.genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:
import numpy as np
np.genfromtxt('myfile.csv', delimiter=',')
For the following 'myfile.csv':
1.0, 2, 3
4, 5.5, 6
the code above gives an array:
array([[ 1. , 2. , 3. ],
[ 4. , 5.5, 6. ]])
and
np.genfromtxt('myfile.csv', delimiter=',', dtype=None)
gives a record array:
array([(1.0, 2.0, 3), (4.0, 5.5, 6)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])
This has the advantage that files with multiple data types (including strings) can be easily imported.
I tried it :
from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))
versus :
import csv
import numpy as np
with open(dest_file,'r') as dest_f:
data_iter = csv.reader(dest_f,
delimiter = delimiter,
quotechar = '"')
data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)
on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.
I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.
You can also try recfromcsv() which can guess data types and return a properly formatted record array.
As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:
Faster
Less CPU usage
1/3 RAM usage compared to NumPy genfromtxt
This is my test code:
$ for f in test_pandas.py test_numpy_csv.py ; do /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps
23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps
test_numpy_csv.py
from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')
test_pandas.py
from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')
Data file:
du -h ~/me/notebook/train.csv
59M /home/hvn/me/notebook/train.csv
With NumPy and pandas at versions:
$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2
Using numpy.loadtxt
A quite simple method. But it requires all the elements being float (int and so on)
import numpy as np
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)
You can use this code to send CSV file data into an array:
import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)
I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),
import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()
You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.
import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()
# Data in NumPy format
data = data.values
This work as a charm...
import csv
with open("data.csv", 'r') as f:
data = list(csv.reader(f, delimiter=";"))
import numpy as np
data = np.array(data, dtype=np.float)
This is the easiest way:
import csv
with open('testfile.csv', newline='') as csvfile:
data = list(csv.reader(csvfile))
Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.
Available on the newest pandas and numpy version.
import pandas as pd
import numpy as np
data = pd.read_csv('data.csv', header=None)
# Discover, visualize, and preprocess data using pandas if needed.
data = data.to_numpy()
I tried this:
import pandas as p
import numpy as n
closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)
In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s
In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s
this is a very simple task, the best way to do this is as follows
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:\Users\Ron\Desktop\Clients.csv') #read the file (put 'r' before the path string to address any special characters in the file such as \). Don't forget to put the file name at the end of the path + ".csv"
print(df)`
y = np.array(df)