Reading a complex csv file into a numpy array - python

I have such a csv file;
rgb-28.ppm
rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.91839749)
rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.88473213) (281.05219, 54.649971, 319.0, 108.78567, 0.61637461)
On each line, there is the name of a file, and there is one or multiple tuples belonging to that file.
I want to read this csv file as the following.
On each row, the first column will involve the name of the file. The next columns will involve the tuples. If there won't be any tuple, the column will be empty. If there is a tuple, the tuple will occupy the column.
And when I want to read this file as the following;
contours = genfromtxt(path, delimiter=' ')
I get the following error:
Line #36098 (got 6 columns instead of 1)
How can I read such kind of a file into a csv?
Thanks,

Try this. The idea is, from the input file, find line which has the maximum number of columns. Use this, to construct a dynamic column list names. Pass this column list as the column names to Pandas. As mentioned in the comments, numpy is not efficient in handling the missing values. Once the data is in DataFrame, use the columns C1, C2, etc. to remove the unwanted characters, and then str.split to convert the numbers into a list to numbers.
import pandas as pd
l_max_col_nos = 0
l_f = open('data.csv','r')
for each_line in l_f:
l_split = len(each_line.split('\t'))
if l_split > l_max_col_nos:
l_max_col_nos = l_split
l_f.close()
l_column_list = []
for each_i in xrange(l_max_col_nos):
l_column_list.append('C' + str(each_i))
print l_column_list
l_df = pd.read_csv('data.csv',sep='\t',header=None,names=l_column_list)
print l_df
print l_df['C1'].str.replace(')','').str.replace('(','').str.replace('\s','').str.split(',')
Output
['C0', 'C1', 'C2']
C0 C1 \
0 rgb-28.ppm NaN
1 rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.918...
2 rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.884...
C2
0 NaN
1 NaN
2 (281.05219, 54.649971, 319.0, 108.78567, 0.616...
0 NaN
1 [214.75142, 45.618622, 319.0, 152.53371, 0.918...
2 [235.09999, 47.999729, 319.0, 147.49998, 0.884...
dtype: object

When you use genfromtxt(path, delimiter=' '), it reads each line, splits it on the delimiter. Without further specifications it takes the number of split strings in the first line as the expected number for all lines.
The first line has just one string - so it expects one column all the way down.
The 2nd line has that string, but it also has those 5 number strings. Yes they are wrapped in () and separated by ,; but they are also separated by the space. genfromtxt does not handle ().
And then the 3rd line has 2 of those () blocks.
The csv.reader can handle quoted strings, but I don't think it can treat () as "...".
Your parsing goal does not fit an array or table. It sounds like you expect a variable of number of 'columns' per row, and that each such 'column' will contain this 5 number tuple. That does not compute. Yes, you could force that structure into an object type array, but the fit is bad.
However if each tuple of numbers contains 5, I can see creating a dictionary with the filename as key, and each tuple of that line as a row in a 5 column 2d array. But regardless of target structure you need to figure out a way of one line, such as that one with 2 tuples. How do split it on the spaces, without splitting on the ', '? Once you have () groups you can strip off the (), and split on ', '. The re, regular expression, module might be the best tool for this (I'll try to develop that).
=======================
A possible parsing of your example
Start with a line parsing function:
def foo(aline):
alist = re.split(' \(',aline)
key = alist[0]
rest = alist[1:]
rest = [r.strip().strip(')') for r in rest]
if len(rest)>0:
rest = np.array([[float(i) for i in r.split(',')] for r in rest])
else:
rest = None
return [key, rest]
Your sample text - copy-n-paste and split into lines
In [310]: txt="""rgb-28.ppm
rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.91839749)
rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.88473213) (281.05219, 54.649971, 319.0, 108.78567, 0.61637461)"""
In [311]: txt=txt.splitlines()
In [312]: txt
Out[312]:
['rgb-28.ppm',
'rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.91839749)',
'rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.88473213) (281.05219, 54.649971, 319.0, 108.78567, 0.61637461)']
Now pass each line through the function:
In [313]: data = []
In [314]: for line in txt:
.....: data.append(foo(line))
In [315]: data
Out[315]:
[['rgb-28.ppm', None],
['rgb-29.ppm',
array([[ 214.75142 , 45.618622 , 319. , 152.53371 ,
0.91839749]])],
['rgb-30.ppm',
array([[ 235.09999 , 47.999729 , 319. , 147.49998 ,
0.88473213],
[ 281.05219 , 54.649971 , 319. , 108.78567 ,
0.61637461]])]]
In [316]: data[2][1].shape
Out[316]: (2, 5)
The last line contains the data in a 2x5 array. The first has None.
Splitting on ' (' seems to be enough to handle the larger groups. It leaves a trailing ')' on the groups, but that's easy to strip off. The rest is to split each group into substrings, and convert those to floats.
As written the function has no error checking or robustness, but it is a start. The data might not exactly in the form you want, but it can be reworked as needed.

Related

Concatenating and sorting

cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
array([['WV5 6NY', 'RE4 9VU', 'BU4 N90', 'TU3 5RE', 'NE5 4F'],
['SA8 7TA', 'BA31 0PO', 'DE3 2FP', 'LR98 4TS', nan],
['MN0 4NU', 'RF5 5FG', 'WA3 0MN', 'EA15 8RE', 'BE1 4RE'],
['SB7 0ET', 'SA7 0SB', 'BT7 6NS', 'TA9 0LP' nan]], dtype=object)
a = np.concatenate(b) #concatenated to get a single array, this worked well
print(np.sort(a)) # to sort alphabetically
it gave me error **error AxisError: axis -1 is out of bounds for array of dimension 0*
I also tried using a.sort() it is also giving me **TypeError: '<' not supported between instances of 'float' and 'str'**
The above is a CSV file containing list of postcodes of different persons which involves travelling from one postcode to another for different jobs, a person could travel to 5 postcoodes a day. using numpy array, I got list of list of postcodes.
I then concatenate the list of postcode to get one big list of postcode after which I want to sort it in an alphabetical order but it kept giving me errors.
Please, can someone help
As it was mentioned in the comments, this error is caused by the comparison of nan to string. To fix this, you cannot use a NumPy array (for sorting), but rather a list.
Convert the array to a list
Remove the nan values
Sort
# Get the data (in your scenario, this would be achieved by reading from your file)
b = np.array([['WV5 6NY', 'RE4 9VU', 'BU4 N90', 'TU3 5RE', 'NE5 4F'],
['SA8 7TA', 'BA31 0PO', 'DE3 2FP', 'LR98 4TS', nan],
['MN0 4NU', 'RF5 5FG', 'WA3 0MN', 'EA15 8RE', 'BE1 4RE'],
['SB7 0ET', 'SA7 0SB', 'BT7 6NS', 'TA9 0LP', nan]], dtype=object)
# Flatten
a = np.concatenate(b)
# Remove nan values - they are converted to strings when concatenated
a = np.array([x for x in a if x != 'nan'])
# Finally, sort
a.sort()

I have a dataset containing both strings and integer, how do I write a program that will only read the integer values on Python?

Need it to only read the integer values, and not the strings.
This is an example of a line in the text file:
yye5 mooProject No yeetcity Nrn de 0 .1 .5 0
We want to skip the first 5 columns (Nrn de is one column) and put every line in the file (which looks like this) into a numpy or pandas array.
Try/Except block is your friend.
x = ('yye5','mooProject','No','yeetcity','Nrn','de','0','.1','.5','0')
result = []
for i in x:
try:
result.append(float(i))
except ValueError:
pass
print(result)

Saving a row-wise txt with numpy with a pre-defined delimiter

Let's say I have two numpy arrays x and y and I want to save them in a txt with a tab as a delimiter (\t) and their appropriate type (x is a float and y is a integer) and a certain format. For example:
import numpy as np
x=np.random.uniform(0.1,10.,4)
y=np.random.randint(10,size=4)
If I simply use np.savetxt('name.txt',(x,y)), this is what I get:
6.111548206503401026e+00 4.208270401300298058e-01 5.914485954766230957e-01 6.652272388676337966e-01
6.027109785846696433e+00 1.024051075089774443e+01 3.358386699980072621e+01 7.652668778594046151e-01
But what I want is a row-wise txt file, so I followed this solution:
numpy array to a file, np.savetxt
and bu using
np.savetxt('name.txt',np.vstack((x,y)).T,delimiter='\t') I get:
2.640596763338360020e+00 4.000000000000000000e+00
8.693117057064670306e+00 4.000000000000000000e+00
3.891035166453641558e+00 6.000000000000000000e+00
9.044178202861068883e+00 2.000000000000000000e+00
Until here it is ok, but as I mentioned, I want the output to have the appropriate data type and some formatting, so I tried np.savetxt('name.txt',np.vstack((x,y)).T,fmt=('%7.2f,%5i'),delimiter='\
...: t'), and what I get is:
2.64, 4
8.69, 4
3.89, 6
9.04, 2
which does have the appropriate format and data type, but which adds a , after the columns. Does anyone knows how to get rid of this , printed after the column?
The comma is in your fmt string. Replace it with fmt='%7.2f %5i', like so:
np.savetxt('name.txt',np.vstack((x,y)).T,fmt='%7.2f %5i')
Note the tab delimiter (delimiter='\t') is not necessary as np.vstack((x,y)).T fills only one column. If you want a tab between the values, change the format string to fmt='%7.2f \t%5i' or alternatively:
np.savetxt('name.txt',np.vstack((x,y)).T,fmt=('%7.2f', '%5i'), delimiter='\t')

Pandas: Replacing column values in dataframe columns

my goal for this question is to insert a comma between every character in every column value, which have been hashed and padded to a length of 19 digits.
The code below works partially, but the array values get messed up by trying to apply the f_comma function to the column value...thanks for any help!
I've taken some of the answers from other questions and have created the following code:
using this function -
def f_comma(p_string, n=1):
p_string = str(p_string)
return ','.join(p_string[i:i+n] for i in range(0, len(p_string), n))
And opening a tsv file
data = pd.read_csv('a1.tsv', sep = '\t', dtype=object)
I have modified another answer to do the following -
h = 1
try:
while data.columns[h]:
a = data.columns[h]
data[a] = f_comma((abs(data[a].apply(hash))).astype(str).str.zfill(19))
h += 1
except IndexError:
pass
which returns this array
array([[ '0, , , , ,4,1,7,5,7,0,1,4,5,4,6,1,6,5,3,1,4,6,1,\n,N,a,m,e,:, ,d,a,t,e,,, ,d,t,y,p,e,:, ,o,b,j,e,c,t',
'0, , , , ,6,2,9,1,6,7,0,8,4,2,8,2,9,1,0,9,5,9,4,\n,N,a,m,e,:, ,n,a,m,e,,, ,d,t,y,p,e,:, ,o,b,j,e,c,t']], dtype=object)
without the f_comma function the array looks like -
array([['3556968867719847281', '3691880917405293133']], dtype=object)
The goal is an array like this -
array([['3,5,5,6,9,6,8,8,6,7,7,1,9,8,4,7,2,8,1', '3,6,9,1,8,8,0,9,1,7,4,0,5,2,9,3,1,3,3']], dtype=object)
You should be able to use pandas string functions.
e.g. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.join.html
df["my_column"].str.join(',')

Python - Subtract 2 lists from each other that are in a dictionary

So I have to take information from a large data file that has 14 properties (columns). Using this information I was able to take the data and combine it into a list of floats. I have to analyse it and was required to normalise the values (value - minvalue)/(maxvalue - minvalue). I then put the original value list into a dictionary with the normalised values to that they are still related.
I need to now take 2 different keys of this dictionary which correspond to 2 different lists of normalised values and then subtract them from each other for further analysis.
Sample of my dictionary info:
(3.0, 13.73, 4.36, 2.26, 22.5, 88.0, 1.28, 0.47, 0.52, 1.15, 6.62, 0.78, 1.75, 520.0):
[0.7105263157894738, 0.7154150197628459, 0.4812834224598929, 0.6134020618556701, 0.1956521739130435, 0.10344827586206898, 0.02742616033755273, 0.7358490566037735, 0.2334384858044164, 0.4556313993174061, 0.2439024390243903, 0.1758241758241758, 0.17261055634807418]
there are over 100 similar entries
Using Python3 and no libraries apart from math
Any help is appreciated but if you feel there is an easier way to do this please let me know.
Edit: I cannot use any imported libraries
Ill add in some of my code but I have to snip a large portion of it out as it is much too large to include in this post.
for line in temp_file:
line = line.strip() #remove white space
line_list = line.split(",") #split the list into components seperated by commas
temp_list2 = []
for item in line_list[0:]:
value_float = float(item) #make values currently of type string into type float
temp_list2.append(value_float)
tuple_list = tuple(temp_list2) #make each item into a seperate tuple and then list
data_list.append(tuple_list) #these tuples in a master list data_list
prop_elts = [(x[1:]) for x in data_list]
------snip-------- (here is just where i defined each of the columns and then calculated the normalised values)
i = 0
while i < len(data_list):
all_props_templist = [prop1_list[i],prop2_list[i],prop3_list[i],prop4_list[i],prop5_list[i],prop6_list[i],prop7_list[i],prop8_list[i],prop9_list[i],prop10_list[i],prop11_list[i],prop12_list[i],prop13_list[i]]
all_properties.append(all_props_templist)
i = i + 1
my_data_Dictionary = {el1: el2 for el1, el2 in zip(data_list,all_properties )}
If data is your dict,
[a-b for a, b in zip(data[key1], data[key2])]
is a list whose elements are the diffference between the corresponding elements in data[key1] and data[key2].
PS. When you see numbered variable names:
all_props_templist = [prop1_list[i],prop2_list[i],prop3_list[i],prop4_list[i],prop5_list[i],prop6_list[i],prop7_list[i],prop8_list[i],prop9_list[i],prop10_list[i],prop11_list[i],prop12_list[i],prop13_list[i]]
know that the situation is crying out for a list with an index in place of the number:
all_props_templist = [prop_list[j][i] for j in range(13)]

Categories