I used numpy.loadtxt to load a file that contains this scructure:
99 0 1 2 3 ... n
46 0.137673 0.147241 0.130374 0.155461 ... 0.192291
32 0.242157 0.186015 0.153261 0.152680 ... 0.154239
77 0.163889 0.176748 0.184754 0.126667 ... 0.191237
12 0.139989 0.417530 0.148208 0.188872 ... 0.141071
64 0.172326 0.172623 0.196263 0.152864 ... 0.168985
50 0.145201 0.156627 0.214384 0.123387 ... 0.187624
92 0.127143 0.133587 0.133994 0.198704 ... 0.161480
Now, I need that the first column (except first line) store the index of the higher value in it's line.
At end, save this array in a file with the same number format as original.
Thank's.
Can you use numpy.argmax something like this:
import numpy as np
# This is a simple example. In your case, A is loaded with np.loadtxt
A = np.array([[1, 2.0, 3.0], [3, 1.0, 2.0], [2.0, 4.0, 3.0]])
B = A.copy()
# Copy the max indices of rows of A into first column of B
B[:,0] = np.argmax(A[:,1:], 1)
# Save the results using np.savetxt with fmt, dynamically generating the
# format string based on the number of columns in B (setting the first
# column to integer and the rest to float)
np.savetxt('/path/to/output.txt', B, fmt='%d' + ' %f' * (B.shape[1]-1))
Note that np.savetxt allows for formatting.
This example code doesn't address the fact that you want to skip the first row, and you might want to subtract 1 from the results of np.argmax depending on if the index into the remaining columns is inclusive of the index column (0) or not.
Your data look like a Dataframe with columns and index : the data type is not homogeneous. It is more convenient to do it with pandas, which manage natively this layout:
import pandas as pd
a=pd.DataFrame.from_csv('data.txt',sep=' *')
u=a.set_index(a.values.argmax(axis=1)).to_string()
with open('out.txt','w') as f : f.write(u)
then out.txt is
0 1 2 3 4
4 0.137673 0.147241 0.130374 0.155461 0.192291
0 0.242157 0.186015 0.153261 0.152680 0.154239
4 0.163889 0.176748 0.184754 0.126667 0.191237
1 0.139989 0.417530 0.148208 0.188872 0.141071
2 0.172326 0.172623 0.196263 0.152864 0.168985
2 0.145201 0.156627 0.214384 0.123387 0.187624
3 0.127143 0.133587 0.133994 0.198704 0.161480
Related
I have created an array which returns (6, 20) as an attribute of the shape, like this:
import numpy as np
data = np.random.logistic(10, 1, 120)
data = data.reshape(6, 20)
instantiate pandas.DataFrame from array data
import pandas as pd
data = pd.DataFrame(data)
now this is a dataframe created using data values that come from the numpy module's distributive function
and return this:
0 1 2 3 4 5
0 9.602117 9.507674 9.848685 9.215080 11.061676 9.627753
1 11.702407 9.804924 7.375905 10.784320 8.485818 10.938005
2 9.628927 9.713187 10.027626 10.653311 11.301493 8.756792
3 11.229905 12.013172 10.023200 9.211614 7.139757 9.687851
6 7 8 9 10 11 12
0 9.356069 11.483162 8.993130 8.015089 9.808234 9.435853 9.773375
1 13.422060 10.027434 9.694008 9.677682 10.806266 12.393364 9.479257
2 10.821846 10.690378 8.321566 9.595122 11.753948 10.021815 10.412572
3 8.499120 7.352394 9.288662 9.178306 10.073842 9.246110 9.075350
13 14 15 16 17 18 19
0 9.809366 8.502451 11.624395 12.824338 9.729167 8.945258 10.464157
1 6.698941 9.416421 11.477242 9.622115 6.374589 9.459355 10.435674
2 11.068721 9.775433 9.447799 8.972052 10.692942 10.978305 10.047067
3 10.381596 10.968330 11.892766 12.241880 9.980124 7.321942 9.241030
when I try to set columns=list("abcdef"), I get this error:
ValueError: Shape of passed values is (6, 20), indices imply (6, 6)
and my expected output is similar to that shown directly from the numpy array. It should contain each column as a pandas.Series of lists (or list of lists).
a.
0 [ 6.98467276 9.16242742 6.99065177 11.50834399 9.29697138 7.93926441
9.05857668 7.13652948 11.01724792 13.31658877 8.63137079 9.5564405
7.37161153 11.19414704 9.45957466 9.19826796 10.13506672 9.74830158
9.97456348 8.35217153]
b.
[10.48249082 11.94030324 12.59080011 10.55695088 12.43071037 11.49568774
10.03540181 11.08708832 10.24655111 8.17904856 11.04791142 7.30069964
8.34783674 9.93743588 8.1537666 9.92773204 10.3416315 9.51624921
9.60124236 11.37511301]
c.
[ 8.21851024 12.71641524 9.7748047 9.51267978 7.92793378 12.1646706
9.67236267 10.22201002 9.67197374 9.70551429 7.79209516 9.20295594
9.26231527 8.04560836 11.0409066 8.63660332 9.18397671 8.17510874
9.61619671 8.42704322]
d.
[14.54825819 16.97573893 7.70643136 12.06334323 14.64054726 9.54619595
10.30686621 12.20487566 10.78492189 12.01011666 10.12405213 8.57057999
10.41665479 7.85921253 10.15572125 9.20554292 10.03832545 9.43720211
11.06605713 9.60298514]
I have found this thread that looks like my problem but it has not helped me much, also I would use the data in a different way.
Could I assign the lengths of the columns or maybe assign the dimensions of this Pandas.DataFrame?
Your data has 6 rows and 20 columns. If you want to pass each "row" of the numpy array as a "column" to the DataFrame, you can simply transpose:
df = pd.DataFrame(data=np.random.logistic(10, 1, 120).reshape(6,20).transpose(),
columns=list("abcdef"))
Edit:
To get the data in a single row, try:
df = pd.DataFrame(columns=list("abcdef"), index=[0])
df.iloc[0] = np.random.logistic(10, 1, 120).reshape(6,20).transpose()
I have a .csv with a 'cities' column. The column's values are supposed to be a list with each element being a list itself in the following format:
['City', (latitude, longitude)]
So for example:
[['Athens', (37.9839412, 23.7283052)], ['Heraklion', (35.3400127, 25.1343475)], ['Mykonos', (37.45142265, 25.392303200095327)]]
I am trying to load the csv into a pandas dataframe using pd.read_csv().
The value in the column ends up with type string and looks like this:
'[[\'Athens\', (37.9839412, 23.7283052)], [\'Heraklion\', (35.3400127, 25.1343475)], [\'Mykonos\', (37.45142265, 25.392303200095327)]]'
However, because its a string, its just seeing each element as one character.
When I do:
for i in cities:
print(i)
Or:
list(cities)
I get:
[
[
'
A
t
h
e
n
s
'
,
(
3
7
.
9
8
3
9
4
1
2
,
2
3
.
7
2
8
3
0
5
2
)
]
,
etc.
I am looking for a way to 're-build' the data back into python list format so that I can access the string 'Athens' with df.loc[0]['cities][0] and the tuple (37.9839412, 23.7283052) with df.loc[0]['cities][1].
I have tried df['cities'].astype(list) which results in the error:
TypeError: dtype '<class 'list'>' not understood
It seems the data is a string that looks like a Python array. You can access the values for this using ast.literal_eval by applying this literal_eval function on every row in the DataFrame and storing the output city and coordinates as separate columns in the DataFrame.
What's happening is that it's currently a string, rather than a list. Simply implement this:
import ast
l = '[[\'Athens\', (37.9839412, 23.7283052)], [\'Heraklion\', (35.3400127, 25.1343475)], [\'Mykonos\', (37.45142265, 25.392303200095327)]]'
res = ast.literal_eval(l)
print(res)
print(type(res))
Output:
[['Athens', (37.9839412, 23.7283052)], ['Heraklion', (35.3400127, 25.1343475)], ['Mykonos', (37.45142265, 25.392303200095327)]]
<class 'list'>
There is a match index function in Excel that i use to match if the elements are present in the required column
=iferror(INDEX($B$2:$F$8,MATCH($J4,$B$2:$B$8,0),MATCH(K$3,$B$1:$F$1,0)),0)
This is the function i am using right now and it is yielding me good results but I want to implement it in python.
brand N Z None
Honor 63 96 190
Tecno 0 695 763
from this table I want
brand L N Z
Honor 0 63 96
Tecno 0 0 695
It should compare both the column and index and give the appropriate value
i have tried the lookup function in python but that gives me the
ValueError: Row labels must have same size as column labels
What you basically do with your excel formula, is creating something like a pivot table, you can also do that with pandas. E.g. like this:
# Define the columns and brands, you like to have in your result table
# along with the dataframe in variable df it's the only input
columns_query=['L', 'N', 'Z']
brands_query=['Honor', 'Tecno', 'Bar']
# no begin processing by selecting the columns
# which should be shown and are actually present
# add the brand, even if it was not selected
columns_present= {col for col in set(columns_query) if col in df.columns}
columns_present.add('brand')
# select the brands in question and take the
# info in columns we identified for these brands
# from this generate a "flat" list-like data
# structure using melt
# it contains records containing
# (brand, column-name and cell-value)
flat= df.loc[df['brand'].isin(brands_query), columns_present].melt(id_vars='brand')
# if you also want to see the columns and brands,
# for which you have no data in your original df
# you can use the following lines (if you don't
# need them, just skip the following lines until
# the next comment)
# the code just generates data points for the
# columns and rows, which would otherwise not be
# displayed and fills them wit NaN (the pandas
# equivalent for None)
columns_missing= set(columns_query).difference(columns_present)
brands_missing= set(brands_query).difference(df['brand'].unique())
num_dummies= max(len(brands_missing), len(columns_missing))
dummy_records= {
'brand': list(brands_missing) + [brands_query[0]] * (num_dummies - len(brands_missing)),
'variable': list(columns_missing) + [columns_query[0]] * (num_dummies - len(columns_missing)),
'value': [np.NaN] * num_dummies
}
dummy_records= pd.DataFrame(dummy_records)
flat= pd.concat([flat, dummy_records], axis='index', ignore_index=True)
# we get the result by the following line:
flat.set_index(['brand', 'variable']).unstack(level=-1)
For my testdata, this outputs:
value
variable L N Z
brand
Bar NaN NaN NaN
Honor NaN 63.0 96.0
Tecno NaN 0.0 695.0
The testdata is (note, that above we don't see col None and row Foo, but we see row Bar and column L, which are actually not present in the testdata, but were "queried"):
brand N Z None
0 Honor 63 96 190
1 Tecno 0 695 763
2 Foo 8 111 231
You can generate this testdata using:
import pandas as pd
import numpy as np
import io
raw=\
"""brand N Z None
Honor 63 96 190
Tecno 0 695 763
Foo 8 111 231"""
df= pd.read_csv(io.StringIO(raw), sep='\s+')
Note: the result as shown in the output is a regular pandas dataframe. So in case you plan to write the data back to a excel sheet, there should be no problem (pandas provides methods to read/write dataframes to/from excel-files).
Do you need to use Pandas for this action. You can do it with simple python as well. Read from one text file and print out matched and processed fields.
Basic file reading in Python goes like this. Where datafile.csv is your file. This reads all the lines in one file and prints out right result. First you need to save your file in .csv format so there is a separator between fields ','.
import csv # use csv
print('brand L N Z') # print new header
with open('datafile.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(spamreader, None) # skip old header
for row in spamreader:
# You need to add Excel Match etc... logic here.
print(row[0], 0, row[1], row[2]) # print output
Input file:
brand,N,Z,None
Honor,63,96,190
Tecno,0,695,763
Prints out:
brand L N Z
Honor 0 63 96
Tecno 0 0 695
(I am not familiar with Excel Match-function so you may need to add some logic to above Python script to get logic working with all your data.)
How do I convert the crosstab data from the input file mentioned below into columns based on the input list without using pandas?
Input list
[A,B,C]
Input data file
Labels A,B,C are only for representation, original file only has the numeric values.
We can ignore the colums XX & YY based on the length of the input list
A B C XX YY
A 0 2 3 4 8
B 4 0 6 4 8
C 7 8 0 5 8
Output (Output needs to have labels)
A A 0
A B 2
A C 3
B A 4
B B 0
B C 6
C A 7
C B 8
C C 0
The labels need to be present in the output file even though its present in the input file, hence I have mentioned its representation in the output file.
NB: In reality the labels are sorted city names without duplicates in ascending order & not single alphabets like A or B.
Unfortunately this would have been easier if I could install pandas on the server & use unstack(), but installations aren't allowed on this old server right now.
This is on python 3.5
Considering you tagged the post csv, I'm assuming the actual input data is a .csv file, without header as you indicated.
So example data would look like:
0,2,3,4,8
4,0,6,4,8
7,8,0,5,8
If the labels are provided as a list, matching the order of the columns and rows (i.e. ['A', 'B', 'C'] this would turn the example output into:
'A','A',0
'A','B',2
'A','C',3
'B','A',4
etc.
Note that this implies the number of rows and columns in the file cannot exceed the number of labels provided.
You indicate that the columns you label 'XX' and 'YY' are to be ignored, but you don't indicate how that's supposed to be communicated, but you do mention the length of the input is determining it, so I assume this means 'everything after column n can be ignored'.
This is a simple implementation:
from csv import reader
def unstack_csv(fn, columns, labels):
with open(fn) as f:
cr = reader(f)
row = 0
for line in cr:
col = 0
for x in line[:columns]:
yield labels[row], labels[col], x
col += 1
row += 1
print(list(unstack_csv('unstack.csv', 3, ['A', 'B', 'C'])))
or if you like it short and sweet:
from csv import reader
with open('unstack.csv') as f:
content = reader(f)
labels = ['A', 'B', 'C']
print([(labels[row], labels[col], x)
for row, data in enumerate(content)
for col, x in enumerate(data) if col < 3])
(I'm also assuming using numpy is out, for the same reason as pandas, but that stuff like csv is in, since it's a standard library)
If you don't want to provide the labels explicitly, but just want them generated, you could do something like:
def label(n):
r = n // 26
c = chr(65 + (n % 26))
if r > 0:
return label(r-1)+c
else:
return c
And then of course just remove the labels from the examples and replace with calls to label(col) and label(row).
Python 2.7
I have a Dataframe with two columns, coordinates and loc. coordinates contains 10 lat/long pairs and loc contains 10 strings.
The following code leads to a ValueError, arrays were different lengths. Seems like I'm not writing the condition correctly.
lst_10_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372'], ['37.226582, -95.70522299999999']]
lst_10_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX'], ['Seattle, WA'], ['Roswell, GA'], ['Texas'], ['null'], ['??, passing by...'], ['null']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_10_cords
df['locs'] = lst_10_locs
print df
df = df[df['coordinates'] != ['37.226582', '-95.70522299999999']] #ValueError
The error message is
File "C:\Users...\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin
e 1283, in wrapper
res = na_op(values, other)
File "C:\Users...\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin
e 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
File "C:...\biney\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin
e 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 10 vs 2
My goal here is to actually check and eliminate all entries in the coordinates column that are equal to the list [37.226582, -95.70522299999999] so I want df['coordinates'] to print out [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372']
I was hoping that this documentation would help, particularly the part that shows:
"You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):"
df[df['A'] > 0]
so it seems like I'm not quite getting the syntax right... But I'm stuck. How do I write set a condition for the cell value of a certain column and return a dataframe only containing rows with cells that meet that condition?
can you consider this?:
df
coordinates locs
0 [37.09024, -95.712891] [United States]
1 [-37.605, 145.146] [Doreen, Melbourne]
2 [43.0481962, -76.0488458] [Upstate NY]
3 [29.7604267, -95.3698028] [Houston, TX]
4 [47.6062095, -122.3320708] [Seattle, WA]
5 [34.0232431, -84.3615555] [Roswell, GA]
6 [31.9685988, -99.9018131] [Texas]
7 [37.226582, -95.705222999] [null]
8 [40.289918, -83.036372] [??, passing by...]
9 [37.226582, -95.7052229999] [null]
df['lat'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[0]))
df['lon'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[1]))
df[~((np.isclose(df['lat'],37.226582)) & (np.isclose(df['lon'],-95.70522299999999)))]
coordinates locs lat lon
0 [37.09024, -95.712891] [United States] 37.090240 -95.712891
1 [-37.605, 145.146] [Doreen, Melbourne] -37.605000 145.146000
2 [43.0481962, -76.0488458] [Upstate NY] 43.048196 -76.048846
3 [29.7604267, -95.3698028] [Houston, TX] 29.760427 -95.369803
4 [47.6062095, -122.3320708] [Seattle, WA] 47.606209 -122.332071
5 [34.0232431, -84.3615555] [Roswell, GA] 34.023243 -84.361555
6 [31.9685988, -99.9018131] [Texas] 31.968599 -99.901813
8 [40.289918, -83.036372] [??, passing by...] 40.289918 -83.036372
One issue if you look into the objects your dataframe is storing the coords as you see that it is a single string. the issue with the error you are getting seems to be that it is comparing the 10 element series .coordinates with a 2 element list and there is obviously a mismatch. using .values seemed to get around that.
df2 = pd.DataFrame([row if row[0]!= ['37.226582, -95.70522299999999'] else [np.nan, np.nan] for row in df.values ], columns=['coords', 'locs']).dropna()
ok here is an approach to ensure you have clean data to operate on.
let's assume 4 entries with a dirty coordinate entry.
lst_4_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['null']]
lst_4_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_4_cords
df['locs'] = lst_4_locs
coordinates locs
0 [37.09024, -95.712891] [United States]
1 [-37.605, 145.146] [Doreen, Melbourne]
2 [43.0481962, -76.0488458] [Upstate NY]
3 [null] [Houston, TX]
now we make a cleaning method. You would really want to test the values using:
type(value) is list.
type(value[0]) is string.
value[0].split(",") has two elements
each element can cast to float - etc.
Each is valid to be a lat or a lon
However we will do it the dirty way using a try except.
def scrubber_drainer(value):
try:
# we assume value is a list, with a single string in position zero, that string has a comma, that we can split into a tuple of two floats
return tuple([float(value[0].split(",")[0]),float(value[0].split(",")[1])])
except:
# return tuple (38.9072,77.0396) # swamp
return tuple([0.0,0.0]) # some default
so the return is typically a tuple with 2 floats. If it can't become that we return a default (0.,0.).
now update the coordinates
df['coordinates'] = df['coordinates'].map(scrubber_drainer)
then we use this cool technique to split out the tuple
df[['lat', 'lon']] = df['coordinates'].apply(pd.Series)
and now you can use the np.isclose() to filter