I have a pandas dataframe df.
One column is a string of numbers (as characters) divided by blank space
I need to convert it to multidim numpy array.
I thought that :
df.A.apply(lambda x: np.array(x.split(" "))).values
would make the trick
Actually it returns an array of array....
array([array(['70', '80', '82', ..., '106', '109', '82'], dtype='<U3'),
array(['151', '150', '147', ..., '193', '183', '184'], dtype='<U3'),
Which does not seem to be what I look what i am looking for whcih should rather look like
array([[[['70', '80', '82', ..., '106', '109', '82'],['151', '150', '147', ..., '193', '183', '184']....
First: what shoudl I do to have my daya in the second format?
Second: I am actually a bit confused about the difference between the 2 data structures. In the end of the day a multidimensional array is an array of arrays. From this perspective it would seem that the 2 are the same structure. But I am sure I am missing somthing
EXAMPLE:
df=pd.DataFrame({"A":[0,1,2,3],"B":["1 2 3 4","5 6 7 8","9 10 11 12","13 14 15 16"]})
A B
0 0 "1 2 3 4"
1 1 "5 6 7 8"
2 2 "9 10 11 12"
3 3 "13 14 15 16"
This command
df.B.apply(lambda x: np.array(x.split(" "))).values
gives:
array([array(['1', '2', '3', '4'], dtype='<U1'),
array(['5', '6', '7', '8'], dtype='<U1'),
array(['9', '10', '11', '12'], dtype='<U2'),
array(['13', '14', '15', '16'], dtype='<U2')], dtype=object)
instead of
array([['1', '2', '3', '4'],
['5', '6', '7', '8'],
['9', '10', '11', '12'],
['13', '14', '15', '16']], dtype='<U2')
Question1: How do I get this last structure?
Question2: what is the difference between the 2? Technically are both array of arrays...
you can do it using str.split on df.A directly, with the parameter expand=True and then use values such as:
df = pd.DataFrame({'A':['70 80 82','151 150 147']})
print (df.A.str.split(' ',expand=True).values)
array([['70', '80', '82'],
['151', '150', '147']], dtype=object)
with your method, if all the strings contain the same amount of numbers, you can still use np.stack to get the same result:
print (np.stack(df.A.apply(lambda x: np.array(x.split(" "))).values))
EDIT: for the difference, not sure I can explain it good enough but I try. let's define
arr1 = df.A.str.split(' ',expand=True).values
arr2 = df.A.apply(lambda x: np.array(x.split(" "))).values
First you can notice that the shape is not the same:
print(arr1.shape)
(2, 3)
print(arr2.shape)
(2,)
so I would say one difference is that arr2 is a 1D array of elements that happens to be also 1D array. When you construct arr2 with values, it constructs a 1D array from the serie df.A.apply(lambda x: np.array(x.split(" "))) without looking at the type in this serie. For arr1, the difference is that df.A.str.split(' ',expand=True) is not a serie but a dataframe, so using values will construct an 2D array with a shape being (number of rows,nb of columns). In both case you use values, but actually having an array in a cell of a serie (as created in your method) will not create a 2D array.
Then, if you want to access any element (such as the first row second element) you can do it by arr1[0,1] while arr2[0,1] will throw an error because this structure is not a 2D array, but arr2[0][1] gives the good answer because you access the second element [1] of the first 1D array [0] in arr2.
I hope it gives some explanation.
Related
I have this code I'm trying to run by using two columns of a csv file that I've converted into lists and used those lists to get a < and > comparison between the numbers inside, now i want to get the results from this comparison in a list format of multiple lists that I want to display in an interval of six digits(the results) per list
eg I get
1
2
3
4
5
6
7
8
9
10
11
12
and i want to display this as
[1,2,3,4,5,6]
[7,8,9,10,11,12]
this is the code I'm using for comparing the lists
'''
for i in range(len(fsa)):
if fsa[i] < ghf[i]:
print('1')
else:
print('0')
'''
the code that's not working which is the one for showing results in an intervalled list format is this one
'''
print()
start = 0
end = len(''' i want the length of my results from the previous code, the 1's and 0's here. ''')
for x in range(start,end,6):
print('''i want the results here as my list'''[x:x+6])
'''
I'm a beginner, please help, how do i make the results a list?
i got the answer i wanted. Incase someone else was suffering with this as
well here's my solution
'''
kol = []
for i in range(len(fsa)):
if fsa[i] < ghf[i]:
kol.append('1')
else:
kol.append('0')
start = 0
end = len(fsa)
for x in range(start,end,6):
print(kol[x:x+6])
'''
outcome
'''
['1', '1', '0', '0', '1', '1']
['1', '0', '0', '1', '0', '0']
['0', '0', '0', '0', '1', '1']
['1', '1', '1', '0', '1', '1']
'''
you just need to make a new list and append it instead of print.
...
...
temp = []
for x in range(start,end,6):
temp.append(fsa[x:x+6])
print(temp)
#[[1, 2, 3, 4, 5, 6], [7, 8, 9, 10, 11, 12]]
I am beginner in Python so I kindly ask your help. I would like to have a document where I have the first column as 2011.01 and the second column is the number of ARD 'events' in that month and the third column is the average of all of the ARD displayed in that month. If not, that e.g. 2012.07 0 0
I've already tried for 3 hours and now I am getting nervous.
I really much appreciate your help
import pandas as pd
from numpy import mean
from numpy import std
from numpy import cov
from matplotlib import pyplot
from scipy.stats import pearsonr
from scipy.stats import spearmanr
data = pd.read_csv('ARD.txt',delimiter= "\t")
month = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
day = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31']
year = ['2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021']
ertek = data[:1].iloc[0].values
print(ertek)
print(data.head)
def list_to_string ( y, m, d):
str = ""
s = [y, m, d]
str.join(s)
return str
for x in year:
for y in month:
for i in day:
x = 1
ertek = data[:x].iloc[0].values
list_to_string(x, y, i)
if ertek[0] == list_to_string[x, y, i]:
print("")
x += 1
else:
print("")
Result:
['2011.01.05.' 0.583333333]
<bound method NDFrame.head of Date ARB
0 2011.01.05. 0.583333
1 2011.01.06. 0.583333
2 2011.01.07. 0.590909
3 2011.01.09. 0.625000
4 2011.01.10. 0.142857
... ... ...
1284 2020.12.31. 0.900000
1285 2020.12.31. 0.900000
1286 2020.12.31. 0.900000
1287 2020.12.31. 0.900000
1288 2020.12.31. 0.900000
[1289 rows x 2 columns]>
Traceback (most recent call last):
File "C:\Users\Kókai Dávid\Desktop\python,java\python\stock-trading-ml-master\venv\Scripts\orosz\oroszpred.py", line 29, in <module>
list_to_string(x, y, i)
File "C:\Users\Kókai Dávid\Desktop\python,java\python\stock-trading-ml-master\venv\Scripts\orosz\oroszpred.py", line 21, in list_to_string
str.join(s)
TypeError: sequence item 0: expected str instance, int found
Process finished with exit code 1
I'm not quite certain I'm tracking your intent with the list_to_string function; if it's for string date comparison, let's sidestep that entirely by
df.iloc[:,0] = pd.to_datetime(df.iloc[:,0]
df.set_index('Date')
df['Month Average'] = df.Date.resample('M').mean()
# File 1
Column = ['1', '2', '3']
# File 2
Column = ['-2', '-6', '-7', '-6', '-7']
# File 3
Column=['0', '3', '4', '6', '5']
# File 4
Column = ['-1', '-2', '-3', '-3', '-3']
# Combined files
Column = ['1', '2', '3', '-2', '-6', '-7', '-6', '-7', '0', '3', '4', '6', '5', '-1', '-2', '-3', '-3', '-3']
Guys, I want to select either max or min value from each file in the combined files.
Expected output:
Column = ['3', '-7', '6', '-3']
Any help will be appreciated!
I think you are asking for the abs maximum value for each column. Try the code below
Column1 = [1, 2, 3]
Column2 = [-2, -6, -7, -6, -7]
Column3 = [0, 3, 4, 6, 5]
Column4 = [-1, -2, -3, -3, -3]
print(max(Column1, key=abs))
print(max(Column2, key=abs))
print(max(Column3, key=abs))
print(max(Column4, key=abs))
Within your lists are strings and not integers so you should first convert them into integers:
--> https://www.geeksforgeeks.org/python-converting-all-strings-in-list-to-integers/
It's the same as asking a person "What's the biggest value of apples, oranges, pears".
After that what you simply do is use the max and min function within python.
Column = [1, 2, 3]
print(max(Column))
--> 3
print(min(Column))
--> 1
I hope I could help a little bit. :)
Use this method
column=[sorted(column1)[random.randint(-1,0)]]
Use one of these.
This method first sort the lists
column=[]
column.append(sorted(column1)[random.randint(-1,0)])
column.append(sorted(column2)[random.randint(-1,0)])
column.append(sorted(column3)[random.randint(-1,0)])
column.appemd(sorted(column4)[random.randint(-1,0)])
column.append(sorted(column5)[random.randint(-1,0)])
Thus use random.choice function
column=[]
column.append(random.choice(max(column1),min(column1)))
column.append(random.choice(max(column2),min(column2)))
column.append(random.choice(max(column3),min(column3)))
column.append(random.choice(max(column4),min(column4)))
column.append(random.choice(max(column5),min(column5)))
I have a certain column in a Pandas Dataframe that have the following unique factor levels:
My_Factor_Levels = [9.0, 0, 6.0, '9', '6', 9, 6, 'DE', '3U', '9.0', '6Z', '6.0', '9.', '6.', '3B', '1U', '2Z', '68', '6B']
Note that there are ten separate values in My_factor_Levels (9.0, 6.0, '9', '6', 9, 6, '9.0', '6.0', '9.', '6.') that represent values from two different factor levels - '9' and '6'. How can I coerce these values to conform to one unique grouping (preferably in string format)? Any help would be much appreciated!
You can try casting values as either int or float and then converting to a set (all unique values in the iterable):
My_Factor_Levels = [9.0, 0, 6.0, '9', '6', 9, 6, 'DE', '3U', '9.0', '6Z', '6.0', '9.', '6.', '3B', '1U', '2Z', '68', '6B']
def safe_convert(x):
try:
return str(float(x))
except:
return x
coerced = set([safe_convert(x) for x in My_Factor_Levels])
>>> coerced
{'0.0', '1U', '2Z', '3B', '3U', '6.0', '68.0', '6B', '6Z', '9.0', 'DE'}
If you would prefer the final coerced result to be a list, simply do list(set(...)) instead.
I want to construct a matrix like:
Col1 Col2 Col3 Coln
row1 1 2 4 2
row2 3 8 3 3
row3 8 7 7 3
rown n n n n
I have yet to find anything in the python documentation that states how a list of list is assembled, is it like:
a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],[n,n,n,n]]
Where each row is a list item or should it be that each column is a list item:
b = [[1,3,8,n],[2,8,7,n],[4,3,7,n],[2,3,3,n]]
I would think that this would be a common question but I can't seem to find a straight answer.
Based on the documentation I'm guessing that I can convert this to a numpy array by simply:
np.array(a)
Can anyone help?
You want the first version:
a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],[n,n,n,n]]
When accessing an element in a matrix, you typically use matrix[row][col], so with the above Python list format a[i] would give you row i, and a[i][j] would give you the jth element from the ith row.
To convert it to a numpy array, np.array(a) is the correct method.
This:
a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],[n,n,n,n]]
will create the list you want, and yes, np.array(a) will convert it to a numpy array.
Also, this is the 'pythonish' was of creating an array with m rows and n columns (and setting all the elements to 0):
a = [[0 for i in range(n)] for j in range(m)]
Since you mention "matrix" let me also add that you have the np.matrix() option as well.
For example: You can use
A = [[1,2,3],[4,5,6],[7,8,9]]
to create a list (of lists), with each inner list representing a row.
Then
AA = np.array(A)
will create a 2D array with the appearance of a matrix, but not all the properties of a matrix.
Whereas
AM = np.matrix(A)
will create a matrix.
If you perform arithmetic operations on these two then you'll see the difference. For example
AA**2
will square each element in the 2D array. However
AM**2
will perform matrix multiplication of AM by itself.
BTW. The above usage assumes "import numpy as np" of course.
Use the first convention. If transpose needed:
>>> a = [[1,2,4,2],[3,8,3,3],[8,7,7,3],['n','n','n','n']]
>>> trans=[]
>>> for i in range(len(a)):
... trans.append([row[i] for row in a])
...
>>> trans
[[1, 3, 8, 'n'], [2, 8, 7, 'n'], [4, 3, 7, 'n'], [2, 3, 3, 'n']]
An element is then a[row][col] vs trans[col][row] (with respect to a of your example)
The first is used by Python and that is easily seen why you should use the first convention when laid out:
a = [[1,2,4,2],
[3,8,3,3],
[8,7,7,3],
['n','n','n','n']]
Certainly when you use numpy, use the first convention since that is used by numpy:
>>> np.array(a)
array([['1', '2', '4', '2'],
['3', '8', '3', '3'],
['8', '7', '7', '3'],
['n', 'n', 'n', 'n']],
dtype='|S1')
>>> np.array(trans)
array([['1', '3', '8', 'n'],
['2', '8', '7', 'n'],
['4', '3', '7', 'n'],
['2', '3', '3', 'n']],
dtype='|S1')
Note: numpy converts the ints to strings because of the 'n' in the final row/col.
When you actual start to print that table, here is a way:
def pprint_table(table):
def format_field(field, fmt='{:,.0f}'):
if type(field) is str: return field
if type(field) is tuple: return field[1].format(field[0])
return fmt.format(field)
def get_max_col_w(table, index):
return max([len(format_field(row[index])) for row in table])
col_paddings=[get_max_col_w(table, i) for i in range(len(table[0]))]
for i,row in enumerate(table):
# left col
row_tab=[row[0].ljust(col_paddings[0])]
# rest of the cols
row_tab+=[format_field(row[j]).rjust(col_paddings[j]) for j in range(1,len(row))]
print(' '.join(row_tab))
pprint_table([
['','Col 1', 'Col 2', 'Col 3', 'Col 4'],
['row 1', '1','2','4','2'],
['row 2','3','8','3','3'],
['row 3','8','7','7','3'],
['row 4', 'n','n','n','n']])
Prints:
Col 1 Col 2 Col 3 Col 4
row 1 1 2 4 2
row 2 3 8 3 3
row 3 8 7 7 3
row 4 n n n n