I am trying to append certain strings to a matrix. One that reads "Target" on the 0th row 0th column. And one that reads "Dog" on the 0th column 1st row downwards to the last row of the matrix.
My initial matrix looks like:
enter image description here
I have a small issue with the following program:
import numpy as np
import pandas as pd
main=pd.read_csv('C:/Users/Jonas/Desktop/testfile/biggertest.csv', header=None)
target_col = ['dog'] * main.shape[0]
main.insert(loc = 0, column = 'target', value = target_col)
This creates a new matrix that looks like this:
enter image description here
Instead of:
enter image description here
I'm wondering what I need to change to make this happen?
Cheers.
You could simply make the following modification.
import numpy as np
import pandas as pd
main=pd.read_csv('C:/Users/Jonas/Desktop/testfile/biggertest.csv', header=None)
target_col = ['dog'] * main.shape[0]
target_col[0] = 'target'
main.insert(loc = 0, column = -1, value = target_col)
Alternatively,
import numpy as np
import pandas as pd
main=pd.read_csv('C:/Users/Jonas/Desktop/testfile/biggertest.csv', header=None)
main.insert(loc = 0, column = -1, value = 'dog')
main.at[0,-1] = 'target'
If you want the column indices to go from 0 to 4 (instead of -1 to 3), then you can add the following command:
main.rename(columns = lambda x:x+1,inplace=True)
Resulting output from all commands:
0 1 2 3 4
0 target 5 5 8 9
1 dog 9 0 2 6
2 dog 6 6 4 3
3 dog 3 3 3 3
Related
I have a list of arrays:
[array([10,20,30]), array([5,6,7])]
How to turn it into pandas dataframe? pd.DataFrame() puts arrays in on column. desired result is:
0 1 2
10 20 30
5 6 7
0 1 2 here are column names
import pandas as pd
import numpy as np
a = [np.array([10,20,30]), np.array([5,6,7])]
print(pd.DataFrame(a))
Make sure you put the np before the array.
import pandas as pd
import numpy as np
list = [np.array([10,20,30]), np.array([5,6,7])]
df = pd.DataFrame(list)
print(df)
output:
0 1 2
0 10 20 30
1 5 6 7
If you still get an error, is the list of arrays a result from previous data manipulation or did you manually type out the values / array lists?
I am having a data frame of four columns. I want to find the minimum among the first two columns and the last two columns for each row.
Code:
np.random.seed(0)
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
xdf['ab_min'] = xdf[['a','b']].min(axis=1)
xdf['cd_min'] = xdf[['c','d']].min(axis=1)
xdf['minimum'] = xdf['ab_min'].list()+xdf['cd_min'].list()
Expected answer:
xdf['minimum']
0 [ab_min,cd_min]
1 [ab_min,cd_min]
2 [ab_min,cd_min]
3 [ab_min,cd_min]
Present answer:
AttributeError: 'Series' object has no attribute 'list'
Select the columns ab_min and cd_min then use to_numpy to convert it to numpy array and assign the result to minimum column
xdf['minimum'] = xdf[['ab_min', 'cd_min']].to_numpy().tolist()
>>> xdf['minimum']
0 [3.23307959607905, 1.9836323494587338]
1 [6.189440334168731, 1.0578078219990983]
2 [3.1194570407645217, 1.2816570607783184]
3 [1.9170068676155894, 7.158027504597937]
4 [0.6244579166416464, 8.568849995324166]
5 [4.108986697339397, 0.6201685780268684]
6 [4.170639127277155, 2.3385281968695693]
7 [2.0831140755567814, 5.94063873401418]
8 [0.4887113296319978, 6.380570614449363]
9 [2.844815261473105, 0.9146457613970793]
Name: minimum, dtype: object
try this:
import pandas as pd
import numpy as np
xdf = pd.DataFrame({'a':np.random.rand(1,10)[0]*10,'b':np.random.rand(1,10)[0]*10,'c':np.random.rand(1,10)[0]*10,'d':np.random.rand(1,10)[0]*10,},index=np.arange(0,10,1))
print(xdf)
ab = xdf['ab_min'] = xdf[['a','b']].min(axis=1)
cd = xdf['cd_min'] = xdf[['c','d']].min(axis=1)
blah = pd.concat([ab, cd], axis=1)
print(blah)
results:
You can use .apply with a lambda function along axis=1:
xdf['minimum'] = xdf.apply(lambda x: [x[['a','b']].min(),x[['c','d']].min()], axis=1)
Result:
>>> xdf
a b c d minimum
0 0.662634 4.166338 8.864823 9.004818 [0.6626341544146663, 8.864822751494284]
1 6.854054 6.163417 6.510728 0.049498 [6.163416966676091, 0.04949754019059838]
2 6.389760 4.462319 2.435369 3.732534 [4.462318678134215, 2.4353686460846893]
3 4.628735 7.571098 1.900726 9.046384 [4.628735362058981, 1.9007255361271058]
4 3.203285 4.364302 2.473973 2.911911 [3.203285015796596, 2.4739732602476727]
5 5.357440 3.166420 9.908758 0.910704 [3.166420385020304, 0.91070444348338]
6 8.120486 6.395869 0.970977 5.278279 [6.395868901095546, 0.9709769503958143]
7 1.574765 7.184971 3.835641 4.495135 [1.574765093192545, 3.835640598199231]
8 8.688497 0.069061 0.771772 8.971878 [0.06906065557899743, 0.7717717844423222]
9 5.455920 2.630342 1.966357 7.374366 [2.6303421168291843, 1.966357159086991]
In EXCEL, calculating a geomean of size 2 on Col1, would result in a 6 in row 1 of Geo_2 as the geomean of 4 and 9 is 6. In Pandas or numpy it appears to be the reverse, with a min_period = 1, the first row reflects the calculation of just 1 value and the subsequent calculations use the previous and current row of Col1 to calculate the geomean.
I want the caluclation window to be the current and the next row of col1, so that the first value of Geo_2 is 6 and the last value should be 2.
DASeries = [4,9,3,3,5,7,8,4,2]
import numpy as np
import pandas as pd
from scipy.stats.mstats import gmean
DA_df = pd.DataFrame(DASeries)
geoMA2 = [2,3]
geo_df = pd.DataFrame([pd.Series(DASeries).rolling(window =elem, min_periods = 1).apply(gmean, raw =True) for elem in geoMA2]).T
Final = pd.concat([DA_df,geo_df],axis=1)
labs = ['Col1','Geo_2','Geo_3']
Final.columns = labs
Final
Using .iloc[::-1]
pd.Series(DASeries).iloc[::-1].rolling(window =2, min_periods = 1).apply(gmean).iloc[::-1]
0 6.000000
1 5.196152
2 3.000000
3 3.872983
4 5.916080
5 7.483315
6 5.656854
7 2.828427
8 2.000000
dtype: float64
I have dataframe below.
I want to even row value substract from odd row value.
and make new dataframe.
How can I do it?
import pandas as pd
import numpy as np
raw_data = {'Time': [281.54385, 436.55295, 441.74910, 528.36445,
974.48405, 980.67895, 986.65435, 1026.02485]}
data = pd.DataFrame(raw_data)
data
dataframe
Time
0 281.54385
1 436.55295
2 441.74910
3 528.36445
4 974.48405
5 980.67895
6 986.65435
7 1026.02485
Wanted result
ON_TIME
0 155.00910
1 86.61535
2 6.19490
3 39.37050
You can use NumPy indexing:
res = pd.DataFrame(data.values[1::2] - data.values[::2], columns=['Time'])
print(res)
Time
0 155.00910
1 86.61535
2 6.19490
3 39.37050
you can use shift for the subtraction, and then pick every 2nd element, starting with the 2nd element (index = 1)
(data.Time - data.Time.shift())[1::2].rename('On Time').reset_index(drop=True)
outputs:
0 155.00910
1 86.61535
2 6.19490
3 39.37050
Name: On Time, dtype: float64
Let's say I have a DataFrame like this:
df
A B
5 0 1
18 2 3
125 4 5
where 5, 18, 125 are the index
I'd like to get the line before (or after) a certain index. For instance, I have index 18 (eg. by doing df[df.A==2].index), and I want to get the line before, and I don't know that this line has 5 as an index.
2 sub-questions:
How can I get the position of index 18? Something like df.loc[18].get_position() which would return 1 so I could reach the line before with df.iloc[df.loc[18].get_position()-1]
Is there another solution, a bit like options -C, -A or -B with grep ?
For your first question:
base = df.index.get_indexer_for((df[df.A == 2].index))
or alternatively
base = df.index.get_loc(18)
To get the surrounding ones:
mask = pd.Index(base).union(pd.Index(base - 1)).union(pd.Index(base + 1))
I used Indexes and unions to remove duplicates. You may want to keep them, in which case you can use np.concatenate
Be careful with matches on the very first or last rows :)
If you need to convert more than 1 index, you can use np.where.
Example:
# df
A B
5 0 1
18 2 3
125 4 5
import pandas as pd
import numpy as np
df = pd.DataFrame({"A": [0,2,4], "B": [1,3,5]}, index=[5,18,125])
np.where(df.index.isin([18,125]))
Output:
(array([1, 2]),)