I have created an array which returns (6, 20) as an attribute of the shape, like this:
import numpy as np
data = np.random.logistic(10, 1, 120)
data = data.reshape(6, 20)
instantiate pandas.DataFrame from array data
import pandas as pd
data = pd.DataFrame(data)
now this is a dataframe created using data values that come from the numpy module's distributive function
and return this:
0 1 2 3 4 5
0 9.602117 9.507674 9.848685 9.215080 11.061676 9.627753
1 11.702407 9.804924 7.375905 10.784320 8.485818 10.938005
2 9.628927 9.713187 10.027626 10.653311 11.301493 8.756792
3 11.229905 12.013172 10.023200 9.211614 7.139757 9.687851
6 7 8 9 10 11 12
0 9.356069 11.483162 8.993130 8.015089 9.808234 9.435853 9.773375
1 13.422060 10.027434 9.694008 9.677682 10.806266 12.393364 9.479257
2 10.821846 10.690378 8.321566 9.595122 11.753948 10.021815 10.412572
3 8.499120 7.352394 9.288662 9.178306 10.073842 9.246110 9.075350
13 14 15 16 17 18 19
0 9.809366 8.502451 11.624395 12.824338 9.729167 8.945258 10.464157
1 6.698941 9.416421 11.477242 9.622115 6.374589 9.459355 10.435674
2 11.068721 9.775433 9.447799 8.972052 10.692942 10.978305 10.047067
3 10.381596 10.968330 11.892766 12.241880 9.980124 7.321942 9.241030
when I try to set columns=list("abcdef"), I get this error:
ValueError: Shape of passed values is (6, 20), indices imply (6, 6)
and my expected output is similar to that shown directly from the numpy array. It should contain each column as a pandas.Series of lists (or list of lists).
a.
0 [ 6.98467276 9.16242742 6.99065177 11.50834399 9.29697138 7.93926441
9.05857668 7.13652948 11.01724792 13.31658877 8.63137079 9.5564405
7.37161153 11.19414704 9.45957466 9.19826796 10.13506672 9.74830158
9.97456348 8.35217153]
b.
[10.48249082 11.94030324 12.59080011 10.55695088 12.43071037 11.49568774
10.03540181 11.08708832 10.24655111 8.17904856 11.04791142 7.30069964
8.34783674 9.93743588 8.1537666 9.92773204 10.3416315 9.51624921
9.60124236 11.37511301]
c.
[ 8.21851024 12.71641524 9.7748047 9.51267978 7.92793378 12.1646706
9.67236267 10.22201002 9.67197374 9.70551429 7.79209516 9.20295594
9.26231527 8.04560836 11.0409066 8.63660332 9.18397671 8.17510874
9.61619671 8.42704322]
d.
[14.54825819 16.97573893 7.70643136 12.06334323 14.64054726 9.54619595
10.30686621 12.20487566 10.78492189 12.01011666 10.12405213 8.57057999
10.41665479 7.85921253 10.15572125 9.20554292 10.03832545 9.43720211
11.06605713 9.60298514]
I have found this thread that looks like my problem but it has not helped me much, also I would use the data in a different way.
Could I assign the lengths of the columns or maybe assign the dimensions of this Pandas.DataFrame?
Your data has 6 rows and 20 columns. If you want to pass each "row" of the numpy array as a "column" to the DataFrame, you can simply transpose:
df = pd.DataFrame(data=np.random.logistic(10, 1, 120).reshape(6,20).transpose(),
columns=list("abcdef"))
Edit:
To get the data in a single row, try:
df = pd.DataFrame(columns=list("abcdef"), index=[0])
df.iloc[0] = np.random.logistic(10, 1, 120).reshape(6,20).transpose()
Related
I'm trying to manipulate a list (type: string) to use that list to drop some columns from a dataframe.
Dataframe
The list is from a dataframe that I created a condition to return columns whose sums of all values are zero:
Selecting the columns with sum = 0
condicao_CO8 = (ex8_centro_oeste.sum(axis = 0) == 0)
condicao_CO8 = condicao_CO8[condicao_CO8 == True]
condicao_CO8.to_csv('D:\Programas\CO_8.csv')
Importing the dataframe and turning it into a list:
CO8 = pd.read_csv(r'D:\Programas\CO_8.csv',
delimiter=','
)
CO8.rename(columns={'Unnamed: 0': 'Nulos'}, inplace = True)
CO8.rename(columns={'0': 'Info'}, inplace = True)
CO8.drop(columns = ['Info'], inplace = True)
CO8.columns
Images from the list:
List
Some itens from the list:
0 ABI - LETRAS
1 ADMINISTRAÇÃO PÚBLICA
2 AGRONOMIA
3 ALIMENTOS
4 ARQUIVOLOGIA
5 AUTOMAÇÃO INDUSTRIAL
6 BIOMEDICINA
7 BIOTECNOLOGIA
8 CIÊNCIAS - BIOLOGIA
9 CIÊNCIAS - QUÍMICA
10 CIÊNCIAS AGRÁRIAS
11 CIÊNCIAS BIOLÓGICAS
12 CIÊNCIAS BIOLÓGICAS E CONSERVAÇÃO
13 CIÊNCIAS DA COMPUTAÇÃO
14 CIÊNCIAS ECONÔMICAS
My goal is to transform the list so that I can drop the columns from that.
Transforming this list into this:
"ABI - LETRAS", "ADMINISTRAÇÃO PÚBLICA", "AGRONOMIA", "ALIMENTOS", "ARQUIVOLOGIA", "AUTOMAÇÃO INDUSTRIAL"...
for this I made the following code (unsuccessful)
list_CO8 = ('\",\" '.join(CO8['Nulos'].apply(str.upper).tolist()))
Please, can anyone help me?
I'm trying to get:
list = "ABI - LETRAS", "ADMINISTRAÇÃO PÚBLICA", "AGRONOMIA", "ALIMENTOS", "ARQUIVOLOGIA", "AUTOMAÇÃO INDUSTRIAL"...
To make:
ex8_centro_oeste.drop(columns=[list])
Dataset link: Link for Drive (8kb)
I read your question properly only now. I deleted my previous answer. You want to drop all the columns that sums to zero, if I am not mistaken.
import pandas as pd
data = pd.read_csv("./centro-oeste.csv")
import numpy as np
data.columns[np.sum(data) == 0] # data is the dataframe
# It out puts a list that you can use.
I have a matrix of dimensions 183,223,040x4 with the variables showed below. There are 140 different values in 'REG', and 1145 different values of both 'SAMAC' and 'SAMAC.1'
I want to iterate over REG to get 140 matrices of size 1145*1145, with the right 'VALUE' in it.
I have tried the following:
-loop over countries
-create empty matrix 1145*1145, indexed with SAMAC and with column names SAMAC.1
-go line by line of the current dataframe
-check the value of SAMAC (rows) and SAMAC.1 (columns)
-locate SAMAC and SAMAC.1 in the empty matrix and assigned the corresponding VALUE
import pandas as pd
import dask.dataframe as dd
all_sam=dd.read_csv(r'C:\GP2\all_sams_trial.csv',skiprows=1)
all_sam.head()
SAMAC SAMAC.1 REG Value
0 m_pdr m_pdr aus 0.0
1 m_wht m_pdr aus 0.0
2 m_gro m_pdr aus 0.0
3 m_v_f m_pdr aus 0.0
4 m_osd m_pdr aus 0.0
countries=list(all_sam["REG"].unique().compute())
col_names=list(all_sam["SAMAC"].unique().compute())
for country in countries:
df=pd.DataFrame(0,index=col_names,columns=col_names)
sam=all_sam[all_sam["REG"]==country].compute()
for index,row in sam.iterrows():
row_index=str(row["SAMAC"])
col_index=str(row["SAMAC.1"])
df.loc[row_index,col_index]=row['Value']
print(index)
df.to_csv(country+"_SAM.csv")
The problem is that it takes way to long to compute (around 2 days). Is there a way to speed this up?
Update 1: After understanding OP's problem of slow computation because of large size of dataframe, here's the update.
Check the dtypes of columns using all_sam.dtypes and the size (in Mb) of your dataframe using:
all_sam.memory_usage(deep=True) / 1024 ** 2
Consider changing the column name 'SAMAC.1' to 'SAMAC_1' as it could cause error in the following lines. Before processing change the dtypes of 'REG', 'SAMAC' and 'SAMAC_1' to 'categorical':
all_sam.REG = all_sam.REG.astype('category')
all_sam.SAMAC = all_sam.SAMAC.astype('category')
all_sam.SAMAC_1 = all_sam.SAMAC_1.astype('category')
Depending on your requirement, you can downcast the dtype of the 'Value' column to float16, int16, int8, etc. using the below code:
all_sam.Value = all_sam.Value.astype('float16')
Check the size again.
all_sam.memory_usage(deep=True) / 1024 ** 2
Hopefully, this will enable faster computation.
Ref: towardsdatascience.com
I have taken a small example dataframe to put up a solution to your problem.
import pandas as pd
import numpy as np
df = pd.DataFrame( {'REG':['A','A','A','A','A','A','B','B','B','B','B','B'], 'SAMAC1':['a','a','a','b','b','b','c','c','c','d','d','d'], 'SAMAC':['p','q','r','p','q','r','p','q','r','p','q','r'], 'value':[0,0,0,0,0,0,0,0,0,0,0,0]})
array_ = df[['REG','SAMAC1','SAMAC']].values.transpose()
index = pd.MultiIndex.from_arrays(array_, names=('REG', 'SAMAC1','SAMAC'))
df2 = df['value']
df2.index=index
country_labels = df2.index.get_level_values(0)
country_unique = country_labels.unique()
result_arr = []
for c in country_unique:
df3 = df2[df2.index.get_level_values(0) == c]
result_arr.append(df3.unstack().values)
result_arr = np.array(result_arr)
print(result_arr.shape)
Output: (2,2,3)
Can you store data as pandas HDFStore and open them / perform i/o using pytables? The reason this question comes up is because I am currently storing data as
pd.HDFStore('Filename',mode='a')
store.append(data)
However, as i understand pandas doesn't support updating records so much. I have a usecase where I have to update 5% of the data daily. Would pd.io.pytables work? if so I found no documentation on this? Pytables has a lot of documentation but i am not sure if i can open the file / update without opening using pytables when i didnt use pytables to save the file initially?
Here is a demonstration for the #flyingmeatball's answer:
Let's generate a test DF:
In [56]: df = pd.DataFrame(np.random.rand(15, 3), columns=list('abc'))
In [57]: df
Out[57]:
a b c
0 0.022079 0.901965 0.282529
1 0.596452 0.096204 0.197186
2 0.034127 0.992500 0.523114
3 0.659184 0.447355 0.246932
4 0.441517 0.853434 0.119602
5 0.779707 0.429574 0.744452
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
8 0.690729 0.052097 0.146705
9 0.828667 0.439608 0.091007
10 0.988435 0.326589 0.536904
11 0.687250 0.661912 0.318209
12 0.829129 0.758737 0.519068
13 0.500462 0.723528 0.026962
14 0.464162 0.364536 0.843899
and save it to HDFStore (NOTE: don't forget to use data_columns=True (or data_columns=[list_of_columns_to_index]) in order to index all columns, that we want to use in the where clause):
In [58]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
In [59]: store.append('test', df, format='t', data_columns=True)
In [60]: store.close()
Solution:
In [61]: store = pd.HDFStore(r'd:/temp/test_removal.h5')
The .remove() method should return # of removed rows:
In [62]: store.remove('test', where="a > 0.5")
Out[62]: 9
Let's append changed (multiplied by 100) rows :
In [63]: store.append('test', df.loc[df.a > 0.5] * 100, format='t', data_columns=True)
Test:
In [64]: store.select('test')
Out[64]:
a b c
0 0.022079 0.901965 0.282529
2 0.034127 0.992500 0.523114
4 0.441517 0.853434 0.119602
6 0.105255 0.934440 0.545421
7 0.216278 0.217386 0.282171
14 0.464162 0.364536 0.843899
1 59.645151 9.620415 19.718557
3 65.918421 44.735482 24.693160
5 77.970749 42.957446 74.445185
8 69.072948 5.209725 14.670545
9 82.866731 43.960848 9.100682
10 98.843540 32.658931 53.690360
11 68.725002 66.191215 31.820942
12 82.912937 75.873689 51.906795
13 50.046189 72.352794 2.696243
finalize:
In [65]: store.close()
Here are the docs I think you're after:
http://pandas.pydata.org/pandas-docs/version/0.19.0/api.html?highlight=pytables
See this thread as well:
Update pandas DataFrame in stored in a Pytable with another pandas DataFrame
Looks like you can load the 5% records into memory, remove them from the store then append the updated ones back
to replace the whole table
store.remove(key, where = ...)
store.append(.....)
You can also do outside of Pandas - see tutorial here on removal
http://www.pytables.org/usersguide/tutorials.html
I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).
you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464
I used numpy.loadtxt to load a file that contains this scructure:
99 0 1 2 3 ... n
46 0.137673 0.147241 0.130374 0.155461 ... 0.192291
32 0.242157 0.186015 0.153261 0.152680 ... 0.154239
77 0.163889 0.176748 0.184754 0.126667 ... 0.191237
12 0.139989 0.417530 0.148208 0.188872 ... 0.141071
64 0.172326 0.172623 0.196263 0.152864 ... 0.168985
50 0.145201 0.156627 0.214384 0.123387 ... 0.187624
92 0.127143 0.133587 0.133994 0.198704 ... 0.161480
Now, I need that the first column (except first line) store the index of the higher value in it's line.
At end, save this array in a file with the same number format as original.
Thank's.
Can you use numpy.argmax something like this:
import numpy as np
# This is a simple example. In your case, A is loaded with np.loadtxt
A = np.array([[1, 2.0, 3.0], [3, 1.0, 2.0], [2.0, 4.0, 3.0]])
B = A.copy()
# Copy the max indices of rows of A into first column of B
B[:,0] = np.argmax(A[:,1:], 1)
# Save the results using np.savetxt with fmt, dynamically generating the
# format string based on the number of columns in B (setting the first
# column to integer and the rest to float)
np.savetxt('/path/to/output.txt', B, fmt='%d' + ' %f' * (B.shape[1]-1))
Note that np.savetxt allows for formatting.
This example code doesn't address the fact that you want to skip the first row, and you might want to subtract 1 from the results of np.argmax depending on if the index into the remaining columns is inclusive of the index column (0) or not.
Your data look like a Dataframe with columns and index : the data type is not homogeneous. It is more convenient to do it with pandas, which manage natively this layout:
import pandas as pd
a=pd.DataFrame.from_csv('data.txt',sep=' *')
u=a.set_index(a.values.argmax(axis=1)).to_string()
with open('out.txt','w') as f : f.write(u)
then out.txt is
0 1 2 3 4
4 0.137673 0.147241 0.130374 0.155461 0.192291
0 0.242157 0.186015 0.153261 0.152680 0.154239
4 0.163889 0.176748 0.184754 0.126667 0.191237
1 0.139989 0.417530 0.148208 0.188872 0.141071
2 0.172326 0.172623 0.196263 0.152864 0.168985
2 0.145201 0.156627 0.214384 0.123387 0.187624
3 0.127143 0.133587 0.133994 0.198704 0.161480