Iterate through 200 datasets [duplicate] - python

This question already has answers here:
Creating multiple dataframes with a loop
(3 answers)
Closed 1 year ago.
I have 200 datasets and I want to iterate through them to pick random rows and add them to another dataset(empty dataset), using iloc and value function. when I execute the code it does not give an error but also does not add anything to the empty dataset. However, when I try to run the single command to check if the random row has any value or not it gives an error of:
AttributeError: 'str' object has no attribute 'iloc'.
my code is given below:
Tdata = np.zeros([20, 6])
k = 0
for j in range(200):
for j1 in range(0, 20):
Tdata[k:k+1,:] = (('dataset'+j)).iloc[random.randint(100)].values
k += 1
('dataset'+j) is basically selecting different datasets. The names of my datasets are dataset0, dataset1, dataset2......there are already defined.

There are multiple issues with you code.
1. Using str in place of the actual DataFrame variable
You are trying use .iloc over a string dataframe1 for example. This won't work since what str has no attribute .iloc, as the error reads for you.
Since you want to work with DataFrame variable names, you may need to use eval() to interpret the string as a variable name. NOTE: BE EXTRA CAREFUL while using eval(). Please read the dangers of using eval() carefully.
2. Sampling 20 rows from each DataFrame.
If you are trying to get 20 rows by using for j1 in range(0, 20): along with random.randint(100), there is a better way to avoid this iteration. Instead what you need is to use random.randint(0,100,(n,) to get n random indexes. In this case random.randint(0,100,(20,)
Or an even better way to do this is just simply using df.sample(20) to sample 20 rows from a given dataframe.
3. Forcing update over views of the dataframe
Its better to use a different appraoch than force an update over a view of the dataframe with Tdata[k:k+1,:] == .... Since you want to combine dataframes, its better to just collect them in a list and pass them to a pd.concat which would be much more useful.
Here is sample code with a simple setting which should help guide you to what you are looking for.
import pandas as pd
import numpy as np
dataset0 = pd.DataFrame(np.random.random((100,3)))
dataset1 = pd.DataFrame(np.random.random((100,3)))
dataset2 = pd.DataFrame(np.random.random((100,3)))
dataset3 = pd.DataFrame(np.random.random((100,3)))
##Using random.randint
##samples = [eval('dataset'+str(i)).iloc[np.random.randint(0,100,(3,))] for i in range(4)]
##Using df.sample()
samples = [eval('dataset'+str(i)).sample(3) for i in range(4)]
##Change -
##1. The 3 to 20 for 20 samples per dataframe
##2. range(4) to range(200) to work with 200 dataframes
output = pd.concat(samples)
print(output)
0 1 2
42 0.372626 0.445972 0.030467
20 0.376201 0.445504 0.835735
56 0.214806 0.083550 0.582863
85 0.691495 0.346022 0.619638
24 0.290397 0.202795 0.704082
16 0.112986 0.013269 0.903917
51 0.521951 0.115386 0.632143
73 0.946870 0.531085 0.437418
98 0.745897 0.718701 0.280326
56 0.679253 0.010143 0.124667
4 0.028559 0.769682 0.737377
84 0.857553 0.866464 0.827472
4. Storing 200 dataframes??
Last but not the least, you should ask yourself, why are you storing 200 dataframe as individual variables, only to sample some rows from each.
Why not try to -
Read each of the files iteratively
Sample rows from each
Store them in a list of dataframes
pd.concat once you are done iterating over the 200 files
... instead of saving 200 dataframes and then doing the same.

Related

Creating a Numpy Array from two Series

I had a DataFrame called "segments" that looks like the below:
ORIGIN_AIRPORT_ID DEST_AIRPORT_ID FL_COUNT ORIGIN_INDEX DEST_INDEX OUTDEGREE
WEIGHT
0 10135 10397 77 119 373 3 0.333333
1 10135 11433 85 119 1375 3 0.333333
Using this, I created two Boolean Series objects: One in which I'm storing all the IDs for which the WEIGHT column is not 0 and one in which they are:
Zeroes = (segments['WEIGHT'] == 0).groupby(segments['ORIGIN_INDEX']).all()
Non_zeroes = (segments['WEIGHT'] != 0).groupby(segments['ORIGIN_INDEX']).all()
I want to do two things (because I'm not sure which this task needs):
Create a NumPy vector where all "True" values in the Non_zeroes Series are set to the result of 1/4191 (~0.024~) and all "True" values in the Zeroes Series are set to 0 (or the same logic using True and False of one Series) keeping the IDs (e.g. ORIGIN_INDEX 119 0.024%, etc.)
And I'd also like to create a NumPy vector that is JUST a list of the percentages and zeroes WITHOUT the IDs
EDIT to add extra detail requested!
I tried using a condition as a variable, then using .loc to apply it:
cond_array = copied.WEIGHT is not 0
df.loc[cond_array, ID] = 1/4191
I tried using from_coo(), toarray(), and DataFrame to convert:
pd.Series.sparse.from_coo(P, dense_index=True)
P.toarray()
pd.DataFrame(P)
Finally, I tried applying logic to the DF instead of the COO Matrix. I THINK this gets close, but it is still failing. I believe it fails because it is not including the 0s (copied is just a DF that's a copy of segments):
copied['WEIGHT'] = copied.loc[copied['WEIGHT'] != 0, 'WEIGHT'] = float((1/len(copied))) #0.00023860653
The last code passes the first two tests (testing if it's an array and that it sums to 1.0), but fails the last
assert np.isclose(x0.max(), 1.0/n_actual, atol=10*n*np.finfo(float).eps), "x0` values seem off..."
EDIT 2:
Had the wrong count. It was supposed to be 1/300, not 1/4191. All fixed now, thanks all who took a look :)

How do I call a value from a list inside of a pandas dataframe?

I have a some data that I put into a pandas dataframe. Inside of cell [0,5] I have a list of times that I want to call and be printed out.
Dataframe:
GAME_A PROCESSING_SPEED
yellow_selected 19
red_selected 0
yellow_total 19
red_total 60
counters [0.849998, 1.066601, 0.883263, 0.91658, 0.96668]
Code:
import pandas as pd
df = pd.read_csv('data.csv', sep = '>')
print(df.iloc[0])
proc_speed = df.iat[0,5]
print(proc_speed[2])
When I try to print the 3rd time in the dictionary I get .. I tried to use a for loop to print the times but instead I get this. How can I call the specific values from the list. How would I print out the 3rd time 0.883263?
[
0
.
8
4
9
9
9
8
,
1
.
0
6
6
...
This happens because with the way you are loading the data, the column 'PROCESSING_SPEED' is read as an object type, therefore, all elements of that series are considered strings (i.e., in this case proc_speed = "[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]", which is exactly the string the loop is printing character by character).
Before printing the values you desire to display (from that cell), one should convert the string to a list of numbers, for example:
proc_speed = df.iat[4,1]
proc_speed = [float(s) for s in proc_speed[1:-1].split(',')]
for num in proc_speed:
print( num)
Where proc_speed[1:-1].split(',') takes the string containing the list, except for the brackets at the beginning and end, and splits it according to the commas separating values.
In general, we have to be careful when loading columns with varying or ambiguous data types, as Pandas could have trouble parsing them correctly or in the way we want/expect it to be.
You can simply call proc_speed[index] as you have set this variable as a list. Here is a working example, note my call to df.iat has different indexes;
d = {'GAME_A':['yellow_selected', 'red_selected', 'yellow_total', 'red_total', 'counters'],'PROCESSING_SPEED':[19,0,19,60,[0.849998, 1.066601, 0.883263, 0.91658, 0.96668]]}
df = pd.DataFrame(d)
proc_speed = df.iat[4, 1]
for i in proc_speed:
print(i)
0.849998
1.066601
0.883263
0.91658
0.96668
proc_speed[1]
1.066601
proc_speed[3]
0.91658
You can convert with apply, it's easier than splitting, and converts your ints to ints:
pd.read_clipboard(sep="(?!\s+(?<=,\s))\s+")['PROCESSING_SPEED'].apply(eval)[4][2]
# 0.883263

Fitting a pandas column containing a list in scikit-learn

I have a pandas DataFrame containing a column called 'X' containing a list of 300 doubles and a column called 'label' when trying to run:
cls = SVC()
cls.fit(miniset.loc[:,'X'],miniset.loc[:,'label'])
I get the error:
ValueError: setting an array element with a sequence.
Any idea how to fix it?
Thanks
Head of my DataFrame
label X
0 0 [-1.1990741, 0.98229957, -2.7413394, 0.5774205...
1 1 [0.10277234, 1.8292198, -1.8241594, 0.07206603...
2 0 [-0.26603428, 1.8654639, -2.2495375, -0.695124...
3 0 [-1.1662953, 3.0714324, -3.4975948, 0.01011618...
4 0 [-0.13769871, 1.9866339, -1.9885212, -0.830097...
Your issue is the 'X' column of your DataFrame. To get this to work with SVC (or basically any scikit-learn model), you need to split that column into several columns, one each for every element in your lists.
You can fix that by doing something like this.
The pandas package is not intended to store lists or other collections as values. It is meant to store panel data, hence the name pandas.
You can try:
cls.fit(np.array(miniset.loc[:,'X'].tolist()),miniset.loc[:,'label'])
where tolist() gives you a 2D array (which would be good enough).

Pandas .loc taking a very long time

I have a 10 GB csv file with 170,000,000 rows and 23 columns that I read in to a dataframe as follows:
import pandas as pd
d = pd.read_csv(f, dtype = {'tax_id': str})
I also have a list of strings with nearly 20,000 unique elements:
h = ['1123787', '3345634442', '2342345234', .... ]
I want to create a new column called class in the dataframe d. I want to assign d['class'] = 'A' whenever d['tax_id'] has a value that is found in the list of strings h. Otherwise, I want d['class'] = 'B'.
The following code works very quickly on a 1% sample of my dataframe d:
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'
However, on the complete dataframe d, this code takes over 48 hours (and counting) to run on a 32 core server in batch mode. I suspect that indexing with loc is slowing down the code, but I'm not sure what it could really be.
In sum: Is there a more efficient way of creating the class column?
If your tax numbers are unique, I would recommend setting tax_num to the index and then indexing on that. As it stands, you call isin which is a linear operation. However fast your machine is, it can't do a linear search on 170 million records in a reasonable amount of time.
df.set_index('tax_num', inplace=True) # df = df.set_index('tax_num')
df['class'] = 'B'
df.loc[h, 'class'] = 'A'
If you're still suffering from performance issues, I'd recommend switching to distributed processing with dask.
"I also have a list of strings with nearly 20,000 unique elements"
Well, for starters, you should make that list a set if you are going to be using it for membership testing. list objects have linear time membership testing, set objects have very optimized constant-time performance for membership testing. That is the lowest hanging fruit here. So use
h = set(h) # convert list to set
d['class'] = 'B'
d.loc[d['tax_num'].isin(h), 'class'] = 'A'

Most efficient way to turn a 5D array into a Pandas dataframe

I have a 5D array called predictors with a shape of [6,288,37,90,107] where 6 is the number of variables,
288 is the time series of those variables,
37is the k locations,
90 is the j locations,
107 is the i locations.
I want to have a pandas dataframe that includes columns of each variable timeseries at each k,j,i location so that of course will be a lot of columns.
Then I would like to somehow obtain the names for each column.
For example the first column would be var1_k_j_i = predictors[0,:,0,0,0]
except in the name I actually want the k location, j location,
and i location instead of k_j_i.
Since there are so many I can't do this by hand so I was hoping for a suggestion on the best way to organize this into a pandas dataframe and obtain the names? A loop possibly?
So in summary by the end of this I would like my 5D array of predictors turned into a large pandas dataframe where each column is a variable located at different k,j,i locations with the corresponding names of the variable and location in the header or first row of the dataframe.
Sound like you need to have fun with reshape here.
To address the location i,j,k is easy as using reshape. Then I'm not sure if you can reshape again to obtain a 2D representation of what you need, so I'm proposing a loop for you as follow.
import itertools
import pandas as pd
dfs = []
new_matrix = matrix.reshape([6,288,37*90*107])
for var range(6):
iterator = itertools.product(range(37), range(90), range(107))
columns = ['var%i_' % var + '_'.join(map(str, x)) for x in iterator]
dfs.append(pd.DataFrame(new_matrix[var]))
result = pd.concat(dfs)

Categories