Pandas Dataframes: how to build them efficiently - python

I have a file with 1M rows that I'm trying to read into 20 DataFrames. I do not know in advance which row belongs to which DataFrame or how large each DataFrame will be. How can I process this file into DataFrames efficiently? I've tried to do this several different ways. Here is what I currently have:
data = pd.read_csv(r'train.data', sep=" ", header = None) # Not slow
def collectData(row):
id = row[0]
df = dictionary[id] # Row content determines which dataframe this row belongs to
next = len(df.index)
df.loc[next] = row
data.apply(collectData, axis=1)
It's very slow. What am I doing wrong? If I just apply an empty function, my code runs in 30 sec. With the actual function it takes at least 10 minutes and I'm not sure if it would finish.
Here are a few sample rows from the dataset:
1 1 4
1 2 2
1 3 10
1 4 4
The full dataset is available here (if you click on Matlab version)

Your approach is not a vectored one, because you apply a python function row by row.
Rather that creating 20 dataframes , make a dictionary containing an index (in range(20)) for each key[0]. Then add this information to your DataFrame:
data['dict']=data[0].map(dictionary)
Then reorganize :
data2=data.reset_index().set_index(['dict','index'])
data2 is like :
0 1 2
dict index
12 0 1 1 4
1 1 2 2
2 1 3 10
3 1 4 4
4 1 5 2
....
and data2.loc[i] is one of the Dataframe you want.
EDIT:
It seems that dictionary is describe in train.label.
You can set the dictionary before by:
with open(r'train.label') as f: u=f.readlines()
v=[int(x) for x in u] # len(v) = 11269 = data[0].max()
dictionary=dict(zip(range(1,len(v)+1),v))

Since, the full data set is easily loaded into memory, the following should be fairly quick
data_split = {i: data[data[0] == i] for i in range(1, 21)}
# to access each dataframe, do a dictionary lookup, i.e.
data_split[2].head()
0 1 2
769 2 12 4
770 2 16 2
771 2 23 4
772 2 27 2
773 2 29 6
you may also want to reset the indices or copy the data frame when you're slicing the data frame into smaller data frames.
additional reading:
copy
reset_index
view-vs-copy

If you want to build them efficiently, I think you need some good raw materials:
wood
cement
Are robust and durable.
Try to avoid using hay as the dataframe can be blown up with a little wind.
Hope that helps

Related

Pandas save counts of multiple columns in single dataframe

I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!
One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1

Conditionally dropping columns in a pandas dataframe

I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050

Filtering after Multi Indexing in pandas iterable indexing

I want to make a subset of my dataFrame object using pandas or any other python liberary using Hierarchical indexing that can be iterable depending on number of rows I have in one of the column.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = pd.read_csv(address)
trajectory frame x y
1 1 447,956 2,219
1 2 447,839 2,327
1 3 449,183 1,795
1 4 450,444 1,833
1 5 448,514 1,708
1 6 451,532 1,832
1 7 448,471 1,759
1 8 450,028 2,097
1 9 448,215 2,203
1 10 449,311 2,063
1 11 451,745 1,76
1 12 450,827 2,264
1 13 448,991 2,208
1 14 452,829 3,106
1 15 448,688 1,77
1 16 449,844 1,951
1 17 450,044 1,991
1 18 449,835 1,901
1 19 450,793 3,49
1 20 449,618 2,354
2 1 445.936 7.219
2 2 442.879 3.327
3 1 441.283 9.795
4 1 447.956 2.219
4 3 447.839 2.327
4 6 449.183 1.795
In this DataFrame, let say there are 4 columns, names: 'trajectory', 'frame, 'x' and 'y'. Number of 'trajectory' can be different from one dataframe to another. Each 'trajectory' can have multiple frames between 1 and 20, where they can be sequential from 1-20 or with some missing frames as well. Each frame has its own value in the column 'x' and 'y'.
My aim is to create a new dataframe where I can have only those 'trajectory' where the 'frame' values is present for all the 20 rows. As the number of rows in 'trajectory' and 'frame' columns are changing, so I would like to have a code that can be used in such conditions.
df_1 = df.set_index(['trajectory','frame'], drop=False)
Here, I did a heirarchical indexing using 'trajectory' and 'frame' and then I found that 'trajectory' number 1 and 6 have 20 frames in them. So I could manually select them using the following code.
df_1_subset = df_1[(df1['traj']== 1)|(df1['trajectory']== 6)]
However, I have multiple csv files where in each Dataframe, the 'trajectory' that will have 20 rows in the 'frame' column will be different, so I will have to do this manually. I think, there must be a better way, but I just can not seem to find it. I am very new to coding and I would really appreciate anybody's help. Thank you very much in advance.
Iif need filter trajectory level for 1 or 6 use Index.get_level_values with Index.isin:
df_1_subset = df_1[df1.index.get_level_values('trajectory').isin([1,6])]
If need filter trajectory level for 1 and frame for 6 select with DataFrame.loc with tuple:
df_1_subset = df_1.loc[(1, 6)]
Alternative:
df_1_subset = df_1.loc[(df1.index.get_level_values('trajectory') == 1) |
(df1.index.get_level_values('frame') == 6)]

Accessing part of a pandas DataFrame by reference in python

I have a large DataFrame object and I want to access parts of it by reference, that is, in a way that whenever the original, large DataFrame is updated, the smaller ones are also updated.
Creating a copy of the smaller parts does not work for obvious reasons.
import pandas as pd
# Create a DataFrame
large_df= pd.DataFrame(dict(a=range(3)))
large_df
0:
a
0 0
1 1
2 2
# Sample some of the DataFrame indices.
# In this example I keep accessing the even rows of a DataFrame
# while updating it, but `sample` is, in general,
# a random list of rows.
sample=[0,2]
# Create a copy of the sampled part of the DataFrame
sub_df = large_df.loc[sample]
sub_df
1:
a
0 0
2 2
# Modify the original DataFrame
large_df.loc[:,'b'] = range(3,6)
large_df
2:
a b
0 0 3
1 1 4
2 2 5
# The copy of the sampled part is kept unchanged
sub_df
3:
a
0 0
2 2
The only solution I found is going back to the loc statement.
# Reusing loc, the sampled part includes the modification
large_df.loc[sample]
4:
a b
0 0 3
2 2 5
Is there a simpler way?

Re-shaping pandas data frame using shape or pivot_table (stack each row)

I have an almost embarrassingly simple question, which I cannot figure out for myself.
Here's a toy example to demonstrate what I want to do, suppose I have this simple data frame:
df = pd.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12]],index=range(2),columns=list('abcdef'))
a b c d e f
0 1 2 3 4 5 6
1 7 8 9 10 11 12
What I want is to stack it so that it takes the following form, where the columns identifiers have been changed (to X and Y) so that they are the same for all re-stacked values:
X Y
0 1 2
3 4
5 6
1 7 8
9 10
11 12
I am pretty sure you can do it with pd.stack() or pd.pivot_table() but I have read the documentation, but cannot figure out how to do it. But instead of appending all columns to the end of the next, I just want to append a pairs (or triplets of values actually) of values from each row.
Just to add some more flesh to the bones of what I want to do;
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
a b c d e f
0 -0.168636 -1.878447 -0.985152 -0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890 -1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250 -1.718324 0.145479 -0.099530
I want this to re-stacked into this form (where column labels have been changed again, to the same for all values):
X Y Z
0 -0.168636 -1.878447 -0.985152
-0.101049 1.244617 1.256772
1 0.395110 -0.237559 0.034890
-1.244669 -0.721756 0.473696
2 -0.973043 1.784627 0.601250
-1.718324 0.145479 -0.099530
Yes, one could just make a for-loop with the following logic operating on each row:
df.values.reshape(df.shape[1]/3,2)
But then you would have to compute each row individually and my actual data has tens of thousands of rows.
So I want to stack each individual row selectively (e.g. by pairs of values or triplets), and then stack that row-stack, for the entire data frame, basically. Preferably done on the entire data frame at once (if possible).
Apologies for such a trivial question.
Use numpy.reshape to reshape the underlying data in the DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3,6),index=range(3),columns=list('abcdef'))
print(df)
# a b c d e f
# 0 -0.889810 1.348811 -1.071198 0.091841 -0.781704 -1.672864
# 1 0.398858 0.004976 1.280942 1.185749 1.260551 0.858973
# 2 1.279742 0.946470 -1.122450 -0.355737 1.457966 0.034319
result = pd.DataFrame(df.values.reshape(-1,3),
index=df.index.repeat(2), columns=list('XYZ'))
print(result)
yields
X Y Z
0 -0.889810 1.348811 -1.071198
0 0.091841 -0.781704 -1.672864
1 0.398858 0.004976 1.280942
1 1.185749 1.260551 0.858973
2 1.279742 0.946470 -1.122450
2 -0.355737 1.457966 0.034319

Categories