Use a loop on a dataframe by taking a specific column - python

I am new to pandas and python. Here I have a data-frame,
DID feature
0 1
0 1
0 2
0 22
0 22
0 33
1 11
1 13
1 14
1 2
1 33
2 1
2 22
2 33
2 13
2 14
In this dataframe there are two columns. DID is a document Id and and feature is the feature of that .
Now, I am trying to use a for loop here on the basis of the document ID's ..
I am trying to call a fucntion inside a loop which will have the data of that DID only , like the features of that DID only.
so
for i in df1 :
call_process ["Here only the values of i"] (i is the document ID , which will be first 0).
call_process[df1['feature'].values]
like this ?
Is there any way to do this ?
expected output is like ,
while calling a method it should have the data of that document ID only.
call_process([1,1,2,22,22,33])

If I understood you correctly, here is a simple function to get you the features for a DID:
def get_features(did):
feats = [] #to load the matching features
for d,idx in zip(df['DID'],range(len(df))): #get DID and index of DID
if d == did:
feats.append(df['feature'][idx])
return feats #return the features in a list
Then you call the function with did value you want, suppose it is DID 0:
get_features(0)
And it returns:
[1, 1, 2, 22, 22, 33]

I don't understand your purpose, but you may do it with for-loop on groupby object.
for _, g in df1.groupby('DID'):
call_process(g['feature'].values)

Related

split dataframe based on column value

I have a df that contains several IDs, I´m trying to run a regression to the data and I need to be able to split it by ID to apply the regression to each ID:
Sample DF (this is only a sample real data is larger)
I tried to save the ID´s within a list like this:
id_list = []
for data in df['id'].unique():
id_list.append(data)
The list output is [1,2,3]
Then I was trying to use that to sort the DF:
def create_dataframe(df):
for unique_id in id_list:
df = df[df['Campaign ID'] == campaign_id]
return df
when I call the function the result is:
However I only got the result for the first ID in the list ,the other 2 [2,3] are not returning any DF... which means that at some point the loop breaks.
Here it is the entire code:
df = pd.read_csv('budget.csv')
id_list = []
for unique_id in df['id'].unique():
id_list.append(unique_id)
def create_dataframe(df):
for unique_id in id_list:
df = df[df['Campaign ID'] == unique_id]
return df
print(create_dataframe(df))
You can use the code snippet df.loc[df['id'] == item] to extract sub dataframes based on a particular value of a column in the dataframe.
Please refer the full code below
import pandas as pd
df_dict = {"id" : [1,1,1,2,2,2,3,3,3],
"value" : [12,13,14,22,23,24,32,33,34]
}
df = pd.DataFrame(df_dict)
print(df)
id_list = []
for data in df['id'].unique():
id_list.append(data)
print(id_list)
for item in id_list:
sub_df = df.loc[df['id'] == item]
print(sub_df)
print("****")
The following output will be generated for this with the requirement of getting the sub dataframes for each of the distinct column ids
id value
0 1 12
1 1 13
2 1 14
3 2 22
4 2 23
5 2 24
6 3 32
7 3 33
8 3 34
[1, 2, 3]
id value
0 1 12
1 1 13
2 1 14
****
id value
3 2 22
4 2 23
5 2 24
****
id value
6 3 32
7 3 33
8 3 34
****
Now in your code snippet the issue was that the function createdataframe() is getting called only once and inside the function when we iterate through the elements, after fetching the details of the sub df for id =1 you have used a return statement to return this df. Hence you are getting only the sub df for id = 1.
You seem to be overnighting the df value in the for loop. I would recommend moving the df creation outside of the for loop and then append to it there. Then adding to it in each of the loops instead of overwriting it.
You can use numpy.split:
df.sort_values('id', inplace=True)
np.split(df, df.index[df.id.diff().fillna(0).astype(bool)])
or pandas groupby:
grp = df.groupby('id')
[grp.get_group(g) for g in df.groupby('id').groups]
Although I think you can make a regression directly using pandas groupby, since it logically apply any function you want taking each group as a distinct dataframe.

Pandas Dataframe Filter Multiple Conditions

I am looking to filter a dataframe to only include values that are equal to a certain value, or greater than another value.
Example dataframe:
0 1 2
0 0 1 23
1 0 2 43
2 1 3 54
3 2 3 77
From here, I want to pull all values from column 0, where column 2 is either equal to 23, or greater than 50 (so it should return 0, 1 and 2). Here is the code I have so far:
df = df[(df[2]=23) & (df[2]>50)]
This returns nothing. However, when I split these apart and run them individually (df = df[df[2]=23] and df = df[df[2]>50]), then I do get results back. Does anyone have any insights onto how to get this to work?
As you said , it's or : | not and: &
df = df[(df[2]=23) | (df[2]>50)]

Take the values from Pandas dataframe column and create a new column with them as a list

I've got a dataframe which looks like this:
Record Field11 ID LesionNumber Diagnosis1
1 False 1000 1 22
1 False 1000 2 88
1 False 1000 3 22
1 False 1000 4 24
All of the ID's are the same. And this kind of structure repeats for many different ID's.
Using all rows with the same ID, I'd like to create a new dataframe which looks like this:
Record ID LesionNumber Diagnosis1
1 1000 1, 2, 3, 4 22, 88, 22, 24
I'd like to have the LesionNumber and Diagnosis1 appear as ordered lists.
I'm new to Pandas and dataframes so my terminology may be off. Is this possible?
Using agg
df.groupby(['Record','Field11','ID']).agg(lambda x : ','.join(x.astype(str))).reset_index()
Out[634]:
Record Field11 ID LesionNumber Diagnosis1
0 1 False 1000 1,2,3,4 22,88,22,24

How to use a dataframe attributes inside of a for in python?

I'm new to python so I'll try detailed as possible.
I'm trying to implement the bagging algorithm on it (i need to implement for my final paper, so basically i can't use one that is already implemented).
I have a CSV file with X classes. What i want to do is read the csv with pandas, and after i have read i want to create one dataframe for each class read of the csv file and at the end create a new CSV file with all these new DataFrames. I already did this for a specific CSV file, but I need to make it generic.
This is the code that i made. It works for one specific CSV file:
import pandas as pd
#Function of percentage
def percentage(number1, number2):
percent = (number1 * number2) / 100
return percent
#read the csv file
dataFrame = pd.read_csv('Monk.csv', sep=',')
#select all the values of the dataframe with the column class = 1, do the same for class = 2
class1 = dataFrame[dataFrame['Class'] == 1]
class2 = dataFrame[dataFrame['Class'] == 2]
#reset the index of the new dataframes
class1 = class1.reset_index(drop=True)
class2 = class2.reset_index(drop=True)
#return the quantity of rows
lenClass1 = class1.__len__()
lenClass2 = class2.__len__()
#randomly select n rows of eache class dataframe
randClass1 = class1.sample(n=int(percentage(lenClass1, 33)))
randClass2 = class2.sample(n=int(percentage(lenClass2, 33)))
subSet= randClass1.append(randClass2)
subSet = subSet.sample(frac=1)
subSet = subSet.reset_index(drop=True)
subSet.to_csv('MonkSub1.csv', sep=',')
With this code i get this:
dataframe
A1 A2 Class
01 a b 1
02 x a 2
03 f a 2
04 r b 1
05 l a 2
06 s b 1
after the that i separate the dataframe in new dataframes
Class1
A1 A2 Class
01 a b 1
04 r b 1
06 s b 1
Class 2
A1 A2 Class
02 x a 2
03 f a 2
05 l a 2
then i reset the index
Class1
A1 A2 Class
01 a b 1
02 r b 1
03 s b 1
Class 2
A1 A2 Class
01 x a 2
02 f a 2
03 l a 2
So i select a number of rows randomly of each new DataFrame based on a percentage of rows in each DataFrame, and create a new CSV with this rows that were selected. For example, with 33% i will have 1 row of Class1 and 1 row of Class2 selected ramdonly.
NewCSV
A1 A2 Class
02 r b 1
01 x a 2
And finally i reset the index of this new DataFrame.
After this i tried to do the same for generic files. My idea was to read a CSV file with pandas, after that group in separates DataFrames by 'Class'(like the example above) with similar values. And in this parte is the problem. I tried that usign the groupby() function, but using the type() function of the new DataFrame i get pandas.core.groupby.DataFrameGroupBy. With a DataFrameGroupBy i couldn't use all the attributes of a DataFrame, then i use this:
dataFrame = pd.read_csv('Monk.csv', sep=',')
grouped = dataFrame.groupby(["Class"])
test= grouped.apply(lambda x:x)
When i use the type() function to print test it returns: pandas.core.frame.DataFrame. Basically i did all i wanted, but after that my problems begin. I need to save all the new DataFrames on a list or array, to iterate over them and reset the index (reset_index()), take a sample with a percentage value based on the total of rows(sample()) and others functions that i could use in a DataFrame, but i can't because inside of a for i can't use all this DataFrames attributes.
for df in test.iterrows():
FrameList = [df]
FrameList.reset_index(drop= True)
But i got the error: AttributeError: 'list' object has no attribute 'reset_index'.
I don't know what to do, i already tried a lot of things, but none of them works.
ps: Sorry for bad english!!
It seems to me that you are trying to create a subset of a larger file where the classes in the subset are proportional to those in the larger file. You need a stratified sample.
If you have a dataframe like so with the following class summary.
df = pd.DataFrame({'class': list('aabbb')*4,
'val': np.arange(20)}); df
Out[106]:
class val
0 a 0
1 a 1
2 b 2
3 b 3
4 b 4
5 a 5
6 a 6
7 b 7
8 b 8
9 b 9
10 a 10
11 a 11
12 b 12
13 b 13
14 b 14
15 a 15
16 a 16
17 b 17
18 b 18
19 b 19
df.groupby('class').count()
Out[107]:
val
class
a 8
b 12
If you want to create a 25% (to keep the math simple) stratified sample by 'class', then your final dataframe will have 5 obs with 2 from 'a' and 3 from 'b' to maintain proportionality. That can easily be done like so.
dfsub = df.groupby('class', as_index=False).apply(lambda x: x.sample(frac=.25)).reset_index(drop=True); dfsub
Out[109]:
class val
0 a 0
1 a 6
2 b 18
3 b 17
4 b 7
From here you can write it out to CSV.
for df in test.iterrows():
FrameList = [df]
FrameList.reset_index(drop= True)
But i got the error: AttributeError: 'list' object has no attribute
'reset_index'. I don't know what to do, i already tried a lot of
things, but none of them works
FrameList = [df] creates a new list that contains a dataframe...
reset_index() is a method of the dataframe object, but you are calling it on the list, not the dataframe itself.

Pandas Dataframes: how to build them efficiently

I have a file with 1M rows that I'm trying to read into 20 DataFrames. I do not know in advance which row belongs to which DataFrame or how large each DataFrame will be. How can I process this file into DataFrames efficiently? I've tried to do this several different ways. Here is what I currently have:
data = pd.read_csv(r'train.data', sep=" ", header = None) # Not slow
def collectData(row):
id = row[0]
df = dictionary[id] # Row content determines which dataframe this row belongs to
next = len(df.index)
df.loc[next] = row
data.apply(collectData, axis=1)
It's very slow. What am I doing wrong? If I just apply an empty function, my code runs in 30 sec. With the actual function it takes at least 10 minutes and I'm not sure if it would finish.
Here are a few sample rows from the dataset:
1 1 4
1 2 2
1 3 10
1 4 4
The full dataset is available here (if you click on Matlab version)
Your approach is not a vectored one, because you apply a python function row by row.
Rather that creating 20 dataframes , make a dictionary containing an index (in range(20)) for each key[0]. Then add this information to your DataFrame:
data['dict']=data[0].map(dictionary)
Then reorganize :
data2=data.reset_index().set_index(['dict','index'])
data2 is like :
0 1 2
dict index
12 0 1 1 4
1 1 2 2
2 1 3 10
3 1 4 4
4 1 5 2
....
and data2.loc[i] is one of the Dataframe you want.
EDIT:
It seems that dictionary is describe in train.label.
You can set the dictionary before by:
with open(r'train.label') as f: u=f.readlines()
v=[int(x) for x in u] # len(v) = 11269 = data[0].max()
dictionary=dict(zip(range(1,len(v)+1),v))
Since, the full data set is easily loaded into memory, the following should be fairly quick
data_split = {i: data[data[0] == i] for i in range(1, 21)}
# to access each dataframe, do a dictionary lookup, i.e.
data_split[2].head()
0 1 2
769 2 12 4
770 2 16 2
771 2 23 4
772 2 27 2
773 2 29 6
you may also want to reset the indices or copy the data frame when you're slicing the data frame into smaller data frames.
additional reading:
copy
reset_index
view-vs-copy
If you want to build them efficiently, I think you need some good raw materials:
wood
cement
Are robust and durable.
Try to avoid using hay as the dataframe can be blown up with a little wind.
Hope that helps

Categories