I have two data frames and their shapes are (707,140) and (34,98).
I want to minimize the bigger data frame to the small one based on the same index name and column names.
So after the removing additional rows and columns from bigger data frame, in the final its shape should be (34,98) with the same index and columns with the small data frame.
How can I do this in python ?
I think you can select by loc index and columns of small DataFrame:
dfbig.loc[dfsmall.index, dfsmall.columns]
Sample:
dfbig = pd.DataFrame({'a':[1,2,3,4,5], 'b':[4,7,8,9,4], 'c':[5,0,1,2,4]})
print (dfbig)
a b c
0 1 4 5
1 2 7 0
2 3 8 1
3 4 9 2
4 5 4 4
dfsmall = pd.DataFrame({'a':[4,8], 'c':[0,1]})
print (dfsmall)
a c
0 4 0
1 8 1
print (dfbig.loc[dfsmall.index, dfsmall.columns])
a c
0 1 5
1 2 0
Related
I am working with a pandas dataframe where I have the following two columns: "personID" and "points". I would like to create a third variable ("localMin") which will store the minimum value of the column "points" at each point in the dataframe as compared with all previous values in the "points" column for each personID (see image below).
Does anyone have an idea how to achieve this most efficiently? I have approached this problem using shift() with different period sizes, but of course, shift is sensitive to variations in the sequence and doesn't always produce the output I would expect.
Thank you in advance!
Use groupby.cummin:
df['localMin'] = df.groupby('personID')['points'].cummin()
Example:
df = pd.DataFrame({'personID': list('AAAAAABBBBBB'),
'points': [3,4,2,6,1,2,4,3,1,2,6,1]
})
df['localMin'] = df.groupby('personID')['points'].cummin()
output:
personID points localMin
0 A 3 3
1 A 4 3
2 A 2 2
3 A 6 2
4 A 1 1
5 A 2 1
6 B 4 4
7 B 3 3
8 B 1 1
9 B 2 1
10 B 6 1
11 B 1 1
If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4
I have a dataframe extracted from an excel file which I have manipulated to be in the following form (there are mutliple rows but this is reduced to make my question as clear as possible):
|A|B|C|A|B|C|
index 0: 1 2 3 4 5 6
As you can see there are repetitions of the column names. I would like to merge this dataframe to look like the following:
|A|B|C|
index 0: 1 2 3
index 1: 4 5 6
I have tried to use the melt function but have not had any success thus far.
import pandas as pd
df = pd.DataFrame([[1,2,3,4,5,6]], columns = ['A', 'B','C','A', 'B','C'])
df
A B C A B C
0 1 2 3 4 5 6
pd.concat(x for _, x in df.groupby(df.columns.duplicated(), axis=1))
A B C
0 1 2 3
0 4 5 6
I Have a dataframe which is of the following structure:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
The index and column C are both set to have the same value. This is because I have created a dataframe which uses dates as it's index to cover every day in the year and I have a large collection of data whose dates are deposited in column C. In practice I can deposit as much data as possible and this can cover the majority of the year but there will be some days where there is no data and my dataframe is structed this way to account for it.
What I wish to do is enable support for multiple readings on one day. Currently my program selects which row to put data into by matching the raw data's date with the date in the index column so if I had the following:
A B C
2 3 2
The row would be selected by the value in column C and inserted into the data frame like so:
A B C
0 1 1 0
1 2 2 1
2 2 3 2
3 4 4 3
How would I handle the case where I have two sets of readings on one day whilst keeping the indexing the same and inserting the data based on the column c value.
Like so:
A B C
4 3 1
2 4 1
And I want to be able to have the following:
A B C
0 1 1 0
1 4 3 1
1 2 4 1
2 2 3 2
3 4 4 3
I wish to keep the indexing the same so that the structure of the dataframe is kept the same in covering all days of the year and days where there is multiple days the data can be inserted whilst keeping the index value the same.
This should do it for you:
Setup:
import pandas as pd
import io
a = io.StringIO(u'''
A B C
1 1 0
2 2 1
3 3 2
4 4 3
''')
df = pd.read_csv(a, delim_whitespace = True)
b = io.StringIO(u'''
A B C
4 3 1
2 4 1
''')
dfX = pd.read_csv(b, delim_whitespace = True)
Processing:
df = df.loc[~df['C'].isin(dfX['C'])]
df = df.append(dfX).sort_values(by = 'C')
df.index = df['C'].values
Output:
A B C
0 1 1 0
1 4 3 1
1 2 4 1
2 3 3 2
3 4 4 3
Given the following code:
import numpy as np
import pandas as pd
arr = np.array([
[1,2,9,1,1,1],
[2,3,3,1,0,1],
[1,4,2,1,2,1],
[2,3,1,1,2,1],
[1,2,3,1,8,1],
[2,2,5,1,1,1],
[1,3,8,7,4,1],
[2,4,7,8,3,3]
])
# 1,2,3,4,5,6 <- Number of the columns.
df = pd.DataFrame(arr)
for _ in df.columns.values:
print {x: list(df[_]).count(x) for x in set(df[_])}
I want to delete from the dataframe all the columns in which one value occurs more often than all the other values of the column together. In this case I would like to drop the columns 4 and 6 (see comment) since the number 1 occurs more often than all the other numbers in these columns together (6 > 2 in column 4 and 7 > 1 in column 6). I don't want to drop the first column (4 = 4). How would I do that?
Another option is to do a value counts on each column and if the maximum of the count is smaller or equal to half of the number of rows of the data frame, then select it:
df.loc[:, df.apply(lambda col: max(col.value_counts()) <= df.shape[0]/2)]
# 0 1 2 4
#0 1 2 9 1
#1 2 3 3 0
#2 1 4 2 2
#3 2 3 1 2
#4 1 2 3 8
#5 2 2 5 1
#6 1 3 8 4
#7 2 4 7 3