I am trying to do some basic dimensional reduction. I have a CSV file that looks something like this:
A B C A B B A C
1 1 2 2 1 3 1 1
1 2 3 0 0 1 1 2
0 2 1 3 0 1 2 2
I want to import as a pandas DF but without renaming the headers to A.1 A.2 etc. Instead I want to sum the duplicates and keep the columns names. Ideally my new DF should look like this:
A B C
4 5 3
2 3 5
5 3 3
Is it possible to do this easily or would you recommend a different way? I can also use bash, R, or anything that can do the trick with a file that is 1 million lines and 1000 columns.
Thank you!
Try split the column names by . and groupby the first part:
df.groupby(df.columns.str.split('.').str[0], axis=1).sum()
Output:
A B C
0 4 5 3
1 2 3 5
2 5 3 3
Just load the dataframe normally and group by the first letter of the column name, and sum the values:
df.groupby(lambda colname: colname[0], axis=1).sum()
which gives
A B C
0 4 5 3
1 2 3 5
2 5 3 3
Related
I would like to achieve the result below in Python using Pandas.
I tried groupby and sum on the id and Group columns using the below:
df.groupby(['id','Group'])['Total'].sum()
I got the first two columns, but I'm not sure how to get the third column (Overall_Total).
How can I do it?
Initial data (before grouping)
id
Group
Time
1
a
2
1
a
2
1
a
1
1
b
1
1
b
1
1
c
1
2
e
2
2
a
4
2
e
1
2
a
5
3
c
1
3
e
4
3
a
3
3
e
4
3
a
2
3
h
4
Assuming df is your initial dataframe, please try this:
df_group = df.groupby(['id','group']).sum(['time']).rename(columns={'time':'Total'})
df_group['All_total'] = df_group.groupby(['id'])['Total'].transform('sum')
How would you use the results of a variable in a pandas data set to call different columns in iteratively that have the same name as the results. If one column had results that were letters and other columns were titled with those letters, how could you use the results from the letters column as column names?
Take this example data set
letter a b c d
a 1 3 4 2
d 4 3 2 1
c 2 1 4 3
d 3 4 2 1
desired results
letter a b c d correct answer
a 1 3 4 2 1
d 4 3 2 1 1
c 2 1 4 3 4
d 3 4 2 1 1
How do you create the correct answer variable seen in the desired results
Assuming that you have set letter as your index, I suppose you could do the following:
df['correct_answer'] = df.apply(lambda x: x.loc[x.name], axis=1)
Yields:
a b c d correct_answer
letter
a 1 3 4 2 1
d 4 3 2 1 1
c 2 1 4 3 4
d 3 4 2 1 1
Here is something that might help: pandas DataFrame diagonal
This is basically searching for an identity diagonal i.e. when rowname == colname. Should be enough to get you started.
I want to apply an operation on multiple groups of a data frame and then fill all values of that group by the result. Lets take mean and np.cumsum as an example and the following dataframe:
df=pd.DataFrame({"a":[1,3,2,4],"b":[1,1,2,2]})
which looks like this
a b
0 1 1
1 3 1
2 2 2
3 4 2
Now I want to group the dataframe by b, then take the mean of a in each group, then apply np.cumsum to the means, and then replace all values of a by the (group dependent) result.
For the first three steps, I would start like this
df.groupby("b").mean().apply(np.cumsum)
which gives
a
b
1 2
2 5
But what I want to get is
a b
0 2 1
1 2 1
2 5 2
3 5 2
Any ideas how this can be solved in a nice way?
You can use map by Series:
df1 = df.groupby("b").mean().cumsum()
print (df1)
a
b
1 2
2 5
df['a'] = df['b'].map(df1['a'])
print (df)
a b
0 2 1
1 2 1
2 5 2
3 5 2
I Have a dataframe which is of the following structure:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
The index and column C are both set to have the same value. This is because I have created a dataframe which uses dates as it's index to cover every day in the year and I have a large collection of data whose dates are deposited in column C. In practice I can deposit as much data as possible and this can cover the majority of the year but there will be some days where there is no data and my dataframe is structed this way to account for it.
What I wish to do is enable support for multiple readings on one day. Currently my program selects which row to put data into by matching the raw data's date with the date in the index column so if I had the following:
A B C
2 3 2
The row would be selected by the value in column C and inserted into the data frame like so:
A B C
0 1 1 0
1 2 2 1
2 2 3 2
3 4 4 3
How would I handle the case where I have two sets of readings on one day whilst keeping the indexing the same and inserting the data based on the column c value.
Like so:
A B C
4 3 1
2 4 1
And I want to be able to have the following:
A B C
0 1 1 0
1 4 3 1
1 2 4 1
2 2 3 2
3 4 4 3
I wish to keep the indexing the same so that the structure of the dataframe is kept the same in covering all days of the year and days where there is multiple days the data can be inserted whilst keeping the index value the same.
This should do it for you:
Setup:
import pandas as pd
import io
a = io.StringIO(u'''
A B C
1 1 0
2 2 1
3 3 2
4 4 3
''')
df = pd.read_csv(a, delim_whitespace = True)
b = io.StringIO(u'''
A B C
4 3 1
2 4 1
''')
dfX = pd.read_csv(b, delim_whitespace = True)
Processing:
df = df.loc[~df['C'].isin(dfX['C'])]
df = df.append(dfX).sort_values(by = 'C')
df.index = df['C'].values
Output:
A B C
0 1 1 0
1 4 3 1
1 2 4 1
2 3 3 2
3 4 4 3
How can I drop the exact duplicates of a row. So if I have a data frame that looks like so:
A B C
1 2 3
3 2 2
1 2 3
now my data frame is a lot larger than this but is their a way that we can have python look at every row and if the values in the rows are the exact same as another row just drop or delete that row. I want to take in to account for the whole data frame i don't want to specify the column I want to get unique values for.
you can use DataFrame.drop_duplicates() method:
In [23]: df
Out[23]:
A B C
0 1 2 3
1 3 2 2
2 1 2 3
In [24]: df.drop_duplicates()
Out[24]:
A B C
0 1 2 3
1 3 2 2
You can get a de-duplicated dataframe with the inverse of .duplicated:
df[~df.duplicated(['A','B','C'])]
Returns:
>>> df[~df.duplicated(['A','B','C'])]
A B C
0 1 2 3
1 3 2 2