I'm trying to use cumsum() to get the result of I want in pandas, but I'm stuck.
score1 score2
team slot
a 2 4 6
a 3 3 7
a 4 2 1
a 5 4 3
b 1 7 2
b 2 2 10
b 5 1 9
my original data look like above , I want to do cummulative of score1 and score2 group by team and slot. I used
df= df.groupby(by=['team','slot']).sum().groupby(level=[0]).cumsum()
this code above almost got I want , but each team needs exactly 5 slot like output below , how can I fix this issue ?
as #Paul H commented, here is the code:
import io
import pandas as pd
text = """team slot score1 score2
a 2 4 6
a 3 3 7
a 4 2 1
a 5 4 3
b 1 7 2
b 2 2 10
b 5 1 9
"""
df = pd.read_csv(io.BytesIO(text), delim_whitespace=True, index_col=[0, 1])
df2 = df.reindex(pd.MultiIndex.from_product([df.index.levels[0], range(1, 6)]))
df2.fillna(0).groupby(level=[0]).cumsum()
Related
I have a Pandas dataframe called "bag' with columns called beans1, beans2, and beans3
bag = pd.DataFrame({'beans1': [3,1,2,5,6,7], 'beans2': [2,2,1,1,5,6], 'beans3': [1,1,1,3,3,2]})
bag
Out[50]:
beans1 beans2 beans3
0 3 2 1
1 1 2 1
2 2 1 1
3 5 1 3
4 6 5 3
5 7 6 2
I want to use a loop to subset each column with observations greater than 1, so that I get:
beans1
0 3
2 2
3 5
4 6
5 7
beans2
0 2
1 2
4 5
5 6
beans3
3 3
4 3
5 2
The way to do it manually is :
beans1=beans.loc[bag['beans1']>1,['beans1']]
beans2=beans.loc[bag['beans2']>1,['beans2']]
beans3=beans.loc[bag['beans3']>1,['beans3']]
But I need to employ a loop, with something like:
for i in range(1,4):
beans+str(i).loc[beans.loc[bag['beans'+i]>1,['beans'+str(i)]]
But it didn't work. I need a Python version of R's eval(parse(text="")))
Any help appreciated. Thanks much!
It is possible, but not recommended, with globals:
for i in range(1,4):
globals()['beans' + str(i)] = bag.loc[bag['beans'+str(i)]>1,['beans'+str(i)]]
for c in bag.columns:
globals()[c] = bag.loc[bag[c]>1,[c]]
print (beans1)
beans1
0 3
2 2
3 5
4 6
5 7
Better is create dictionary:
d = {c: bag.loc[bag[c]>1, [c]] for c in bag}
print (d['beans1'])
beans1
0 3
2 2
3 5
4 6
5 7
I want to shift especific column down by one (I dont know if other library can help me)
import pandas as pd
#pd.set_option('display.max_rows',100)
fac=pd.read_excel('TEST.xlsm',sheet_name="DC - Consumables",header=None, skiprows=1)
df = pd.DataFrame(fac)
df1=df.iloc[0:864,20:39]
df2=df.iloc[0:864,40:59]
df1=pd.concat([df1,df2])
print (df1)
I want one column to be below the other column
A B C` A B C`
1 2 3` 6 7 8`
4 5 8` 4 1 9`
my code print this
A B C
1 2 3
4 5 8
A B C
6 7 8
4 1 9
I need the second column (dataframe) to be below the first column, like this:
A B C
1 2 3
4 5 8
A B C
6 7 8
4 1 9
Please help me
Try pd.concat().
df3 = pd.concat([df1, df2])
I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])
I am trying to select a subset of a DataFrame based on the columns of another DataFrame.
The DataFrames look like this:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I want to get all rows of the first Dataframe for the columns which are included in both DataFrames. My result should look like this:
a b
0 0 1
1 4 5
2 8 9
3 12 13
You can use pd.Index.intersection or its syntactic sugar &:
intersection_cols = df1.columns & df2.columns
res = df1[intersection_cols]
import pandas as pd
data1=[[0,1,2,3,],[4,5,6,7],[8,9,10,11],[12,13,14,15]]
data2=[[0,1],[2,3],[4,5],[6,7],[8,9]]
df1 = pd.DataFrame(data=data1,columns=['a','b','c','d'])
df2 = pd.DataFrame(data=data2,columns=['a','b'])
df1[(df1.columns) & (df2.columns)]
I have a DataFrame with column names in the shape of x.y, where I would like to sum up all columns with the same value on x without having to explicitly name them. That is, the value of column_name.split(".")[0] should determine their group. Here's an example:
import pandas as pd
df = pd.DataFrame({'x.1': [1,2,3,4], 'x.2': [5,4,3,2], 'y.8': [19,2,1,3], 'y.92': [10,9,2,4]})
df
Out[3]:
x.1 x.2 y.8 y.92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
The result should be the same as this operation, only I shouldn't have to explicitly list the column names and how they should group.
pd.DataFrame({'x': df[['x.1', 'x.2']].sum(axis=1), 'y': df[['y.8', 'y.92']].sum(axis=1)})
x y
0 6 29
1 6 11
2 6 3
3 6 7
Another option, you can extract the prefix from the column names and use it as a group variable:
df.groupby(by = df.columns.str.split('.').str[0], axis = 1).sum()
# x y
#0 6 29
#1 6 11
#2 6 3
#3 6 7
You can first create Multiindex by split and then groupby by first level and aggregate sum:
df.columns = df.columns.str.split('.', expand=True)
print (df)
x y
1 2 8 92
0 1 5 19 10
1 2 4 2 9
2 3 3 1 2
3 4 2 3 4
df = df.groupby(axis=1, level=0).sum()
print (df)
x y
0 6 29
1 6 11
2 6 3
3 6 7