I have a dataframe A with 80 columns, and I did group by A and Sum 20 columns
E.g.
New_df=A.groupby(['X','Y','Z'])['a','b','c',......].sum().reset_Index()--------(1)
Then I want to overwrite the values in columns which are present in A with the New_df columns value which are common.
You can do:
cols1=set(A.columns.tolist())
cols2=set(New_df.columns.tolist())
common_cols = list(cols1.intersection(cols2))
A[common_cols]=New_df[common_cols]
to find the columns that the two df's have in common , then replace those in the first with the columns from the second.
This will give you results for example given an initial A:
x y
0 1 a
1 2 b
2 3 c
and New_df:
z y
0 4 d
1 5 e
2 6 f
And we wind up with final 'A', with y column taken from New_df:
x y
0 1 d
1 2 e
2 3 f
Related
I have a dataset of 100 rows, I want to split them into multiple of 4 and then perform operations on it, i.e., first perform operation on first four rows, then on the next four rows and so on.
Note: Rows are independent of each other.
I don't know how to do it. Can somebody pls help me, I would be extremely thankful to him/her.
i will divide df per 2 row (simple example)
and make list dfs
Example
df = pd.DataFrame(list('ABCDE'), columns=['value'])
df
value
0 A
1 B
2 C
3 D
4 E
Code
grouper for grouping
grouper = pd.Series(range(0, len(df))) // 2
grouper
0 0
1 0
2 1
3 1
4 2
dtype: int64
divide to list
g = df.groupby(grouper)
dfs = [g.get_group(x) for x in g.groups]
result(dfs):
[ value
0 A
1 B,
value
2 C
3 D,
value
4 E]
Check
dfs[0]
output:
value
0 A
1 B
My data includes a few variables holding data from multi-answer questions. These are stored as string (comma separated) and aren't ordered by value.
I need to run different counts across 2 or more of these variables at the same time, i.e. get the frequencies of each combination of their unique values.
I also have a second dataframe with the available codes for each variable
df_meta['a']['Categories'] = ['1', '2', '3','4']
df_meta['b']['Categories'] = ['1', '2']
If this is my data
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],["3,1","2,1"]]),
columns=['a', 'b'])
index a b
1 1,3 1
2 3 1,2
3 1,3,2 1
4 3,1 2,1
Ideally, this is what the output would look like
a b count
1 1 3
1 2 1
2 1 1
2 2 0
3 1 4
3 2 2
4 1 0
4 2 0
Although if I it's not possible to get the zero-counts, this would be just fine
a b count
1 1 3
1 2 1
2 1 1
3 1 4
3 2 2
So far, I got the counts for each of these variables individually, by using split and value_counts
df["a"].str.split(',',expand=True).stack().value_counts()
3 4
1 3
2 1
df["b"].str.split(',',expand=True).stack().value_counts()
1 4
2 2
But I can't figure how to group by them, because of the differences in the indexes.
df2 = pd.DataFrame()
df2["a"] = df["a"].str.split(',',expand=True).stack()
df2["b"] = df["b"].str.split(',',expand=True).stack()
df2.groupby(['a','b']).size()
a b
1 1 3
3 1 1
2 1
Is there a way to adjust the groupby to only count the instances of the first index or another way to count the unique combinations more efficiency?
I can alternatively iterate through all codes using the df_meta dataframe, but some of the actual variables have 300-400 codes and it's very slow, when I try to cross 2-3 of them and, if it's possible to use groupby or another function, it should work much faster.
First we make your dataframe to start with.
df = pd.DataFrame(np.array([["1,3","1"],["3","1,2"],["1,3,2","1"],
["3,1","2,1"]]),columns=['a', 'b'])
Then split columns to separate dataframes.
da = df["a"].str.split(',',expand=True)
db = df["b"].str.split(',',expand=True)
Loop through all rows and both dataframes. Make temporary dataframes of all compinations and add them to a list.
ab = list()
for r in range(len(da)):
for i in da.iloc[r,:]:
for j in db.iloc[r,:]:
if i != None and j != None:
daf = pd.DataFrame({'a':[i], 'b':[j]})
ab.append(daf)
Concatenate list of temporary dataframes into one new dataframe.
dfn = pd.concat(ab)
Groupby with 'a' and 'b' columns and size() gives you the answer.
print(dfn.groupby(['a', 'b']).size().reset_index(name='count'))
a b count
0 1 1 3
1 1 2 1
2 2 1 1
3 3 1 4
4 3 2 2
Let's say I have these two dataframes :
dfX = pd.DataFrame({'Points':["A","B","C","D"],'Group':[1,2,1,3]})
dfX
Points Group
0 A 1
1 B 2
2 C 1
3 D 3
dfY = pd.DataFrame({'Points':["A","B","C","D"],'Score':[2,3,4,5]})
dfY
Points Score
0 A 2
1 B 3
2 C 4
3 D 5
I would like to get the minimum Score of Points sharing the Group of point C. Consequently, I would like to get 2.
Because the point C is in the Group 1 and that the Group 1 is composed of points A and C, I would like to get the minimum Score between A and C, that is to say, 2.
How could I do that in Python?
Thanks
First you need to merge the dataframes on the key Points, then get the group of point C, and finally take the mean of the scores within that group:
merged = pd.merge(dfX, dfY, on='Points')
group = merged.loc[merged.Points == 'C', 'Group']
val = merged.loc[merged.Group.isin(group), 'Score'].min()
print(val)
2
I have a dataframe with rows that I'd like to shuffle continuously until the value in column B is not identical across any two consecutive rows:
initial dataframe:
A | B
_______
a 1
b 1
c 2
d 3
e 3
Possible outcome:
A | B
_______
b 1
c 2
e 3
a 1
d 3
I made a function scramble meant to do this but I am having trouble passing the newly scrambled dataframe back into the function to test for matching B values:
def scamble(x):
curr_B='nothing'
for index, row in x.iterrows():
next_B=row['B']
if str(next_B) == str(curr_B):
x=x.sample(frac=1)
curr_B=next_B
curr_B=next_B
return x
df=scramble(df)
I suspect the function is finding the matching values in the next row, but I can't shuffle it continuously until there are no two sequential rows with the same B value.
Printing the output yields a dataframe shows consecutive rows with the same value in B.
If your goal is to eliminate consecutive duplicates, you can just use groupby and cumcount, then reindex your DataFrame:
df.loc[df.groupby('B').cumcount().sort_values().index]
A B
0 a 1
2 c 2
3 d 3
1 b 1
4 e 3
If you actually want randomness, then you can group on cumcount and call shuffle. This should eliminate consecutive dupes to some degree (NOT GUARANTEED) while preserving randomness and still avoiding slow iteration. Here's an example:
np.random.seed(0)
(df.groupby(df.groupby('B').cumcount(), group_keys=False)
.apply(lambda x: x.sample(frac=1))
.reset_index(drop=True))
A B
0 d 3
1 a 1
2 c 2
3 b 1
4 e 3
I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4