I've succeed on splitting a DataFrame into several smaller DataFrames. I'm now working on giving these DataFrames sequential names, and can be called independently.
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 3)
for part in result:
print(part, '\n')
movie_id 1 2 5 borda rank IRAM
2 3 4 0 0 4 3 2
1 2 3 0 3 6 2 1
movie_id 1 2 5 borda rank IRAM
4 5 3 0 0 3 4 3
0 1 5 4 4 13 1 4
movie_id 1 2 5 borda rank IRAM
3 4 3 0 0 3 4 3
I want to give names in sequential order to these separated DataFrames with loop(or any helpful methods).
For instance :
df_1
movie_id 1 2 5 borda rank IRAM
2 3 4 0 0 4 3 2
1 2 3 0 3 6 2 1
df_2
movie_id 1 2 5 borda rank IRAM
4 5 3 0 0 3 4 3
0 1 5 4 4 13 1 4
df_3
movie_id 1 2 5 borda rank IRAM
3 4 3 0 0 3 4 3
I've been searching solutions for a while, but I can't find an ideally answer to my problem.
This can be done by taking a dictionary and adding all dataframes into it:
df = pd.DataFrame({'Col1': np.random.randint(10, size=10)})
shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 3)
d = {}
for i, part in enumerate(result):
d['df_'+str(i)] = part # If want to start the number for df from 1 then use str(i+1)
print(d['df_0'])
Col1
7 7
6 0
4 5
2 3
print(d['df_1'])
Col1
0 0
8 1
1 5
print(d['df_2'])
Col1
5 2
3 2
9 4
df_dict = {}
for index, splited in enumerate(result):
df_name = "df_{}".format(index)
# if you want to set name of the dataframe
splited.name = df_name
# if you want to set the variable name to dataframe
df_dict[df_name] = splited
print(df_dict)
{'df_0': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
9 10 3 2 0 0 0 4 0 0 0 0 0 9
7 8 1 0 0 0 4 5 0 0 0 4 0 14
6 7 4 0 0 0 2 5 3 4 4 0 0 22
0 1 5 4 0 4 4 0 0 0 4 0 0 21,
'df_1': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
8 9 5 0 0 0 4 5 0 0 4 5 0 23
3 4 3 0 0 0 0 5 0 0 4 0 5 17
5 6 5 0 0 0 0 0 0 5 0 0 0 10,
'df_2': movie_id 1 2 4 5 6 7 8 9 10 11 12 borda
4 5 3 0 0 0 0 0 0 0 0 0 0 3
2 3 4 0 0 0 0 0 0 0 0 0 0 4
1 2 3 0 0 3 0 0 0 0 0 0 0 6}
Then you can call any splited_df by df_dict[df_name].
You can use a dictionary, like this:
d = {"df_"+str(k):v for (k,v) in [(i,result[i]) for i in range(len(result))]}
Related
I have a dataframe like this:
vehicle_id trip
0 0 0
1 0 0
2 0 0
3 0 1
4 0 1
5 1 0
6 1 0
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 2
13 2 0
14 2 1
15 2 2
I want to add a column that counts the frequency of each trip value for each 'vehicle id' group and drop the rows where the frequency is equal to 'one'. So after adding the column the frequency will be like this:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
12 1 2 1
13 2 0 1
14 2 1 1
15 2 2 1
and the final result will be like this
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
what is the best solution for that? Also, what should I do if I intend to directly drop rows where the frequency is equal to 1 in each group (without adding the frequency column)?
Check the collab here :
https://colab.research.google.com/drive/1AuBTuW7vWj1FbJzhPuE-QoLncoF5W_7W?usp=sharing
You can use df.groupby() :
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
But of course you need to create the frequency column before_hand :
df["frequency"] = 0
If I take your dataframe as example this gives :
import pandas as pd
dict = {"vehicle_id" : [0,0,0,0,0,1,1,1,1,1,1,1],
"trip" : [0,0,0,1,1,0,0,1,1,1,1,1]}
df = pd.DataFrame.from_dict(dict)
df["frequency"] = 0
df["frequency"] = df.groupby(["vehicle_id","trip"]).transform("count")
output :
Try:
df["frequency"] = (
df.assign(frequency=0).groupby(["vehicle_id", "trip"]).transform("count")
)
print(df[df.frequency > 1])
Prints:
vehicle_id trip frequency
0 0 0 3
1 0 0 3
2 0 0 3
3 0 1 2
4 0 1 2
5 1 0 2
6 1 0 2
7 1 1 5
8 1 1 5
9 1 1 5
10 1 1 5
11 1 1 5
I have a pandas dataframe as follows:
df2
amount 1 2 3 4
0 5 1 1 1 1
1 7 0 1 1 1
2 9 0 0 0 1
3 8 0 0 1 0
4 2 0 0 0 1
What I want to do is replace the 1s on every row with the value of the amount field in that row and leave the zeros as is. The output should look like this
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
I've tried applying a lambda function row-wise like this, but I'm running into errors
df2.apply(lambda x: x.loc[i].replace(0, x['amount']) for i in len(x), axis=1)
Any help would be much appreciated. Thanks
Let's use mask:
df2.mask(df2 == 1, df2['amount'], axis=0)
Output:
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
You can also do it wit pandas.DataFrame.mul() method, like this:
>>> df2.iloc[:, 1:] = df2.iloc[:, 1:].mul(df2['amount'], axis=0)
>>> print(df2)
amount 1 2 3 4
0 5 5 5 5 5
1 7 0 7 7 7
2 9 0 0 0 9
3 8 0 0 8 0
4 2 0 0 0 2
Currently, I have a dataframe like this:
0 0 0 3 0 0
0 7 8 9 1 0
0 4 5 2 4 0
My code to stack it:
dt = dataset.iloc[:,0:7].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
dt['variable'] = pandas.Categorical(dt.index).codes+1
dt.rename(columns={0:index_column_name}, inplace=True)
dt.set_index(index_column_name, inplace=True)
dt['variable'] = numpy.sort(dt['variable'])
However, it drops the first row when I'm stacking it, and I want to keep the headers / first row, how would I achieve this?
In essence, I'm losing the data from the first row (a.k.a column headers) and I want to keep it.
Desired Output:
value,variable
0 1
0 1
0 1
0 2
7 2
4 2
0 3
8 3
5 3
3 4
9 4
2 4
0 5
1 5
4 5
0 6
0 6
0 6
Current output:
value,variable
0 1
0 1
7 2
4 2
8 3
5 3
9 4
2 4
1 5
4 5
0 6
0 6
Why not use df.melt as #WeNYoBen mentioned?
print(df)
1 2 3 4 5 6
0 0 0 0 3 0 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
print(df.melt())
variable value
0 1 0
1 1 0
2 1 0
3 2 0
4 2 7
5 2 4
6 3 0
7 3 8
8 3 5
9 4 3
10 4 9
11 4 2
12 5 0
13 5 1
14 5 4
15 6 0
16 6 0
17 6 0
I have dataframe df with following data.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
2 4 6 4
3 1 4 1
3 2 0 4
4 1 2 6
5 1 2 4
5 2 8 3
grp = df.groupby('A')
Next I want to make all groups of dataframe df grouped on columns A to have same number of rows. Either Truncate extra rows or pad 0 rows. For above data, I want to make all groups to have 3 rows. I required the following results.
A B C D
1 1 3 1
1 2 9 8
1 3 3 9
2 1 2 9
2 2 1 4
2 3 9 5
3 1 4 1
3 2 0 4
3 0 0 0
4 1 2 6
4 0 0 0
4 0 0 0
5 1 2 4
5 2 8 3
5 0 0 0
Similarly, I may want to groupby on multiple columns, like
grp = df.groupby(['A','B'])
Use GroupBy.cumcount for counter column with DataFrame.reindex by MultiIndex.from_product:
df['g'] = df.groupby('A').cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(), range(3)], names=('A','g'))
df = (df.set_index(['A','g'])
.reindex(mux, fill_value=0)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
Another solution with DataFrame.merge with left join with helper DataFrame:
from itertools import product
df['g'] = df.groupby('A').cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(), range(3))), columns=['A','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)
print (df)
A B C D
0 1 1 3 1
1 1 2 9 8
2 1 3 3 9
3 2 1 2 9
4 2 2 1 4
5 2 3 9 5
6 3 1 4 1
7 3 2 0 4
8 3 0 0 0
9 4 1 2 6
10 4 0 0 0
11 4 0 0 0
12 5 1 2 4
13 5 2 8 3
14 5 0 0 0
EDIT:
df['g'] = df.groupby(['A','B']).cumcount()
mux = pd.MultiIndex.from_product([df['A'].unique(),
df['B'].unique(),
range(3)], names=('A','B','g'))
df = (df.set_index(['A','B','g'])
.reindex(mux, fill_value=0)
.reset_index(level=2, drop=True)
.reset_index())
print (df.head(10))
A B C D
0 1 1 3 1
1 1 1 0 0
2 1 1 0 0
3 1 2 9 8
4 1 2 0 0
5 1 2 0 0
6 1 3 3 9
7 1 3 0 0
8 1 3 0 0
9 1 4 0 0
from itertools import product
df['g'] = df.groupby(['A','B']).cumcount()
df1 = pd.DataFrame(list(product(df['A'].unique(),
df['B'].unique(),
range(3))), columns=['A','B','g'])
df = df1.merge(df, how='left').fillna(0).astype(int).drop('g', axis=1)
I have
{"A":[0,1], "B":[4,5], "C":[0,1], "D":[0,1]}
what I want it
A B C D
0 4 0 0
0 4 0 1
0 4 1 0
0 4 1 1
1 4 0 1
...and so on. Basically all the combinations of values for each of the categories.
What would be the best way to achieve this?
If x is your dict:
>>> pandas.DataFrame(list(itertools.product(*x.values())), columns=x.keys())
A C B D
0 0 0 4 0
1 0 0 4 1
2 0 0 5 0
3 0 0 5 1
4 0 1 4 0
5 0 1 4 1
6 0 1 5 0
7 0 1 5 1
8 1 0 4 0
9 1 0 4 1
10 1 0 5 0
11 1 0 5 1
12 1 1 4 0
13 1 1 4 1
14 1 1 5 0
15 1 1 5 1
If you want the columns in a particular order you'll need to switch them afterwards (with, e.g., df[["A", "B", "C", "D"]].