I know there are already lots of posts on how to convert a pandas dict to a dataframe, however I could not find one discussing the issue I have.
My dictionary looks as follows:
[Out 23]:
{'atmosphere': 0
2 5
9 4
15 1
26 5
29 5
... ..
2621 4
6419 3
[6934 rows x 1 columns],
'communication': 0
13 1
15 1
26 1
2621 2
3119 5
... ..
6419 4
6532 1
[714 rows x 1 columns]
Now, what I want is to create a dataframe out of this dictionary, where the 'atmosphere' and 'communication' are the columns, and the indices of both items are merged, so that the dataframe looks as follows:
index atmosphere commmunication
2 5
9 4
13 1
15 1 1
26 5 1
29 5
2621 4 2
3119 5
6419 3 4
6532 1
I already tried pd.DataFrame.from_dict, but it saves all values in one row.
Any help is much appreciated!
Use concat with DataFrame.droplevel for remove second level 0 from MultiIndex in columns:
d = {'atmosphere':pd.DataFrame({0: {2: 5, 9: 4, 15: 1, 26: 5, 29: 5,
2621: 4, 6419: 3}}),
'communication':pd.DataFrame({0: {13: 1, 15: 1, 26: 1, 2621: 2,
3119: 5, 6419: 4, 6532: 1}})}
print (d['atmosphere'])
0
2 5
9 4
15 1
26 5
29 5
2621 4
6419 3
print (d['communication'])
0
13 1
15 1
26 1
2621 2
3119 5
6419 4
6532 1
df = pd.concat(d, axis=1).droplevel(1, axis=1)
print (df)
atmosphere communication
2 5.0 NaN
9 4.0 NaN
13 NaN 1.0
15 1.0 1.0
26 5.0 1.0
29 5.0 NaN
2621 4.0 2.0
3119 NaN 5.0
6419 3.0 4.0
6532 NaN 1.0
Alternative solution:
df = pd.concat({k: v[0] for k, v in d.items()}, axis=1)
You can use pandas.concat on the values and set_axis with the dictionary keys:
out = pd.concat(d.values(), axis=1).set_axis(d, axis=1)
output:
atmosphere communication
2 5.0 NaN
9 4.0 NaN
13 NaN 1.0
15 1.0 1.0
26 5.0 1.0
29 5.0 NaN
2621 4.0 2.0
3119 NaN 5.0
6419 3.0 4.0
6532 NaN 1.0
Related
I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0
I have the following pandas DataFrame
Id_household Age_Father Age_child
0 1 30 2
1 1 30 4
2 1 30 4
3 1 30 1
4 2 27 4
5 3 40 14
6 3 40 18
and I want to achieve the following result
Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
Id_household
1 30 1 2.0 4.0 4.0
2 27 4 NaN NaN NaN
3 40 14 18.0 NaN NaN
I tried stacking with multi-index renaming, but I am not very happy with it and I am not able to make everything work properly.
Use this:
df_out = df.set_index([df.groupby('Id_household').cumcount()+1,
'Id_household',
'Age_Father']).unstack(0)
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
Id_household Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
0 1 30 2.0 4.0 4.0 1.0
1 2 27 4.0 NaN NaN NaN
2 3 40 14.0 18.0 NaN NaN
I need to combine multiple rows into a single row, and the original dataframes looks like:
IndividualID DayID TripID JourSequence TripPurpose
200100000001 1 1 1 3
200100000001 1 2 2 31
200100000001 1 3 3 23
200100000001 1 4 4 5
200100000009 1 55 1 3
200100000009 1 56 2 12
200100000009 1 57 3 4
200100000009 1 58 4 6
200100000009 1 59 5 19
200100000009 1 60 6 2
I was trying to build some sort of 'trip chain', so basically all the journey sequences and trip purposes of one individual on a single day should be in the same row...
Ideally I was trying to convert the table to something like this:
IndividualID DayID Seq1 TripPurp1 Seq2 TripPur2 Seq3 TripPurp3 Seq4 TripPur4
200100000001 1 1 3 2 31 3 23 4 5
200100000009 1 1 3 2 12 3 4 4 6
If this is not possible, then the following mode would also be fine:
IndividualID DayID TripPurposes
200100000001 1 3, 31, 23, 5
200100000009 1 3, 12, 4, 6
Is there any possible solutions? I was thinking on for loop/ while statement, but maybe that was not really a good idea.
Thanks in advance!
You can try:
df_out = df.set_index(['IndividualID','DayID',df.groupby(['IndividualID','DayID']).cumcount()+1]).unstack().sort_index(level=1, axis=1)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out.reset_index()
Output:
IndividualID DayID JourSequence_1 TripID_1 TripPurpose_1 \
0 200100000001 1 1.0 1.0 3.0
1 200100000009 1 1.0 55.0 3.0
JourSequence_2 TripID_2 TripPurpose_2 JourSequence_3 TripID_3 \
0 2.0 2.0 31.0 3.0 3.0
1 2.0 56.0 12.0 3.0 57.0
TripPurpose_3 JourSequence_4 TripID_4 TripPurpose_4 JourSequence_5 \
0 23.0 4.0 4.0 5.0 NaN
1 4.0 4.0 58.0 6.0 5.0
TripID_5 TripPurpose_5 JourSequence_6 TripID_6 TripPurpose_6
0 NaN NaN NaN NaN NaN
1 59.0 19.0 6.0 60.0 2.0
To get your second output you just need to groupby and apply list:
df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list)
TripPurpose
IndividualID DayID
200100000001 1 [3, 31, 23, 5]
200100000009 1 [3, 12, 4, 6, 19, 2]
to get your first output you can do something like this (probably not the best approach):
df2 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['TripPurpose'].apply(list))
trip = df2['TripPurpose'].apply(pd.Series).rename(columns = lambda x: 'TripPurpose'+ str(x+1))
df3 = pd.DataFrame(df.groupby(['IndividualID', 'DayID'])['JourSequence'].apply(list))
seq = df3['JourSequence'].apply(pd.Series).rename(columns = lambda x: 'seq'+ str(x+1))
pd.merge(trip,seq,on=['IndividualID','DayID'])
output is not sorted
I am trying to decile the column score of a DataFrame.
I use the following code:
np.percentile(df['score'], np.arange(0, 100, 10))
My problem is in score, there are lots of zeros. How can I filter out these 0 values and only decile the rest of values?
Filter them with boolean indexing:
df.loc[df['score']!=0, 'score']
or
df['score'][lambda x: x!=0]
and pass that to the percentile function.
np.percentile(df['score'][lambda x: x!=0], np.arange(0,100,10))
Consider the dataframe df
df = pd.DataFrame(
dict(score=np.random.rand(20))
).where(
np.random.choice([True, False], (20, 1), p=(.8, .2)),
0
)
score
0 0.380777
1 0.559356
2 0.103099
3 0.800843
4 0.262055
5 0.389330
6 0.477872
7 0.393937
8 0.189949
9 0.571908
10 0.133402
11 0.033404
12 0.650236
13 0.593495
14 0.000000
15 0.013058
16 0.334851
17 0.000000
18 0.999757
19 0.000000
Use pd.qcut to decile
pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10))
0 4
1 6
2 1
3 9
4 3
5 4
6 6
7 5
8 2
9 7
10 1
11 0
12 8
13 8
15 0
16 3
18 9
Name: score, dtype: category
Categories (10, int64): [0 < 1 < 2 < 3 ... 6 < 7 < 8 < 9]
Or all together
df.assign(decile=pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10)))
score decile
0 0.380777 4.0
1 0.559356 6.0
2 0.103099 1.0
3 0.800843 9.0
4 0.262055 3.0
5 0.389330 4.0
6 0.477872 6.0
7 0.393937 5.0
8 0.189949 2.0
9 0.571908 7.0
10 0.133402 1.0
11 0.033404 0.0
12 0.650236 8.0
13 0.593495 8.0
14 0.000000 NaN
15 0.013058 0.0
16 0.334851 3.0
17 0.000000 NaN
18 0.999757 9.0
19 0.000000 NaN
You can simply mask zeros and then remove them from your column using boolean indexing:
score = df['score']
score_no_zero = score[score != 0]
np.percentile(score_no_zero, np.arange(0,100,10))
or in one step:
np.percentile(df['score'][df['score'] != 0], np.arange(0,100,10))
I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)