Transpose by grouping a Dataframe having both numeric and string variables - python

I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.

Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1

Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1

Related

Making a Unique Indicator Column in Pandas Based off of Two Columns

This has been a tricky one for me. So my data structure is roughly the following:
column_1
column_2
111
a
111
a
111
a
111
a
111
b
111
b
222
a
222
b
222
b
222
b
222
b
222
b
I want to get to this in the most efficient way possible from a processing perspective:
column_1
column_2
unique_id
111
a
1
111
a
2
111
a
3
111
a
4
111
b
1
111
b
2
222
a
1
222
b
1
222
b
2
222
b
3
222
b
4
222
b
5
In summary I want to create a column that will turn each row into a unique occurrence. The preference is that this unique_id column starts at 1 for each new combination of column_1 and column_2.

Convert a column with values in a list into separated rows grouped by specific columns

I'm trying to convert a column with values in a list into separated rows grouped by specifics columns.
That's the dataframe I have:
id rooms bathrooms facilities
111 1 2 [2, 3, 4]
222 2 3 [4, 5, 6]
333 2 1 [2, 3, 4]
That's the dataframe I need:
id rooms bathrooms facility
111 1 2 2
111 1 2 3
111 1 2 4
222 2 3 4
222 2 3 5
222 2 3 6
333 2 1 2
333 2 1 3
333 2 1 4
I was trying converting to list the column facilities first:
facilities = pd.DataFrame(df.facilities.tolist())
And later join by columns and following the same method with another suggested solution:
df[['id', 'rooms', 'bathrooms']].join(facilities).melt(id_vars=['id', 'rooms', 'bathrooms']).drop('variable', 1)
Unfortunately, it didn't work for me.
Another solution?
Thanks in advance!
You need explode:
df.explode('facilities')
# id rooms bathrooms facilities
#0 111 1 2 2
#0 111 1 2 3
#0 111 1 2 4
#1 222 2 3 4
#1 222 2 3 5
#1 222 2 3 6
#2 333 2 1 2
#2 333 2 1 3
#2 333 2 1 4
It is a bit awkward to have list as values in a dataframe so one way I can think of to work around this is to unpack the lists and store each in its own column, then use the melt function.
# recreate your data
d = {"id":[111, 222, 333],
"rooms": [1,2,2],
"bathrooms": [2,3,1],
"facilities": [[2, 3, 4],[4, 5, 6],[2, 3, 4]]}
df = pd.DataFrame(d)
# unpack the lists
f0, f1, f2 = [],[],[]
for row in df.itertuples():
f0.append(row.facilities[0])
f1.append(row.facilities[1])
f2.append(row.facilities[2])
df["f0"] = f0
df["f1"] = f1
df["f2"] = f2
# melt the dataframe
df = pd.melt(df, id_vars=['id', 'rooms', 'bathrooms'], value_vars=["f0", "f1", "f2"], value_name="facilities")
# optionally sort the values and remove the "variable" column
df.sort_values(by=['id'], inplace=True)
df = df[['id', 'rooms', 'bathrooms', 'facilities']]
I think that should get you the dataframe you need.
id rooms bathrooms facilities
0 111 1 2 2
3 111 1 2 3
6 111 1 2 4
1 222 2 3 4
4 222 2 3 5
7 222 2 3 6
2 333 2 1 2
5 333 2 1 3
8 333 2 1 4
The following will give the desired output
def changeDf(x):
df_m = pd.DataFrame(columns=['id','rooms','bathrooms','facilities'])
for index, fc in enumerate(x['facilities']):
df_m.loc[index] = [x['id'], x['rooms'], x['bathrooms'], fc]
return df_m
df_modified = df.apply(changeDf, axis=1)
df_final = pd.concat([i for i in df_modified])
print(df_final)
"df" is input dataframe and "df_final" is desired dataframe
Try this
reps = [len(x) for x in df.facilities]
facilities = pd.Series(np.array(df.facilities.tolist()).ravel())
df = df.loc[df.index.repeat(reps)].reset_index(drop=True)
df.facilities = facilities
df
id rooms bathrooms facilities
0 111 1 2 2
1 111 1 2 3
2 111 1 2 4
3 222 2 3 4
4 222 2 3 5
5 222 2 3 6
6 333 2 1 2
7 333 2 1 3
8 333 2 1 4

count id which are smaller than certain value

My data consists of unique ids with a certain distance to a point. The goal is to count the id which is
equal or smaller than the radius.
Following example shows my DataFrame:
id distance radius
111 0.5 1
111 2 1
111 1 1
222 1 2
222 3 2
333 5 3
333 4 3
The output should look like this:
id count
111 2
222 1
333 0
You can do:
df['distance'].le(df['radius']).groupby(df['id']).sum()
Output:
id
111 2.0
222 1.0
333 0.0
dtype: float64
Or you can do:
(df.loc[df.distance <= df.radius, 'id']
.value_counts()
.reindex(df['id'].unique(), fill_value=0)
)
Output:
111 2
222 1
333 0
Name: id, dtype: int64

Compare 2 dataframes Pandas, returns wrong values

There are 2 dfs
datatypes are the same
df1 =
ID city name value
1 LA John 111
2 NY Sam 222
3 SF Foo 333
4 Berlin Bar 444
df2 =
ID city name value
1 NY Sam 223
2 LA John 111
3 SF Foo 335
4 London Foo1 999
5 Berlin Bar 444
I need to compare them and produce a new df, only with values, which are in df2, but not in df1
By some reason results after applying different methods are wrong
So far I've tried
pd.concat([df1, df2], join='inner', ignore_index=True)
but it returns all values together
pd.merge(df1, df2, how='inner')
it returns df1
then this one
df1[~(df1.iloc[:, 0].isin(list(df2.iloc[:, 0])))
it returns df1
The desired output is
ID city name value
1 NY Sam 223
2 SF Foo 335
3 London Foo1 999
Use DataFrame.merge by all columns without first and indicator parameter:
c = df1.columns[1:].tolist()
Or:
c = ['city', 'name', 'value']
df = (df2.merge(df1,on=c, indicator = True, how='left', suffixes=('','_'))
.query("_merge == 'left_only'")[df1.columns])
print (df)
ID city name value
0 1 NY Sam 223
2 3 SF Foo 335
3 4 London Foo1 999
Try this:
print("------------------------------")
print(df1)
df2 = DataFrameFromString(s, columns)
print("------------------------------")
print(df2)
common = df1.merge(df2,on=["city","name"]).rename(columns = {"value_y":"value", "ID_y":"ID"}).drop("value_x", 1).drop("ID_x", 1)
print("------------------------------")
print(common)
OUTPUT:
------------------------------
ID city name value
0 ID city name value
1 1 LA John 111
2 2 NY Sam 222
3 3 SF Foo 333
4 4 Berlin Bar 444
------------------------------
ID city name value
0 1 NY Sam 223
1 2 LA John 111
2 3 SF Foo 335
3 4 London Foo1 999
4 5 Berlin Bar 444
------------------------------
city name ID value
0 LA John 2 111
1 NY Sam 1 223
2 SF Foo 3 335
3 Berlin Bar 5 444

Indexing/Binning Time Series

I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?
Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3
One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3

Categories