Restructuring CSV into Pandas DataFrame - python

I've got a CSV with a rather messy format:
t, 01_x, 01_y, 02_x, 02_y
0, 0, 1, ,
1, 1, 1, 0, 0
Thereby "01_" and "02_" are numbers of entities (1, 2), which can vary from file to file and there might be additional columns too (but at least the same for all entities).
Note also that entity 2 enters the scene at t=1 (no entries at t=0).
I already import the CSV into a pandas dataframe, but don't see the way to transform the stuff into the following form:
t, entity, x, y
0, 1, 0, 1
1, 1, 1, 1
1 2, 0, 0
Is there a simple (pythonic) way to transform that?
Thanks!
René

This is wide_to_long, but we need to first swap the order of your column names around the '_'
df.columns = ['_'.join(x.split('_')[::-1]) for x in df.columns]
#Index(['t', 'x_01', 'y_01', 'x_02', 'y_02'], dtype='object')
(pd.wide_to_long(df, i='t', j='entity', stubnames=['x', 'y'], sep='_')
.dropna()
.reset_index())
t entity x y
0 0 1 0.0 1.0
1 1 1 1.0 1.0
2 1 2 0.0 0.0

Related

summing rows based on one hot variables

I think the code below is OK but seems to clumsy. Basically, I want to go from here:
to here:
Basically adding column Result if the dummy column is 1. Hope this makes sense?
data = {'Dummy1':[0, 0, 1, 1],
'Dummy2':[1, 1, 0, 0],
'Result':[1, 1, 2, 2]}
haves = pd.DataFrame(data)
print(haves)
melted = pd.melt(haves, id_vars=['Result'])
melted = melted.loc[melted["value"] > 0]
print(melted)
wants = melted.groupby(["variable"])["Result"].sum()
print(wants)
No need to melt, perform a simple multiplication and sum:
wants = haves.drop('Result', axis=1).mul(haves['Result'], axis=0).sum()
output:
Dummy1 4
Dummy2 2
dtype: int64
Intermediate:
>>> haves.drop('Result', axis=1).mul(haves['Result'], axis=0)
Dummy1 Dummy2
0 0 1
1 0 1
2 2 0
3 2 0
Shorter variant
Warning: this mutates the original dataframe, which will lose the 'Result' column.
wants = haves.mul(haves.pop('Result'), axis=0).sum()

add multiple columns programmatically from individual criteria/rules

I would like to add multiple columns programmatically to a dataframe using pre-defined rules. As an example, I would like to add 3 columns to the below dataframe, based on whether or not they satisfy the three rules indicated in code below:
#define dataframe
df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
"time2": [1, 0, 0, 0, 1],
"time3": [0, 0, 0, 1, 0],
"outcome": [1, 0, 0, 1, 0]})
#define "rules" for adding subsequent columns
rule_1 = (df1["time1"] == 1)
rule_2 = (df1["time2"] == 1)
rule_3 = (df1["time3"] == 1)
#add new columns based on whether or not above rules are satisfied
df1["rule_1"] = np.where(rule_1, 1, 0)
df1["rule_2"] = np.where(rule_2, 1, 0)
df1["rule_3"] = np.where(rule_3, 1, 0)
As you can see my approach gets tedious when I need to add 10s of columns - each based on a different "rule" - to a test dataframe.
Is there a way to do this more easily without defining each column manually along with its individual np.where clause? I tried doing something like this, but pandas does not accept this.
rules = [rule_1, rule_2, rule_3]
for rule in rules:
df1[rule] = np.where(rule, 1, 0)
Any ideas on how to make my approach more programmatically efficient?
The solution you provided doesn't work because you are using the rule element as the new dataframe column for the rule. I would solve it like this:
rules = [rule_1, rule_2, rule_3]
for i, rule in enumerate(rules):
df1[f'rule_{i+1}'] = np.where(rule, 1, 0)
Leverage pythons f strings in a for loop. They are good at this
#Create a list by filtering the time columns
cols=list(df1.filter(regex='time', axis=1).columns)
#Iterate through the list of columns imposing your conditions using np.where
for col in cols:
df1[f'{col}_new'] = df1[f'{col}'].apply(lambda x:np.where(x==1,1,0))
I might be oversimplifying your rules, but something like:
rules = [
('item1', 1),
('item2', 1),
('item3', 1),
]
for i, (col, val) in enumerate(rules):
df[f"rule_{i + 1}"] = np.where(df[col] == val, 1, 0)
If all of your rules check the same thing, maybe this could be helpful: unstack the relevant columns and check the condition on the Series and convert back to DataFrame with unstack:
df1[['rule1','rule2','rule3']] = df1[['time1','time2','time3']].unstack().eq(1).astype(int).swaplevel().unstack()
Output:
time1 time2 time3 outcome rule1 rule2 rule3
0 0 1 0 1 0 1 0
1 1 0 0 0 1 0 0
2 1 0 0 0 1 0 0
3 0 0 1 1 0 0 1
4 0 1 0 0 0 1 0

How to build a dataframe with clustered timestamps as index from a generator of tuple(key, dict)?

I'm new to Pandas, so maybe I'm missing something very simple here, but searching through other questions didn't get me what I need.
I have a Python generator that yields tuples of (timestamp, {k1: v1, k2: v2, ...}) # the timestamp is a float
and I want to build a dataframe of this form:
datetime(timestamp) (<-- this should be the index) | k1 | k2 | k3 |...
The second request (that might actually help in terms of efficiency) is to have lines that have very close timestamps (<0.3) be merged into a single line (it is promised that the columns will not overlap - i.e. at least one of the lines will have Nan for every
column).
The following lines did it for me, but only for as a time series, not as the index of a dataframe, and I don't know how to "stick it back" into the dataframe:
times.loc[times.diff() < 0.3] = times[times.diff() > 0.3]
times = times.pad().map(datetime.fromtimestamp)
The size of the data can get to thousands of (clusters of) timestamps over a million columns.
This option was the fastest for me:
t = {}
for ts, d in file_content:
for k, v in d.items():
t.setdefault(ts, {})[k] = v
df1 = pd.DataFrame.from_dict(t, orient='index')
Loading into dict took 14sec, and loading the dict into df took 30sec (where the output dataframe is of size ~1GB), but this is without any optimization over the timestamp clustering.
What's the best way to load the dataframe, and what's the code that can build and "attach" the timestamp index to this dataframe?
EDIT:
here's an example of the first tuple from file_content:
In [2]: next(file_content)
Out[2]:
(1628463575.9415462,
{'E2_S0_ME_rbw': 0,
'E2_S0_ME_rio': 0,
'E2_S0_ME_rlat': 0,
'E2_S0_ME_rmdi': 0,
'E2_S0_ME_wbw': 0,
'E2_S0_ME_wio': 0,
'E2_S0_ME_wlat': 0,
'E2_S0_ME_wmdi': 0})
EDIT2:
the second tuple (note that the timestamp is VERY close to the previous one, AND that the keys are completely different):
In [12]: next(file_content)
Out[12]:
(1628463575.946525,
{'E2_S1_ME_errors': 0,
'E2_S1_ME_messages': 0})
You discovered that you can use a dictionary to load your data, that could be written slightly simpler:
>>> pd.DataFrame.from_dict(dict(file_contents), orient='index')
E2_S0_ME_rbw E2_S0_ME_rio E2_S0_ME_rlat E2_S0_ME_rmdi E2_S0_ME_wbw E2_S0_ME_wio E2_S0_ME_wlat E2_S0_ME_wmdi
1.628464e+09 0 0 0 0 0 0 0 0
You can also directly load the iterable into a dataframe and then normalize from there:
>>> fc = pd.DataFrame(file_contents)
>>> fc
0 1
0 1.628464e+09 {'E2_S0_ME_rbw': 0, 'E2_S0_ME_rio': 0, 'E2_S0_...'
>>> df = pd.json_normalize(fc[1]).join(fc[0].rename('timestamp'))
>>> df
E2_S0_ME_rbw E2_S0_ME_rio E2_S0_ME_rlat E2_S0_ME_rmdi E2_S0_ME_wbw E2_S0_ME_wio E2_S0_ME_wlat E2_S0_ME_wmdi timestamp
0 0 0 0 0 0 0 0 0 1.628464e+09
Now for coalescing lines, let’s start with a dataframe that has values as you describe, here there’s 2 groups, one of rows 0-3 and the other rows 4-5, with at most one non-NaN value per column and coalesced row:
>>> df
timestamp E2_S0_ME_rbw E2_S0_ME_rio E2_S0_ME_rlat E2_S0_ME_rmdi E2_S0_ME_wbw E2_S0_ME_wio E2_S0_ME_wlat E2_S0_ME_wmdi
0 1.628464e+09 NaN NaN NaN 0.886793 0.525714 NaN NaN NaN
1 1.628464e+09 NaN 0.638154 0.319839 NaN NaN 0.375288 NaN NaN
2 1.628464e+09 NaN NaN NaN NaN NaN NaN 0.660108 NaN
3 1.628464e+09 0.969127 NaN NaN NaN NaN NaN NaN 0.362666
4 1.628464e+09 NaN NaN NaN 0.879372 NaN NaN 0.851226 NaN
5 1.628464e+09 0.029188 0.757706 0.718359 NaN 0.491337 0.239511 NaN 0.503021
>>> df['timestamp'].astype('datetime64[s]')
0 2021-08-08 22:59:35
1 2021-08-08 22:59:36
2 2021-08-08 22:59:36
3 2021-08-08 22:59:36
4 2021-08-08 22:59:36
5 2021-08-08 22:59:37
Name: timestamp, dtype: datetime64[ns]
>>> df['timestamp'].diff()
0 NaN
1 0.2
2 0.2
3 0.2
4 0.4
5 0.2
Name: timestamp, dtype: float64
You want to merge all lines that are within .3s of each other, which we can check with diff(), meaning we start a new group every time a diff is greater than .3s. Using .first() to get the first non-NA value in the row:
>>> df.groupby((df['timestamp'].diff().rename(None) > .3).cumsum()).first()
timestamp E2_S0_ME_rbw E2_S0_ME_rio E2_S0_ME_rlat E2_S0_ME_rmdi E2_S0_ME_wbw E2_S0_ME_wio E2_S0_ME_wlat E2_S0_ME_wmdi
0 1.628464e+09 0.969127 0.638154 0.319839 0.886793 0.525714 0.375288 0.660108 0.362666
1 1.628464e+09 0.029188 0.757706 0.718359 0.879372 0.491337 0.239511 0.851226 0.503021
Note that with .resample() if you’ve got values that are close but to the wrong side of a boundary value, e.g. 0.299s and 0.301s, they’ll get aggregated to different lines.
I've made an example of your data:
file_content = [
(1628463575.9415462,
{'E2_S0_ME_rbw': 0,
'E2_S0_ME_rio': 0,
'E2_S0_ME_rlat': 0,
'E2_S0_ME_rmdi': 0,
'E2_S0_ME_wbw': 0,
'E2_S0_ME_wio': 0,
'E2_S0_ME_wlat': 0,
'E2_S0_ME_wmdi': 0}
),
(1628463576.7,
{'E2_S0_ME_rbw': 0,
'E2_S0_ME_rio': 0,
'E2_S0_ME_rlat': 0,
'E2_S0_ME_rmdi': 1,
'E2_S0_ME_wbw': 0,
'E2_S0_ME_wio': 0,
'E2_S0_ME_wlat': 0,
'E2_S0_ME_wmdi': 0}
),
(1628464579,
{'E2_S0_ME_rbw': 0,
'E2_S0_ME_rio': 1,
'E2_S0_ME_rlat': 0,
'E2_S0_ME_rmdi': 0,
'E2_S0_ME_wbw': 0,
'E2_S0_ME_wio': 0,
'E2_S0_ME_wlat': 0,
'E2_S0_ME_wmdi': 0}
),
(1628493589,
{'E2_S0_ME_rbw': 0,
'E2_S0_ME_rio': 0,
'E2_S0_ME_rlat': 0,
'E2_S0_ME_rmdi': 0,
'E2_S0_ME_wbw': 0,
'E2_S0_ME_wio': 0,
'E2_S0_ME_wlat': 0,
'E2_S0_ME_wmdi': 0}
)
]
Here is the code to generate dataframe with date in index:
for i in range(0,len(file_content)):
file_content[i][1]['time'] = file_content[i][0]
file_content[i] = file_content[i][1]
d = pd.DataFrame(file_content)
d['time'] = d['time'].apply(lambda x: pd.datetime.fromtimestamp(x))
d = d.set_index('time')
Output:
Than you can use resample to split columns. It will be helpful if the timstamp in your data doesn't have a big difference in time. But if the time is very different you might get a lot of NaN columns. In my example it looks like this:
Code:
d = d.resample('3s').mean()
Output:
Of course you could just drop NaN after that, but it might generate too big dataframe if your data is infrequent. Also you can use other function to aggregate values, like min or max.

Replace the current row with data with the previous row if the value of a certain column in the current row is 1

I have a pandas dataframe and want to replace the current row with data with the previous row if the value of a certain column in the current row is 1, but had no success yet. Any help appreciated.
It could be done like this:
#B is the column that lets you know if the row should change or not
for i in range(1, len(df)):
if df.loc[i, 'B'] == 1:
df.loc[i, :] = df.loc[i-1, :]
This can be done just using the shift operator and assigning. If b is the index we want to condition on, then we create a condition based on b first
import pandas as pd
df = pd.DataFrame({
'a':[1, 2, 3, 4, 5, 6],
'b':[0, 1, 1, 0, 0, 1],
'c':[7, 8, 9, 0, 1, 2]}
)
# this is our condition (which row we're changing)
index_to_change = df['b'] == 1
# this is a list of columns we want to change
cols_to_change = ['a', 'c']
df.loc[index_to_change, cols_to_change] = df[cols_to_change].shift(1).loc[index_to_change]
Output:
In []: df
Out[]:
a b c
0 1.0 0 7.0
1 1.0 1 7.0
2 2.0 1 8.0
3 4.0 0 0.0
4 5.0 0 1.0
5 5.0 1 1.0

Appending pandas DataFrame with MultiIndex with data containing new labels, but preserving the integer positions of the old MultiIndex

Base scenario
For a recommendation service I am training a matrix factorization model (LightFM) on a set of user-item interactions. For the matrix factorization model to yield the best results, I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
I'm using a pandas DataFrame in the process, and I have found a MultiIndex to be extremely convenient to create this mapping, like so:
ratings = [{'user_id': 1, 'item_id': 1, 'rating': 1.0},
{'user_id': 1, 'item_id': 3, 'rating': 1.0},
{'user_id': 3, 'item_id': 1, 'rating': 1.0},
{'user_id': 3, 'item_id': 3, 'rating': 1.0}]
df = pd.DataFrame(ratings, columns=['user_id', 'item_id', 'rating'])
df = df.set_index(['user_id', 'item_id'])
df
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 1 1.0
And then allows me to get the continuous maps like so
df.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1], dtype='int8')
df.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1], dtype='int8')
Afterwards, I can map them back using df.index.levels[0].get_loc method. Great!
Extension
But, now I'm trying to streamline my model training process, ideally by training it incrementally on new data, preserving the old ID mappings. Something like:
new_ratings = [{'user_id': 2, 'item_id': 1, 'rating': 1.0},
{'user_id': 2, 'item_id': 2, 'rating': 1.0}]
df2 = pd.DataFrame(new_ratings, columns=['user_id', 'item_id', 'rating'])
df2 = df2.set_index(['user_id', 'item_id'])
df2
Out:
rating
user_id item_id
2 1 1.0
2 2 1.0
Then, simply appending the new ratings to the old DataFrame
df3 = df.append(df2)
df3
Out:
rating
user_id item_id
1 1 1.0
1 3 1.0
3 1 1.0
3 3 1.0
2 1 1.0
2 2 1.0
Looks good, but
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 2, 2, 1, 1], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 2, 0, 2, 0, 1], dtype='int8')
I added user_id=2 and item_id=2 in the later DataFrame on purpose, to illustrate where it goes wrong for me. In df3, labels 3 (for both user and item), have moved from integer position 1 to 2. So the mapping is no longer the same. What I'm looking for is [0, 0, 1, 1, 2, 2] and [0, 1, 0, 1, 0, 2] for user and item mappings respectively.
This is probably because of ordering in pandas Index objects, and I'm unsure if what I want is at all possible using a MultiIndex strategy. Looking for help on how most to effectively tackle this problem :)
Some notes:
I find using DataFrames convenient for several reasons, but I use the MultiIndex purely for the ID mappings. Alternatives without MultiIndex are completely acceptable.
I cannot guarantee that new user_id and item_id entries in new ratings are larger than any values in the old dataset, hence my example of adding id 2 when [1, 3] were present.
For my incremental training approach, I will need to store my ID maps somewhere. If I only load new ratings partially, I will have to store the old DataFrame and ID maps somewhere. Would be great if it could all be in one place, like it would be with an index, but columns work too.
EDIT: An additional requirement is to allow for row re-ordering of the original DataFrame, as might happen when duplicate ratings exist, and I want to keep the most recent one.
Solution (credits to #jpp for original)
I've made a modification to #jpp's answer to satisfy the additional requirement I've added later (tagged with EDIT). This also truly satisfies the original question as posed in the title, since it preserves the old index integer positions, regardless of rows being reordered for whatever reason. I've also wrapped things into functions:
from itertools import chain
from toolz import unique
def expand_index(source, target, index_cols=['user_id', 'item_id']):
# Elevate index to series, keeping source with index
temp = source.reset_index()
target = target.reset_index()
# Convert columns to categorical, using the source index and target columns
for col in index_cols:
i = source.index.names.index(col)
col_cats = list(unique(chain(source.index.levels[i], target[col])))
temp[col] = pd.Categorical(temp[col], categories=col_cats)
target[col] = pd.Categorical(target[col], categories=col_cats)
# Convert series back to index
source = temp.set_index(index_cols)
target = target.set_index(index_cols)
return source, target
def concat_expand_index(old, new):
old, new = expand_index(old, new)
return pd.concat([old, new])
df3 = concat_expand_index(df, df2)
The result:
df3.index.labels[0] # For users
Out:
FrozenNDArray([0, 0, 1, 1, 2, 2], dtype='int8')
df3.index.labels[1] # For items
Out:
FrozenNDArray([0, 1, 0, 1, 0, 2], dtype='int8')
I think the use of MultiIndex overcomplicates this objective:
I need to map my user and item IDs to a continuous range of integer IDs starting at 0.
This solution falls in to the below category:
Alternatives without MultiIndex are completely acceptable.
def add_mapping(df, df2, df3, column_name='user_id'):
initial = df.loc[:, column_name].unique()
new = df2.loc[~df2.loc[:, column_name].isin(initial), column_name].unique()
maps = np.arange(len(initial))
mapping = dict(zip(initial, maps))
maps = np.append(maps, np.arange(np.max(maps)+1, np.max(maps)+1+len(new)))
total = np.append(initial, new)
mapping = dict(zip(total, maps))
df3[column_name+'_map'] = df3.loc[:, column_name].map(mapping)
return df3
add_mapping(df, df2, df3, column_name='item_id')
add_mapping(df, df2, df3, column_name='user_id')
user_id item_id rating item_id_map user_id_map
0 1 1 1.0 0 0
1 1 3 1.0 1 0
2 3 1 1.0 0 1
3 3 3 1.0 1 1
0 2 1 1.0 0 2
1 2 2 1.0 2 2
Explanation
This is how to maintain a mapping for the user_id values. Same holds for the item_id values as well.
These are the initial user_id values (unique):
initial_users = df['user_id'].unique()
# initial_users = array([1, 3])
user_map maintains a mapping for user_id values, as per your requirement:
user_id_maps = np.arange(len(initial_users))
# user_id_maps = array([0, 1])
user_map = dict(zip(initial_users, user_id_maps))
# user_map = {1: 0, 3: 1}
These are the new user_id values you got from df2 - ones that you didn't see in df:
new_users = df2[~df2['user_id'].isin(initial_users)]['user_id'].unique()
# new_users = array([2])
Now we update user_map for the total user base with the new users:
user_id_maps = np.append(user_id_maps, np.arange(np.max(user_id_maps)+1, np.max(user_id_maps)+1+len(new_users)))
# array([0, 1, 2])
total_users = np.append(initial_users, new_users)
# array([1, 3, 2])
user_map = dict(zip(total_users, user_id_maps))
# user_map = {1: 0, 2: 2, 3: 1}
Then, just map the values from user_map to df['user_id']:
df3['user_map'] = df3['user_id'].map(user_map)
user_id item_id rating user_map
0 1 1 1.0 0
1 1 3 1.0 0
2 3 1 1.0 1
3 3 3 1.0 1
0 2 1 1.0 2
1 2 2 1.0 2
Forcing alignment of index labels after concatenation does not appear straightforward and, if there is a solution, it is poorly documented.
One option which may appeal to you is Categorical Data. With some careful manipulation, this can achieve the same purpose: each unique index value within a level has a one-to-one mapping to an integer, and this mapping persists even after concatenation with other dataframes.
from itertools import chain
from toolz import unique
# elevate index to series
df = df.reset_index()
df2 = df2.reset_index()
# define columns for reindexing
index_cols = ['user_id', 'item_id']
# convert to categorical with merged categories
for col in index_cols:
col_cats = list(unique(chain(df[col], df2[col])))
df[col] = pd.Categorical(df[col], categories=col_cats)
df2[col] = pd.Categorical(df2[col], categories=col_cats)
# convert series back to index
df = df.set_index(index_cols)
df2 = df2.set_index(index_cols)
I use toolz.unique to return an ordered unique list, but if you don't have access to this library, you can use the identical unique_everseen recipe from the itertool docs.
Now let's have a look at the category codes underlying the 0th index level:
for data in [df, df2]:
print(data.index.get_level_values(0).codes.tolist())
[0, 0, 1, 1]
[2, 2]
Then perform our concatenation:
df3 = pd.concat([df, df2])
Finally, check that categorical codes are aligned:
print(df3.index.get_level_values(0).codes.tolist())
[0, 0, 1, 1, 2, 2]
For each index level, note we must take the union of all index values across dataframes to form col_cats, otherwise the concatenation will fail.

Categories