Adding new rows with new values at some specific columns in pandas - python

Assume we have a table looks like the following:
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
5
100
1990129
1
2
3
1
7
100
1990212
1
2
3
week_num skip the "4" and "6" because the corresponding "people" is 0. However, we want the all the rows included like the following table.
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
4
0
1990122
1
2
3
1
5
100
1990129
1
2
3
1
6
0
1990205
1
2
3
1
7
100
1990212
1
2
3
The date starts with 1990101, the next row must +7 days if it is a continuous week_num(Ex: 1,2 is continuous; 1,3 is not).
How can we use python(pandas) to achieve this goal?
Note: Each id has 10 week_num(1,2,3,...,10), the output must include all "week_num" with corresponding "people" and "date".
Update: Other columns like "level","a","b" should stay the same even we add the skipped week_num.

This assumes that the date restarts at 1990-01-01 for each id:
import itertools
# reindex to get all combinations of ids and week numbers
df_full = (df.set_index(["id", "week_num"])
.reindex(list(itertools.product([1,2], range(1, 11))))
.reset_index())
# fill people with zero
df_full = df_full.fillna({"people": 0})
# forward fill some other columns
cols_ffill = ["level", "a", "b"]
df_full[cols_ffill] = df_full[cols_ffill].ffill()
# reconstruct date from week starting from 1990-01-01 for each id
df_full["date"] = pd.to_datetime("1990-01-01") + (df_full.week_num - 1) * pd.Timedelta("1w")
df_full
# out:
id week_num people date level a b
0 1 1 20.0 1990-01-01 1.0 2.0 3.0
1 1 2 30.0 1990-01-08 1.0 2.0 3.0
2 1 3 40.0 1990-01-15 1.0 2.0 3.0
3 1 4 0.0 1990-01-22 1.0 2.0 3.0
4 1 5 100.0 1990-01-29 1.0 2.0 3.0
5 1 6 0.0 1990-02-05 1.0 2.0 3.0
6 1 7 100.0 1990-02-12 1.0 2.0 3.0
7 1 8 0.0 1990-02-19 1.0 2.0 3.0
8 1 9 0.0 1990-02-26 1.0 2.0 3.0
9 1 10 0.0 1990-03-05 1.0 2.0 3.0
10 2 1 0.0 1990-01-01 1.0 2.0 3.0
11 2 2 0.0 1990-01-08 1.0 2.0 3.0
12 2 3 0.0 1990-01-15 1.0 2.0 3.0
13 2 4 0.0 1990-01-22 1.0 2.0 3.0
14 2 5 0.0 1990-01-29 1.0 2.0 3.0
15 2 6 0.0 1990-02-05 1.0 2.0 3.0
16 2 7 0.0 1990-02-12 1.0 2.0 3.0
17 2 8 0.0 1990-02-19 1.0 2.0 3.0
18 2 9 0.0 1990-02-26 1.0 2.0 3.0
19 2 10 0.0 1990-03-05 1.0 2.0 3.0

Related

How to iterate over an array using a lambda function with pandas apply

I have the following dataset:
0 1 2
0 2.0 2.0 4
0 1.0 1.0 2
0 1.0 1.0 3
3 1.0 1.0 5
4 1.0 1.0 2
5 1.0 NaN 1
6 NaN 1.0 1
and what I want to do is insert a new column that iterates over each row, and if there is a NaN then give it a 0, if not then copy the value from column '2' to get this:
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
The following code is what I have so far, which works fine but does not iterate over the values of column '2'.
df.isna().sum(axis=1).apply(lambda x: df[2].iloc[x] if x==0 else 0)
if I use df.iloc[x] I get
0 4
1 4
2 4
3 4
4 4
5 0
6 0
How can I iterate over the column '2'?
Try the below code with np.where with isna and any:
>>> df['3'] = np.where(df[['0', '1']].isna().any(1), 0, df['2'])
>>> df
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
>>>

Rearranging dataframe based on index and column values [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a dataframe that looks like this:
df
Out[262]:
klass varde
0 1.0 53.801840
1 1.0 58.524591
2 1.0 51.879057
3 1.0 48.391662
4 1.0 48.451202
5 1.0 53.072189
6 1.0 55.418486
7 1.0 56.561995
8 1.0 59.161386
9 1.0 53.033094
0 1.0 52.438421
1 1.0 53.554198
2 1.0 38.968125
3 1.0 53.895055
4 1.0 55.335374
5 1.0 48.885893
6 1.0 48.173335
7 1.0 45.083425
8 1.0 50.846878
9 1.0 47.132339
0 2.0 88.804034
1 2.0 105.083136
2 2.0 96.204701
3 2.0 94.890052
4 2.0 90.846715
5 2.0 99.433425
6 2.0 113.972773
7 2.0 94.816123
8 2.0 114.141583
9 2.0 91.235912
0 2.0 104.331863
1 2.0 106.283919
2 2.0 105.769039
3 2.0 97.678197
4 2.0 106.136627
5 2.0 90.884468
6 2.0 104.920153
7 2.0 81.463938
8 2.0 107.859278
9 2.0 90.248085
I want to reshape the dataframe so that values 'varde' with same index and same value in column 'klass' are put beside each other like this:
klass varde varde
0 1.0 53.801840 52.438421
1 1.0 58.524591 53.554198
2 1.0 51.879057 38.968125
3 1.0 48.391662 53.895055
4 1.0 48.451202 55.335374
5 1.0 53.072189 48.885893
6 1.0 55.418486 48.173335
7 1.0 56.561995 45.083425
8 1.0 59.161386 50.846878
9 1.0 53.033094 47.132339
0 2.0 88.804034 104.331863
1 2.0 105.083136 106.283919
2 2.0 96.204701 105.769039
3 2.0 94.890052 97.678197
4 2.0 90.846715 106.136627
5 2.0 99.433425 90.884468
6 2.0 113.972773 104.920153
7 2.0 94.816123 81.463938
8 2.0 114.141583 107.859278
9 2.0 91.235912 0.248085
I'm really stuck on this...
We can stack several commands
>>> df.groupby(["id", "klass"])['varde'].apply(list).apply(pd.Series).reset_index()
id klass 0 1
0 0 1.0 53.801840 52.438421
1 0 2.0 88.804034 104.331863
2 1 1.0 58.524591 53.554198
3 1 2.0 105.083136 106.283919
4 2 1.0 51.879057 38.968125
5 2 2.0 96.204701 105.769039
6 3 1.0 48.391662 53.895055
7 3 2.0 94.890052 97.678197
8 4 1.0 48.451202 55.335374
9 4 2.0 90.846715 106.136627
10 5 1.0 53.072189 48.885893
11 5 2.0 99.433425 90.884468
12 6 1.0 55.418486 48.173335
13 6 2.0 113.972773 104.920153
14 7 1.0 56.561995 45.083425
15 7 2.0 94.816123 81.463938
16 8 1.0 59.161386 50.846878
17 8 2.0 114.141583 107.859278
18 9 1.0 53.033094 47.132339
19 9 2.0 91.235912 90.248085

Python: how to select indexes from pandas dataframe?

I have a dataframe that looks like the following:
df
0 1 2 3 4
0 0.0 NaN NaN NaN NaN
1 NaN 0.0 0.0 NaN 4.0
2 NaN 2.0 0.0 NaN 5.0
3 NaN NaN NaN 0.0 NaN
4 NaN 0.0 3.0 NaN 0.0
I would like to have a dataframe of all the couples of values different from NaN. The dataframe should be like the following
df
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 5.0
4 3 3 0.0
5 4 1 0.0
6 4 2 3.0
7 4 4 0.0
Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack().rename_axis(('i','j')).reset_index(name='val')
print (df)
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0
Like this:
In [379]: df.stack().reset_index(name='val').rename(columns={'level_0':'i', 'level_1':'j'})
Out[379]:
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0

Reshape, recombine row with column

I have dataset like this:
Block Vector
blk_-1 0.0 2 3, 0.5 3 8, 0.7 33 5
blk_-2 1.0 4 1, 2.0 2 4
blk_-3 0.0 0 0, 6.0 0 7
blk_-4 8.0 3 0, 7.0 5 8
blk_-5 9.0 0 5, 5.0 0 2, 5.2 3 2, 5.9 5 3
dat = {'Block': ['blk_-1', 'blk_-2', 'blk_-3', 'blk_-4', 'blk_-5'],\
'Vector': ['0.0 2 3, 0.5 3 8, 0.7 33 5',\
'1.0 4 1, 2.0 2 4',\
'0.0 0 0, 6.0 0 7',\
'8.0 3 0, 7.0 5 8',\
'9.0 0 5, 5.0 0 2, 5.2 3 2, 5.9 5 3']
}
I want to get:
Block Vector
blk_-1 0.0 2 3
blk_-1 0.5 3 8
blk_-1 0.7 33 5
blk_-2 1.0 4 1
blk_-2 2.0 2 4
blk_-3 0.0 0 0
blk_-3 6.0 0 7
blk_-4 8.0 3 0
blk_-4 7.0 5 8
blk_-5 9.0 0 5
blk_-5 5.0 0 2
blk_-5 5.2 3 2
blk_-5 5.9 5 3
Try:
df['Vector'] = df['Vector'].apply(lambda x : list(map(str, x.split(','))))
df.Vector.apply(pd.Series) \
.merge(df, left_index = True, right_index = True) \
.drop(["Vector"], axis = 1)
Get:
0 1 2 3 Block
0 0.0 2 3 0.5 3 8 0.7 33 5 NaN blk_-1
1 1.0 4 1 2.0 2 4 NaN NaN blk_-2
2 0.0 0 0 6.0 0 7 NaN NaN blk_-3
3 8.0 3 0 7.0 5 8 NaN NaN blk_-4
4 9.0 0 5 5.0 0 2 5.2 3 2 5.9 5 3 blk_-5
Actually
Stuck at this moment. Waiting for your ideas and comments :)
You can use split, explode and join.
df[['Block']].join(df.Vector.str.split(',').explode())
Block Vector
0 blk_-1 0.0 2 3
0 blk_-1 0.5 3 8
0 blk_-1 0.7 33 5
1 blk_-2 1.0 4 1
1 blk_-2 2.0 2 4
2 blk_-3 0.0 0 0
2 blk_-3 6.0 0 7
3 blk_-4 8.0 3 0
3 blk_-4 7.0 5 8
4 blk_-5 9.0 0 5
4 blk_-5 5.0 0 2
4 blk_-5 5.2 3 2
4 blk_-5 5.9 5 3
Solution for pandas 0.25+ - Series.str.split column and assign back with DataFrame.assign, use DataFrame.explode and last for default index add DataFrame.reset_index with drop=True:
df = pd.DataFrame(dat)
df = df.assign(Vector=df['Vector'].str.split(',')).explode('Vector').reset_index(drop=True)
print (df)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
3 blk_-2 1.0 4 1
4 blk_-2 2.0 2 4
5 blk_-3 0.0 0 0
6 blk_-3 6.0 0 7
7 blk_-4 8.0 3 0
8 blk_-4 7.0 5 8
9 blk_-5 9.0 0 5
10 blk_-5 5.0 0 2
11 blk_-5 5.2 3 2
12 blk_-5 5.9 5 3
Version for oldier pandas versions - use pop + split + stack + reset_index + rename for new Series and then join to original:
df = (df.join(df.pop('Vector')
.str.split(',',expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('Vector')).reset_index(drop=True))
print (df)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
3 blk_-2 1.0 4 1
4 blk_-2 2.0 2 4
5 blk_-3 0.0 0 0
6 blk_-3 6.0 0 7
7 blk_-4 8.0 3 0
8 blk_-4 7.0 5 8
9 blk_-5 9.0 0 5
10 blk_-5 5.0 0 2
11 blk_-5 5.2 3 2
12 blk_-5 5.9 5 3
For lower than version .25:
final=df.merge(df['Vector'].str.split(',',expand=True).stack().reset_index(0,name='Vector'),
left_index=True,right_on='level_0',suffixes=('_x','')).drop(['level_0','Vector_x'],1)
print(final)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
0 blk_-2 1.0 4 1
1 blk_-2 2.0 2 4
0 blk_-3 0.0 0 0
1 blk_-3 6.0 0 7
0 blk_-4 8.0 3 0
1 blk_-4 7.0 5 8
0 blk_-5 9.0 0 5
1 blk_-5 5.0 0 2
2 blk_-5 5.2 3 2
3 blk_-5 5.9 5 3

How to remove duplicate groups in a pandas dataframe?

I would like to remove duplicated sets in the dataframe.
import pandas as pd
import pdb
filename = "result_4_tiling_116.csv"
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, names=['id', 'tileID', 'x', 'y', 'h', 'w'], chunksize=chunksize):
pdb.set_trace()
An example of the first 31 lines of the data:
chunk.head(31)
tileID x y h w
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
25 0 0 8.0 1 4.0
26 1 0 12.0 1 4.0
27 2 0 0.0 1 2.0
28 3 0 2.0 1 2.0
29 4 0 4.0 1 2.0
30 5 0 6.0 1 2.0
I would like to filter out the duplicated ones. The data contains a set of groups (for each starting with tileID=0), as follows:
1.
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2.
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
3.
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
4.
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
5.
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
6.
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
7.
25 0 0 8.0 1 4.0
26 1 0 12.0 1 4.0
27 2 0 0.0 1 2.0
28 3 0 2.0 1 2.0
29 4 0 4.0 1 2.0
30 5 0 6.0 1 2.0
In this example, 5 and 7 are duplicated data. I try to use drop_duplicates, but no success yet.
but look drop_duplicates is works
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
print(df)
This may be not the most efficient way to solve this problem, but it gives correct results.
Let df will be your initial dataframe:
unique_chunks = []
for _, chunk in df.groupby((df['tileID'].diff() != 1).cumsum()):
unindexed_chunk = chunk.reset_index(drop=True)
for unique_chunk in unique_chunks:
unindexed_unique_chunk = unique_chunk.reset_index(drop=True)
if unindexed_chunk.equals(unindexed_unique_chunk):
break
else:
unique_chunks.append(chunk)
output_df = pd.concat(unique_chunks)
will give:
tileID x y h w
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
The idea here is to iterate over the chunks of the initial dataframe, collect them in a list, and check the chunk on the current iteration if it is already present in that list. Don't forget to reset indices!
For explanations on how to iterate over chunks, see this answer.
Edit:
For the case of a very huge input file of ~20Gb you can try saving processed unique chunks to a file instead of keeping them in a list, and reading them back by chunks, in the same manner as you do it with your input file.

Categories