I have dataset like this:
Block Vector
blk_-1 0.0 2 3, 0.5 3 8, 0.7 33 5
blk_-2 1.0 4 1, 2.0 2 4
blk_-3 0.0 0 0, 6.0 0 7
blk_-4 8.0 3 0, 7.0 5 8
blk_-5 9.0 0 5, 5.0 0 2, 5.2 3 2, 5.9 5 3
dat = {'Block': ['blk_-1', 'blk_-2', 'blk_-3', 'blk_-4', 'blk_-5'],\
'Vector': ['0.0 2 3, 0.5 3 8, 0.7 33 5',\
'1.0 4 1, 2.0 2 4',\
'0.0 0 0, 6.0 0 7',\
'8.0 3 0, 7.0 5 8',\
'9.0 0 5, 5.0 0 2, 5.2 3 2, 5.9 5 3']
}
I want to get:
Block Vector
blk_-1 0.0 2 3
blk_-1 0.5 3 8
blk_-1 0.7 33 5
blk_-2 1.0 4 1
blk_-2 2.0 2 4
blk_-3 0.0 0 0
blk_-3 6.0 0 7
blk_-4 8.0 3 0
blk_-4 7.0 5 8
blk_-5 9.0 0 5
blk_-5 5.0 0 2
blk_-5 5.2 3 2
blk_-5 5.9 5 3
Try:
df['Vector'] = df['Vector'].apply(lambda x : list(map(str, x.split(','))))
df.Vector.apply(pd.Series) \
.merge(df, left_index = True, right_index = True) \
.drop(["Vector"], axis = 1)
Get:
0 1 2 3 Block
0 0.0 2 3 0.5 3 8 0.7 33 5 NaN blk_-1
1 1.0 4 1 2.0 2 4 NaN NaN blk_-2
2 0.0 0 0 6.0 0 7 NaN NaN blk_-3
3 8.0 3 0 7.0 5 8 NaN NaN blk_-4
4 9.0 0 5 5.0 0 2 5.2 3 2 5.9 5 3 blk_-5
Actually
Stuck at this moment. Waiting for your ideas and comments :)
You can use split, explode and join.
df[['Block']].join(df.Vector.str.split(',').explode())
Block Vector
0 blk_-1 0.0 2 3
0 blk_-1 0.5 3 8
0 blk_-1 0.7 33 5
1 blk_-2 1.0 4 1
1 blk_-2 2.0 2 4
2 blk_-3 0.0 0 0
2 blk_-3 6.0 0 7
3 blk_-4 8.0 3 0
3 blk_-4 7.0 5 8
4 blk_-5 9.0 0 5
4 blk_-5 5.0 0 2
4 blk_-5 5.2 3 2
4 blk_-5 5.9 5 3
Solution for pandas 0.25+ - Series.str.split column and assign back with DataFrame.assign, use DataFrame.explode and last for default index add DataFrame.reset_index with drop=True:
df = pd.DataFrame(dat)
df = df.assign(Vector=df['Vector'].str.split(',')).explode('Vector').reset_index(drop=True)
print (df)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
3 blk_-2 1.0 4 1
4 blk_-2 2.0 2 4
5 blk_-3 0.0 0 0
6 blk_-3 6.0 0 7
7 blk_-4 8.0 3 0
8 blk_-4 7.0 5 8
9 blk_-5 9.0 0 5
10 blk_-5 5.0 0 2
11 blk_-5 5.2 3 2
12 blk_-5 5.9 5 3
Version for oldier pandas versions - use pop + split + stack + reset_index + rename for new Series and then join to original:
df = (df.join(df.pop('Vector')
.str.split(',',expand=True)
.stack()
.reset_index(level=1, drop=True)
.rename('Vector')).reset_index(drop=True))
print (df)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
3 blk_-2 1.0 4 1
4 blk_-2 2.0 2 4
5 blk_-3 0.0 0 0
6 blk_-3 6.0 0 7
7 blk_-4 8.0 3 0
8 blk_-4 7.0 5 8
9 blk_-5 9.0 0 5
10 blk_-5 5.0 0 2
11 blk_-5 5.2 3 2
12 blk_-5 5.9 5 3
For lower than version .25:
final=df.merge(df['Vector'].str.split(',',expand=True).stack().reset_index(0,name='Vector'),
left_index=True,right_on='level_0',suffixes=('_x','')).drop(['level_0','Vector_x'],1)
print(final)
Block Vector
0 blk_-1 0.0 2 3
1 blk_-1 0.5 3 8
2 blk_-1 0.7 33 5
0 blk_-2 1.0 4 1
1 blk_-2 2.0 2 4
0 blk_-3 0.0 0 0
1 blk_-3 6.0 0 7
0 blk_-4 8.0 3 0
1 blk_-4 7.0 5 8
0 blk_-5 9.0 0 5
1 blk_-5 5.0 0 2
2 blk_-5 5.2 3 2
3 blk_-5 5.9 5 3
Related
Assume we have a table looks like the following:
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
5
100
1990129
1
2
3
1
7
100
1990212
1
2
3
week_num skip the "4" and "6" because the corresponding "people" is 0. However, we want the all the rows included like the following table.
id
week_num
people
date
level
a
b
1
1
20
1990101
1
2
3
1
2
30
1990108
1
2
3
1
3
40
1990115
1
2
3
1
4
0
1990122
1
2
3
1
5
100
1990129
1
2
3
1
6
0
1990205
1
2
3
1
7
100
1990212
1
2
3
The date starts with 1990101, the next row must +7 days if it is a continuous week_num(Ex: 1,2 is continuous; 1,3 is not).
How can we use python(pandas) to achieve this goal?
Note: Each id has 10 week_num(1,2,3,...,10), the output must include all "week_num" with corresponding "people" and "date".
Update: Other columns like "level","a","b" should stay the same even we add the skipped week_num.
This assumes that the date restarts at 1990-01-01 for each id:
import itertools
# reindex to get all combinations of ids and week numbers
df_full = (df.set_index(["id", "week_num"])
.reindex(list(itertools.product([1,2], range(1, 11))))
.reset_index())
# fill people with zero
df_full = df_full.fillna({"people": 0})
# forward fill some other columns
cols_ffill = ["level", "a", "b"]
df_full[cols_ffill] = df_full[cols_ffill].ffill()
# reconstruct date from week starting from 1990-01-01 for each id
df_full["date"] = pd.to_datetime("1990-01-01") + (df_full.week_num - 1) * pd.Timedelta("1w")
df_full
# out:
id week_num people date level a b
0 1 1 20.0 1990-01-01 1.0 2.0 3.0
1 1 2 30.0 1990-01-08 1.0 2.0 3.0
2 1 3 40.0 1990-01-15 1.0 2.0 3.0
3 1 4 0.0 1990-01-22 1.0 2.0 3.0
4 1 5 100.0 1990-01-29 1.0 2.0 3.0
5 1 6 0.0 1990-02-05 1.0 2.0 3.0
6 1 7 100.0 1990-02-12 1.0 2.0 3.0
7 1 8 0.0 1990-02-19 1.0 2.0 3.0
8 1 9 0.0 1990-02-26 1.0 2.0 3.0
9 1 10 0.0 1990-03-05 1.0 2.0 3.0
10 2 1 0.0 1990-01-01 1.0 2.0 3.0
11 2 2 0.0 1990-01-08 1.0 2.0 3.0
12 2 3 0.0 1990-01-15 1.0 2.0 3.0
13 2 4 0.0 1990-01-22 1.0 2.0 3.0
14 2 5 0.0 1990-01-29 1.0 2.0 3.0
15 2 6 0.0 1990-02-05 1.0 2.0 3.0
16 2 7 0.0 1990-02-12 1.0 2.0 3.0
17 2 8 0.0 1990-02-19 1.0 2.0 3.0
18 2 9 0.0 1990-02-26 1.0 2.0 3.0
19 2 10 0.0 1990-03-05 1.0 2.0 3.0
I have the following dataset:
0 1 2
0 2.0 2.0 4
0 1.0 1.0 2
0 1.0 1.0 3
3 1.0 1.0 5
4 1.0 1.0 2
5 1.0 NaN 1
6 NaN 1.0 1
and what I want to do is insert a new column that iterates over each row, and if there is a NaN then give it a 0, if not then copy the value from column '2' to get this:
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
The following code is what I have so far, which works fine but does not iterate over the values of column '2'.
df.isna().sum(axis=1).apply(lambda x: df[2].iloc[x] if x==0 else 0)
if I use df.iloc[x] I get
0 4
1 4
2 4
3 4
4 4
5 0
6 0
How can I iterate over the column '2'?
Try the below code with np.where with isna and any:
>>> df['3'] = np.where(df[['0', '1']].isna().any(1), 0, df['2'])
>>> df
0 1 2 3
0 2.0 2.0 4 4
0 1.0 1.0 2 2
0 1.0 1.0 3 3
3 1.0 1.0 5 5
4 1.0 1.0 2 2
5 1.0 NaN 1 0
6 NaN 1.0 1 0
>>>
I have a dataframe that looks like the following:
df
0 1 2 3 4
0 0.0 NaN NaN NaN NaN
1 NaN 0.0 0.0 NaN 4.0
2 NaN 2.0 0.0 NaN 5.0
3 NaN NaN NaN 0.0 NaN
4 NaN 0.0 3.0 NaN 0.0
I would like to have a dataframe of all the couples of values different from NaN. The dataframe should be like the following
df
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 5.0
4 3 3 0.0
5 4 1 0.0
6 4 2 3.0
7 4 4 0.0
Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack().rename_axis(('i','j')).reset_index(name='val')
print (df)
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0
Like this:
In [379]: df.stack().reset_index(name='val').rename(columns={'level_0':'i', 'level_1':'j'})
Out[379]:
i j val
0 0 0 0.0
1 1 1 0.0
2 1 2 0.0
3 1 4 4.0
4 2 1 2.0
5 2 2 0.0
6 2 4 5.0
7 3 3 0.0
8 4 1 0.0
9 4 2 3.0
10 4 4 0.0
I would like to remove duplicated sets in the dataframe.
import pandas as pd
import pdb
filename = "result_4_tiling_116.csv"
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, names=['id', 'tileID', 'x', 'y', 'h', 'w'], chunksize=chunksize):
pdb.set_trace()
An example of the first 31 lines of the data:
chunk.head(31)
tileID x y h w
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
25 0 0 8.0 1 4.0
26 1 0 12.0 1 4.0
27 2 0 0.0 1 2.0
28 3 0 2.0 1 2.0
29 4 0 4.0 1 2.0
30 5 0 6.0 1 2.0
I would like to filter out the duplicated ones. The data contains a set of groups (for each starting with tileID=0), as follows:
1.
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2.
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
3.
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
4.
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
5.
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
6.
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
7.
25 0 0 8.0 1 4.0
26 1 0 12.0 1 4.0
27 2 0 0.0 1 2.0
28 3 0 2.0 1 2.0
29 4 0 4.0 1 2.0
30 5 0 6.0 1 2.0
In this example, 5 and 7 are duplicated data. I try to use drop_duplicates, but no success yet.
but look drop_duplicates is works
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df.drop_duplicates(subset=['A', 'C'], keep=False)
print(df)
This may be not the most efficient way to solve this problem, but it gives correct results.
Let df will be your initial dataframe:
unique_chunks = []
for _, chunk in df.groupby((df['tileID'].diff() != 1).cumsum()):
unindexed_chunk = chunk.reset_index(drop=True)
for unique_chunk in unique_chunks:
unindexed_unique_chunk = unique_chunk.reset_index(drop=True)
if unindexed_chunk.equals(unindexed_unique_chunk):
break
else:
unique_chunks.append(chunk)
output_df = pd.concat(unique_chunks)
will give:
tileID x y h w
0 0 0 0.0 1 8.0
1 1 0 8.0 1 8.0
2 0 0 8.0 1 8.0
3 1 0 0.0 1 4.0
4 2 0 4.0 1 4.0
5 0 0 0.0 1 4.0
6 1 0 4.0 1 4.0
7 2 0 8.0 1 4.0
8 3 0 12.0 1 4.0
9 0 0 4.0 1 4.0
10 1 0 8.0 1 4.0
11 2 0 12.0 1 4.0
12 3 0 0.0 1 2.0
13 4 0 2.0 1 2.0
14 0 0 8.0 1 4.0
15 1 0 12.0 1 4.0
16 2 0 0.0 1 2.0
17 3 0 2.0 1 2.0
18 4 0 4.0 1 2.0
19 5 0 6.0 1 2.0
20 0 0 12.0 1 4.0
21 1 0 0.0 1 2.0
22 2 0 2.0 1 2.0
23 3 0 4.0 1 2.0
24 4 0 6.0 1 2.0
The idea here is to iterate over the chunks of the initial dataframe, collect them in a list, and check the chunk on the current iteration if it is already present in that list. Don't forget to reset indices!
For explanations on how to iterate over chunks, see this answer.
Edit:
For the case of a very huge input file of ~20Gb you can try saving processed unique chunks to a file instead of keeping them in a list, and reading them back by chunks, in the same manner as you do it with your input file.
I am trying to do the following: When the value of 'content' is NaN, replace its value from that of the target row. Below is my code which does that by iterating over all rows, but it is ugly and slow. I suspect there should be a more elegant/fast way to do this with mask, so I figured someone may inspire me on this:
Inputs:
import pandas as pd
d = {'content': [1, 3, None, 6, 1, 59, None], 'target': [0,1,0,2,4,5,4]}
df = pd.DataFrame(data=d)
print(df)
for index, row in df.iterrows():
if df.loc[index,'content']!=df.loc[index,'content']: # To detect NaN
df.loc[index,'content']=df.loc[df.loc[index,'target'],'content']
print(df)
outputs:
content target
0 1.0 0
1 3.0 1
2 NaN 0
3 6.0 2
4 1.0 4
5 59.0 5
6 NaN 4
content target
0 1.0 0
1 3.0 1
2 1.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 1.0 4
Thanks in advance!
Note: Only when the content of the row is "NaN", the content should be changed to that of the target row.
Additional Question: Can I do the same thing, whenever the content is 59 or 6? Thanks a lot!
By using fillna
df.content=df.content.fillna(df.target)
df
Out[268]:
content target
0 1.0 0
1 3.0 1
2 6.0 0
3 2.0 2
4 1.0 4
5 59.0 5
6 5.0 5
EDIT
df.ffill()
Out[487]:
content target
0 1.0 0
1 3.0 1
2 6.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 59.0 5
I guess you need this
df.content.reindex(df.target)
Out[492]:
target
0 1.0
1 3.0
0 1.0
2 6.0
4 1.0
5 59.0
5 59.0
Name: content, dtype: float64
After assign it back
df.content=df.content.reindex(df.target).values
df
Out[494]:
content target
0 1.0 0
1 3.0 1
2 1.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 59.0 5
Let me edit again
df.content.fillna(df.content.reindex(df.target).reset_index(drop=True))
Out[508]:
0 1.0
1 3.0
2 1.0
3 6.0
4 1.0
5 59.0
6 1.0
Name: content, dtype: float64