python recode csv with condition - python

I am a beginner in python.
I need to recode a CSV file:
unique_id,pid,Age
1,1,1
1,2,3
2,1,5
2,2,6
3,1,6
3,2,4
3,3,6
3,4,1
3,5,4
4,1,6
4,2,5
The condition is: for each [unique_id], if there is any [Age]==6, then put a value 1 in the corresponding rows of with a [pid]=1, others should be 0.
the output csv will look like this:
unique_id,pid,Age,recode
1,1,1,0
1,2,3,0
2,1,5,1
2,2,6,0
3,1,6,1
3,2,4,0
3,3,6,0
3,4,1,0
3,5,4,0
4,1,6,1
4,2,5,0
I was using numpy: like follwoing:
import numpy
input_file1 = "data.csv"
input_folder = 'G:/My Drive/'
Her_HH =pd.read_csv(input_folder + input_file1)
Her_HH['recode'] = numpy.select([Her_PP['Age']==6,Her_PP['Age']<6], [1,0], default=Her_HH['recode'])
Her_HH.to_csv('recode_elderly.csv', index=False)
but it does not put value 1 in where [pid] is 1. Any help will be appreciated.

You can use DataFrame.assign for new column with GroupBy.transform for test if at least one match by GroupBy.any, chain mask for test 1 with & for bitwise AND and last cast output to integers
#sorting if necessary
df = df.sort_values('unique_id')
m1 = df.assign(test=df['Age'] == 6).groupby('unique_id')['test'].transform('any')
Another idea for get groups with 6 is filter them with unique_id and Series.isin:
m1 = df['unique_id'].isin(df.loc[df['Age'] == 6, 'unique_id'])
m2 = df['pid'] == 1
df['recode'] = (m1 & m2).astype(int)
print (df)
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
EDIT:
For check groups with no match 6 in Age column is possible filter by inverted mask by ~ and if want only all unique rows by unique_id values add DataFrame.drop_duplicates:
print (df[~m1])
unique_id pid Age
0 1 1 1
1 1 2 3
df1 = df[~m1].drop_duplicates('unique_id')
print (df1)
unique_id pid Age
0 1 1 1

This a bit clumsy, since I know numpy a lot better than pandas.
Load your csv sample into a dataframe:
In [205]: df = pd.read_csv('stack59885878.csv')
In [206]: df
Out[206]:
unique_id pid Age
0 1 1 1
1 1 2 3
2 2 1 5
3 2 2 6
4 3 1 6
5 3 2 4
6 3 3 6
7 3 4 1
8 3 5 4
9 4 1 6
10 4 2 5
Generate a groupby object based on the unique_id column:
In [207]: gps = df.groupby('unique_id')
In [209]: gps.groups
Out[209]:
{1: Int64Index([0, 1], dtype='int64'),
2: Int64Index([2, 3], dtype='int64'),
3: Int64Index([4, 5, 6, 7, 8], dtype='int64'),
4: Int64Index([9, 10], dtype='int64')}
I've seen pandas ways for iterating on groups, but here's a list comprehension. The iteration produce a tuple, with the id and a dataframe. We want to test each group dataframe for 'Age' and 'pid' values:
In [211]: recode_values = [(gp['Age']==6).any() & (gp['pid']==1) for x, gp in gps]
In [212]: recode_values
Out[212]:
[0 False
1 False
Name: pid, dtype: bool, 2 True
3 False
Name: pid, dtype: bool, 4 True
5 False
6 False
7 False
8 False
Name: pid, dtype: bool, 9 True
10 False
Name: pid, dtype: bool]
The result is a list of Series, with a True where pid is 1 and there's a 'Age' 6 in the group.
Joining these Series with numpy.hstack produces a boolean array, which we can convert to an integer array:
In [214]: np.hstack(recode_values)
Out[214]:
array([False, False, True, False, True, False, False, False, False,
True, False])
In [215]: df['recode']=_.astype(int) # assign that to a new column
In [216]: df
Out[216]:
unique_id pid Age recode
0 1 1 1 0
1 1 2 3 0
2 2 1 5 1
3 2 2 6 0
4 3 1 6 1
5 3 2 4 0
6 3 3 6 0
7 3 4 1 0
8 3 5 4 0
9 4 1 6 1
10 4 2 5 0
Again, I think there's an idiomatic pandas way of joining those series. But for now this works.
===
OK, the groupby object has an apply:
In [223]: def foo(gp):
...: return (gp['Age']==6).any() & (gp['pid']==1).astype(int)
...:
In [224]: gps.apply(foo)
Out[224]:
unique_id
1 0 0
1 0
2 2 1
3 0
3 4 1
5 0
6 0
7 0
8 0
4 9 1
10 0
Name: pid, dtype: int64
And remove the multi-indexing with:
In [242]: gps.apply(foo).reset_index(0, True)
Out[242]:
0 0
1 0
2 1
3 0
4 1
5 0
6 0
7 0
8 0
9 1
10 0
Name: pid, dtype: int64
In [243]: df['recode']=_ # and assign to recode
Lots of experimenting and learning here.

Related

Filtering out the end of a dataframe in pandas

I have a large dataframe with timestamps and there is one value that starts with a decrease, then stays 0 for a while and increases again starting the next cycle.
I would like to analyze the decreasing and stable part, but not the increasing part.
Ideally the code should check if there was a 0 in the df before, and if so any value > 0 afterwards should be excluded from the dataframe. It would also work to determine when the last value at ==0 occurs and delete all the data afterwards.
Is there a possibility to do this?
Cheers!
import pandas as pd
data = {
"Period_index": [1,2,3,4,5,6,7,8,9,10],
"Value": [9, 7, 3, 0, 0, 0, 0, 2, 4, 6]
}
df = pd.DataFrame(data)
If your data is always starting with decrease, staying 0 for a while and increasing, this should work.
df[:np.where(df.Value == 0)[0].max()]
Period index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
Literally as you say:
check there has never been a zero with .gt(0).cummin() which returns True on non-zero and then the cumulative min, i.e. as soon as there is a False return only False
or the value is zero with .eq(0)
>>> df['Value'].gt(0).cummin()
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 False
8 False
9 False
Name: Value, dtype: bool
>>> df[df['Value'].gt(0).cummin() | df['Value'].eq(0)]
Period_index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
6 7 0
If you’re afraid of getting unconnected zeros again you can pass the whole thing to cummin() again, i.e.:
(df[df['Value'].gt(0).cummin() | df['Value'].eq(0)]).cummin()
FWIW this is the only answer that really stops at the first zero:
>>> test = pd.Series([1, 0, 0, 1, 0])
>>> test[(test.gt(0).cummin() | test.eq(0)).cummin()]
0 1
1 0
2 0
dtype: int64
>>> test.loc[:test.eq(0).idxmax()]
0 1
1 0
dtype: int64
>>> test.loc[:test.eq(0)[::-1].idxmax()]
0 1
1 0
2 0
3 1
4 0
dtype: int64
#SeaBean’s answer is also good (or a variant thereof):
>>> test[test.gt(test.shift()).cumsum().eq(0)]
0 1
1 0
2 0
dtype: int64
This code will cut the serie when it starts growing again.
The rolling window calculates the variation and when it starts growing again we have a positive value captured by cummax
df[df.Value.rolling(2).apply(lambda w:w.values[1]-w.values[0]).cummax().fillna(-1)<=0]
Period_index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
6 7 0
try via idxmax() + iloc:
out=df.iloc[:df['Value'].eq(0).idxmax()]
#df.loc[:df['Value'].eq(0).idxmax()-1]
output of out:
Period_index Value
0 1 9
1 2 7
2 3 3
OR
via idxmax() and loc:
out=df.loc[:df['Value'].eq(0)[::-1].idxmax()]
output of out:
Period_index Value
0 1 9
1 2 7
2 3 3
3 4 0
4 5 0
5 6 0
6 7 0
OR
If you don't want to get unconnected 0's then use argmin():
df=df.loc[:(df['Value'].eq(0) | (~df['Value'].gt(df['Value'].shift()))).argmin()-1]

How to divide dataframe into 2 equal parts (first half rows and second half rows) - in Python

I have a dataframe and need to break it into 2 equal dataframes.
1st dataframe would contain top half rows and 2nd would contain the remaining rows.
Please help how to achieve this using python.
Also in both the even rows scenario and odd rows scenario (as in odd rows I would need to drop the last row to make it equal).
Consider df:
In [122]: df
Out[122]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3
12 3 4 5 6
Use numpy.array_split():
In [127]: import numpy as np
In [128]: def split_df(df):
...: if len(df) % 2 != 0: # Handling `df` with `odd` number of rows
...: df = df.iloc[:-1, :]
...: df1, df2 = np.array_split(df, 2)
...: return df1, df2
...:
In [130]: df1, df2 = split_df(df)
In [131]: df1
Out[131]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
In [133]: df2
Out[133]:
id days sold days_lag
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3
with a simple eg. you can try as below:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13],['Tom',20],['Jerry',25]]
#data = [['Alex',10],['Bob',12],['Clarke',13],['Tom',20]]
data1 = data[0:int(len(data)/2)]
if (len(data) % 2) == 0:
data2 = data[int(len(data)/2):]
else:
data2 = data[int(len(data)/2):-1]
df1 = pd.DataFrame(data1, columns=['Name', 'Age'], dtype=float); print("1st half:\n",df1)
df2 = pd.DataFrame(data2, columns=['Name', 'Age'], dtype=float); print("2nd Half:\n",df2)
Output:
D:\Python>python temp.py
1st half:
Name Age
0 Alex 10.0
1 Bob 12.0
2nd Half:
Name Age
0 Clarke 13.0
1 Tom 20.0

Comparing columns of 2 dataframes

I am trying to get the columns that are unique to a data frame.
DF_A has 10 columns
DF_B has 3 columns (all three match column names in DF_A).
Before I was using:
cols_to_use = DF_A.columns - DF_B.columns.
Since my pandas update, I am getting this error:
TypeError: cannot perform sub with this index type:
What should I be doing now instead?
Thank you!
You can use difference method:
Demo:
In [12]: df
Out[12]:
a b c d
0 0 8 0 3
1 3 4 1 7
2 0 5 4 0
3 0 9 7 0
4 5 8 5 4
In [13]: df2
Out[13]:
a d
0 4 3
1 3 1
2 1 2
3 3 4
4 0 3
In [14]: df.columns.difference(df2.columns)
Out[14]: Index(['b', 'c'], dtype='object')
In [15]: cols = df.columns.difference(df2.columns)
In [16]: df[cols]
Out[16]:
b c
0 8 0
1 4 1
2 5 4
3 9 7
4 8 5

Pandas DataFrame column value remapping

Assuming the following DataFrame:
df = pd.DataFrame({'id': [8,16,23,8,23], 'count': [5,8,7,1,2]}, columns=['id', 'count'])
id count
0 8 5
1 16 8
2 23 7
3 8 1
4 23 2
...is there some Pandas magic that allows me to remap the ids so that the ids become sequential? Looking for a result like:
id count
0 0 5
1 1 8
2 2 7
3 0 1
4 2 2
where the original ids [8,16,23] were remapped to [0,1,2]
Note: the remapping doesn't have to maintain original order of ids. For example, the following remapping would also be fine: [8,16,23] -> [2,0,1], but the id space after remapping should be contiguous.
I'm currently using a for loop and a dict to keep track of the remapping, but it feels like Pandas might have a better solution.
use factorize:
>>> df
id count
0 8 5
1 16 8
2 23 7
3 8 1
4 23 2
>>> df['id'] = pd.factorize(df['id'])[0]
>>> df
id count
0 0 5
1 1 8
2 2 7
3 0 1
4 2 2
You can do this via a groupby's labels:
In [11]: df
Out[11]:
id count
0 8 5
1 16 8
2 23 7
3 8 1
4 23 2
In [12]: g = df.groupby("id")
In [13]: g.grouper.labels
Out[13]: [array([0, 1, 2, 0, 2])]
In [14]: df["id"] = g.grouper.labels[0]
In [15]: df
Out[15]:
id count
0 0 5
1 1 8
2 2 7
3 0 1
4 2 2
This may be helpful to you.
x,y = pd.factorize(df['id'])
remap = dict(set(zip(list(x),list(y))))

Pandas: expand index of a series so it contains all values in a range

I have a pandas series that looks like this:
>>> x.sort_index()
2 1
5 2
6 3
8 4
I want to fill out this series so that the "missing" index rows are represented, filling in the data values with a 0.
So that when I list the new series, it looks like this:
>>> z.sort_index()
1 0
2 1
3 0
4 0
5 2
6 3
7 0
8 4
I have tried creating a "dummy" Series
>>> y = pd.Series([0 for i in range(0,8)])
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
And then concat'ing them together - but the results are either:
>>> pd.concat([x,z],axis=0)
2 1
5 2
6 3
8 4
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
Or
>>> pd.concat([x,z],axis=1)
0 1
0 NaN 0
1 NaN 0
2 1 0
3 NaN 0
4 NaN 0
5 2 0
6 3 0
7 NaN 0
8 4 NaN
Neither of which is my target structure listed above.
I could try performing some arithmetic on the axis=1 version, and taking a sum of columns 1 and 2, but am looking for a neater, one-line version of this - does such an index filling/cleansing operation exist, and if so, what is it?
What you want is a reindex. First create the index as you want (in this case just a range), and then reindex with it:
In [64]: x = pd.Series([1,2,3,4], index=[2,5,6,8])
In [65]: x
Out[65]:
2 1
5 2
6 3
8 4
dtype: int64
In [66]: x.reindex(range(9), fill_value=0)
Out[66]:
0 0
1 0
2 1
3 0
4 0
5 2
6 3
7 0
8 4
dtype: int64
Apologies - slightly embarrassing situation, but having read here on what to do in this situation, am offering an answer to my own question.
I read the documentation here - one way of doing what I'm looking for is this:
>>> x.combine_first(y)
0 0
1 0
2 1
3 0
4 0
5 2
6 3
7 0
8 4
dtype: float64
N.B. in the above,
>>> y = pd.Series([0 for i in range(0,8)])

Categories