I have python pandas data frame like this with 200k to 400k rows
Index value
1 a
2
3 v
4
5
6 6077
7
8 h
and I want this dataframe value column to be filled all below rows with the specific value based on number of string values(like here in this table we have 1 number of string value).
I want my dataframe to be like this.
Index value
1 a
2 a
3 v
4 v
5 v
6 v
7 v
8 h
If need repeat strings with length 1 you can use Series.str.match by regex [a-zA-Z]{1} for check if strings with length 1, replace not matched values to NaNs by Series.where and last forward filling missing values by ffill:
df['value'] = df['value'].where(df['value'].str.match('^[a-zA-Z]{1}$', na=False)).ffill()
print (df)
Index value
0 1 a
1 2 a
2 3 v
3 4 v
4 5 v
5 6 v
6 7 v
7 8 h
Another idea:
m1 = df['value'].str.len().eq(1)
m2 = df['value'].str.isalpha()
df['value'] = df['value'].where(m1 & m2).ffill()
The forward fill method in fillna is exactly for this.
This should work for you:
df.fillna(method='ffill')
try this,
import pandas as pd
df['value'].replace('\d+', pd.np.nan, regex=True).ffill()
0 a
1 a
2 v
3 v
4 v
5 v
6 v
7 h
Name: value, dtype: object
Once you have removed all numbers, do this:
df[df['value']==""] = np.NaN
df.fillna(method='ffill')
Assuming that any value that is not an empty string or number should be forward filled, then the regular expression r'^\d*$' will match both an empty string or number. These values can be replaced by np.nan and then ffill can be called:
import numpy as np
df['value'].replace(r'^\d*$', np.nan, regex=True, inplace=True)
df['value'].ffill(inplace=True)
Related
I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?
As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)
>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]
Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.
As Stated, I would like to remove a specific row based on group by logic. In the below dataframe wherever combination for F and G occurs for an ID, I would like to remove row with value G.
import pandas as pd
op_d = {'ID': [1,1,2,2,3,4],'Value':['F','G','K','G','H','G']}
df = pd.DataFrame(data=op_d)
df
In this case, I would like to remove second row with value 'G' for ID = 1. So far
temp = df.groupby('ID').apply(lambda x: (x['Value'].nunique()>1)).reset_index().rename(columns={0:'Expected_Output'})
temp = temp.loc[temp['Expected_Output']==True]
multiple_options = df.loc[df['ID'].isin(temp['ID'])]
So far, I am able to figure out where each ID has a multiple value. Could you tell how to remove this specific row ?
Use Series.eq + Series.groupby transform with any:
m1, m2 = df['Value'].eq('F'), df['Value'].eq('G')
m = m2 & m1.groupby(df['ID']).transform('any') & m2.groupby(df['ID']).transform('any')
df1 = df[~m]
Result:
print(df1)
ID Value
0 1 F
2 2 K
3 2 G
4 3 H
5 4 G
Using isin:
c = (df['Value'].isin(['F','G']).groupby(df['ID']).transform('sum').eq(2)
& df['Value'].eq('G'))
out = df[~c].copy()
ID Value
0 1 F
2 1 H
3 2 K
4 2 G
5 3 H
6 4 G
I have a dataframe with duplicate rows
>>> d = pd.DataFrame({'n': ['a', 'a', 'a'], 'v': [1,2,1]})
>>> d
n v
0 a 1
1 a 2
2 a 1
I would like to understand how to use .groupby() method specifically so that I can add a new column to the dataframe which shows count of rows which are identical to the current one.
>>> dd = d.groupby(by=['n','v'], as_index=False) # Use all columns to find groups of identical rows
>>> for k,v in dd:
... print(k, "\n", v, "\n") # Check what we found
...
('a', 1)
n v
0 a 1
2 a 1
('a', 2)
n v
1 a 2
When I'm trying to do dd.count() on resulting DataFrameGroupBy object I get IndexError: list index out of range. This seems to happen because all columns are used in grouping operation and there's no other column to use for counting. Similarly dd.agg({'n', 'count'}) fails with ValueError: no results.
I could use .apply() to achieve something that looks like result.
>>> dd.apply(lambda x: x.assign(freq=len(x)))
n v freq
0 0 a 1 2
2 a 1 2
1 1 a 2 1
However this has two issues: 1) something happens to the index so that it is hard to map this back to the original index, 2) this does not seem idiomatic Pandas and manuals discourage using .apply() as it could be slow.
Is there more idiomatic way to count duplicate rows when using .groupby()?
One solution is use GroupBy.size for aggregate output with counter:
d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
Your solution working if specify some column name after groupby, because no another columns n, v in input DataFrame:
d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
What is also necessary if need new column with GroupBy.transform - new column is filled by aggregate values:
d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2
I want to drop a group (all rows in the group) if the sum of values in a group is equal to a certain value.
The following code provides an example:
>>> df = pd.DataFrame(randn(10,10), index=pd.date_range('20130101',periods=10,freq='T'))
>>> df = pd.DataFrame(df.stack(), columns=['Values'])
>>> df.index.names = ['Time', 'Group']
>>> df.head(12)
Values
Time Group
2013-01-01 00:00:00 0 0.541795
1 0.060798
2 0.074224
3 -0.006818
4 1.211791
5 -0.066994
6 -1.019984
7 -0.558134
8 2.006748
9 2.737199
2013-01-01 00:01:00 0 1.655502
1 0.376214
>>> df['Values'].groupby('Group').sum()
Group
0 3.754481
1 -5.234744
2 -2.000393
3 0.991431
4 3.930547
5 -3.137915
6 -1.260719
7 0.145757
8 -1.832132
9 4.258525
Name: Values, dtype: float64
So the question is; how can I for instance drop all group rows where the grouped sum is negative? In my actual dataset I want to drop the groups where the sum or mean is zero.
Using GroupBy + transform with sum, followed by Boolean indexing:
res = df[df.groupby('Group')['Values'].transform('sum') > 0]
From the pandas documentation, filtration seems more suitable:
df2 = df.groupby('Group').filter(lambda g: g['Values'].sum() >= 0)
(Old answer):
This worked for me:
# Change the index to *just* the `Group` column
df.reset_index(inplace=True)
df.set_index('Group', inplace=True)
# Then create a filter using the groupby object
gb = df['Values'].groupby('Group')
gb_sum = gb.sum()
val_filter = gb_sum[gb_sum >= 0].index
# Print results
print(df.loc[val_filter])
The condition on which you filter can be changed accordingly.
I need to group by and then return the values of a column in a concatenated form. While I have managed to do this, the returned dataframe has a column name 0. Just 0. Is there a way to specify what the results will be.
all_columns_grouped = all_columns.groupby(['INDEX','URL'], as_index = False)['VALUE'].apply(lambda x: ' '.join(x)).reset_index()
The resulting groupby object has the headers
INDEX | URL | 0
The results are in the 0 column.
While I have managed to rename the column using
.rename(index=str, columns={0: "variant"}) this seems very in elegant.
Any way to provide a header for the column? Thanks
The simpliest is remove as_index = False for return Series and add parameter name to reset_index:
Sample:
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
print (all_columns)
INDEX URL VALUE
0 a 5 a
1 a 5 s
2 a 4 d
3 b 4 ss
4 b 4 t
5 b 4 y
all_columns_grouped = all_columns.groupby(['INDEX','URL'])['VALUE'] \
.apply(' '.join) \
.reset_index(name='variant')
print (all_columns_grouped)
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y
You can use agg when applied to a column (VALUE in this case) to assign column names to the result of a function.
# Sample data (thanks #jezrael)
all_columns = pd.DataFrame({'VALUE':['a','s','d','ss','t','y'],
'URL':[5,5,4,4,4,4],
'INDEX':list('aaabbb')})
# Solution
>>> all_columns.groupby(['INDEX','URL'], as_index=False)['VALUE'].agg(
{'variant': lambda x: ' '.join(x)})
INDEX URL variant
0 a 4 d
1 a 5 a s
2 b 4 ss t y