Split string in a column based on character position - python

I have a dataframe like this:
Basic Stats Min Max Mean Stdev
1 LT50300282010256PAC01 0.336438 0.743478 0.592622 0.052544
2 LT50300282009269PAC01 0.313259 0.678561 0.525667 0.048047
3 LT50300282008253PAC01 0.374522 0.746828 0.583513 0.055989
4 LT50300282007237PAC01 -0.000000 0.749325 0.330068 0.314351
5 LT50300282006205PAC01 -0.000000 0.819288 0.600136 0.170060
and for the column Basic Stats I want to retain only the characters between [9:12] so for row 1 I only want to retain 2010 and for row 2 I only want to retain 2009. Is there a way to do this?

Just use vectorised str method to slice your strings:
In [23]:
df['Basic Stats'].str[9:13]
Out[23]:
0 2010
1 2009
2 2008
3 2007
4 2006
Name: Basic Stats, dtype: object

One way would be to use
df['Basic Stats'] = df['Basic Stats'].map(lambda x: x[9:13])

You can slice
df["Basic Stats"] = df["Basic Stats"].str.slice(9,13)
Output:
Basic Stats Min Max Mean Stdev
0 2010 0.336438 0.743478 0.592622 0.052544
1 2009 0.313259 0.678561 0.525667 0.048047
2 2008 0.374522 0.746828 0.583513 0.055989
3 2007 -0.000000 0.749325 0.330068 0.314351
4 2006 -0.000000 0.819288 0.600136 0.170060

You can do:
df["Basic Stats"] = [ x[9:13] for x in df["Basic Stats"] ]

Related

How to remove .0 from number

I have one column DOB(Year) in df dataframe, which consist values like below:
DOB(Year)
1990.0
1998.0
2015.0
2017.0
I want to remove .0 from all values.
I have tried
df[DOB(Year)]=df[DOB(Year)].astype(str)
df[DOB(Year)]=df[DOB(Year)].str.replace(".0$", "",regex=True)
But resulting column values are nan.
Can anyone please suggest solution for this?
If you want a safe method that works on numeric/string input:
df['DOB(Year)'] = (pd.to_numeric(df['DOB(Year)'], errors='coerce')
.round().convert_dtypes()
)
Example (as new column):
DOB(Year) DOB(Year)_converted
0 1990.0 1990
1 1998.0 1998
2 2015.0 2015
3 2017.0 2017
4 2011.0001 2011
5 abc <NA>
Try this:
df[DOB(Year)]=df[DOB(Year)].astype('int')

Dataframe group by a specific column, aggerage ratio of some other column?

I have a Data Frame with columns: Year and Min Delay. Sample rows as follows:
2014 0
2014 2
2014 0
2014 4
2015 4
2015 4
2015 2
2015 2
I want to group this dataframe by year and find the delay ratio per year (i.e. number of non-zero entries that year divided by total number of entries for that year). So if we consider the data frame above, what I am trying to get is:
2014 0.5
2015 1
(There are 2 delays in 2014, total 4, 4 delays in 2015 total 4. A delay is defined by Min Delay > 0)
This is what I tried:
def find_ratio(df):
ratio = 1 - (len(df[df == 0]) / len(df))
return ratio
print(df.groupby(["Year"])["Min Delay"].transform(find_ratio).unique())
which prints: [0.5 1]
How can I get a data frame instead of an array?
First I think unique is not good idea use here. Because if need assign output of function to years, it is impossible.
Also transform is good idea if need new column to DataFrame, not aggregated DataFrame.
I think need GroupBy.apply, also function should be simplify by mean of boolean mask:
def find_ratio(df):
ratio = (df != 0).mean()
return ratio
print(df.groupby(["Year"])["Min Delay"].apply(find_ratio).reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with lambda function:
print (df.groupby(["Year"])["Min Delay"]
.apply(lambda x: (x != 0).mean())
.reset_index(name='ratio'))
Year ratio
0 2014 0.5
1 2015 1.0
Solution with GroupBy.transform return new column:
df['ratio'] = df.groupby(["Year"])["Min Delay"].transform(find_ratio)
print (df)
Year Min Delay ratio
0 2014 0 0.5
1 2014 2 0.5
2 2014 0 0.5
3 2014 4 0.5
4 2015 4 0.0
5 2015 4 0.0
6 2015 2 0.0
7 2015 2 0.0

Pandas: Get per-year counts for Dateranges spanning multiple years

I have a dataframe with records spanning multiple years:
WarName | StartDate | EndDate
---------------------------------------------
'fakewar1' 01-01-1990 02-02-1995
'examplewar' 05-01-1990 03-07-1998
(...)
'examplewar2' 05-07-1999 06-09-2002
I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:
Year | Number_of_wars
----------------------------
1989 0
1990 2
1991 2
1992 3
1994 2
Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.
I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).
years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
range = year_in_range(year, row)
if range = True:
year_dict[year] = year_dict.get(year, 0) + 1
This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?
Use a comprehension with pd.value_counts
pd.value_counts([
d.year for s, e in zip(df.StartDate, df.EndDate)
for d in pd.date_range(s, e, freq='Y')
]).sort_index()
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
Alternate
from functools import reduce
def r(t):
return pd.date_range(t.StartDate, t.EndDate, freq='Y')
pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()
Setup
df = pd.DataFrame(dict(
WarName=['fakewar1', 'examplewar', 'feuxwar2'],
StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])
df
WarName StartDate EndDate
0 fakewar1 1990-01-01 1995-02-02
1 examplewar 1990-05-01 1998-03-07
2 feuxwar2 1999-05-07 2002-06-09
By using np.unique
x,y = np.unique(sum([list(range(x.year,y.year)) for x,y in zip(df.StartDate,df.EndDate)],[]), return_counts=True)
pd.Series(dict(zip(x,y)))
Out[222]:
1990 2
1991 2
1992 2
1993 2
1994 2
1995 1
1996 1
1997 1
1999 1
2000 1
2001 1
dtype: int64
The other answers with pandas are far preferable, but the native Python answer you showed didn't have to be so convoluted; just instantiate and directly index into an array:
wars = [0] * 191 # max(df['EndDate']).year - min(df['StartDate']).year + 1
yr_offset = 1816 # min(df['StartDate']).year
for _, row in df.iterrows():
for yr in range(row['StartDate'].year-yr_offset, row['EndDate'].year-yr_offset): # or maybe (year+1)
wars[yr] += 1

Filtering Dataframe in Python

I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India

Pandas: Rolling sum with multiple indexes (i.e. panel data)

I have a dataframe with multiple index and would like to create a rolling sum of some data, but for each id in the index.
For instance, let us say I have two indexes (Firm and Year) and I have some data with name zdata. The working example is the following:
import pandas as pd
# generating data
firms = ['firm1']*5+['firm2']*5
years = [2000+i for i in range(5)]*2
zdata = [1 for i in range(10)]
# Creating the dataframe
mydf = pd.DataFrame({'firms':firms,'year':years,'zdata':zdata})
# Setting the two indexes
mydf.set_index(['firms','year'],inplace=True)
print(mydf)
zdata
firms year
firm1 2000 1
2001 1
2002 1
2003 1
2004 1
firm2 2000 1
2001 1
2002 1
2003 1
2004 1
And now, I would like to have a rolling sum that starts over for each firm. However, if I type
new_rolling_df=mydf.rolling(window=2).sum()
print(new_rolling_df)
zdata
firms year
firm1 2000 NaN
2001 2.0
2002 2.0
2003 2.0
2004 2.0
firm2 2000 2.0
2001 2.0
2002 2.0
2003 2.0
2004 2.0
It doesn't take into account the multiple index and just make a normal rolling sum. Anyone has an idea how I should do (especially since I have even more indexes than 2 (firm, worker, country, year)
Thanks,
Adrien
Option 1
mydf.unstack(0).rolling(2).sum().stack().swaplevel(0, 1).sort_index()
Option 2
mydf.groupby(level=0, group_keys=False).rolling(2).sum()

Categories