Removing words from strings within a column dataframe - python

I have a dataframe like this:
Num Text
1 15 March 2020 - There was...
2 15 March 2020 - There has been...
3 24 April 2018 - Nothing has ...
4 07 November 2014 - The Kooks....
...
I would like to remove the first 4 words from each rows in Text (i.e. 15 March 2020 - , 15 March 2020 -,...).
I tried with
df['Text']=df['Text'].str.replace(' ', ) but I do not know what I should include in the brackets to replace those values with an empty space (or just nothing).

You use df.str.split with df.str.slice.
df['test'].str.split(n=4).str[-1]

What might work is using the split command to split it into words and then take the anything after the 4th word using [4:]

Python can implement different regex and example could be for four words str.replace("\d* \d* \d* \d*", '') here is a link to learn more about python regex and how to detect different patterns in strings.

Even if it is less elegant, I prefer to use ".find()" with an ".apply()". Whatever happen the ".find" the first "-" will be taken as delimiter.
t = pd.DataFrame({'Num':[1,2,3,4], 'Text':['15 March 2020 - There was','15 March 2020 - There has been','24 April 2018 - Nothing has','07 November 2014 - The Kooks']})
t["text2"] = t.apply(lambda x: x['Text'][str(x['Text']).find("- ")+2:], axis=1)
This :
Num Text
1 15 March 2020 - There was...
2 15 March 2020 - There has been...
3 24 April 2018 - Nothing has ...
4 07 November 2014 - The Kooks....
Become this :
Num Text text2
0 1 15 March 2020 - There was There was
1 2 15 March 2020 - There has been There has been
2 3 24 April 2018 - Nothing has Nothing has
3 4 07 November 2014 - The Kooks The Kooks

You can do this using str.split:
Considering your df to be:
In [1193]: df = pd.DataFrame({'Num':[1,2,3,4], 'Text':['15 March 2020 - There was','15 March 2020 - There has been','24 April 2018 - Nothing has','07 November 2014 - The Kooks']})
In [1194]: df
Out[1194]:
Num Text
0 1 15 March 2020 - There was
1 2 15 March 2020 - There has been
2 3 24 April 2018 - Nothing has
3 4 07 November 2014 - The Kooks
In [1207]: df['Text'].str.split().str[4:].apply(' '.join)
Out[1207]:
0 There was
1 There has been
2 Nothing has
3 The Kooks
Name: Text, dtype: object

Related

Move data from row 1 to row 0

I have this function written in python. I want this thing show difference between row from production column.
Here's the code
def print_df():
mycursor.execute("SELECT * FROM productions")
myresult = mycurson.fetchall()
myresult.sort(key=lambda x: x[0])
df = pd.DataFrame(myresult, columns=['Year', 'Production (Ton)'])
df['Dif'] = abs(df['Production (Ton)']. diff())
print(abs(df))
And of course the output is this
Year Production (Ton) Dif
0 2010 339491 NaN
1 2011 366999 27508.0
2 2012 361986 5013.0
3 2013 329461 32525.0
4 2014 355464 26003.0
5 2015 344998 10466.0
6 2016 274317 70681.0
7 2017 200916 73401.0
8 2018 217246 16330.0
9 2019 119830 97416.0
10 2020 66640 53190.0
But I want the output like this
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0
What should I change or add to my code?
You can use a negative period input to diff to get the differences the way you want, and then fillna to fill the last value with the value from the Production column:
df['Dif'] = df['Production (Ton)'].diff(-1).fillna(df['Production (Ton)']).abs()
Output:
Year Production (Ton) Dif
0 2010 339491 27508.0
1 2011 366999 5013.0
2 2012 361986 32525.0
3 2013 329461 26003.0
4 2014 355464 10466.0
5 2015 344998 70681.0
6 2016 274317 73401.0
7 2017 200916 16330.0
8 2018 217246 97416.0
9 2019 119830 53190.0
10 2020 66640 66640.0
Use shift(-1) to shift all rows one position up.
df['Dif'] = (df['Production (Ton)'] - df['Production (Ton)'].shift(-1).fillna(0)).abs()
Notice that by setting fillna(0), you avoid the NaNs.
You can also use diff:
df['Dif'] = df['Production (Ton)'].diff().shift(-1).fillna(0).abs()

Python pandas: Calculated Week number overlaps with two months

I have a data frame that contains daily data of the last five years. Beside values column, data frame also contains date field and regulatory year columns. I wanted to create two columns: the regulatory week number and the regulatory month number. The regulatory year starts from the 1st of April and ends on 31st March. So I used the following code to generate regulatory week number and month number:
df['Week'] = np.where(df['date'].dt.isocalendar().week > 13, df['date'].dt.isocalendar().week-13,df['date'].dt.isocalendar().week + 39)
df['month'] =df['date'].dt.month
months = ['Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar']
df['month'] = pd.CategoricalIndex(df['month'], ordered=True, categories=months)
df['month number'] = df['month'].apply(lambda x: months.index(x)+1)
After creating the above-mentioned two columns, my data frame looks like as follow:
RY month Week Value 1 Value 2 Value 3 Value 4 month number
2016 Apr 1 0.00000 0.00000 0.000000 0.00000 1
2016 Apr 2 1.31394 0.02961 1.313940 0.02961 1
2016 Apr 3 4.98354 0.07146 4.983540 0.07146 1
2016 Apr 4 4.30606 0.05742 4.306060 0.05742 1
2016 Apr 5 1.94634 0.01958 1.946340 0.01958 1
2016 May 5 0.25342 0.01625 0.253420 0.01625 2
2016 May 6 0.64051 0.00777 0.640510 0.00777 2
2016 May 7 1.26451 0.02994 1.264510 0.02994 2
2016 May 8 2.71035 0.08150 2.194947 0.08150 2
2016 May 9 11.95120 0.13386 1.624328 0.13386 2
2016 Jun 10 6.93051 0.08126 6.930510 0.08126 3
2016 Jun 11 1.18872 0.03953 1.188720 0.03953 3
2016 Jun 12 3.19961 0.05760 0.924562 0.05760 3
2016 Jun 13 3.90429 0.04985 0.956445 0.04985 3
2016 Jun 14 0.84002 0.01738 0.840020 0.01738 3
2016 Jul 14 0.07358 0.00562 0.073580 0.00562 4
2016 Jul 15 0.78253 0.03014 0.782530 0.03014 4
2016 Jul 16 1.23036 0.01816 1.230360 0.01816 4
2016 Jul 17 0.62948 0.01341 0.629480 0.01341 4
2016 Jul 18 0.45513 0.00552 0.455130 0.00552 4
Now I want to create a data frame that contains mean of values column based on Week. So I used following command to calculate the mean:
mean_df = df.groupby('Week')['Value1','Value2','Value3','Value4'].mean().reset_index()
The new dataframe looks like as follow:
Week Value 1 Value 2 Value 3 Value 4
1 3.013490 0.039740 1.348016 0.039740
2 3.094456 0.045142 3.094456 0.045142
3 1.615948 0.027216 1.615948 0.027216
4 2.889245 0.043998 1.903319 0.043998
5 0.431549 0.009679 0.431549 0.009679
6 1.045670 0.017302 1.045670 0.017302
7 2.444196 0.034304 2.444196 0.034304
8 1.041210 0.026464 0.938129 0.026464
9 2.068607 0.030550 0.921176 0.030550
10 2.400118 0.051476 2.400118 0.051476
11 1.738332 0.035362 1.738332 0.035362
12 1.369790 0.038576 0.914780 0.038576
13 1.921781 0.021218 0.749460 0.021218
14 1.471432 0.027367 1.471432 0.027367
15 2.722526 0.053794 1.676559 0.053794
16 3.132406 0.043520 1.195321 0.043520
17 0.733952 0.021142 0.733952 0.021142
18 0.645236 0.014454 0.645236 0.014454
19 2.466326 0.049704 0.879481 0.049704
20 2.111326 0.013262 0.682253 0.013262
21 1.301004 0.023048 1.301004 0.023048
22 0.705360 0.023439 0.705360 0.023439
23 1.323438 0.019103 1.323438 0.019103
24 0.569906 0.012540 0.569906 0.012540
25 7.898792 0.034246 1.382349 0.034246
26 0.896413 0.013013 0.896413 0.013013
27 4.478349 0.039749 1.703887 0.039749
28 5.807160 0.052526 2.036502 0.052526
29 3.308176 0.043984 2.117939 0.043984
30 1.991078 0.046058 1.991078 0.046058
31 0.806589 0.016945 0.806589 0.016945
32 2.091860 0.029234 2.091860 0.029234
33 1.149280 0.025194 1.149280 0.025194
34 4.746376 0.067742 2.863484 0.067742
35 5.128558 0.029608 1.537541 0.029608
36 2.765563 0.052125 2.765563 0.052125
37 2.314376 0.036046 2.314376 0.036046
38 2.552290 0.030626 1.483397 0.030626
39 1.456778 0.037448 1.456778 0.037448
40 1.212090 0.024698 1.212090 0.024698
41 4.729104 0.037646 1.296358 0.037646
42 3.412830 0.053132 3.412830 0.053132
43 8.916526 0.050044 1.839411 0.050044
44 2.450281 0.029806 0.942205 0.029806
45 2.156186 0.024064 2.156186 0.024064
46 2.336330 0.042538 2.336330 0.042538
47 1.798326 0.025270 1.798326 0.025270
48 1.352004 0.018382 1.352004 0.018382
49 10.220510 0.073480 1.607830 0.073480
50 2.575344 0.047760 2.575344 0.047760
51 1.226056 0.028676 1.226056 0.028676
52 0.470392 0.009991 0.466561 0.009991
Now I want to insert the month and month name from the above data frame to the new data frame. I thought to merge the two data frames together based on 'Week' but I found that the same week number is assigned to the two different months (in the first data frame). For example, Week 5 is assigned to April and May.
Ideally, a week number is assigned to only one month. I am not sure whether I am calculating the week number in the right manner or not. Has anyone come across the same problem? Any advice on how to calculate the week number so that a week number does not overlap with two months.
Presumably, week 5 contains some days in April and some in May. So it's not possible to assign week 5 (as a whole) to a single month.
Perhaps you could assign the month in which the first day of the week falls?

Pandas: Simple Analysis of growth (comparatively) and with Fillna

Below is the basic data I'm provided with every month. There are many department related files I get and the job gets very monotonous and repetitive.
Month,year,sales,
January,2017,34400,
February,2017,35530,
March,2017,34920,
April,2017,35950,
May,2017,36230,
June,2017,36820,
July,2017,34590,
August,2017,36500,
September,2017,36600,
October,2017,37140,
November,2017,36790,
December,2017,43500,
January,2018,34900,
February,2018,37700,
March,2018,37900,
April,2018,38100,
May,2018,37800,
June,2018,38500,
July,2018,39400,
August,2018,39700,
September,2018,39980,
October,2018,40600,
November,2018,39100,
December,2018,46600,
January,2019,42500,
I've tried to use certain functions like value_count(sadly, giving only summary) in order to achieve this output. And failed. (See output below.)
I need to autofill the 3rd and 4th columns (with fillna=True/False)
the third column is just telling if it is P/L compared to previous month (like if April is greater than March, then it is Profit.)
The fourth column is showing the sequence of P/L achieved, i.e. 2 months or 5months profit(/loss) in a row. (I mean continuously, as it results in certain awards/recognition for teams.)
The fifth column is the max sales achieved in the last 'n' number of months.
They only allow Apache OpenOffice for our job, and hence no Excel. But we have the permission by IT to install Python.
The solution in this Link is not helping me as they are grouping-by two columns. The columns in my output are inter-dependent.
import pandas as pd
df = pd.read_csv("Test_1.csv", "a")
df['comparative_position'] = df['sales'].diff().fillna=True
df.loc[df['comparative_position'] > 0.0, 'comparative_position'] = "Profit"
df.loc[df['comparative_position'] < 0.0, 'comparative_position'] = "Loss"
Month,Year,Sales,comparative_position,Months_in_P(or)L,Highest_in_12Months
January,2016,34400,NaN,NaN,NaN
February,2016,35530,Profit,1,NaN
March,2016,34920,Loss,1,NaN
April,2016,35950,Profit,1,NaN
May,2016,36230,Profit,2,NaN
June,2016,36820,Profit,3,NaN
July,2016,34590,Loss,1,NaN
August,2016,36500,Profit,1,NaN
September,2016,36600,Profit,2,NaN
October,2016,37140,Profit,3,NaN
November,2016,36790,Loss,1,NaN
December,2016,43500,Profit,1,43500
January,2017,34900,Loss,1,43500
February,2017,37700,Profit,1,43500
March,2017,37900,Profit,2,43500
April,2017,38100,Profit,3,43500
May,2017,37800,Loss,1,43500
June,2017,38500,Profit,1,43500
July,2017,39400,Profit,2,43500
August,2017,39700,Profit,3,43500
September,2017,39980,Profit,4,43500
October,2017,40600,Profit,5,43500
November,2017,39100,Loss,1,43500
December,2017,46600,Profit,1,46600
January,2018,42500,Loss,1,46600
AFAIU this should work for you:
# Get difference from previous as True / False
df['P/L'] = df.sales > df.sales.shift()
# Add column counting 'streaks' of P or L
df['streak'] = df['P/L'].groupby(df['P/L'].ne(df['P/L'].shift()).cumsum()).cumcount()
# map True/False to string of Profit/Loss
df['P/L'] = df['P/L'].map({True:'Profit', False:'Loss'})
# max of last n months where n is 12, as in your example, you can change it to any int
df['12_max'] = df.sales.rolling(12).max()
Output:
Month year sales P/L streak 12_max
0 January 2017 34400 False 0 NaN
1 February 2017 35530 True 0 NaN
2 March 2017 34920 False 0 NaN
3 April 2017 35950 True 0 NaN
4 May 2017 36230 True 1 NaN
5 June 2017 36820 True 2 NaN
6 July 2017 34590 False 0 NaN
7 August 2017 36500 True 0 NaN
8 September 2017 36600 True 1 NaN
9 October 2017 37140 True 2 NaN
10 November 2017 36790 False 0 NaN
11 December 2017 43500 True 0 43500.0
12 January 2018 34900 False 0 43500.0
13 February 2018 37700 True 0 43500.0
14 March 2018 37900 True 1 43500.0
15 April 2018 38100 True 2 43500.0
16 May 2018 37800 False 0 43500.0
17 June 2018 38500 True 0 43500.0
18 July 2018 39400 True 1 43500.0
19 August 2018 39700 True 2 43500.0
20 September 2018 39980 True 3 43500.0
21 October 2018 40600 True 4 43500.0
22 November 2018 39100 False 0 43500.0
23 December 2018 46600 True 0 46600.0
24 January 2019 42500 False 0 46600.0

pandas histogram: extracting column and group by from data

I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.

set_index equivalent for columns headings

In Pandas, if I have a DataFrame that looks like:
0 1 2 3 4 5 6
0 2013 2012 2011 2010 2009 2008
1 January 3,925 3,463 3,289 3,184 3,488 4,568
2 February 3,632 2,983 2,902 3,053 3,347 4,527
3 March 3,909 3,166 3,217 3,175 3,636 4,594
4 April 3,903 3,258 3,146 3,023 3,709 4,574
5 May 4,075 3,234 3,266 3,033 3,603 4,511
6 June 4,038 3,272 3,316 2,909 3,057 4,081
7 July 3,661 3,359 3,062 3,354 4,215
8 August 3,942 3,417 3,077 3,395 4,139
9 September 3,703 3,169 3,095 3,100 3,752
10 October 3,727 3,469 3,179 3,375 3,874
11 November 3,722 3,145 3,159 3,213 3,567
12 December 3,866 3,251 3,199 3,324 3,362
13 Total 23,482 41,997 38,946 37,148 40,601 49,764
I can convert the first column to be the index using:
In [55]: df.set_index([0])
Out[55]:
1 2 3 4 5 6
0
2013 2012 2011 2010 2009 2008
January 3,925 3,463 3,289 3,184 3,488 4,568
February 3,632 2,983 2,902 3,053 3,347 4,527
March 3,909 3,166 3,217 3,175 3,636 4,594
April 3,903 3,258 3,146 3,023 3,709 4,574
May 4,075 3,234 3,266 3,033 3,603 4,511
June 4,038 3,272 3,316 2,909 3,057 4,081
July 3,661 3,359 3,062 3,354 4,215
August 3,942 3,417 3,077 3,395 4,139
September 3,703 3,169 3,095 3,100 3,752
October 3,727 3,469 3,179 3,375 3,874
November 3,722 3,145 3,159 3,213 3,567
December 3,866 3,251 3,199 3,324 3,362
Total 23,482 41,997 38,946 37,148 40,601 49,764
My question is how to convert the first row to be the column headings?
The closest I can get is:
In [53]: df.set_index([0]).rename(columns=df.loc[0])
Out[53]:
2013 2012 2011 2010 2009 2008
0
2013 2012 2011 2010 2009 2008
January 3,925 3,463 3,289 3,184 3,488 4,568
February 3,632 2,983 2,902 3,053 3,347 4,527
March 3,909 3,166 3,217 3,175 3,636 4,594
April 3,903 3,258 3,146 3,023 3,709 4,574
May 4,075 3,234 3,266 3,033 3,603 4,511
June 4,038 3,272 3,316 2,909 3,057 4,081
July 3,661 3,359 3,062 3,354 4,215
August 3,942 3,417 3,077 3,395 4,139
September 3,703 3,169 3,095 3,100 3,752
October 3,727 3,469 3,179 3,375 3,874
November 3,722 3,145 3,159 3,213 3,567
December 3,866 3,251 3,199 3,324 3,362
Total 23,482 41,997 38,946 37,148 40,601 49,764
but then I have to go in and remove the first row.
The best way to handle this is to avoid getting into this situation.
How was df created? For example, if you used read_csv or a variant, then header=0 will tell read_csv to parse the first line as the column names.
Given df as you have it, I don't think there is an easier way to fix it than what you've described. To remove the first row, you could use df.iloc:
df = df.iloc[1:]
I'm not sure if this is more efficient, but you could try creating a data frame with the corect index and default column names out of your problem data frame, and then rename the columns also using the promlematic data frame. For example:
import pandas as pd
import numpy as np
from pandas import DataFrame
data = {'0':[' ', 'Jan', 'Feb', 'Mar', 'April'], \
'1' : ['2013', 3926, 3456, 3245, 1254], \
'2' : ['2012', 3346, 4342, 1214, 4522], \
'3' : ['2011', 3946, 4323, 1214, 8922]}
DF = DataFrame(data)
DF2 = (DataFrame(DF.ix[1:, 1:]).set_index(DF.ix[1:,0]))
DF2.columns = DF.ix[0, 1:]
DF2
If there is a valid index you can double transform like this:
If you know the name of the row (in this case: 0)
df.T.set_index(0).T
If you know the position of the row (in this case: 0)
df.T.set_index(df.index[0]).T
Or for multiple rows to MultiIndex:
df.T.set_index(list(df.index[0:2])).T

Categories