Extract numbers from string column from Pandas DF

Extract numbers from string column from Pandas DF - python

I have the next DataFrame with string column ("Info"):
df = pd.DataFrame( {'Date': ["2014/02/02", "2014/02/03"], 'Info': ["Out of 78 shares traded during the session today, there were 54 increases, 9 without change and 15 decreases.", "Out of 76 shares traded during the session today, there were 60 increases, 4 without change and 12 decreases."]})
I need to extract the numbers from "Info" to new 4 columns in the same df.
The first row will have the values [78, 54, 9, 15]
I have trying with
df[["new1","new2","new3","new4"]]= df.Info.str.extract('(\d+(?:\.\d+)?)', expand=True).astype(int)
but I think that is more complicated.
regards,

Just so I understand, you're trying to avoid capturing decimal parts of numbers, right? (The (?:\.\d+)? part.)
First off, you need to use pd.Series.str.extractall if you want all the matches; extract stops after the first.
Using your df, try this code:
# Get a multiindexed dataframe using extractall
expanded = df.Info.str.extractall(r"(\d+(?:\.\d+)?)")
# Pivot the index labels
df_2 = expanded.unstack()
# Drop the multiindex
df_2.columns = df_2.columns.droplevel()
# Add the columns to the original dataframe (inplace or make a new df)
df_combined = pd.concat([df, df_2], axis=1)

Extractall might be better for this task
df[["new1","new2","new3","new4"]] = df['Info'].str.extractall(r'(\d+)')[0].unstack()
Date Info new1 new2 new3 new4
0 2014/02/02 Out of 78 shares traded during the session tod... 78 54 9 15
1 2014/02/03 Out of 76 shares traded during the session tod... 76 60 4 12

Related

A simple way of selecting the previous row in a column and performing an operation?

I'm trying to create a forecast which takes the previous day's 'Forecast' total and adds it to the current day's 'Appt'. Something which is straightforward in Excel but I'm struggling in pandas. At the moment all I can get in pandas using .loc is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,0,0,0,0]
})
What I'm looking for it to do is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,47,52,56,69]
})
E.g. 'Forecast' total on the 1st December is 37. On the 2nd December the value in the 'Appt' column in 10. I want it to select 37 and + 10, then put this in the 'Forecast' column for the 2nd December. Then iterate over the rest of the column.
I've tied using .loc() with the index, and experimented with .shift() but neither seem to work for what I'd like. Also looked into .rolling() but I think that's not appropriate either.
I'm sure there must be a simple way to do this?
Apologies, the original df has 'Date' as a datetime column.

You can use mask and cumsum:
df['Forecast'] = df['Forecast'].mask(df['Forecast'].eq(0), df['Appt']).cumsum()
# or
df['Forecast'] = np.where(df['Forecast'].eq(0), df['Appt'], df['Forecast']).cumsum()
Output:
Date Appt Forecast
0 2022-12-01 12 37
1 2022-12-01 10 47
2 2022-12-01 5 52
3 2022-12-01 4 56
4 2022-12-01 13 69

You have to make sure that your column has datetime/date type, then you may filter df like this:
# previous code&imports
yesterday = datetime.now().date() - timedelta(days=1)
df[df["date"] == yesterday]["your_column"].sum()

Combine Duplicate Rows in a Column in PySpark Dataframe

I have duplicate rows in a PySpark data frame and I want to combine and sum all of them into one row per column based on duplicate entries in one column.
Current Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 350 900
30 Deal 1 Client A 360 850
50 Deal 2 Client B 30 50
30 Deal 1 Client A 125 200
30 Deal 1 Client A 90 100
10 Deal 3 Client C 32 121
Attempted PySpark Code
F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),
Expected Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 925 2050
50 Deal 2 Client B 30 50
10 Deal 3 Client C 32 121

I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :
df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})

You have a SQL tag, so that's how it will work in there
select
deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3
otherwise you can use the same group by function in python pandas / dataframe and apply to your datadrame:
you have to pass in the columns that you would need to aggregate by as a list
then you need to specify the aggregation type and the column you want to add up
df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})

How can I drop rows which aren't in a given time period?

I'm sure that this question is not really helpful and could mean a lot of thinks so I'll try to explain the question with an example.
So my goal is to delete rows in a DataFrame like the following one if the row can't be part in a line of consecutive days which are as big as a given time period t. If t for example is 3, then the last row needs to be deleted, because there is a gap between the last and the row before. If t would be 4 then also the first three rows must be deleted, hence the 07.04.2012 or 03.04.2012 is missing. Hopefully you can understand what I try to explain here.
Date
Value
04.04.2012
24
05.04.2012
21
06.04.2012
20
08.04.2012
21
09.04.2012
23
10.04.2012
21
11.04.2012
26
13.04.2012
24
My attempt was to iterate over the values in the column 'Date' and check for every element x in the column if the value of the element x subtracted by the value of element x + t = -t. If this is not the case the whole row of the element should be deleted. But while I was searching how you can iterate over a DataFrame I read several times that it is not recommended to do that, because this needs a lot of computing time for big DataFrames. Unfortunately I couldn't find any other method or function which could do this. Therefore, I would be really glad if someone could help me out here. Thank you! :)

With the dates as index you can expand the index of the dataframe to include the missing days. The new dates will create nan values. Create groups for every nan value with .isna().cumsum() and count the size of each groups. Finally select the rows with a count larger or equal to the desired time period.
period = 3
df.set_index('Date', inplace=True)
df[df.groupby(df.reindex(pd.date_range(df.index.min(), df.index.max()))
.Value.isna().cumsum())
.transform('count').ge(period).Value].reset_index()
Output
Date Value
0 2012-04-04 24
1 2012-04-05 21
2 2012-04-06 20
3 2012-04-08 21
4 2012-04-09 23
5 2012-04-10 21
6 2012-04-11 26
To create the dataframe used in this solution
t = '''
Date Value
04.04.2012 24
05.04.2012 21
06.04.2012 20
08.04.2012 21
09.04.2012 23
10.04.2012 21
11.04.2012 26
13.04.2012 24
'''
import pandas as pd
from datetime import datetime
df = pd.read_csv(io.StringIO(t), sep='\s+', parse_dates=['Date'],
date_parser=lambda x: datetime.strptime(x, '%d.%m.%Y'))

getting mean() used in groupby to use the right grouped values for calculation

Data import from csv:
Date
Item_1
Item 2
1990-01-01
34
78
1990-01-02
42
19
.
.
.
.
.
.
2020-12-31
41
23
df = pd.read_csv(r'Insert file directory')
df.index = pd.to_datetime(df.index)
gb= df.groupby([(df.index.year),(df.index.month)]).mean()
Issue:
So basically, the requirement is to group the data according to year and month before processing and I thought that the groupby function would have grouped the data so that the mean() calculate the averages of all values grouped under Jan-1990, Feb-1990 and so on. However, I was wrong. The output result in the average of all values under Item_1
My example is similar to the below post but in my case, it is calculating the mean. I am guessing that it has to do with the way the data is arranged after groupby or some parameters in mean() have to be specified but I have no idea which is the cause. Can someone enlighten me on how to correct the code?
Pandas groupby month and year
Update:
Hi all, I have created the sample data file .csv with 3 items and 3 months of data. I am wondering if the cause has to do with the conversion of data into df when it is imported from .csv because I have noticed some weird time data on the leftmost as shown below:
Link to sample file is:
https://www.mediafire.com/file/t81wh3zem6vf4c2/test.csv/file

import pandas as pd
df = pd.read_csv( 'test.csv', index_col = 'date' )
df.index = pd.to_datetime( df.index )
df.groupby([(df.index.year),(df.index.month)]).mean()
Seems to do the trick from the provided data.

IIUC, you want to calculate the mean of all elements. You can use numpy's mean function that operates on the flattened array by default:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
gb = df.groupby([(df.index.year),(df.index.month)]).apply(lambda d: np.mean(d.values))
output:
date date
1990 1 0.563678
2 0.489105
3 0.459131
4 0.755165
5 0.424466
6 0.523857
7 0.612977
8 0.396031
9 0.452538
10 0.527063
11 0.397951
12 0.600371
dtype: float64

Python pandas data frame remove row where index name DOES NOT occurs in other data frame

I have two data frames. I want to remove rows where the indexes do not occur in both data frames.
Here is an example of the data frames:
import pandas as pd
data = {'Correlation': [1.000000, 0.607340, 0.348844]}
df = pd.DataFrame(data, columns=['Correlation'])
df = df.rename(index={0: 'GINI'})
df = df.rename(index={1: 'Central government debt, total (% of GDP)'})
df = df.rename(index={2: 'Grants and other revenue (% of revenue)'})
data_2 = {'Correlation': [1.000000, 0.607340, 0.348844, 0.309390, -0.661046]}
df_2 = pd.DataFrame(data_2, columns=['Correlation'])
df_2 = df_2.rename(index={0: 'GINI'})
df_2 = df_2.rename(index={1: 'Central government debt, total (% of GDP)'})
df_2 = df_2.rename(index={2: 'Grants and other revenue (% of revenue)'})
df_2 = df_2.rename(index={3: 'Compensation of employees (% of expense)'})
df_2 = df_2.rename(index={4: 'Central government debt, total (current LCU)'})
I have found this question: How to remove rows in a Pandas dataframe if the same row exists in another dataframe? but was unable to use it as I am trying to remove if the index name is the same.
I also saw this question: pandas get rows which are NOT in other dataframe but removes rows which are equal in both data frames but I also did not find this useful.
What I have thought to do is to transpose then concat the data frames and remove duplicate columns:
df = df.T
df_2 = df_2.T
df3 = pd.concat([df,df_2],axis = 1)
df3.iloc[: , ~df3.columns.duplicated()]
The problem with this is that it only removes one of the columns that is duplicated but I want it to remove both these columns.
Any help doing this would be much appreciated, cheers.

You can just compare the indexes and use .loc to pull the relevant rows:
In [19]: df1 = pd.DataFrame(list(range(50)), index=range(0, 100, 2))
In [20]: df2 = pd.DataFrame(list(range(34)), index=range(0, 100, 3))
In [21]: df2.loc[df2.index.difference(df1.index)]
Out[21]:
0
3 1
9 3
15 5
21 7
27 9
33 11
39 13
45 15
51 17
57 19
63 21
69 23
75 25
81 27
87 29
93 31
99 33

you can simply do this for indices in df2 but not in df1
df_2[~df_2.index.isin(df.index)]
Correlation
Compensation of employees (% of expense) 0.309390
Central government debt, total (current LCU) -0.661046

I have managed to work this out by adapting the answers already submitted:
df_2[df_2.index.isin(df.index)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract numbers from string column from Pandas DF - python

Related

A simple way of selecting the previous row in a column and performing an operation?

Combine Duplicate Rows in a Column in PySpark Dataframe

How can I drop rows which aren't in a given time period?

getting mean() used in groupby to use the right grouped values for calculation

Python pandas data frame remove row where index name DOES NOT occurs in other data frame

Categories

Resources