pandas dataframe condition based on regex expression - python

TTT
1. 802010001-999-00000285-888-
2. 256788
3. 1940
4. NaN
5. NaN
6. 702010001-X-2YZ-00000285-888-
I want to Fill column GGT column with all other values except for the amounts
Required table would be like this
TTT GGT
1. 802010001-999-00000285-888- 802010001-999-00000285-888-
2. 256788 NaN
3. 1940 NaN
4. NaN NaN
5. NaN NaN
6. 702010001-X-2YZ-00000285-888- 702010001-X-2YZ-00000285-888-
the orginal table has more than 200thousands rows.

If you want to remove the rows with only numbers, you can use the match() method of the string elements of the column TTT. You can use a code like that :
df["GGT"] = df["TTT"][df["TTT"].str.match(r'^(\d)+$')==False]

Use Series.mask:
df['GGT'] = df['TTT'].mask(pd.to_numeric(df['TTT'], errors='coerce').notna())
Or:
df['GGT'] = df['TTT'].mask(df["TTT"].astype(str).str.contains('^\d+$', na=True))
print (df)
TTT GGT
0 802010001-999-00000285-888- 802010001-999-00000285-888-
1 256788 NaN
2 1940 NaN
3 NaN NaN
4 702010001-X-2YZ-00000285-888- 702010001-X-2YZ-00000285-888-
I

Related

Replace a Pandas subrow with an array of values?

I have a (simple?) problem, but that cannot understand how to solve in a pandas way.
I have this CSV:
,Unnamed: 0,Unnamed: 1,1Me,2Gi,3Ve,4Sa,5Do,6Lu,7Ma,8Me,9Gi,0Ve,1Sa,2Do,3Lu,4Ma,5Me,6Gi,7Ve,8Sa,9Do,0Lu,1Ma,2Me,3Gi,4Ve,5Sa,6Do,7Lu,8Ma,9Me,0Gi,1Ve,Unnamed: 2
0,,Ore,,",30",",46",",50",,,",20",",48",",41",",07",",52",",11",,",53",",51",",14",",28",",33",,",32",",10",",03",",44",",39",",04",,",26",",15",",07",",11",",59",
1,,Ore,,,,,,",53",,,,,,,,,,,,,,,,,,,,,,,,,,
That, when loaded, results in this dataframe:
>>> df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN ,30 ,46 ,50 ... ,26 ,15 ,07 ,11 ,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
And also, I have values, that is a numpy array of two lists.
>>> values
array([list(['6,30', '5,46', '4,50', '5,20', '7,48', '5,41', '2,07', '3,52', '3,11', '4,53', '4,51', '5,14', '4,28', '3,33', '5,32', '3,10', '5,03', '4,44', '4,39', '5,04', '5,26', '7,15', '5,07', '6,11', '2,59']),
list(['2,53'])], dtype=object)
My question is, I want to replace all elements in Dataframe that match a specific regex to be replaced with the corresponding element of the values list.
I assume that df and values have the same length (in this case 2) and also that "wrong" numbers to be replaced inside df are the same of the corresponding row in the values array.
In my case, I tried using df.replace(), but it didn't work; I got this error:
>>> df_lattice2.replace(r"\d?,\d+", values)
TypeError: Invalid "to_replace" type: 'str'
After I while, I came out with an iterative algorithm, using df.iterrows(), counters and checking the elements one by one; I think, however, that a Pandas solution to a problem like this must exist, but I didn't find anything.
My expected output is:
>>> expected_df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 ... 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
A precondition is that any function should work row-to-row (so no applymap) because some values are found in the second row - and the corresponding value in the first row is NaN -, while applymap works column by column.
Simple pandas solution
s = df.stack()
s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
out = s.unstack().reindex(df.columns, axis=1)
Explanation
stack the dataframe to reshape. Note: Stacking operation also drops the NaN values by default.
>>> df.stack()
0 Unnamed: 1 Ore
2Gi ,30
3Ve ,46
4Sa ,50
...
9Me ,07
0Gi ,11
1Ve ,59
1 Unnamed: 1 Ore
6Lu ,53
dtype: object
Match the regular expression pattern (\d?,\d+) against the stacked frame using str.contains, this essentially creates a boolean mask
>>> s.str.contains(r'\d?,\d+', na=False)
0 Unnamed: 1 False
2Gi True
3Ve True
4Sa True
...
9Me True
0Gi True
1Ve True
1 Unnamed: 1 False
6Lu True
dtype: bool
Using hstack flatten the values then assign these values to the matched strings in the stacked frame
>>> s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
>>> s
0 Unnamed: 1 Ore
2Gi 6,30
3Ve 5,46
4Sa 4,50
...
9Me 5,07
0Gi 6,11
1Ve 2,59
1 Unnamed: 1 Ore
6Lu 2,53
dtype: object
Now unstack to reshape back into a dataframe and reindex the columns
>>> s.unstack().reindex(df.columns, axis=1)
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa 5Do 6Lu 7Ma 8Me 9Gi 0Ve 1Sa 2Do 3Lu 4Ma 5Me 6Gi 7Ve 8Sa 9Do 0Lu 1Ma 2Me 3Gi 4Ve 5Sa 6Do 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 NaN NaN 5,20 7,48 5,41 2,07 3,52 3,11 NaN 4,53 4,51 5,14 4,28 3,33 NaN 5,32 3,10 5,03 4,44 4,39 5,04 NaN 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN NaN 2,53 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You are dealing with integers in df, but strings everywhere else, so that's part of the problem. Trying to replace with regex requires strings, so trying it with your df of digits will fail. Also, trying to replace using the entire values array doesn't seem like what you want to do. Iterrows is very slow, so avoid that at all costs.
Looks like you want to find the string '30' and replace with '6,30' for example. You can do that with df.replace(), all at once, as you originally wanted. You can also use replace on integers, e.g., replace the integer 30 with '6,30' in whatever data format that is. I'm not sure how what the exact data is you are working with and what data types you want in the end, so see this toy example for replacing all matching values in a df at once:
row1list = ['30', '46', '50', '20']
row2list = ['48', '41', '07', '52']
df = pd.DataFrame([row1list, row2list], columns=['1Me', '2Gi', '3ve', '4sa'])
values = ['6,30', '5,46', '2,07', '3,52']
for val in values:
left, right = val.split(',')
df = df.replace(right, val)
print(df)
# 1Me 2Gi 3ve 4sa
# 0 6,30 5,46 50 20
# 1 48 41 2,07 3,52

Generate szenarios with differnet means from data frame

I have the following data frame:
Cluster OPS(4) mean(ln) std(ln)
0 5-894 5-894a 2.203 0.775
1 5-894 5-894b 2.203 0.775
2 5-894 5-894c 2.203 0.775
3 5-894 5-894d 2.203 0.775
4 5-894 5-894e 2.203 0.775
For each surgery type (in column OPS(4)) I would like to generate 10.000 scenarios which should be stored in another data frame.
I know, that I can create scenarios with:
num_reps = 10.000
scenarios = np.ceil(np.random.lognormal(mean, std, num_reps))
And the new data frame should look like this whith 10,000 scenarios in each column:
scen_per_surg = pd.DataFrame(index=range(num_reps), columns=merged_information['OPS(4)'])
OPS(4) 5-894a 5-894b 5-894c 5-894d 5-894e
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
...
Unfortunately, I don't know how to iterate over the rows of the first data frame to create the scenarios.
Can somebody help me?
Best regards
Create some experimenting data
import pandas as pd
df = pd.DataFrame(data=[
[ '5-894' , '5-894a' , 2.0 , 0.70],
[ '5-894' , '5-894b' , 2.1 , 0.71],
[ '5-894' , '5-894c' , 2.2 , 0.72],
[ '5-894' , '5-894d' , 2.3 , 0.73],
[ '5-894' , '5-894e' , 2.4 , 0.74] ], columns =['Cluster', 'OPS(4)', 'mean(ln)', 'std(ln)'])
print(df)
create an empty dataframe
new_df = pd.DataFrame()
Define a function that will be applied to each row of the original df and generates the random values required and assign it to a column in new df
import numpy as np
def geb_scenarios(row):
# print(row)
col, mean, std = row[1:]
new_df[col] = np.ceil(np.random.lognormal(mean, std, 10))
Apply the function
df.apply(geb_scenarios, axis=1)
print(new_df)

Pandas combine two dataseries into one series

I need to combine the dataseries rateScore and rate into one.
This is the current DataFrame I have
rateScore rate
10 NaN 4.5
11 2.5 NaN
12 4.5 NaN
13 NaN 5.0
..
235 NaN 4.7
236 3.8 NaN
This needs to be something like this:
rateScore
10 4.5
11 2.5
12 4.5
13 5.0
..
235 4.7
236 3.8
The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate:
df = df.fillna(0)
df['rateScore'] = df['rateScore'] + df['rate']
df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series:
df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add
df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)

Remove Specific Characters/Strings/Sequences of Characters in Python

I am creating a long list of what seem to be tuples that I would like to later convert into a Dataframe, but there are certain common sequences of characters that prevent this from being possible. And example of a fraction of the output:
0,"GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
Name: 0, dtype: object"
Whereas the desired output would be:
GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
How can I get rid of the 0 and " at the start, and then the trash at the end past the TIME_ELAPSED_PERIOD? The int at the start and the one in the bottom row increases by 1 until the end of my program, which could likely go upwards of around 320,000, so I will need the code to be able to adapt for a range of int values. I think it would be easiest to do this after the creation of my list, so it shouldn't be necessary for me to show you any of my code. Just a systematic manipulation of characters should do the trick. Thanks!
Provided that your input data is in the form of a list, you can try the following to meet your requirements:
inputlist = Your_list_to_be_corrected #Assign your input list here
# Now, remove the rows in the list that have the format "Name: 0, dtype: object""
inputlist = [ x for x in inputlist if "dtype: object" not in x ]
#Now, correct the rows containing GAME_ID by removing the int number and special characters
sep = 'GAME_ID'
for index, element in enumerate(inputlist):
if "GAME_ID" in element:
inputlist[index] = 'GAME_ID' + element.split(sep, 1)[1]

Excluding the NaN values while doing a Sum operation across the rows inside FOR Loop

I am having two data frame as given below
df1=
2492 3853 2486 3712 2288
0 4 NaN 3.5 NaN NaN
1 3 NaN 2.0 4.5 3.5
2 3 3.5 4.5 NaN 3.5
3 3. NaN 3.5 4.5 NaN
df2=
2492 0.476683
3853 0.464110
2486 0.438992
3712 0.400275
2288 0.379856
Right now I would like to get the sum of df2 values by excluding the NaN Values
Expected output
0 0.915675[0.476683+0.438992]
1 1.695806[0.476683+0.438992+0.400275+0.379856]
2 1.759641[0.476683+0.464110+0.438992+0.379856]
3 1.31595 [0.476683+0.438992+0.400275]
Please let me know your thoughts how to achieve this issue(without replacing NaN values as "0" )
df2.sum(1).sum()
Should be enough and skip NaNs.
The first sum is a DataFrame method that returns a Series which contains the sum for every line, then the second is summing the values on this Series.
NaNs are ignored by default.
edit: using simply df2.sum() should be enough
You can do:
>>> ((df1.fillna(0)>0)*1).mul(df2.iloc[:,1].values).sum(axis=1)
0 0.915675
1 1.695806
2 1.759641
3 1.315950
dtype: float64
Note that NaN are not replaced "by reference", you still have NaN in your original df1 after this operation.

Categories