pandas dataframe condition based on regex expression - python
TTT
1. 802010001-999-00000285-888-
2. 256788
3. 1940
4. NaN
5. NaN
6. 702010001-X-2YZ-00000285-888-
I want to Fill column GGT column with all other values except for the amounts
Required table would be like this
TTT GGT
1. 802010001-999-00000285-888- 802010001-999-00000285-888-
2. 256788 NaN
3. 1940 NaN
4. NaN NaN
5. NaN NaN
6. 702010001-X-2YZ-00000285-888- 702010001-X-2YZ-00000285-888-
the orginal table has more than 200thousands rows.
If you want to remove the rows with only numbers, you can use the match() method of the string elements of the column TTT. You can use a code like that :
df["GGT"] = df["TTT"][df["TTT"].str.match(r'^(\d)+$')==False]
Use Series.mask:
df['GGT'] = df['TTT'].mask(pd.to_numeric(df['TTT'], errors='coerce').notna())
Or:
df['GGT'] = df['TTT'].mask(df["TTT"].astype(str).str.contains('^\d+$', na=True))
print (df)
TTT GGT
0 802010001-999-00000285-888- 802010001-999-00000285-888-
1 256788 NaN
2 1940 NaN
3 NaN NaN
4 702010001-X-2YZ-00000285-888- 702010001-X-2YZ-00000285-888-
I
Related
Replace a Pandas subrow with an array of values?
I have a (simple?) problem, but that cannot understand how to solve in a pandas way. I have this CSV: ,Unnamed: 0,Unnamed: 1,1Me,2Gi,3Ve,4Sa,5Do,6Lu,7Ma,8Me,9Gi,0Ve,1Sa,2Do,3Lu,4Ma,5Me,6Gi,7Ve,8Sa,9Do,0Lu,1Ma,2Me,3Gi,4Ve,5Sa,6Do,7Lu,8Ma,9Me,0Gi,1Ve,Unnamed: 2 0,,Ore,,",30",",46",",50",,,",20",",48",",41",",07",",52",",11",,",53",",51",",14",",28",",33",,",32",",10",",03",",44",",39",",04",,",26",",15",",07",",11",",59", 1,,Ore,,,,,,",53",,,,,,,,,,,,,,,,,,,,,,,,,, That, when loaded, results in this dataframe: >>> df Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2 0 NaN Ore NaN ,30 ,46 ,50 ... ,26 ,15 ,07 ,11 ,59 NaN 1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN And also, I have values, that is a numpy array of two lists. >>> values array([list(['6,30', '5,46', '4,50', '5,20', '7,48', '5,41', '2,07', '3,52', '3,11', '4,53', '4,51', '5,14', '4,28', '3,33', '5,32', '3,10', '5,03', '4,44', '4,39', '5,04', '5,26', '7,15', '5,07', '6,11', '2,59']), list(['2,53'])], dtype=object) My question is, I want to replace all elements in Dataframe that match a specific regex to be replaced with the corresponding element of the values list. I assume that df and values have the same length (in this case 2) and also that "wrong" numbers to be replaced inside df are the same of the corresponding row in the values array. In my case, I tried using df.replace(), but it didn't work; I got this error: >>> df_lattice2.replace(r"\d?,\d+", values) TypeError: Invalid "to_replace" type: 'str' After I while, I came out with an iterative algorithm, using df.iterrows(), counters and checking the elements one by one; I think, however, that a Pandas solution to a problem like this must exist, but I didn't find anything. My expected output is: >>> expected_df Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2 0 NaN Ore NaN 6,30 5,46 4,50 ... 5,26 7,15 5,07 6,11 2,59 NaN 1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN A precondition is that any function should work row-to-row (so no applymap) because some values are found in the second row - and the corresponding value in the first row is NaN -, while applymap works column by column.
Simple pandas solution s = df.stack() s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values) out = s.unstack().reindex(df.columns, axis=1) Explanation stack the dataframe to reshape. Note: Stacking operation also drops the NaN values by default. >>> df.stack() 0 Unnamed: 1 Ore 2Gi ,30 3Ve ,46 4Sa ,50 ... 9Me ,07 0Gi ,11 1Ve ,59 1 Unnamed: 1 Ore 6Lu ,53 dtype: object Match the regular expression pattern (\d?,\d+) against the stacked frame using str.contains, this essentially creates a boolean mask >>> s.str.contains(r'\d?,\d+', na=False) 0 Unnamed: 1 False 2Gi True 3Ve True 4Sa True ... 9Me True 0Gi True 1Ve True 1 Unnamed: 1 False 6Lu True dtype: bool Using hstack flatten the values then assign these values to the matched strings in the stacked frame >>> s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values) >>> s 0 Unnamed: 1 Ore 2Gi 6,30 3Ve 5,46 4Sa 4,50 ... 9Me 5,07 0Gi 6,11 1Ve 2,59 1 Unnamed: 1 Ore 6Lu 2,53 dtype: object Now unstack to reshape back into a dataframe and reindex the columns >>> s.unstack().reindex(df.columns, axis=1) Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa 5Do 6Lu 7Ma 8Me 9Gi 0Ve 1Sa 2Do 3Lu 4Ma 5Me 6Gi 7Ve 8Sa 9Do 0Lu 1Ma 2Me 3Gi 4Ve 5Sa 6Do 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2 0 NaN Ore NaN 6,30 5,46 4,50 NaN NaN 5,20 7,48 5,41 2,07 3,52 3,11 NaN 4,53 4,51 5,14 4,28 3,33 NaN 5,32 3,10 5,03 4,44 4,39 5,04 NaN 5,26 7,15 5,07 6,11 2,59 NaN 1 NaN Ore NaN NaN NaN NaN NaN 2,53 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You are dealing with integers in df, but strings everywhere else, so that's part of the problem. Trying to replace with regex requires strings, so trying it with your df of digits will fail. Also, trying to replace using the entire values array doesn't seem like what you want to do. Iterrows is very slow, so avoid that at all costs. Looks like you want to find the string '30' and replace with '6,30' for example. You can do that with df.replace(), all at once, as you originally wanted. You can also use replace on integers, e.g., replace the integer 30 with '6,30' in whatever data format that is. I'm not sure how what the exact data is you are working with and what data types you want in the end, so see this toy example for replacing all matching values in a df at once: row1list = ['30', '46', '50', '20'] row2list = ['48', '41', '07', '52'] df = pd.DataFrame([row1list, row2list], columns=['1Me', '2Gi', '3ve', '4sa']) values = ['6,30', '5,46', '2,07', '3,52'] for val in values: left, right = val.split(',') df = df.replace(right, val) print(df) # 1Me 2Gi 3ve 4sa # 0 6,30 5,46 50 20 # 1 48 41 2,07 3,52
Generate szenarios with differnet means from data frame
I have the following data frame: Cluster OPS(4) mean(ln) std(ln) 0 5-894 5-894a 2.203 0.775 1 5-894 5-894b 2.203 0.775 2 5-894 5-894c 2.203 0.775 3 5-894 5-894d 2.203 0.775 4 5-894 5-894e 2.203 0.775 For each surgery type (in column OPS(4)) I would like to generate 10.000 scenarios which should be stored in another data frame. I know, that I can create scenarios with: num_reps = 10.000 scenarios = np.ceil(np.random.lognormal(mean, std, num_reps)) And the new data frame should look like this whith 10,000 scenarios in each column: scen_per_surg = pd.DataFrame(index=range(num_reps), columns=merged_information['OPS(4)']) OPS(4) 5-894a 5-894b 5-894c 5-894d 5-894e 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN ... Unfortunately, I don't know how to iterate over the rows of the first data frame to create the scenarios. Can somebody help me? Best regards
Create some experimenting data import pandas as pd df = pd.DataFrame(data=[ [ '5-894' , '5-894a' , 2.0 , 0.70], [ '5-894' , '5-894b' , 2.1 , 0.71], [ '5-894' , '5-894c' , 2.2 , 0.72], [ '5-894' , '5-894d' , 2.3 , 0.73], [ '5-894' , '5-894e' , 2.4 , 0.74] ], columns =['Cluster', 'OPS(4)', 'mean(ln)', 'std(ln)']) print(df) create an empty dataframe new_df = pd.DataFrame() Define a function that will be applied to each row of the original df and generates the random values required and assign it to a column in new df import numpy as np def geb_scenarios(row): # print(row) col, mean, std = row[1:] new_df[col] = np.ceil(np.random.lognormal(mean, std, 10)) Apply the function df.apply(geb_scenarios, axis=1) print(new_df)
Pandas combine two dataseries into one series
I need to combine the dataseries rateScore and rate into one. This is the current DataFrame I have rateScore rate 10 NaN 4.5 11 2.5 NaN 12 4.5 NaN 13 NaN 5.0 .. 235 NaN 4.7 236 3.8 NaN This needs to be something like this: rateScore 10 4.5 11 2.5 12 4.5 13 5.0 .. 235 4.7 236 3.8 The rate column needs to be dropped after merging the series and also for each row, the index number needs stay the same.
You can try with the following with fillna(), redifining the rateScore column and dropping rate: df = df.fillna(0) df['rateScore'] = df['rateScore'] + df['rate'] df = df.drop(columns='rate')
You could use combine_first to fill NaN values from a second Series: df['rateScore'] = df['rateScore'].combine_first(df['rateScore'])
Let us do add df['rateScore'] = df['rateScore'].add(df['rate'],fill_value=0)
Remove Specific Characters/Strings/Sequences of Characters in Python
I am creating a long list of what seem to be tuples that I would like to later convert into a Dataframe, but there are certain common sequences of characters that prevent this from being possible. And example of a fraction of the output: 0,"GAME_ID 21900001 EVENTNUM 2 EVENTMSGTYPE 12 EVENTMSGACTIONTYPE 0 PERIOD 1 WCTIMESTRING 8:04 PM PCTIMESTRING 12:00 HOMEDESCRIPTION NEUTRALDESCRIPTION VISITORDESCRIPTION SCORE NaN SCOREMARGIN NaN PERSON1TYPE 0 PLAYER1_ID 0 PLAYER1_NAME NaN PLAYER1_TEAM_ID NaN PLAYER1_TEAM_CITY NaN PLAYER1_TEAM_NICKNAME NaN PLAYER1_TEAM_ABBREVIATION NaN PERSON2TYPE 0 PLAYER2_ID 0 PLAYER2_NAME NaN PLAYER2_TEAM_ID NaN PLAYER2_TEAM_CITY NaN PLAYER2_TEAM_NICKNAME NaN PLAYER2_TEAM_ABBREVIATION NaN PERSON3TYPE 0 PLAYER3_ID 0 PLAYER3_NAME NaN PLAYER3_TEAM_ID NaN PLAYER3_TEAM_CITY NaN PLAYER3_TEAM_NICKNAME NaN PLAYER3_TEAM_ABBREVIATION NaN VIDEO_AVAILABLE_FLAG 0 DESCRIPTION TIME_ELAPSED 0 TIME_ELAPSED_PERIOD 0 Name: 0, dtype: object" Whereas the desired output would be: GAME_ID 21900001 EVENTNUM 2 EVENTMSGTYPE 12 EVENTMSGACTIONTYPE 0 PERIOD 1 WCTIMESTRING 8:04 PM PCTIMESTRING 12:00 HOMEDESCRIPTION NEUTRALDESCRIPTION VISITORDESCRIPTION SCORE NaN SCOREMARGIN NaN PERSON1TYPE 0 PLAYER1_ID 0 PLAYER1_NAME NaN PLAYER1_TEAM_ID NaN PLAYER1_TEAM_CITY NaN PLAYER1_TEAM_NICKNAME NaN PLAYER1_TEAM_ABBREVIATION NaN PERSON2TYPE 0 PLAYER2_ID 0 PLAYER2_NAME NaN PLAYER2_TEAM_ID NaN PLAYER2_TEAM_CITY NaN PLAYER2_TEAM_NICKNAME NaN PLAYER2_TEAM_ABBREVIATION NaN PERSON3TYPE 0 PLAYER3_ID 0 PLAYER3_NAME NaN PLAYER3_TEAM_ID NaN PLAYER3_TEAM_CITY NaN PLAYER3_TEAM_NICKNAME NaN PLAYER3_TEAM_ABBREVIATION NaN VIDEO_AVAILABLE_FLAG 0 DESCRIPTION TIME_ELAPSED 0 TIME_ELAPSED_PERIOD 0 How can I get rid of the 0 and " at the start, and then the trash at the end past the TIME_ELAPSED_PERIOD? The int at the start and the one in the bottom row increases by 1 until the end of my program, which could likely go upwards of around 320,000, so I will need the code to be able to adapt for a range of int values. I think it would be easiest to do this after the creation of my list, so it shouldn't be necessary for me to show you any of my code. Just a systematic manipulation of characters should do the trick. Thanks!
Provided that your input data is in the form of a list, you can try the following to meet your requirements: inputlist = Your_list_to_be_corrected #Assign your input list here # Now, remove the rows in the list that have the format "Name: 0, dtype: object"" inputlist = [ x for x in inputlist if "dtype: object" not in x ] #Now, correct the rows containing GAME_ID by removing the int number and special characters sep = 'GAME_ID' for index, element in enumerate(inputlist): if "GAME_ID" in element: inputlist[index] = 'GAME_ID' + element.split(sep, 1)[1]
Excluding the NaN values while doing a Sum operation across the rows inside FOR Loop
I am having two data frame as given below df1= 2492 3853 2486 3712 2288 0 4 NaN 3.5 NaN NaN 1 3 NaN 2.0 4.5 3.5 2 3 3.5 4.5 NaN 3.5 3 3. NaN 3.5 4.5 NaN df2= 2492 0.476683 3853 0.464110 2486 0.438992 3712 0.400275 2288 0.379856 Right now I would like to get the sum of df2 values by excluding the NaN Values Expected output 0 0.915675[0.476683+0.438992] 1 1.695806[0.476683+0.438992+0.400275+0.379856] 2 1.759641[0.476683+0.464110+0.438992+0.379856] 3 1.31595 [0.476683+0.438992+0.400275] Please let me know your thoughts how to achieve this issue(without replacing NaN values as "0" )
df2.sum(1).sum() Should be enough and skip NaNs. The first sum is a DataFrame method that returns a Series which contains the sum for every line, then the second is summing the values on this Series. NaNs are ignored by default. edit: using simply df2.sum() should be enough
You can do: >>> ((df1.fillna(0)>0)*1).mul(df2.iloc[:,1].values).sum(axis=1) 0 0.915675 1 1.695806 2 1.759641 3 1.315950 dtype: float64 Note that NaN are not replaced "by reference", you still have NaN in your original df1 after this operation.