Remove Specific Characters/Strings/Sequences of Characters in Python - python

I am creating a long list of what seem to be tuples that I would like to later convert into a Dataframe, but there are certain common sequences of characters that prevent this from being possible. And example of a fraction of the output:
0,"GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
Name: 0, dtype: object"
Whereas the desired output would be:
GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
How can I get rid of the 0 and " at the start, and then the trash at the end past the TIME_ELAPSED_PERIOD? The int at the start and the one in the bottom row increases by 1 until the end of my program, which could likely go upwards of around 320,000, so I will need the code to be able to adapt for a range of int values. I think it would be easiest to do this after the creation of my list, so it shouldn't be necessary for me to show you any of my code. Just a systematic manipulation of characters should do the trick. Thanks!

Provided that your input data is in the form of a list, you can try the following to meet your requirements:
inputlist = Your_list_to_be_corrected #Assign your input list here
# Now, remove the rows in the list that have the format "Name: 0, dtype: object""
inputlist = [ x for x in inputlist if "dtype: object" not in x ]
#Now, correct the rows containing GAME_ID by removing the int number and special characters
sep = 'GAME_ID'
for index, element in enumerate(inputlist):
if "GAME_ID" in element:
inputlist[index] = 'GAME_ID' + element.split(sep, 1)[1]

Related

Replace a Pandas subrow with an array of values?

I have a (simple?) problem, but that cannot understand how to solve in a pandas way.
I have this CSV:
,Unnamed: 0,Unnamed: 1,1Me,2Gi,3Ve,4Sa,5Do,6Lu,7Ma,8Me,9Gi,0Ve,1Sa,2Do,3Lu,4Ma,5Me,6Gi,7Ve,8Sa,9Do,0Lu,1Ma,2Me,3Gi,4Ve,5Sa,6Do,7Lu,8Ma,9Me,0Gi,1Ve,Unnamed: 2
0,,Ore,,",30",",46",",50",,,",20",",48",",41",",07",",52",",11",,",53",",51",",14",",28",",33",,",32",",10",",03",",44",",39",",04",,",26",",15",",07",",11",",59",
1,,Ore,,,,,,",53",,,,,,,,,,,,,,,,,,,,,,,,,,
That, when loaded, results in this dataframe:
>>> df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN ,30 ,46 ,50 ... ,26 ,15 ,07 ,11 ,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
And also, I have values, that is a numpy array of two lists.
>>> values
array([list(['6,30', '5,46', '4,50', '5,20', '7,48', '5,41', '2,07', '3,52', '3,11', '4,53', '4,51', '5,14', '4,28', '3,33', '5,32', '3,10', '5,03', '4,44', '4,39', '5,04', '5,26', '7,15', '5,07', '6,11', '2,59']),
list(['2,53'])], dtype=object)
My question is, I want to replace all elements in Dataframe that match a specific regex to be replaced with the corresponding element of the values list.
I assume that df and values have the same length (in this case 2) and also that "wrong" numbers to be replaced inside df are the same of the corresponding row in the values array.
In my case, I tried using df.replace(), but it didn't work; I got this error:
>>> df_lattice2.replace(r"\d?,\d+", values)
TypeError: Invalid "to_replace" type: 'str'
After I while, I came out with an iterative algorithm, using df.iterrows(), counters and checking the elements one by one; I think, however, that a Pandas solution to a problem like this must exist, but I didn't find anything.
My expected output is:
>>> expected_df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 ... 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
A precondition is that any function should work row-to-row (so no applymap) because some values are found in the second row - and the corresponding value in the first row is NaN -, while applymap works column by column.
Simple pandas solution
s = df.stack()
s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
out = s.unstack().reindex(df.columns, axis=1)
Explanation
stack the dataframe to reshape. Note: Stacking operation also drops the NaN values by default.
>>> df.stack()
0 Unnamed: 1 Ore
2Gi ,30
3Ve ,46
4Sa ,50
...
9Me ,07
0Gi ,11
1Ve ,59
1 Unnamed: 1 Ore
6Lu ,53
dtype: object
Match the regular expression pattern (\d?,\d+) against the stacked frame using str.contains, this essentially creates a boolean mask
>>> s.str.contains(r'\d?,\d+', na=False)
0 Unnamed: 1 False
2Gi True
3Ve True
4Sa True
...
9Me True
0Gi True
1Ve True
1 Unnamed: 1 False
6Lu True
dtype: bool
Using hstack flatten the values then assign these values to the matched strings in the stacked frame
>>> s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
>>> s
0 Unnamed: 1 Ore
2Gi 6,30
3Ve 5,46
4Sa 4,50
...
9Me 5,07
0Gi 6,11
1Ve 2,59
1 Unnamed: 1 Ore
6Lu 2,53
dtype: object
Now unstack to reshape back into a dataframe and reindex the columns
>>> s.unstack().reindex(df.columns, axis=1)
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa 5Do 6Lu 7Ma 8Me 9Gi 0Ve 1Sa 2Do 3Lu 4Ma 5Me 6Gi 7Ve 8Sa 9Do 0Lu 1Ma 2Me 3Gi 4Ve 5Sa 6Do 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 NaN NaN 5,20 7,48 5,41 2,07 3,52 3,11 NaN 4,53 4,51 5,14 4,28 3,33 NaN 5,32 3,10 5,03 4,44 4,39 5,04 NaN 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN NaN 2,53 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You are dealing with integers in df, but strings everywhere else, so that's part of the problem. Trying to replace with regex requires strings, so trying it with your df of digits will fail. Also, trying to replace using the entire values array doesn't seem like what you want to do. Iterrows is very slow, so avoid that at all costs.
Looks like you want to find the string '30' and replace with '6,30' for example. You can do that with df.replace(), all at once, as you originally wanted. You can also use replace on integers, e.g., replace the integer 30 with '6,30' in whatever data format that is. I'm not sure how what the exact data is you are working with and what data types you want in the end, so see this toy example for replacing all matching values in a df at once:
row1list = ['30', '46', '50', '20']
row2list = ['48', '41', '07', '52']
df = pd.DataFrame([row1list, row2list], columns=['1Me', '2Gi', '3ve', '4sa'])
values = ['6,30', '5,46', '2,07', '3,52']
for val in values:
left, right = val.split(',')
df = df.replace(right, val)
print(df)
# 1Me 2Gi 3ve 4sa
# 0 6,30 5,46 50 20
# 1 48 41 2,07 3,52

How to change value in pandas DataFrame dependent on if clause in a for loop

I would like to analyse a list of orchids (input_df) if it contains orchid species that are on one of six lists. I import these lists from an xlsx file with six sheets as dictionary containing the six lists as DataFrames (orchid_checklists).
import pandas as pd
orchid_checklists = pd.read_excel('\\orchid_checklists.xlsx', sheet_name=None)
input_df = pd.read_excel('\\input.xlsx')
input_df['Orchideen-Checkliste'] = ''
With the following for loop with if condition I am trying to add the name of the Checklist into the field corresponding to the item in input_df['Input Name'] in the column ['Orchideen-Checkliste'] to visualize to what checklist one should refer.
for item in input_df['Input Name']:
for list_name, sheet in orchid_checklists.items():
genus = item.split(' ')[0]
if genus in sheet['referenced'].values:
input_df['Orchideen-Checkliste'] = list_name
else:
pass
In my test input list there is one species called "Bulbophyllum pachyrachis" that should be found. Unfortunately the name of the list "CL_Bulbophyllum" is put into all rows. I can´t figure out why.
In the next step I want to check if the species name is also in the column "exceptions" in either of my checklists. In that case that would not be the correct checklist. In these cases the full species name (e.g. "Aerangis ellisii", see CL_App_I and CL_III below) is found in the column "referenced" of another list.
I haven´t started coding this exception, because I am still stuck on the part before, but any pointers how to approach this are highly welcomed.
This is the input data:
Input Name Orchideen-Checkliste
0 Sobralia madisonii
1 Stelis cocornaensis
2 Stelis gelida
3 Braemia vittata
4 Brassia escobariana
5 Aspasia silvana
6 Bulbophyllum maximum
7 Bulbophyllum pachyrachis
8 Chondroscaphe amabilis
9 Dresslerella hispida
10 Elleanthus sodiroi
11 Maxillaria mathewsii
orchid_checklists:
CL_III
referenced exceptions
0 Aerangis Aerangis ellisii
1 Angraecum NaN
2 Ascocentrum NaN
3 Bletilla NaN
4 Brassavola NaN
5 Calanthe NaN
6 Catasetum NaN
7 Miltonia NaN
8 Miltoniopsis NaN
9 Renanthera NaN
10 Renantherella NaN
11 Rhynchostylis NaN
12 Rossioglossum NaN
13 Vanda NaN
14 Vandopsis NaN
CL_App_I
referenced exceptions
0 Paphiopedilum NaN
1 Phragmipedium NaN
2 Aerangis ellisii NaN
3 Cattleya jongheana NaN
4 Cattleya lobata NaN
5 Dendrobium cruentum NaN
6 Mexipedium xerophyticum NaN
7 Peristeria elata NaN
8 Renanthera imshootiana NaN
CL_Bulbophyllum
referenced exceptions
0 Acrochaene NaN
1 Bulbophyllum NaN
2 Chaseella NaN
3 Codonosiphon NaN
4 Drymoda NaN
5 Monomeria NaN
6 Monosepalum NaN
7 Pedilochilus NaN
8 Succoglossum NaN
9 Sunipia NaN
10 Trias NaN
Thank you in advance for your help!
input_df['Orchideen-Checkliste'] = list_name
Assigns a value to every item of that column because you did not specify a row indexer.
Without changing your process too much: enumerate the items in input_df['Input Name'] when iterating and use the enumeration to specify the row for the assignment.
for index,item in enumerate(input_df['Input Name']):
for list_name, sheet in orchid_checklists.items():
genus = item.split(' ')[0]
if genus in sheet['referenced'].values:
input_df.loc[index,'Orchideen-Checkliste'] = list_name

How to get Dataframe with Table ID in Pandas?

I want to extract dataframe from HTML using URL.
The page contains 59 table/dataframe.
I want to extract 1 particular table which can be identified by its ID "ctl00_Menu1"
Following is my trail which is giving error.
import pandas as pd
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S12",attrs = {'id': 'ctl00_Menu1'})
As this is my very early stage in python so can be simple solution but I am unable to find. appreciate help.
I would look at how the URL passes params and probably try to read a dataframe directly from it. I'm unsure if you are trying to develop a function or a script or just exercising.
If you do (notice the 58 at the end of the url)
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S1258",attrs = {'id':
'ctl00_Menu1'})
It works and gives you table 59.
[ 0 1 2 \
0 Partywise Partywise NaN
1 Partywise NaN NaN
2 Constituencywise-All Candidates NaN NaN
3 Constituencywise Trends NaN NaN
3 4 5 \
0 Constituencywise-All Candidates Constituencywise-All Candidates NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
6 7
0 Constituencywise Trends Constituencywise Trends
1 NaN NaN
2 NaN NaN
3 NaN NaN ]
Unsure if that's the table you want to extract, but most of the time it's easier to pass it as a url parameter. If you try it without the 58 it works too, I believe the 'ElectionResult' argument might not be a table classifier hence why you can't find any tables with that name.

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Change value if consecutive number of certain condition is achieved in Pandas

I would to change the value of certain DataFrame values only if a certain condition is met an n number of consecutive times.
Example:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12,0]=-40
df.iloc[10:12,1]=-40
Which gives me this DF:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 -1.045209
4 40.000000 0.598657 -1.268399
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 1.744822
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 1.416020
10 -1.337494 -40.000000 -1.195780
11 -0.703669 -40.000000 0.657519
12 -40.000000 -0.288235 -0.840145
13 -1.084869 -0.298030 -1.592004
14 -0.617568 -1.046210 -0.531523
Now, if I do
a=df.copy()
a[ abs(a) > abs(a.std()) ] = float('nan')
I get
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 NaN 0.598657 NaN
5 NaN 0.442297 -0.016363
6 NaN -0.316817 NaN
7 NaN 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
which is fair. However, I would like only to replace the values with NaN if these conditions were met by a maximum of 2 consecutive entries (so I can interpolate later). For example, I wanted the result to be
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
Apparently there's no ready-to-use method to do this. The solution I found that closest resembles my problem was this one, but I couldn't make it work for me.
Any ideas?
See below - the tricky part is (cond[c] != cond[c].shift(1)).cumsum() which breaks the data into contiguous runs of the same value.
In [23]: cond = abs(df) > abs(df.std())
In [24]: for c in df.columns:
...: grouper = (cond[c] != cond[c].shift(1)).cumsum() * cond[c]
...: fill = (df.groupby(grouper)[c].transform('size') <= 2)
...: df.loc[fill, c] = np.nan
In [25]: df
Out[25]:
0 1 2
0 1.238892 0.802318 -0.013856
1 -1.136326 -0.527263 -0.260975
2 1.118771 0.031517 0.527350
3 1.629482 -0.158941 NaN
4 40.000000 0.598657 NaN
5 40.000000 0.442297 -0.016363
6 40.000000 -0.316817 NaN
7 40.000000 0.193083 0.914172
8 0.322756 -0.680682 0.888702
9 -1.204531 -0.240042 NaN
10 -1.337494 NaN NaN
11 -0.703669 NaN 0.657519
12 NaN -0.288235 -0.840145
13 -1.084869 -0.298030 NaN
14 -0.617568 -1.046210 -0.531523
To explain a bit more, cond[c] is a boolean series indicating whether your condition is true or not.
The cond[c] != cond[c].shift(1) compares the current row's condition to the next row's. This has the effecting of 'marking' where a run of values begins with the value True.
The .cumsum() converts the bools to integers and takes the cumulative sum. It may not be immediately intuitive, but this 'numbers' the groups of contiguous values. Finally the * cond[c] reassigns all groups that didn't meet the criteria to 0 (using False == 0)
So now you have groups of contiguous numbers that meet your condition, the next step performs a groupby to count how many values are in each group (transform('size').
Finally a new bool condition is used to assign missing values to those groups with 2 or less values meeting the condition.

Categories