Regex Validation Not working for large Numbers in column Pandas - python

I am trying to validate columns over a particular regex in dataframe. The Limit of number is (20,3) i.e maximum 20 length with int datatype or 23 with float datatype . but pandas is converting original number to random int number and my regex validation is getting failed . I checked my regex is proper .
Dataframe :
FirstColumn,SecondColumn,ThirdColumn
111900987654123.123,111900987654123.123,111900987654123.123
111900987654123.12,111900987654123.12,111900987654123.12
111900987654123.1,111900987654123.1,111900987654123.1
111900987654123,111900987654123,111900987654123
111900987654123,-111900987654123,-111900987654123
-111900987654123.123,-111900987654123.123,-111900987654123.1
-111900987654123.12,-111900987654123.12,-111900987654123.12
-111900987654123.1,-111900987654123.1,-111900987654123.1
11119009876541231111,1111900987654123,1111900987654123
Code:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF=pd.read_csv("D:\\FTP\LocalUser\\NCCLCOLL\\COLLATERALUPLOAD\\upld\\SplitFiles\\AACCR6675H_22102021_07_1 - Copy.csv")
pd.set_option('display.float_format', '{:.3f}'.format)
rslt_df2=df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1=rslt_df2[~rslt_df2.iloc[:,0].apply(str).str.contains(NumberValidationRegexnegative, regex=True)].index
print("rslt_df1",rslt_df1)
Output Result:
rslt_df1 Int64Index([8], dtype='int64')
Expected Result:
rslt_df1 Int64Index([], dtype='int64')

Use dtype=str as parameter of pd.read_csv:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF = pd.read_csv("data.csv", dtype=str)
rslt_df2 = df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1 = rslt_df2[~rslt_df2.iloc[:,0] \
.str.contains(NumberValidationRegexnegative, regex=True)].index
Output:
>>> print("rslt_df1", rslt_df1)
rslt_df1 Int64Index([], dtype='int64')

Related

String type to array or list pandas column [duplicate]

This question already has answers here:
How to convert string representation of list to a list
(19 answers)
Closed last month.
I have pandas dataframe as below:
id emb
0 529581720 [-0.06815625727176666, 0.054927315562963486, 0...
1 663817504 [-0.05805483087897301, 0.031277190893888474, 0...
2 507084910 [-0.07410381734371185, -0.03922194242477417, 0...
3 1774950548 [-0.09088297933340073, -0.04383128136396408, -...
4 725573369 [-0.06329705566167831, 0.01242107804864645, 0....
data types of emb column is object. Now I want to convert those into numpy array. So I tried following:
embd = df[embd].values
But as it's in string format I'm getting following output:
embd[0]
out:
array('[-0.06815625727176666, 0.054927315562963486, 0.056555990129709244, -0.04559280723333359, -0.025042753666639328, -0.06674829870462418, -0.027613995596766472,
0.05307046324014664, 0.020159300416707993, 0.012015435844659805, 0.07048438489437103,
-0.020022081211209297, -0.03899797052145004, -0.03358669579029083, -0.06369364261627197,
-0.045727960765361786, -0.05619484931230545, -0.07043793052434921, -0.07021039724349976,
2.8020248282700777E-4, -0.04271571710705757, -0.04004468396306038, 0.01802503503859043, -0.0553901381790638, 0.0068290019407868385, -0.021117383614182472, -0.06583991646766663]',
dtype='<U11190')
Can someone tell me how can I convert this successfully into array with float32 values.
You can use the numpy function numpy.array() to convert an array of strings to an array with float32 values. Here is an example:
import numpy as np
string_array = ["1.0", "2.5", "3.14"]
float_array = np.array(string_array, dtype=np.float32)
Alternatively, you can use the pandas function pandas.to_numeric() to convert the values of a column of a dataframe from string to float32. Here is an example:
import pandas as pd
df = pd.DataFrame({"A": ["1.0", "2.5", "3.14"]})
df["A"] = pd.to_numeric(df["A"], downcast='float')
You can also use the pd.to_numeric() method and catch the errors that might arise when trying to convert the string to float, using the errors='coerce' argument. This will replace the invalid string values with NaN.
df['A'] = pd.to_numeric(df['A'], errors='coerce')
Use ast.literal_eval:
import ast
df['emb'] = df['emb'].apply(ast.literal_eval)
Output:
>>> df['emb'].values
array([list([-0.06815625727176666, 0.054927315562963486]),
list([-0.05805483087897301, 0.031277190893888474]),
list([-0.07410381734371185, -0.03922194242477417]),
list([-0.09088297933340073, -0.04383128136396408]),
list([-0.06329705566167831, 0.01242107804864645])], dtype=object)
>>> np.stack(df['emb'].values)
array([[-0.06815626, 0.05492732],
[-0.05805483, 0.03127719],
[-0.07410382, -0.03922194],
[-0.09088298, -0.04383128],
[-0.06329706, 0.01242108]])
Alternative to store list as numpy array:
df['emb'] = df['emb'].apply(lambda x: np.array(ast.literal_eval(x)))

How can convert struct column timestamp with start and end into normal pythonic stamp column?

I have a time-series pivot table with struct timestamp column including start and end of time frame of records as follow:
import pandas as pd
pd.set_option('max_colwidth', 400)
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}'],
"X1": [25],
"X2": [33],
})
df
# timestamp X1 X2
#0 {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"} 25 33
Since later I will use timestamps as the index for time-series analysis, I need to convert it into timestamps with just end/start.
I have tried to find the solution using regex maybe unsuccessfully based on this post as follows:
df[["start_timestamp", "end_timestamp"]] = (
df["timestamp"].str.extractall(r"(\d+\.\d+\.\d+)").unstack().ffill(axis=1)
)
but I get:
ValueError: Columns must be same length as key
so I try to reach following expected dataframe:
df = pd.DataFrame({'timestamp': ['{"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"}'],
'start_timestamp': ['2022-01-19T00:00:00.000+0000'],
'end_timestamp': ['2022-01-20T00:00:00.000+0000'],
"X1": [25],
"X2": [33]})
df
# timestamp start_timestamp end_timestamp X1 X2
#0 {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:00.000+0000"} 2022-01-19T00:00:00.000+0000 2022-01-20T00:00:00.000+0000 25 33
You can extract both values with an extract call:
df[["start_timestamp", "end_timestamp"]] = df["timestamp"].str.extract(r'"start":"([^"]*)","end":"([^"]+)')
The "start":"([^"]*)","end":"([^"]+) regex matches "start":", then captres any zero or more chars other than " into Group 1 (the start column value) and then matches ","end":" and then captures one or more chars other than " into Group 2 (the end column value).
Also, if the data you have is valid JSON, you can parse the JSON instead of using a regex:
def extract_startend(x):
j = json.loads(x)
return pd.Series([j["start"], j["end"]])
df[["start_timestamp", "end_timestamp"]] = df["timestamp"].apply(extract_startend)
Output of print(df.to_string()):
timestamp X1 X2 start_timestamp end_timestamp
0 {"start":"2022-01-19T00:00:00.000+0000","end":"2022-01-20T00:00:......... 25 33 2022-01-19T00:00:00.000+0000 2022-01-20T00:00:00.000+0000
This may not be the most efficient approach, but it works:
df[['start_timestamp','end_timestamp']]=df['timestamp'].str.split(',',expand=True)
df['start_timestamp']=df['start_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\+\d{4})')
df['end_timestamp']=df['end_timestamp'].str.extract('(\d{4}\-\d{2}\-\d{2}T\d{2}\:\d{2}\:\d{2}\.\d{3}\+\d{4})')

How to turn value in timestamp column into numbers

I have a dataframe:
id timestamp
1 "2025-08-02 19:08:59"
1 "2025-08-02 19:08:59"
1 "2025-08-02 19:09:59"
I need to turn timestamp into integer number to iterate over conditions. So it look like this:
id timestamp
1 20250802190859
1 20250802190859
1 20250802190959
you can convert string using string of pandas :
df = pd.DataFrame({'id':[1,1,1],'timestamp':["2025-08-02 19:08:59",
"2025-08-02 19:08:59",
"2025-08-02 19:09:59"]})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df['timestamp'] = df['timestamp'].str.replace(r'[-\s:]', '').astype('float64')
>>> df
id timestamp
0 1 20250802190859.000
1 1 20250802190859.000
2 1 20250802190959.000
Have you tried opening the file, skipping the first line (or better: validating that it contains the header fields as expected) and for each line, splitting it at the first space/tab/whitespace. The second part, e.g. "2025-08-02 19:08:59", can be parsed using datetime.fromisoformat(). You can then turn the datetime object back to a string using datetime.strftime(format) with e.g. format = '%Y%m%d%H%M%S'. Note that there is no "milliseconds" format in strftime though. You could use %f for microseconds.
Note: if datetime.fromisoformat() fails to parse the dates, try datetime.strptime(date_string, format) with a different format, e.g. format = '%Y-%m-%d %H:%M:%S'.
You can use the solutions provided in this post: How to turn timestamp into float number? and loop through the dataframe.
Let's say you have already imported pandas and have a dataframe df, see the additional code below:
import re
df = pd.DataFrame(l)
df1 = df.copy()
for x in range(len(df[0])):
df1[0][x] = re.sub(r'\D','', df[0][x])
This way you will not modify the original dataframe df and will get desired output in a new dataframe df1.
Full code that I tried (including creatiion of first dataframe), this might help in removing any confusions:
import pandas as pd
import re
l = ["2025-08-02 19:08:59", "2025-08-02 19:08:59", "2025-08-02 19:09:59"]
df = pd.DataFrame(l)
df1 = df.copy()
for x in range(len(df[0])):
df1[0][x] = re.sub(r'\D','', df[0][x])

Specify dtype option on import or set low_memory=False

I am using the following code:
df = pd.read_csv('/Python Test/AcquirerRussell3000.csv')
I have the following type of data:
18.07.2000 27.1875 0 08.08.2000 25.3125 0.1 05.09.2000 \
0 19.07.00 26.6250 -0.020690 09.08.00 25.2344 -0.003085 06.09.00
1 20.07.00 26.6250 0.000000 10.08.00 25.1406 -0.003717 07.09.00
2 21.07.00 25.6875 -0.035211 11.08.00 25.5781 0.017402 08.09.00
3 24.07.00 26.2500 0.021898 14.08.00 25.4375 -0.005497 11.09.00
4 25.07.00 26.6875 0.016667 15.08.00 25.5625 0.004914 12.09.00
I am getting the following error:
Pythone Test/untitled0.py:1: DtypeWarning: Columns (long list of numbers) have mixed types.
Specify dtype option on import or set low_memory=False.
So every 3rd column is a date the rest are numbers. I guess there is no single dtype since dates are strings and the rest is a float or int? I have about 5000 columns or more and around 400 rows.
I have seen similar questions to this but dont quite know how to apply this to my data. Furthermore I want to run the following code after to stack the data frame.
a = np.arange(len(df.columns))
df.columns = [a % 3, a // 3]
df = df.stack().reset_index(drop=True)
df.to_csv('AcquirerRussell3000stacked.csv', sep=',')
What dtype should I use? Or should I just set low_memory to false?
This solved my problem from here
dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Could anyone explain this answer to me tough?
df = pd.read_csv('/Python Test/AcquirerRussell3000.csv', engine='python')
or
df = pd.read_csv('/Python Test/AcquirerRussell3000.csv', low_memory=False)
does the trick for me.

Drop 0 values, NaN values, and empty strings

import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
filevalues = filevalues.fillna(int(0))
int_series = filevalues.astype(int)
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)
So I have hundreds of csv files with many empty spots for values. Some of the blanks spaces are detected as NaNs and others are empty strings.This has Forced me to create my code in the way it is right now, and the reason so is that I need to conduct a formula on each value so I changed all such NaNs and empty strings to 0 so that I am able to conduct any formula ( in this example 1/1.2.) The problem is that I do not want to see values that are 0, NaN or empty strings when printing my dataframe.
I have tried to use the following:
filevalues = filevalues.dropna()
But because certain csv files have empty strings, this method does not fully work and get the error:
ValueError: invalid literal for int() with base 10: ' '
I have also tried the following after converting all values to 0:
filevalues = filevalues.loc[:, (filevalues != 0).all(axis=0)]
and
mask = np.any(np.isnan(filevalues) | np.equal(a, 0), axis=1)
Every method seems to be giving different errors. Is there a clean way to not count these types of values when I am printing my pandas dataframe? Please let me know if an example csv file is needed.
Got it to work! Here is the answer if it is of use to anyone.
import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(" ", "", regex=True)
filevalues.replace("", np.nan, inplace=True) # replace empty string with np.nan
filevalues.dropna(inplace=True) # drop nan values
int_series = filevalues.astype(int) # change type
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)

Categories