How to get data from excel file 'as is' using openpyxl?

How to get data from excel file 'as is' using openpyxl? - python

I have an excel file - the data in which looks like.
[ ("a", 4, 4,2, 5,32), ("b", 6, 7,2, 7,32), ("c", 6, 7,2, None) ]
I want to get float values of cells - in which separator is comma. But when accessing them, a value with a dot is returned. Is it possible to do this with opepyxl or other libraries?
what i tried
ref_workbook = load_workbook('file.xlsx')
sheet_ranges = ref_workbook['Sheet']
sheet_ranges['C2'].value,# == 4.2, expected 4,2
I also tried pandas, but it also formats cells with commas.

This will set the locale to German, which uses a comma as the decimal separator and a dot as the thousands separator, and then format the float value using this locale.
You can also use 'fr_FR' for French, 'pt_BR' for Brazilian Portuguese and 'es_ES' for Spanish etc.
import pandas as pd
import locale
def modify_float_separator(row_num, col_num):
value = df.iloc[row_num, col_num]
# Check if the value is a float
if isinstance(value, float):
# Set the locale to German
locale.setlocale(locale.LC_ALL, 'de_DE')
# Format the float using the German locale
formatted_value = locale.format_string("%.2f", value, grouping=True)
return formatted_value
# Read the Excel file into a DataFrame
df = pd.read_excel('example.xlsx')
print(str(df.iloc[1,0])) # 5.32
print(modify_float_separator(1,0)) # 5,32

That's because floats always use points in programming. A comma is only used as a delimiter. If you were to use a comma to assign a value to a float you would create a tuple with two integer values instead.
my_number = 4,2
print(my_number)
print(type(my_number))
Output:
(4, 2)
<class 'tuple'>
If you want the numbers with a comma as a delimiter you need to convert them to a string and replace the comma with a point.
my_number = 4.2
my_number = str(my_number).replace(".", ",")
print(my_number)
print(type(my_number))
Output:
4,2
<class 'str'>
Should you need to use the numbers for calculations inside your program you need to keep them as floats and convert them to string once your done, you cannot use them as a string in calculations, that will cause errors.

I used pandas to get the data from excel and the result is the float values of the cells - where the delimiter is comma
import pandas as pd
read_file = pd.read_excel('name_file')
table_data = read_file.values.tolist()
table_heading = read_file.columns.values.tolist()
print(table_heading)
print(table_data)
the result i got:
['ID', 'Name', 'code']
[[1, 'a', '1,2'], [2, 'b', '1,3'], [3, 'c', '1,4'], [4, 'd', '1,5'], [5, 'e', '1,6']]

Related

Regex Validation Not working for large Numbers in column Pandas

I am trying to validate columns over a particular regex in dataframe. The Limit of number is (20,3) i.e maximum 20 length with int datatype or 23 with float datatype . but pandas is converting original number to random int number and my regex validation is getting failed . I checked my regex is proper .
Dataframe :
FirstColumn,SecondColumn,ThirdColumn
111900987654123.123,111900987654123.123,111900987654123.123
111900987654123.12,111900987654123.12,111900987654123.12
111900987654123.1,111900987654123.1,111900987654123.1
111900987654123,111900987654123,111900987654123
111900987654123,-111900987654123,-111900987654123
-111900987654123.123,-111900987654123.123,-111900987654123.1
-111900987654123.12,-111900987654123.12,-111900987654123.12
-111900987654123.1,-111900987654123.1,-111900987654123.1
11119009876541231111,1111900987654123,1111900987654123
Code:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF=pd.read_csv("D:\\FTP\LocalUser\\NCCLCOLL\\COLLATERALUPLOAD\\upld\\SplitFiles\\AACCR6675H_22102021_07_1 - Copy.csv")
pd.set_option('display.float_format', '{:.3f}'.format)
rslt_df2=df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1=rslt_df2[~rslt_df2.iloc[:,0].apply(str).str.contains(NumberValidationRegexnegative, regex=True)].index
print("rslt_df1",rslt_df1)
Output Result:
rslt_df1 Int64Index([8], dtype='int64')
Expected Result:
rslt_df1 Int64Index([], dtype='int64')

Use dtype=str as parameter of pd.read_csv:
NumberValidationRegexnegative = r"^-?[0-9]{1,20}(?:\.[0-9]{1,3})?$"
df_CPCodeDF = pd.read_csv("data.csv", dtype=str)
rslt_df2 = df_CPCodeDF[df_CPCodeDF.iloc[:, 0].notna()]
rslt_df1 = rslt_df2[~rslt_df2.iloc[:,0] \
.str.contains(NumberValidationRegexnegative, regex=True)].index
Output:
>>> print("rslt_df1", rslt_df1)
rslt_df1 Int64Index([], dtype='int64')

Error could not convert string to float: ''

I have a column which does not want to change from object to float. The data from the xlsx file is always presented the same (as a number), but somehow only this column is seen as object.
The numbers in the column represent percentage using point(.) as a decimal placement.
xls3[' Vesturland'] = xls3[' Vesturland'].astype(float)
does not work. There is no special characters to replace (eg.str.replace()), I have tried that as well.
I dare not to use
xls3[' Vesturland'] = pd.to_numeric(xls3[' Vesturland'])
because it changes all floats to NaN and the whole column is percentage values.
The only thing I can think of is that the number of decimals is not consistent, but that shouldn't really matter, or does it? I put a red arrow on the column I want to change to float.
I only get this error when I try to convert to float Error could not convert string to float: '' and searching for it on my specific problem has not given any results yet.

You have empty strings in your pd.Series, which cannot be readily converted to a float data type. What you can do is check for them and remove them. An example script is:
import pandas as pd
a=pd.DataFrame([['a','b','c'],['2.42','','3.285']]).T
a.columns=['names', 'nums']
a['nums']=a['nums'][a['nums']!=''].astype(float)
Note: if you try to run a['nums']=a['nums'].astype(float) before selecting non-empty strings the same error that you've mentioned will be thrown.

First use this line to obtain the current dtypes:
col_dtypes = dict([(k, v.name) for k, v in dict(df.dtypes).items()])
Like so:
xls3 = pd.read_csv('path/to/file')
col_dtypes = dict([(k, v.name) for k, v in dict(xls3.dtypes).items()])
print(col_dtypes)
Copy the value that is printed.
It should be like this:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
Then, for the column which whose datatype you know isn't object, change it to the required type ('int32', 'int64', 'float32' or 'float64')
Example:
The datatypes might be detected as:
{'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64', ' Vesturland': 'object', ...}
If we know Vesturland is supposed to be Float, then we can edit this to be:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64', ...
}
Now, with this snippet you can find the non-numeric values:
def clean_non_numeric_values(series, col_type):
illegal_value_pos = []
for i in range(len(series)):
try:
if col_type == 'int64' or col_type == 'int32':
val = int(series[i])
elif col_type == 'float32' or col_type == 'float64':
val = float(series[i])
except:
illegal_value_pos.append(i)
# series[i] = None # We can set the illegal values to None
# to remove them later using xls3.dropna()
return series, illegal_value_pos
# Now we will manually replace the dtype of the column Vesturland like so:
col_dtypes = {
'Date': 'object', 'Karlar': 'float64', 'Konur': 'float64',
' Vesturland': 'float64'
}
for col in list(xls3.columns):
if col_dtypes[col] in ['int32', 'int64', 'float32', 'float64']:
series, illegal_value_pos = (
clean_non_numeric_values(series=xls3[col], col_type=col_dtypes[col])
)
xls3[col] = series
print(illegal_value_pos)
if illegal_value_pos:
illegal_rows = xls3.iloc[illegal_value_pos]
# This will print all the illegal values.
print(illegal_rows[col])
Now you can use this information to remove the non-numeric values from the dataframe.
Warning: Since this uses a for loop, it is slow but it will help you to remove the values you don't want.

After much trial and error, I ended up opening the excel sheet, deleted about 10 rows below the last data input. Then I unfroze rows/column, read it into Jupyter Notebook again and now ALL OF THE DATA IS FLOAT. I don't know which did the trick, but this is resolved now.
Thank you all that helped me here for your time and your attempts to solve this.

len([x for x in xls3[' Vesturland'] if x == ' '])
sometimes it can be blank go to your CSV file open it from excel and check is ctrl+shift+l filter and blank space.

Could it be that you do for i in range 0,len(tablename)
and you need len(tablename)-1
because you start at 0?

Change the type of a numeric string with a space separating the decimal places to int/float

I have a pandas column with str values. I need to change its type to int. The problem is that all values are separated with (' ') to differentiate the Ks and the Ms, ex:
a = '17 000 000'
int(a)
print(a)
output:
17000000

It is also possible to do this by changing locale settings and applying locale.atof().
CAUTION: don't use this if other locale-related logics were present.
Code:
import locale
import pandas as pd
# data
df = pd.DataFrame({"A":["17 000 000", "15 000,22"]})
# change locale settings
locale._override_localeconv["thousands_sep"] = " "
locale._override_localeconv["decimal_point"] = ","
# apply
df["A"] = df["A"].apply(locale.atof)
Result:
print(df)
A
0 17000000.00
1 15000.22
Personally I think the df["A"].str.replace(" ", "").astype(int) construct mentioned by #Erfan is recommended.

Can't convert string to float - Python Dash

I'm trying to convert string into float in Dash callback but when I run my code I'm getting in my Dash app error: lati = float(lati[-1])
ValueError: could not convert string to float: 'float64) I'm not getting this error in terminal though.
First what I need to do is extract given latitude (and longitude) number. Therefore I need it to convert it to string and split it because I could not find better way to get this number from csv file using pandas.
Output:
# converting to string:
12 41.6796
Name: latitude, dtype: float64
# splitting:
['12', '', '', '', '41.6796']
# converting to float:
41.6796
This is the actual code:
#app.callback(Output('text-output', 'children'),
[Input('submit-val', 'n_clicks')],
[State('search-input', 'value')])
def updateText(n_clicks, searchVar):
df = pd.read_csv("powerplant.csv")
df = df[df.name == searchVar]
# converting to string
lati = str(df['latitude'])
longi = str(df['longitude'])
# splitting it
lati = lati.split('\n', 1)
lati = lati[0].split(' ', 4)
longi = longi.split('\n', 1)
longi = longi[0].split(' ', 4)
#converting to float
lati = float(lati[-1])
longi = float(longi[-1])
I actually tested this code in other script and it worked just fine. Is there any better way how could I extract latitude and longitude numbers?
The data can be downloaded from https://datasets.wri.org/dataset/globalpowerplantdatabase; here is an excerpt.
country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,other_fuel3,commissioning_year,owner,source,url,geolocation_source,wepp_id,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,generation_gwh_2015,generation_gwh_2016,generation_gwh_2017,estimated_generation_gwh
AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009793,2017,,,,,,
AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787,Hydro,,,,,,GEODB,http://globalenergyobservatory.org,GEODB,1009795,2017,,,,,,
ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305,Hydro,,,,1963.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021238,,,,,,,79.22851153039832
ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936,Hydro,,,,1958.0,,Energy Charter Secretariat,http://www.energycharter.org/fileadmin/DocumentsMedia/IDEER/IDEER-Albania_2013_en.pdf,GEODB,1021241,,,,,,,82.52969951083159

The issue is the way you are accessing the values in a dataframe. Pandas allows you to access the data without having to parse the string representation.
You can access the row and the column in one call to .loc
If you know you will have a single value, you can call the squeeze method
>>> import pandas as pd
>>> from io import StringIO
>>> # data shortened for brievity
>>> df = pd.read_csv(StringIO("""country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude
... AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190
... AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.5560,69.4787
... ALB,Albania,Shkopet,WRI1002173,24.0,41.6796,19.8305
... ALB,Albania,Ulez,WRI1002174,25.0,41.6796,19.8936"""))
>>> searchVar = "Ulez"
>>> df.loc[df["name"] == searchVar, "latitude"] # here you have a pd.Series
3 41.6796
Name: latitude, dtype: float64
>>> df.loc[df["name"] == searchVar, "latitude"].squeeze() # here you have a scalar
41.6796
>>> df.loc[df["name"] == searchVar, "longitude"].squeeze()
19.8936
If for some reason you have several rows with the same name, you will get a Series back and not a scalar. But maybe it is a case where failure is what you want rather than passing ambiguous data.

What you're looking at is a pandas.Series object, containing a single row of data, and you're trying to chop up its __repr__ to get at the value. There is no need for this. I'm not familiar with the Python version of plotly but I see that you have a callback so I've wrapped it up into a function (I'm not sure whether the case exists where the name can't be found):
import pandas as pd
def get_by_name(name):
df = pd.read_csv('powerplants.csv')
df = df[df['name'] == name]
if not df.empty:
return df[['latitude', 'longitude']].values.tolist()[0]
return None, None
lat, lon = get_by_name('Kajaki Hydroelectric Power Plant Afghanistan')

Does pandas support read `set` paramater using read_csv

I save set parameter using to_csv.
csv file as below.
1,59,"set([17122, 196, 26405, 13032, 39657, 12427, 25133, 35951,
38928, 2 6088, 10258, 49235, 10326, 13176, 30450, 41787, 14084,
46149])",18,19.0,1 1,5.36363649368
Can I use read_csv and return a set type but str
users = pd.read_csv(DATA_PATH + "users_match.csv", dtype={
})

The answer is yes. Your solution
users = pd.read_csv(DATA_PATH + "users_match.csv", header = None)
will already return column 2 as a string as long as you have double quotes around set([...]).
Then use
users[2].apply(lambda x: eval(x))
to convert it back to set

To convert the DataFrame's str object (the string starting with the characters "set") into a built-in Python set object, here is one way:
>>> import pandas as pd
>>> df = pd.read_csv('users_match.csv', header=None)
>>> type(df[2][0])
str
>>> df.set_value(0, 2, eval(df[2][0]))
>>> type(df[2][0])
set

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get data from excel file 'as is' using openpyxl? - python

Related

Regex Validation Not working for large Numbers in column Pandas

Error could not convert string to float: ''

Change the type of a numeric string with a space separating the decimal places to int/float

Can't convert string to float - Python Dash

Does pandas support read `set` paramater using read_csv

Categories

Resources