From float to string without losing zeros (conda/python/jupyter notebook) - python

I am doing data analysis with jupyter notebook using python and pandas. In my master csv, thousands are indicated with dots and python reads the columns as a float. I want to eliminate the dots and do some calculations afterwards.
If I transform the column into a string, I lose the zeros from the thousands.
df['name_of_a_new_column'] = df['Name_of_The_column_it_is_a_float'].astype(str)
The result is from 249.000 to 249.0
Is there any way to change from float to interger or string without losing zeros from thousands?
Thanks in advance!

In my master csv, thousands are indicated with dots
Read the original CSV with the thousands option:
data.csv:
column1,column2
1.000.000,2.000.000
test.py:
import pandas as pd
df = pd.read_csv('data.csv',thousands='.')
print(df)
Output:
column1 column2
0 1000000 2000000
See the read_csv documentation for lots of other options to control reading the CSV. For example some countries use comma for decimal point in floats (decimal=','), and semicolon for column separators (sep=';') in their CSVs as well.

Related

Pandas read_parquet partially parses binary column

I'm trying to read a parquet file that contains a binary column with multiple hex values, which is causing issues when reading it with Pandas. Pandas is automatically converting some of the hex values to characters, but some are left untouched, so the data is not really usable anymore. When reading it with PySpark, it converts all hex values to decimal base, but as the output is consistent, it's usable.
Any ideas why pandas parse this column differently and how I can get the same output, or at least a consistent one (no partial parsing applied) as Spark returns?
The snippets of code and returned outputs :
Pandas :
df = pd.read_parquet('data.parquet'))
pd.read_parquet output:
Spark :
spark_df = spark.read.parquet("data.parquet")
df = spark_df.toPandas()
Spark.read.parquet output:
Pandas is returning a byte string, some characters will be displayed like that, but nothing is wrong with it. For example:
x = bytes([1,10,100]) # x is shown as b'\x01\nd' where last 'd' is ASCII letter
list(x) # get as a list of numbers
To convert your pandas dataframe to look like spark one, use:
df['BASE_PERIOD_VECTOR'].apply(list)

converting a multiple space delimited file to CSV using python 2.7 with Pandas?

I have multiple problems that I'm trying to solve with this single file, but but my immediate concern is trying to convert this file which has fields which are delimited by variable numbers of white spaces between the fields to a standard CSV file without 1000 lines of genius level code. I know of one way to do it as I have done it in a previous project a couple of years ago by setting up functions similar to the old left$, mid$ and right$ functions in VB to select out particular characters from the row that I am interested in, because the data is very well defined and neatly parsed, i.e., all the way down each column is the same size so I can grab the header row by using those functions to select out the field names of the columns, then go row by row using the same functions to pull the numeric data as strings with mid$(), write that to another file by adding in a "," between each written string, convert the strings back to floats and then I've got my CSV file with headers. But wow is that cumbersome and ugly - I want to use Pandas to make it more elegant, concise and sharp.
Here is a snippet of the first few lines of a data file - I have hundreds of them to process. I the actual files there are dozens more columns, this is just a sample that demonstrates the variable spaces between fields as delimiters.
DATE......................TIME.....................CH4.......................H2O
2021-04-01................01:47:45.407..............2.0063472018E+00..........1.2005321188E+00...
2021-04-01................01:47:46.336..............2.0063472018E+00..........1.2005321188E+00...
2021-04-01................01:47:47.244..............2.0063472018E+00..........1.2025918742E+00...
2021-04-01................01:47:49.049..............2.0059096902E+00..........1.2025918742E+00...
I also need tp parse the DATE and TIME columns as a timestamp object, which I've been trying to do from panda read_csv(parse_dates[[0,1]]), which almost works. I need the dates for plotting the x-axis labels for each series...but this is another problem for another post haha.
Thanks in advance for any assistance!!
john rainh2o
Using Pandas, specify the delimiter as a space (assuming your example has replaced spaces with dots). Next specify skipinitialspace=True. The date and time columns can be converted into a single datetime64 type:
import pandas as pd
df = pd.read_csv('input.txt', delimiter=' ', skipinitialspace=True, parse_dates=[['DATE', 'TIME']])
print(df)
print(df.dtypes)
This would give you:
DATE_TIME CH4 H2O
0 2021-04-01 01:47:45.407 2.006347 1.200532
1 2021-04-01 01:47:46.336 2.006347 1.200532
2 2021-04-01 01:47:47.244 2.006347 1.202592
3 2021-04-01 01:47:49.049 2.005910 1.202592
DATE_TIME datetime64[ns]
CH4 float64
H2O float64

Replacing dot with comma from a dataframe using Python

I have a dataframe for example df :
I'm trying to replace the dot with a comma to be able to do calculations in excel.
I used :
df = df.stack().str.replace('.', ',').unstack()
or
df = df.apply(lambda x: x.str.replace('.', ','))
Results :
Nothing changes but I receive his warning at the end of an execution without errors :
FutureWarning: The default value of regex will change from True to
False in a future version. In addition, single character regular
expressions willnot be treated as literal strings when regex=True.
View of what I have :
Expected Results :
Updated Question for more information thanks to #Pythonista anonymous:
print(df.dtypes)
returns :
Date object
Open object
High object
Low object
Close object
Adj Close object
Volume object
dtype: object
I'm extracting data with the to_excel method:
df.to_excel()
I'm not exporting the dataframe in a .csv file but an .xlsx file
Where does the dataframe come from - how was it generated? Was it imported from a CSV file?
Your code works if you apply it to columns which are strings, as long as you remember to do
df = df.apply() and not just df.apply() , e.g.:
import pandas as pd
df = pd.DataFrame()
df['a'] =['some . text', 'some . other . text']
df = df.apply(lambda x: x.str.replace('.', ','))
print(df)
However, you are trying to do this with numbers, not strings.
To be precise, the other question is: what are the dtypes of your dataframe?
If you type
df.dtypes
what's the output?
I presume your columns are numeric and not strings, right? After all, if they are numbers they should be stored as such in your dataframe.
The next question: how are you exporting this table to Excel?
If you are saving a csv file, pandas' to_csv() method has a decimal argument which lets you specify what should be the separator for the decimals (tyipically, dot in the English-speaking world and comma in many countries in continental Europe). Look up the syntax.
If you are using the to_excel() method, it shouldn't matter because Excel should treat it internally as a number, and how it displays it (whether with a dot or comma for decimal separator) will typically depend on the options set in your computer.
Please clarify how you are exporting the data and what happens when you open it in Excel: does Excel treat it as a string? Or as a number, but you would like to see a different separator for the decimals?
Also look here for how to change decimal separators in Excel: https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html
UPDATE
OP, you have still not explained where the dataframe comes from. Do you import it from an external source? Do you create it/ calculate it yourself?
The fact that the columns are objects makes me think they are either stored as strings, or maybe some rows are numeric and some are not.
What happens if you try to convert a column to float?
df['Open'] = df['Open'].astype('float64')
If the entire column should be numeric but it's not, then start by cleansing your data.
Second question: what happens when you use Excel to open the file you have just created? Excel displays a comma, but what character Excel sues to separate decimals depends on the Windows/Mac/Excel settings, not on how pandas created the file. Have you tried the link I gave above, can you change how Excel displays decimals? Also, does Excel treat those numbers as numbers or as strings?

How to prevent truncation using pd.read_sas()

I am new to python and use the follwing code to read in a sas-file:
df=pd.read_sas('C:\\test\\test.sas7bdat', format = 'sas7bdat', encoding = 'latin-1')
There are columns which have either a 7-string code or just "M" for missing. Columns where the first rows just have a M in the first couple of rows and only in later rows the 7-string codes are truncated to just one character for all rows, which does not happen for rows where I have a 7-string code in the first rows.
this is how the original data looks like in sas
How can I prevent pandas to truncate the text when reading in the data?
Thank you.
Lia

Column value is read as date instead of string - Pandas

I am having an excel file and in that one row of column Model is having value "9-3" which is a string value. I double-checked the excel file to have the column datatype as Plain string instead of Date. But still When I use read_excel and convert it into a data frame, the value is shown as 2017-09-03 00:00:00 instead of string "9-3".
Here is how I read the excel file:
table = pd.read_excel('ManualProfitAdjustmentUpdates.xlsx' , header=0, converters={'Model': str})
Any idea on why pandas is not treating value as string even when I set the converters as str?
The Plain string setting in the excel file affects only how the data is shown in Excel.
The str setting in the converter affects only how it treats the data that it gets.
To force the excel file to return the data as string, the cell's first character should be an apostrophe.
Change "9-3" to "'9-3".
The problem may be with excel. Make sure the entire column is stored as text and not just the singular value you are talking about. If excel had the column saved as a data at any point it will store a year in that cell no matter what is shown or what the datatype is changed too. Pandas is going to read the entire column as one data type so if you have dates above 9-3 it will be converted. Changing dates to strings without years can be tricky. It may be better to save the excel sheet as a csv once it is in the proper format you like and then use pandas pd.read_csv(). I made a test excel workbook "book1.xlsx"
9-3 1 Hello
12-1 2 World
1-8 3 Test
Then ran
import pandas as pd
df = pd.read_excel('book1.xlsx',header=0)
print(df)
and got back my data frame correctly. Thus, I am led to believe it is excel. Sorry is isn't the best answer but I don't believe it is a pandas error.

Categories