I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
Related
I am very new to Python and would like to use it for my mass spectrometry data analysis. I have a txt file that is separated by tabulator. I can import it into Excel with the import assistant.
I have also managed to import it into spyder with the import assistant, but I would like to automate the process.
Is there a way to "record" the import settings I use while manually loading the data? That way I would generate a code that I could use in the future for the other txt files.
I've tried using NumPy and pandas to import my data but my txt file contains strings and numbers (floats) and I have not managed to tell Python to distinguish between the two.
When in import the file manually I get the exat DataFrame I want with the first row as a header, and the strings, and numbers correctly formatted.
here is a sample of my txt file:
Protein.IDs Majority.protein.IDs Peptide.counts..all.
0 LmxM.01.0330.1-p1 LmxM.01.0330.1-p1 5
1 LmxM.01.0410.1-p1 LmxM.01.0410.1-p1 15
2 LmxM.01.0480.1-p1 LmxM.01.0480.1-p1 14
3 LmxM.01.0490.1-p1 LmxM.01.0490.1-p1 27
4 LmxM.01.0520.1-p1 LmxM.01.0520.1-p1 27
Using numpy or pandas is the best way to automate the process, so good job using the right tools.
I suggest that you look at all the options that the pandas read_csv function has to offer. There is most likely a single line of code that can import the data properly by using the right options.
In particular, look at the decimal option if the floats are not parsed correctly.
Other solutions, which you may still want to use even if you use pandas properly are:
Formatting the input data to make your life easier : either when it is generated, or using some notepad with good macros (Notepadd++ can replace expression or accomplish tedious repeating keystrokes for you).
Formatting the output of the pandas import. If you still have strings that should be interpreted as numeric values, maybe you can run a loop to check that all values are converted in the format that they should be in.
Finally, you may want to provide some examples when you ask technical questions: show an example of data, the code that you're using, and the output of your code would make answering your question easier :)
Edit:
From the data example that you posted, it seems to me that pandas should separate the data just fine and detect strings and numerical values without trouble.
Look at the options sep of read_csv. The default is ',', you probably want to switch it to a tabulation: '\t'
Try this:
pandas.read_csv(my_filename, sep='\t')
You may run into some header issue, which you can solve using the header and names options.
I want to get the discord.user_id, I am VERY new to python and just need help getting this data.
I have tried everything and there is no clear answer online.
currently, this works to get a data point in the attributes section
pledge.relationship('patron').attribute('first_name')
You should try this :
import pandas as pd
df = pd.read_json(path_to_your/file.json)
The ourput will be a DataFrame which is a matrix, in which the json attributes will be the names of the columns. You will have to manipulate it afterwards, which is preferable, as the operations on DataFrames are optimized in terms of processing time.
Here is the official documentation, take a look.
Assuming the whole object is call myObject, you can obtain the discord.user_id by calling myObject.json_data.attributes.social_connections.discord.user_id
I'm taking in shipment data from a csv file, I've edited data for privacy purposes, but the thing to look at is when using pandas.read_csv on my csv file the original as shown below is normal in this sense: the ZIP code (01234) has a leading 0, and the order number (22276) is an integer.
After using pandas.read_csv and printing out my data (and viewing my data in a text editor) I see that the leading 0 was taken out from the ZIP code (it is now 1234), and the order number is now a floating number (22276.0)
Original:
GROUND,THIRD PARTY,Company Name,1 Road
Ave,Town,State,01234,,22276,22276,22276,,Customer Name,Street
Name,00000 00th Ave
Z.Z.,,Town,State,00001,V476V6,18001112222,,,,Package,1
After using pandas.read_csv:
GROUND,THIRD PARTY,Dreams,100 Higginson
Ave,LINCOLN,RI,1234,,22276.0,22276.0,22276.0,,Customer Name,Street
Name,00000 00th Ave
Z.Z.,,Town,State,00001,V476V6,18001112222,,,,Package,1
I've seen others have these issues as well, and in those questions you will see well-written answers about HOW to fix the problem. What I want to know is WHY the problem exists in the first place. Why does a reading function write out original data back to the file?
EDIT
Here's the code I'm currently working with, reference is the name of the column with the order number.
import pandas
grid = pandas.read_csv("thirdparty.csv", dtype={'ZIP': int, 'REFERENCE': int})
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(grid)
How
You'll want to use the dtype argument of pd.read_csv. One solution would be to read in all the columns as string type. This will preserve the values exactly as they were in your csv file.
import pandas as pd
data = pd.read_csv("thirdparty.csv", dtype=str)
Though a better solution would be to specify your desired dtype of each column:
data = pd.read_csv(("thirdparty.csv", dtype={‘ZIP’: str, ‘REFERENCE’: int}
When writing the csv file back out again you should also use the float_format argument to ensure any floats are wrote as you desire.
Why
You also asked why the "problem" exists.
Essentially, when you use pd.read_csv without specifying a dtype, anything which looks like a number is read in as a float. So, 01234 is converted to 1234 on read.
When you write back out to your file this number is now wrote as a float. The pd.read_csv function is not writing out data to the original file.
I have a csv file containing numerical values such as 1524.449677. There are always exactly 6 decimal places.
When I import the csv file (and other columns) via pandas read_csv, the column automatically gets the datatype object. My issue is that the values are shown as 2470.6911370000003 which actually should be 2470.691137. Or the value 2484.30691 is shown as 2484.3069100000002.
This seems to be a datatype issue in some way. I tried to explicitly provide the data type when importing via read_csv by giving the dtype argument as {'columnname': np.float64}. Still the issue did not go away.
How can I get the values imported and shown exactly as they are in the source csv file?
Pandas uses a dedicated dec 2 bin converter that compromises accuracy in preference to speed.
Passing float_precision='round_trip' to read_csv fixes this.
Check out this page for more detail on this.
After processing your data, if you want to save it back in a csv file, you can passfloat_format = "%.nf" to the corresponding method.
A full example:
import pandas as pd
df_in = pd.read_csv(source_file, float_precision='round_trip')
df_out = ... # some processing of df_in
df_out.to_csv(target_file, float_format="%.3f") # for 3 decimal places
I realise this is an old question, but maybe this will help someone else:
I had a similar problem, but couldn't quite use the same solution. Unfortunately the float_precision option only exists when using the C engine and not with the python engine. So if you have to use the python engine for some other reason (for example because the C engine can't deal with regex literals as deliminators), this little "trick" worked for me:
In the pd.read_csv arguments, define dtype='str' and then convert your dataframe to whatever dtype you want, e.g. df = df.astype('float64') .
Bit of a hack, but it seems to work. If anyone has any suggestions on how to solve this in a better way, let me know.
I have to dump data from SAS datasets. I found a Python module called sas7bdat.py that says it can read SAS .sas7bdat datasets, and I think it would be simpler and more straightforward to do the project in Python rather than SAS due to the other functionality required. However, the help(sas7bdat) in interactive Python is not very useful and the only example I was able to find to dump a dataset is as follows:
import sas7bdat
from sas7bdat import *
# following line is sas dataset to convert
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
#following line is txt file to create
foo.convertFile('/support/textfiles/locked_data.txt','\t')
This doesn't do what I want because a) it uses the SAS variable names as column headers and I need it to use the variable labels, and b) it uses "nan" to denote missing numeric values where I'd rather just leave the value blank.
Can anyone point me to some useful documentation on the methods included in sas7bdat.py? I've Googled every permutation of key words that I could think of, with no luck. If not, can someone give me an example or two of using readColumnAttributes(), readColumnLabels(), and/or readColumnNames()?
Thanks, all.
As time passes, solutions become easier. I think this one is easiest if you want to work with pandas:
import pandas as pd
df = pd.read_sas('/support/sas/locked_data.sas7bdat')
Note that it is easy to get a numpy array by using df.values
This is only a partial answer as I've found no [easy to read] concrete documentation.
You can view the source code here
This shows some basic info regarding what arguments the methods require, such as:
readColumnAttributes(self, colattr)
readColumnLabels(self, collabs, coltext, colcount)
readColumnNames(self, colname, coltext)
I think most of what you are after is stored in the "header" class returned when creating an object with SAS7BDAT. If you just print that class you'll get a lot of info, but you can also access class attributes as well. I think most of what you may be looking for would be under foo.header.cols. I suspect you use various header attributes as parameters for the methods you mention.
Maybe something like this will get you closer?
from sas7bdat import SAS7BDAT
foo = SAS7BDAT(inFile) #your file here...
for i in foo.header.cols:
print '"Atrributes"', i.attr
print '"Labels"', i.label
print '"Name"', i.name
edit: Unrelated to this specific question, but the type() and dir() commands come in handy when trying to figure out what is going on in an unfamiliar class/library
I know I'm late for the answer, but in case someone searches for similar question. The best option is:
import sas7bdat
from sas7bdat import *
foo = SAS7BDAT('/support/sas/locked_data.sas7bdat')
# This converts to dataframe:
ds = foo.to_data_frame()
Personally I think the better approach would be to export the data using SAS then process the external file as needed using Python.
In SAS, you can do this...
libname datalib "/support/sas";
filename sasdump "/support/textfiles/locked_data.txt";
proc export
data = datalib.locked_data
outfile = sasdump
dbms = tab
label
replace;
run;
The downside to this is that while the column labels are used rather than the variable names, the labels are enclosed in double quotes. When processing in Python, you may need to programmatically remove them if they cause a problem. I hope that helps even though it doesn't use Python like you wanted.