Pandas Dataframe to_csv flipping column names? - python

Issue:
Pandas appears to be swapping the column data on the data frame when it is saving to CSV? What is going on
# Code
myDF.to_csv('./myDF.csv')
print(myDF)
# Print Output
dd-3 dd-4
5346177884_triplet+ 3 3
5346177884_dublet- 5 5
5346177884_dublet+ 3 3
...
1434120345_triplet+ NaN 1
1434120345_singlet+ NaN 3
# CSV File
,dd-3,dd-4
5346177884_triplet+,3.0,3
5346177884_dublet-,5.0,5
5346177884_dublet+,3.0,3
...
1434120345_triplet+,,1
1434120345_singlet+,,3
Anyone seen anything like this before?

Be sure to check the raw CSV file to make sure that it is not the tool you are using to display the CSV that is interpreting the file incorrectly. For instance pandas will output nans as blank space in a csv file. While libercalc on import can be set to merge repeat delimiters for things like space separated files with multiple spaces. If you accidentally leave that feature on when importing a csv with blanks between delimiters you may see an effect similar to what you have reported.
Issue:
# CSV Format
,h1,h2,3
obj,v1,v2,v3
# PD handling NAN for v1 & v2
,h1,h2,3
obj,,,v3
# Merge delimiter interpretation
,h1,h2,h3
obj,v3
# Resulting View
h1 h2 h3
obj_number v3

Related

Excel Steps Formula in Python

So I am venturing into python scripts. I am a beginner, but i have been tasked with taking a excel formula to python code. I have a text file that contains 3+million rows, each row has three columns and is delimited by tab.
All rows comes as string and first two columns have no problem. Problem with third column is that the content when downloaded gets added padding of 0s to make 18 characters if data is numerical.
In the same column, there are also values that contain space in between. Like 00 5372. Some are fully text format either identified by letter or character like ABC3400 or 00-ab-fu-99 or 00-33-44-66.
A1 B1 Values Output
AA00 dd 000000000000056484 56484
AB00 dd 00 564842 00 564842
AC00 dd 00-563554-f 00-563554-f
AD00 dd STO45642 STO45642
AE00 dd 45632 45632
I need to clean this type of codes to make the output text to be clean, while
Leaving the spaces between,
Clean the leading and trailing spaces,
Clean the value if it is padded with 0's in front.
I do in excel for limited amount by using following formula.
=TRIM(IFERROR(IF((FIND(" ";A2))>=1;A2);TEXT(A2;0)))
*Semicolon due to regional language setting.
For large file, I use following power query steps.
= Table.ReplaceValue(#"Trimmed Text", each [Values], each if Text.Contains([Values]," ") then [Values] else if Number.From([Values]) is number then Text.TrimStart([Values],"0") else [Values],Replacer.ReplaceValue,{"Values"})
First trim and then replace values. This does the job in Power Query very well. Now I would like to do it with Python Script. But as a noob I am stuck at very beginning. Can anyone help me with the library and code ?
My end target is to get the data saved in txt/csv with cleaned values.
Excel ScreenShot
*Edited to correct point 1) Leaving and not removing and further clarification with data.
try the code below(replace the column1,column2,column3 with respectiove column names and provide file_address to variable file_name, if python script and excel file is saved at same location then only name will be enough):
import pandas as pd
df = pd.read_excel(file_name, sep='\t', lineterminator='\r', skipinitialspace=True)
df['column1'] = df['column1'].str.replace(' ','')
df['column2'] = df['column2'].str.replace(' ','')
df['column3'] = df['column3'].str.replace(' ','')
df.to_csv('output.csv',index=False)

Replacing a single character in a dataframe?

I've been looking all over Google for a solution.
I am pulling data using requests.get(), which outputs a lengthy block in JSON.
I've been able to normalize it and figure out how to put it into a PANDAS dataframe.
The problem I am having is taking the URLs in Columns X, Y, Z and removing the percent encoding. I'd be okay with removing it from the entire dataframe.
stat.get.url
0
1
2 entrance%7Cstate%2Fgoogle
3 entrance%7Cstate%2Fyahoo
4 entrance%7Cstate%2Fmsn
I've tried this code:
df.replace('%7C', '|', regex=True)
But that doesn't replace anything in the dataframe.
How can I replace the percent encoded characters and get them to be saved on the dataframe?

remove the index values while converting excel file to text file using python

I am using python 3.7 and I wanted to convert .xlsx file into .txt file and below is my code:
import pandas as pd
dataframe1 = pd.read_excel(r'C:\path\exceldata1.xlsx', index=False)
print(dataframe1)
with open(r'C:\path\exceldata1.txt', 'w') as text_file:
dataframe1.to_string(text_file)
I am able to convert .xlsx into .txt but I am also getting the index value printed in the text file. I want to remove that.
0 9100499S 1
1 9100099S 10
2 910000SW 1
3 91Y961SR 5
4 9120301S 20
above is the text file created but I do not need the index values in the first column.
and i also wanted to separate the two column values by tab to make it into a TVS text file so that they do not come in one cell..How do i go about it??
9100499S 1
9100099S 10
910000SW 1
91Y961SR 5
9120301S 20
Adding index = False to your to_string call will prevent the index from being included.
If you want tab separated values, instead of to_string you should use to_csv - this function is meant to generate comma separated values, but you can change what character is used as a separator by adding sep='\t' to get tab separated values instead.

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

Reference Excel in Python

I am writing a python code for beam sizing. I have an Excel workbook from AISC that has all the data of the shapes and other various information on the cross-sections. I would like to be able to reference data in particular cells in this Excel workbook in my python code.
For example if the width of rectangle is 2in and stored in cell A1 and the height is 10in and stored in cell B1 I want to write a code in python that somehow pulls in cell A1 and B1 and multiply them.
I do not need to export back into excel I just want to make python do all the work and use excel purely as reference material.
Thank you in advance for all your advice and input!!
Try pandas as well...might be easier to work with than lists in this case
DATA :
Width Height
4 2
4 4
1 1
4 5
Code
import pandas as pd
#read the file
beam = pd.read_csv('cross_section.csv')
beam['BeamSize'] = beam['Width']*beam['Height'] #do your calculations
Output:
>>> beam
Width Height BeamSize
0 4 2 8
1 4 4 16
2 1 1 1
3 4 5 20
4 2 2 4
You can slice and dice the data as you wish.
For eg, lets say you want the fifth beam :
>>> beam.ix[4]
Width 2
Height 2
BeamSize 4
Name: 4, dtype: int64
Check this for more info:
http://pandas.pydata.org/pandas-docs/stable/
You can read directly from excel as well..
Thank you for you inputs. I have found the solution I was looking for by using Numpy.
data = np.loadtxt('C:\Users[User_Name]\Desktop[fname].csv', delimiter=',')
using that it took the data and created an array with the data that I needed. Now I am able to use the data like any other matrix or array.
If you don't mind adding a (somewhat heavy) dependency to your project, pandas has a read_excel function that can read a sheet from an Excel workbook into a pandas DataFrame object, which acts sort of like a dictionary. Then reading a cell would just amount to something like:
data = pd.read_excel('/path/to/file.xls')
cell_a1 = data.loc[1, 'a'] # or however it organizes it on import
For future readers of this question, it should be mentioned that xlrd is the "most exact" solution to your requirements. It will allow you to read data directly from an Excel file (no need to convert to CSV first).
Many other packages that read Excel files (including pandas) use xlrd themselves to provide that capability. They are useful, but "heavier" than xlrd (larger, more dependencies, may require compilation of C code, etc.).
(Incidentally, pandas happens to use both xlrd and NumPy.)

Categories