Converting text file to .xlsx in python - python

10 4
0 0 0 0
0 0 1 0
1 0 2 0
2 2 0 1
0 0 1 1
0 1 2 1
1 1 1 1
1 2 2 0
0 0 1 0
2 0 1 0
I want to read the above text file store this in xlsx format. But with my code whole row is stored in same cell. But I want them in different cells. Each item is space seperated.
import pandas as pd
df=pd.read_csv('input.txt')
df.to_excel('test2.xls',index=False)
Expected result: Each item stored in different cell.
Note that the first row only has 2 elements.

You need to specify the delimiter to be a space instead of a comma.
import pandas as pd
df = pd.read_csv("input.txt", header=None, delim_whitespace=True, names=['a','b','c','d'])
df.to_excel('test2.xls',index=False, header=None)
Notice that I used delim_whitespace=True to tell pandas to use spaces instead of commas. Also since you have only 2 elements in the first row I name the columns to let pandas know that you expect 4 columns.

import pandas as pd
with open('input.txt') as fin:
lines = fin.readlines()
lines[0] = "0 " + lines[0] + " 0"
with open('input2.txt', 'w') as fout:
for line in lines:
fout.write(line)
df = pd.read_csv('input2.txt', header=None, delim_whitespace=True)
df.to_excel('output.xls', index=False, header=None)
So this creates a unnecessary input2.txt and this should work assuming your first line is your first row in the text.
But what it is changing is that 10 4 becomes 0 10 4 0 so this has 4 elements in it.
I did delim_whitespace=True for read_csv() because it tells pandas not to do it in commas but in whitespaces.
At the top, I opened the file, used lines = fin.readlines() and then did lines[0] because lines is a array and the first element is arr[0]. Then I put the 0 and a space at the left and a space then a 0 at the right. And then I wrote a new text and wrote all the lines inside by using a for loop.
Finally I plug that into the read_csv() function and apply to_excel() with the parameters and one more parameter that I talked about above.
It's okay if this creates a new text file, and that the text thing is not at all pandas. This should be working and I have not checked yet so I don't know if this works or not.
Hope this helps!

Related

Pandas Dataframe Remove all Rows with Letters in Certain Column

I have a pandas dataframe in python that I want to remove rows that contain letters in a certain column. I have tried a few things, but nothing has worked.
Input:
A B C
0 9 1 a
1 8 2 b
2 7 cat c
3 6 4 d
I would then remove rows that contained letters in column 'B'...
Expected Output:
A B C
0 9 1 a
1 8 2 b
3 6 4 d
Update:
After seeing the replies, I still haven't been able to get this to work. I'm going to just place my entire code here. Maybe I'm not understanding something...
import pandas as pd
#takes file path from user and removes quotation marks if necessary
sysco1file = input("Input path of FS1 file: ").replace("\"","")
sysco2file = input("Input path of FS2 file: ").replace("\"","")
sysco3file = input("Input path of FS3 file: ").replace("\"","")
#tab separated files, all values string
sysco_1 = pd.read_csv(sysco1file, sep='\t', dtype=str)
sysco_2 = pd.read_csv(sysco2file, sep='\t', dtype=str)
sysco_3 = pd.read_csv(sysco3file, sep='\t', dtype=str)
#combine all rows from the 3 files into one dataframe
sysco_all = pd.concat([sysco_1,sysco_2,sysco_3])
#Also dropping nulls from CompAcctNum column
sysco_all.dropna(subset=['CompAcctNum'], inplace=True)
#ensure all values are string
sysco_all = sysco_all.astype(str)
#implemented solution from stackoverflow
#I also tried putting "sysco_all = " in front of this
sysco_all.loc[~sysco_all['CompanyNumber'].str.isalpha()]
#writing dataframe to new csv file
sysco_all.to_csv(r"C:\Users\user\Desktop\testcsvfile.csv")
I do not get an error. However, the csv still has rows with letters in this column.
Assuming the B column be string type, we can use str.contains here:
df[~df["B"].str.contains(r'^[A-Za-z]+$', regex=True)]
here is another way to do it
# use isalpha to check if value is alphabetic
# use negation to pick where value is not alphabetic
df=df.loc[~df['B'].str.isalpha()]
df
A B C
0 9 1 a
1 8 2 b
3 6 4 d
OR
# output the filtered result to csv, preserving the original DF
df.loc[~df['B'].str.isalpha()].to_csv('out.csv')

Pandas Dataframe reads value of one column wrong

The csv file:
Link to Github
This is my code:
import pandas as pd
df = pd.read_csv("log_1_2018_09_07.csv", encoding="ISO-8859-1", delimiter=';')
print(df.columns.tolist())
dates = []
times = []
outputs = []
for date in df.loc[:, "Datum"]:
dates.append(date)
print("date")
print(date)
for time in df.loc[:, " Zeit"]:
times.append(time)
print("time")
print(time)
for out in df.iloc[:, 19]:
print("output")
outputs.append(out)
print(out)
It reads the dates and times correctly, but the 19th column (column T) are all 0 and the 6th value is 990, however pandas reads it as all 0 and the 9th value as 1.
Does anybody know why it's reading the wrong values?
Thank you!!
import pandas as pd
url = 'https://raw.github.com/liamrisch/helper/master/log_1_2018_09_07.csv'
df = pd.read_csv(url, encoding="ISO-8859-1", delimiter=';')
df.iloc[:,[6,19]]
Gives:
Teil 1-8 - Abstand Rasthaken MP1-MP2 Teil 1-8
0 26,764 0
1 26,787 0
2 26,792 0
3 26,788 0
4 26,771 0
5 999,990 0
6 26,786 0
7 26,785 0
8 26,780 1
9 26,783 0
10 26,798 0
take a very close look at the data, the value is actually 1 but since 999 takes more visual space it makes the illusion of 999 being the value in that cell.
printing the df in that column (before any maniulation) shows the actual values of that column, without any surprises.

How to skip the first two lines when reading in multiple files into a Pandas df [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

How to create a DataFrame from custom values

I am reading in a text file, on each line there are multiple values. I am parsing them based on requirements using function parse.
def parse(line):
......
......
return line[0],line[2],line[5]
I want to create a dataframe, with each line as a row and the three returened values as columns
df = pd.DataFrame()
with open('data.txt') as f:
for line in f:
df.append(line(parse(line)))
When I run the above code, I get all values as a single column. Is it possible to get it in proper tabular format.
You shouldn't .append to DataFrame in a loop, that is very inefficient anyway. Do something like:
colnames = ['col1','col2','col3'] # or whatever you want
with open('data.txt') as f:
df = pd.DataFrame([parse(l) for l in f], columns=colnames)
Note, the fundamental problem is that pd.DataFrame.append expects another data-frame, and it appends the rows of that other data-frame. It interpretes a list as a bunch of single rows. So note, if you structure your list to have "rows" it would work as intended. But you shouldn't be using .append here anyway:
In [6]: df.append([1,2,3])
Out[6]:
0
0 1
1 2
2 3
In [7]: df = pd.DataFrame()
In [8]: df.append([[1, 2, 3]])
Out[8]:
0 1 2
0 1 2 3
Uma forma rĂ¡pida de fazer isso (TL;DR):
Creating the new column:
`df['com_zeros'] = '0'`
Applying the condition::
for b in df.itertuples():
df.com_zeros[b.Index] = '0'+str(b.battles) if b.battles<9 else str(b.battles)
Result:
df
regiment company deaths battles size com_zeros
0 Nighthawks 1st kkk 5 l 05
1 Nighthawks 1st 52 42 ll 42
2 Nighthawks 2nd 25 2 l 02
3 Nighthawks 2nd 616 2 m 02
See the example by https://repl.it/JHW6.
Obs.:
The example running on repl.it seems to hang, but that is not the case, the load of pandas on repl.it is always time consuming.
To suppress warnings on jupyter notebook:
import warnings
warnings.filterwarnings('ignore')
In addition to #juanpa.arrilaga,
It seems that you do have a structured file and just need the 1st 3rd and 5th item in the file.
load it and use drop
df = pd.read_csv('file')
df.drop([columns],axis = 1)

How do I skip first rows with Pandas in Python [duplicate]

I'm trying to import a .csv file using pandas.read_csv(), however, I don't want to import the 2nd row of the data file (the row with index = 1 for 0-indexing).
I can't see how not to import it because the arguments used with the command seem ambiguous:
From the pandas website:
skiprows : list-like or integer
Row numbers to skip (0-indexed) or number of rows to skip (int) at the
start of the file."
If I put skiprows=1 in the arguments, how does it know whether to skip the first row or skip the row with index 1?
You can try yourself:
>>> import pandas as pd
>>> from StringIO import StringIO
>>> s = """1, 2
... 3, 4
... 5, 6"""
>>> pd.read_csv(StringIO(s), skiprows=[1], header=None)
0 1
0 1 2
1 5 6
>>> pd.read_csv(StringIO(s), skiprows=1, header=None)
0 1
0 3 4
1 5 6
I don't have reputation to comment yet, but I want to add to alko answer for further reference.
From the docs:
skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
I got the same issue while running the skiprows while reading the csv file.
I was doning skip_rows=1 this will not work
Simple example gives an idea how to use skiprows while reading csv file.
import pandas as pd
#skiprows=1 will skip first line and try to read from second line
df = pd.read_csv('my_csv_file.csv', skiprows=1) ## pandas as pd
#print the data frame
df
All of these answers miss one important point -- the n'th line is the n'th line in the file, and not the n'th row in the dataset. I have a situation where I download some antiquated stream gauge data from the USGS. The head of the dataset is commented with '#', the first line after that are the labels, next comes a line that describes the date types, and last the data itself. I never know how many comment lines there are, but I know what the first couple of rows are. Example:
> # ----------------------------- WARNING ----------------------------------
> # Some of the data that you have obtained from this U.S. Geological Survey database
> # may not have received Director's approval. ... agency_cd site_no datetime tz_cd 139719_00065 139719_00065_cd
> 5s 15s 20d 6s 14n 10s USGS 08041780 2018-05-06 00:00 CDT 1.98 A
It would be nice if there was a way to automatically skip the n'th row as well as the n'th line.
As a note, I was able to fix my issue with:
import pandas as pd
ds = pd.read_csv(fname, comment='#', sep='\t', header=0, parse_dates=True)
ds.drop(0, inplace=True)
Indices in read_csv refer to line/row numbers in your csv file (the first line has the index 0). You have the following options to skip rows:
from io import StringIO
csv = \
"""col1,col2
1,a
2,b
3,c
4,d
"""
pd.read_csv(StringIO(csv))
# Output:
col1 col2 # index 0
0 1 a # index 1
1 2 b # index 2
2 3 c # index 3
3 4 d # index 4
Skip two lines at the start of the file (index 0 and 1). Column names are skipped as well (index 0) and the top line is used for column names. To add column names use names = ['col1', 'col2'] parameter:
pd.read_csv(StringIO(csv), skiprows=2)
# Output:
2 b
0 3 c
1 4 d
Skip second and fourth lines (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=[1, 3])
# Output:
col1 col2
0 2 b
1 4 d
Skip last two lines:
pd.read_csv(StringIO(csv), engine='python', skipfooter=2)
# Output:
col1 col2
0 1 a
1 2 b
Use a lambda function to skip every second line (index 1 and 3):
pd.read_csv(StringIO(csv), skiprows=lambda x: (x % 2) != 0)
# Output:
col1 col2
0 2 b
1 4 d
skip[1] will skip second line, not the first one.

Categories