How to separate a CVS column by position in Python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have data that I want to separate into 3 columns form the one column in a CVS file.
The original file looks like this:
0400000006340000000000965871
0700000007850000000000336487
0100000003360000000000444444
I would like to separate the columns to resemble the list below, while still preserving the leading zeros:
04 0000000634 0000000000965871
07 0000000785 0000000000336487
01 0000000336 0000000000444444
I can upload the file onto Python, but I don't know which Delimiter or positioning I have to use. The code I have so far:
import pandas as pd
df = pd.read_cvs('new_numbers.txt', header=None)
Thank you for the help.

Use the pandas read_fwf() method - which stands for "fixed-width format":
pd.read_fwf('new_numbers.txt', widths=[2, 10, 16], header=None)
which will drop the leading zeroes:
0 1 2
0 4 634 965871
1 7 785 336487
2 1 336 444444
To keep them, specify the dtype as strings with object:
pd.read_fwf('new_numbers.txt', widths=[2, 10, 16], dtype=object, header=None)
Output:
0 1 2
0 04 0000000634 0000000000965871
1 07 0000000785 0000000000336487
2 01 0000000336 0000000000444444

It looks like there is no delimiter and you are using fixed lengths.
Access fixed lengths by their position in a list notation.
So for instance:
str1 = "0400000006340000000000965871"
str1A = str1[:2]
str1B = str1[3:14]
str1C = str1[14:]
I wouldn't particularly bother with pandas for it unless you need a dataframe out the far end.

You don't need pandas to load your text file and read its content (and also, you aren't loading a csv file).
with open("new_numbers.txt") as f:
lines = f.readlines()
What I suggest you is to use re module.
import re
PATTERN = re.compile(r"(0*[1-9]+)(0*[1-9]+)(0*[1-9]+)")
You can check here the result of this expression on your example.
Then you need to get matches from your lines, and join them with a space.
matches = []
for line in lines:
match = PATTERN.match(line)
first, second, third = match.group(1, 2, 3)
matches.append(" ".join([first, second, third]))
At the end, matches will be an array of space-separated numbers (with leading zeros).
At this point you can write them to another file, or do whatever you need to do with it.
towrite = "\n".join(matches)
with open("output.txt", "w") as f:
f.write(towrite)

Related

Is there a way to write a one column pd data frame on top of a mxn pd dataframe to a text file in Python

I have two data frames of different dimensions that I want to write to a text (.txt) file such that one is on top of the other. I'm sure it's easy but I have no way of doing it.
The data I want to write is:
import numpy as np
import pandas as pd
preamble = pd.DataFrame(np.array(["software", "version", "frequency: 100", "firmware:100.10.1"]).T)
data = pd.DataFrame.from_dict({"frame": np.array([1, 2, 3, 4, 5]), "X": np.array([2,4,6,8,10]), "Y": np.array([3,6,9,12,15]), "Z": np.array([1,2,3,4,5])})
I want to create a text file that looks like this:
software
version
frequency:100
firmware: 100.10.1
frame X Y Z
1 2 3 1
2 4 6 2
3 6 8 3
4 8 10 4
5 10 12 5
I tried to get the format correctly at the top end.
I want to keep the [frame, X, Y, Z] headers where they are. But place the "preamble" at the top in a column.
I've tried to append and combine the two data frames, but can't do it. I don't think that's possible.
I've tried looking for ways to write the preamble in cell (column = 1, row = 1) and then start the data in cell (column = 1, row = 5).
Any help here would be appreciated! Please let me know if you need more information!
The main idea is to use to_string()`.
Here is a idea how to do this in your example:
with open('myfile.txt', 'w') as fp:
fp.write(preamble.to_string(index=False).replace('0', '').replace(' ', '')[1:])
fp.write('\n')
fp.write(data.to_string())
Here I removed some element from the first string and did not use the index. I also added one newline character \n to seperate those two DataFrames.

Import CSV file where last column has many separators [duplicate]

This question already has an answer here:
python pandas read_csv delimiter in column data
(1 answer)
Closed 2 years ago.
The dataset looks like this:
region,state,latitude,longitude,status
florida,FL,27.8333,-81.717,open,for,activity
georgia,GA,32.9866,-83.6487,open
hawaii,HI,21.1098,-157.5311,illegal,stuff
iowa,IA,42.0046,-93.214,medical,limited
As you can see, the last column sometimes has separators in it. This makes it hard to import the CSV file in pandas using read_csv(). The only way I can import the file is by adding the parameter error_bad_lines=False to the function. But this way I'm losing some of the data.
How can I import the CSV file without losing data?
I would read the file as one single column and parse manually:
df = pd.read_csv(filename, sep='\t')
pat = ','.join([f'(?P<{x}>[^\,]*)' for x in ['region','state','latitude','longitute']])
pat = '^'+ pat + ',(?P<status>.*)$'
df = df.iloc[:,0].str.extract(pat)
Output:
region state latitude longitute status
0 florida FL 27.8333 -81.717 open,for,activity
1 georgia GA 32.9866 -83.6487 open
2 hawaii HI 21.1098 -157.5311 illegal,stuff
3 iowa IA 42.0046 -93.214 medical,limited
Have you tried the old-school technique with the split function? A major downside is that you'd end up losing data or bumping into errors if your data has a , in any of the first 4 fields/columns, but if not, you could use it.
data = open(file,'r').read().split('\n')
for line in data:
items = line.split(',',4). # Assuming there are 4 standard columns, and the 5th column has commas
Each row items would look, for example, like this:
['hawaii', 'HI', '21.1098', '-157.5311', 'illegal,stuff']

Alignment of column names and its corresponding rows in Python

I have a CSV file which is very messy in terms of column and row alignment. In the first cell, all column names are stated, but they do not align with the rows beneath. So when I load this CSV in python using pandas
I do not get a clean dataframe
In the below picture, there is an example of how it should look like with the columns separated and matching the rows.
Some details:
Few lines of raw CSV file:
Columns:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu"
Rows:
ITLT4301;1;"1-5-2018";976439;35059255;53842;6545371441;3235864;95200029;"MemActive";"4096";"0";"0"
Code:
df = pd.read_csv(file_location, sep=";")
Output when loading the dataframe in python:
VMName;"Cluster";"time";"AvgValue";"MinValue";"MaxValue";"MetricId";"MemoryMB";"CpuMHz";"NumCpu",,,
ITLT4301;1;"1-5-2018";976439,35059255 53842,6545371441 3235864,"95200029 MemActive"" 4096"" 0"" 0"""
Desired output:
VMName Cluster time AvgValue MinValue MaxValue MetricId MemoryMB CpuMHz
ITLT4301 1 1-5-201 976439 35059255 53842 6545371441 95200029 MemActive
NumCpu
4096
Hopefully this clears up the topic and problem a bit. Desired output is a well-organized data frame where the columns match the rows based on separater sign ";"
Your input data file is not a standard csv file. The correct way would be to fix the previous step in order to get a normal csv file instead of a mess of double quotes preventing any decent csv parser to correctly extract data.
As a workaround, it is possible to remove the initial and terminating double quote, remove any doubled double quote, and split every line on semi-column ignoring any remaining double quote. Optionnaly, you could also try to just remove any double quote and split the lines on ';'. It really depends on what values you expect.
A possible code could be:
def split_line(line):
'''split a line on ; after stripping white spaces, the initial and terminating "
doubles double quotes are also removed'''
return line.strip()[1:-1].replace('""', '').split(';')
with open('file.dat') as fd:
cols = split_line(next(fd)) # extract column names from header line
data = [split_line(line) for line in fd] # process data lines
df = pd.DataFrame(data, columns=cols) # build a dataframe from that
With that input:
"VMName;""Cluster"";""time"";""AvgValue"";""MinValue"";""MaxValue"";""MetricId"";""MemoryMB"";""CpuMHz"";""NumCpu"""
"ITLT4301;1;""1-5-2018"";976439" 35059255;53842 6545371441;3235864 "95200029;""MemActive"";""4096"";""0"";""0"""
"ITLT4301;1;""1-5-2018"";98" 9443749608104;29 3435452286154;673 "067568681366;""CpuUsageMHz"";""0"";""5600"";""2"""
It gives:
VMName Cluster time AvgValue MinValue \
0 ITLT4301 1 1-5-2018 976439" 35059255 53842 6545371441
1 ITLT4301 1 1-5-2018 98" 9443749608104 29 3435452286154
MaxValue MetricId MemoryMB CpuMHz NumCpu
0 3235864 "95200029 MemActive 4096 0 0
1 673 "067568681366 CpuUsageMHz 0 5600 2

Extract data in R or Python from data file with no column headers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a txt file with several columns. see sample data below.
25  180701  1  12
25  180701  2  15
25  180701  3  11
25  180702  1  11
25  180702  2  14
25  180722  2  14
14  180701  1  11
14  180701  2  13
There are no column headers. Column 1 is ID, Column 2 is date, Column 3 is Hour, Column 4 is value. I am trying to look up the number 25 in column 1 and extract data for all hours during period 180701 to say 180705 all values. so the result would be a new text file with following data.
25  180701  1  12
25  180701  2  15
25  180701  3  11
25  180702  1  11
25  180702  2  14
Any help in R or Python is appreciated.Thanks!
When we read the file with read.csv/read.table, there is an option header=FALSE and use col.names
df1 <- read.csv("file.csv", header = FALSE,
col.names = c("ID", "date", "Hour", "value"))
and subset the values later
subset(df1, ID == 25 & (date %in% 180701:180705), select = 1:4)
In R readr::read_delim() has a col_names parameter that you can set to F
> readr::read_delim('hi;1;T\nbye;2;F', delim = ';', col_names = F)
# A tibble: 2 x 3
X1 X2 X3
<chr> <int> <lgl>
1 hi 1 TRUE
2 bye 2 FALSE
In Python, try this:
import pandas as pd
#To read csv files without headers. use 'header = None' to be explicit
df = pd.read_csv('test.csv',header = None)
df
# Then rename the generated columns
df2 = df.rename({0:'ID',1:'Date',2:'Hours',3:'Value'},axis = 'columns')
df2

python read data from file

I've a very simple question: which is the most efficient way to read different entries from a txt file with Python?
Suppose I've a text file like:
42017 360940084.621356 21.00 09/06/2015 13:08:04
42017 360941465.680841 29.00 09/06/2015 13:31:05
42017 360948446.517761 16.00 09/06/2015 15:27:26
42049 361133954.539315 31.00 11/06/2015 18:59:14
42062 361208584.222483 10.00 12/06/2015 15:43:04
42068 361256740.238150 19.00 13/06/2015 05:05:40
In C I would do:
while(fscanf(file_name, "%d %lf %f %d/%d/%d %d:%d:%d", &id, &t0, &score, &day, &month, &year, &hour, &minute, &second) != EOF){...some instruction...}
What would be the best way to do something like this in Python? In order to store every value into a different variable (since I've got to work with those variables throughout the code).
Thanks in advance!
I feel like the muddyfish answer is good, here is another way (maybe a bit lighter)
import time
with open(file) as f:
for line in f:
identifier, t0, score, date, hour = line.split()
# You can also get a time_struct from the time
timer = time.strptime(date + hour, "%d/%m/%Y%H:%M:%S")
I would look up the string.split() method
For example you could use
for line in file.readlines():
data = dict(zip(("id", "t0", "score", "date", "time"), line.split(" ")))
instructions()
Depending on what you want to do with the data, pandas may be something to look into:
import pandas as pd
with open(file_name) as infile:
df = pd.read_fwf(infile, header=None, parse_dates=[[3, 4]],
date_parser=lambda x: pd.to_datetime(x, format='%d/%m/%Y %H:%M:%S'))
The double list [[3, 4]], together with the date_parser argument, will read the the third and fourth (0-indexed) columns as a single data-time object. You can then access individual parts of that column with
>>> df['3_4'].dt.hour
0 13
1 13
2 15
3 18
4 15
5 5
dtype: int64
(If you don't like the '3_4' key, use the parse_dates argument above as follows:
parse_dates={'timestamp': [3, 4]}
)
read_fwf is for reading fixed width columns, which your data seems to adhere to. Alternatively, there are functions such as read_csv, read_table and a lot more.
(This answer is pretty much a duplicate of this SO answer, but since this question here is more general, I've put this here as another answer, not as a comment.)

Categories