Any idea how to import this data set?

Any idea how to import this data set? - python

I have the following dataset:
https://github.com/felipe0216/fdfds/blob/master/sheffield_weather_station.csv
I know it is a csv, so I can use the pd.read_csv() function. However, if you look into the file it is not comma separated values nor a tab separated values, I am not sure what separation it has exactly. Any ideas about how to open it?

The proper way to do this is as follows:
df = pd.read_csv("sheffield_weather_station.csv", comment="#", delim_whitespace=True)
You should first download the .csv file. You have to tell pandas that there will be comments and that there will just be spaces separating the columns. The amount of spaces do not matter.

Related

Python) to_csv: my csv file data is separated and pushed back

I'm saving my pd.DataFrame with
"""df.to_csv('df.csv', encoding='utf-8-sig)"""
my csv file have a problem...
please see rows, where have content2-1, content2-2, and content2-3 in this pic.
Before saving(to_csv), there was no problem. All the data had right columns, 'content2' was not separated. but after df -> csv...
'content2' is all separated, and the others of 'id2' are allocated to the wrong columns.
"2018-04-21" have to be in column D, 0 with E,F,G, and url must be in column I.
why this happen? because of large csv file?(774,740KB), because of language?(Korean), or csv cannot recognize enter key?(All data with problems such as content2 were separated based on the enter key.)
how can I resolve this? I have no idea

Unfortunately I never figured out the reason for this.. I assumed it was something to do with the size of the data i was working with and excel not liking it.
What worked for me though was using .to_excel() instead of to_csv(). I know, far from a perfect answer, but thought id put it here incase it is enough for your case

How to keep both good and bad lines when loading text file?

I am trying to load a large text file into python dataframe. One thing I noticed is, if I want to load it successfully, I have to drop all the bad lines. But I would like to load all rows first then take a look then clean it manually. Is there a way to do that?
data = pd.read_csv('filename.txt', sep="\t", error_bad_lines=False, engine='python')
Here's warnings I've got. It's a common error, but all solutions are just skipping them, I really need to load all rows... any thought?
Skipping XXX line: Expected 28 fields in line XXX, saw 29

Without knowing more about the specific CSV file, it looks like there is either:
Too many columns in that row (an extra comma)
Quoting is off meaning there's a comma that should be quoted but isn't
The best way to remedy this is to fix the problem in the CSV file.

Technically you're not just loading the file, but also parsing it at the same time. It looks like you've handled the delimiter properly, so as you may have guessed you have too many columns or too few in some of your rows. That may actually be the case, or perhaps you have tabs within text fields that are being interpreted as delimiters.
In any case, pandas isn't going to parse those inconsistent lines.
A typical approach is to open the file in a robust text editor and look at the lines that are erroring out in Pandas. See what's actually wrong and either fix it in the text editor, or use python's native open() function to load the entire file and iterate line by line, with logic that fixes whatever the problem is.
Once you're certain that you have the same number of columns in every row load it with Pandas.

How to make read_csv more flexibile with numbers and whitespaces

I want to read a txt.file with Pandas and the Problem is the seperator/delimiter consits of a number and Minimum two blanks afterwards.
I already tried it similiar to this code (How to make separator in pandas read_csv more flexible wrt whitespace?):
pd.read_csv("whitespace.txt", header=None, delimiter=r"\s+")
This is only working if there is only a blank or more. So I adjustet it to the following code.
delimiter=r"\d\s\s+"
But this is seperating my dataframe when it sees two blanks or more, but i strictly Need the number before it followed by at least two blanks, anyone has an idea how to fix it?
My data Looks as follows:
I am an example of a dataframe
I have Problems to get read
100,00
So How can I read it
20,00
so the first row should be:
I am an example of a dataframe I have Problems to get read 100,00
followed by the second row:
So HOw can I read it 20,00

Id try it like this.
Id manipulate the text file before I attempt to parse it to a dataframe as follows:
import pandas as pd
import re
f = open("whitespace.txt", "r")
g = f.read().replace("\n", " ")
prepared_text = re.sub(r'(\d+,\d+)', r'\1#', g)
df = pd.DataFrame({'My columns':prepared_text.split('#')})
print(df)
This gives the following:
My columns
0 I am an example of a dataframe I have Problems...
1 So How can I read it 20,00
2
I guess this'd suffice as long as the input file wasnt too large but using the re module and substitiution gives you the control you seek.
The (\d+,\d+) parentheses mark a group which we want to match. We're basically matching any of your numbers in your text file.
Then we use the \1 which is called a backreference to the matched group which is referred to when specifying a replacement. So \d+,\d+ is replaced by \d+,\d+#.
Then we use the inserted character as a delimiter.
There are some good examples here:
https://lzone.de/examples/Python%20re.sub

How to merge columns with no header name in a python script?

My Python script parsed some text of a Excel file. It strips white-space from an Excel file and changes the delimiters
(from " : "--> " , ")
and my script outputs to a CSV file. Much of the data looks like this
(what data looks like in Excel)
Separated by a single column due to there being a extra comma or two.
CSV == Comma separated values.
I have tried using if statements to add or subtract commas to try shore it up but it ends up completely messing up the relative order it was first in. Driving me nuts!
To try do it another way installed the pandas library (a data manipulating library) using pip.
Is it possible to merge columns that have no column headers inside a single Data Frame? There's plenty of advice regarding separate DataFrames but much for one single one.
Furthermore how can I merge the columns while retaining the row position. The emails are in the correct row position but not the column position.
Or am I on the wrong track completely, is pandas overkill for a simple parsing script? I've been learning python as I go along to try complete the script so I might have missed a simple way of doing it.
Some sample data:
C5XXEmployeeNumXX,C5XXEmployeeNumXX,JohnSmith,1,,John,,Smith,,IT Supp.Centre,EU,,London1,,,59XXXX,ITServiceDesk,LOND01,,,,Notmaintained,,,,,,,,john.smith#company.com,
Snippet of parsing logic
for line in f:
#finds the identifier for users
if ':LON ' in line:
#parsing logic.
#Delimiters are swapped. Whitespace is scrubbed
line = line.replace(':', ',')
line = line.replace(' ', '')

You can user a separator/delimiter of your choice. Check out: https://docs.python.org/2/library/csv.html#csv.Dialect.delimiter.
Also, regarding the order, if you are reading in a list it should be fine but if you are reading the contents of a row in a dict then it is normal that the order is not preserved.

Trailing delimiter confuses pandas read_csv

A csv (comma delimited) file, where lines have an extra trailing delimiter, seems to confuse pandas.read_csv. (The data file is [1])
It treats the extra delimiter as if there's an extra column. So there's one more column than what headers require. Then pandas.read_csv takes the first column as row labels. The overall effect is that columns and headers are not aligned any more - the first column becomes row labels, the second column is named by first header, etc.
It is quite annoying. Any idea how to tell pandas.read_csv do the right thing? I couldn't find one.
Great book, BTW.
[1]: 2012 FEC Election Database from chapter 9 of the book Python for Data Analysis

For everyone who is still finding this. Wes wrote a blogpost about this. The problem if there is one value too many in the row it is treated as the rows name.
This behaviour can be changed by setting index_col=False as an option to read_csv.

I created a GitHub issue to have a look at handling this issue automatically:
https://github.com/pydata/pandas/issues/2442
I think the FEC file format changed slightly causing this annoying issue-- if you use the one posted here http://github.com/pydata/pydata-book you hopefully won't have that problem.

Well, there's a very simple workaround. Add a dummy column to the header when reading csv file in:
cols = ...
cols.append('')
records = pandas.read_csv('filename.txt', skiprows=1, names=cols)
Then columns and header get aligned again.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.