I am trying to use tables in pandas.
The original data look like that (.txt file):
µm nm
1.34E+00 1.39E+00
1.34E+00 1.61E+00
...
When I manually convert the file from .txt to .csv, by opening it in excel and saving as a .csv file, I obtain something like that:
µm;nm
1.339216;1.388997
1.340324;1.612847
1.341462;1.587352
1.342533;1.686544
...
Which is working fine in pandas, using the following code:
file =('filename.csv')
df = pd.read_csv(file, sep = ";")
df
dataframe from manually obtained .csv file
Which is what I want. But since I am planning to deal with a lot of those files, I need to process them as batch. So I need to obtain the same dataframe from the original files, which come as .txt.
But if I try to do that from the original data, it looks like this:
enter image description here
The code is as follows:
df2 = pd.read_csv('filename.txt', sep = ";", encoding = 'unicode_escape')
df2.to_csv('filename-2.csv', sep='\t', index=None)
df2
Please note that I use the 'unicode_escape' value to avoid the error message "utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte"
I tried to specify various separators, but without success so far.
I hope someone will be able to help.
Thanks,
Sébastien.
While read df2, you should use the same separater as the one given while writing it (assuming it is written from df.
df2 = pd.read_csv('filename.txt', sep = ';')
Or if .txt is a separate file altogether,
df2 = pd.read_csv('filename.txt', sep = '\t')
This should give correctly formatted dataframe.
Related
I am trying to read this file using read_csv in pandas(python).
But I am not able to capture all columns.
Can you help?
Here is the code:
file = r'path of file'
df = pd.read_csv(file, encoding='cp1252', on_bad_lines='skip')
Thank you
I tried to read your file, and I first noticed that the encoding you specified does not correspond to the one used in your file. I also noticed that the separator is not a comma (,) but a tab (\t).
First, to get the file encoding (in linux), you just need to run:
$ file -i kopie.csv
kopie.csv: text/plain; charset=utf-16le
In Python:
import pandas as pd
path_to_file = 'kopie.csv'
df = pd.read_csv(path_to_file, encoding='utf-16le', sep='\t')
And when I print the shape of the loaded dataframe:
>>> df.shape
(869, 161)
I am trying to merge a large number of .csv files. They all have the same table format, with 60 columns each. My merged table results in the data coming out fine, except the first row consists of 640 columns instead of 60 columns. The remainder of the merged .csv consists of the desired 60 column format. Unsure where in the merge process it went wrong.
The first item in the problematic row is the first item in 20140308.export.CSV while the second (starting in column 61) is the first item in 20140313.export.CSV. The first .csv file is 20140301.export.CSV the last is 20140331.export.CSV (YYYYMMDD.export.csv), for a total of 31 .csv files. This means that the problematic row consists of the first item from different .csv files.
The Data comes from http://data.gdeltproject.org/events/index.html. In particular the dates of March 01 - March 31, 2014. Inspecting the download of each individual .csv file shows that each file is formatted the same way, with tab delimiters and comma separated values.
The code I used is below. If there is anything else I can post, please let me know. All of this was run through Jupyter Lab through Google Cloud Platform. Thanks for the help.
import glob
import pandas as pd
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory= False) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
I used the following bash code to download the data:
!curl -LO http://data.gdeltproject.org/events/[20140301-20140331].export.CSV.zip
I used the following code to unzip the data:
!unzip -a "********".export.CSV.zip
I used the following code to transfer to my storage bucket:
!gsutil cp 2014DataCombinedMarch.csv gs://ddeltdatabucket/2014DataCombinedMarch.csv
Looks like these CSV files have no header on them, so Pandas is trying to use the first row in the file as a header. Then, when Pandas tries to concat() the dataframes together, it's trying to match the column names which it has inferred for each file.
I figured out how to suppress that behavior:
import glob
import pandas as pd
def read_file(f):
names = [f"col_{i}" for i in range(58)]
return pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory=False, names=names)
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([read_file(f) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
You can supply your own column names to Pandas through the names parameter. Here, I'm just supplying col_0, col_1, col_2, etc for the names, because I don't know what they should be. If you know what those columns should be, you should change that names = line.
I tested this script, but only with 2 data files as input, not all 31.
PS: Have you considered using Google BigQuery to get the data? I've worked with GDELT before through that interface and it's way easier.
I have a messy text file that I need to sort into columns in a dataframe so I
can do the data analysis I need to do. Here is the messy looking file:
Messy text
I can read it in as a csv file, that looks a bit nicer using:
import pandas as pd
data = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt')
print(data)
And this prints out the data aligned, but the issue is that the output is [640 rows x 1 column]. And I need to separate it into multiple columns and manipulate it as a dataframe.
I have tried a number of solutions using StringIO that have worked here before, but nothing seems to be doing the trick.
However, when I do this, there is the issue that the
delim_whitespace=True
Link to docs ^
df = pd.read_csv('phx_30kV_indepth_0_0_outfile.txt', delim_whitespace=True)
Your input file is actually not in CSV format.
As you provided only .png picture, it is even not clear, whether this file
is divided into rows or not.
If not, you have to start from "cutting" the content into individual lines and
read the content from the output file - result of this cutting.
I think, this is the first step, before you can use either read_csv or read_table (of course, with delim_whitespace=True).
I have a text file that contains data like this. It is is just a small example, but the real one is pretty similar.
I am wondering how to display such data in an "Excel Table" like this using Python?
The pandas library is wonderful for reading csv files (which is the file content in the image you linked). You can read in a csv or a txt file using the pandas library and output this to excel in 3 simple lines.
import pandas as pd
df = pd.read_csv('input.csv') # if your file is comma separated
or if your file is tab delimited '\t':
df = pd.read_csv('input.csv', sep='\t')
To save to excel file add the following:
df.to_excel('output.xlsx', 'Sheet1')
complete code:
import pandas as pd
df = pd.read_csv('input.csv') # can replace with df = pd.read_table('input.txt') for '\t'
df.to_excel('output.xlsx', 'Sheet1')
This will explicitly keep the index, so if your input file was:
A,B,C
1,2,3
4,5,6
7,8,9
Your output excel would look like this:
You can see your data has been shifted one column and your index axis has been kept. If you do not want this index column (because you have not assigned your df an index so it has the arbitrary one provided by pandas):
df.to_excel('output.xlsx', 'Sheet1', index=False)
Your output will look like:
Here you can see the index has been dropped from the excel file.
You do not need python! Just rename your text file to CSV and voila, you get your desired output :)
If you want to rename using python then -
You can use os.rename function
os.rename(src, dst)
Where src is the source file and dst is the destination file
XLWT
I use the XLWT library. It produces native Excel files, which is much better than simply importing text files as CSV files. It is a bit of work, but provides most key Excel features, including setting column widths, cell colors, cell formatting, etc.
saving this is:
df.to_excel("testfile.xlsx")
I have around 60 .csv files which i would like to combine in pandas. So far i've used this:
import pandas as pd
import glob
total_files = glob.glob("something*.csv")
data = []
for csv in total_files:
list = pd.read_csv(csv, encoding="utf-8", sep='delimiter', engine='python')
data.append(list)
biggerlist = pd.concat(data, ignore_index=True)
biggerlist.to_csv("output.csv")
This works somewhat, only the files I would like to combine all have the same structure of 15 columns with the same headers. When I use this code, only one column is filled with info of the entire row, and every column name is add-up of all column names (e.g. SEARCH_ROW, DATE, TEXT, etc.).
How can I combine these csv files, while keeping the same structure of the original files?
Edit:
So perhaps I should be a bit more specific regarding my data. This is a snapshot of one of the .csv files i'm using:
As you can see it is just newspaper-data, where the last column is 'TEXT', which isn't shown completely when you open the file.
This is a part of how it looks when i have combined the data using my code.
Apart, i can read any of these .csv files no problem using
data = pd.read_csv("something.csv",encoding="utf-8", sep='delimiter', engine='python')
I solved it!
The problem was the amount of comma's in the text part of my .csv files. So after removing all comma's (just using search/replace), I used:
import pandas
import glob
filenames = glob.glob("something*.csv")
df = pandas.DataFrame()
for filename in filenames:
df = df.append(pandas.read_csv(filename, encoding="utf-8", sep=";"))
Thanks for all the help.