Combining .csv Files in Python - Merged File Data Error - Jupyter Lab - python

I am trying to merge a large number of .csv files. They all have the same table format, with 60 columns each. My merged table results in the data coming out fine, except the first row consists of 640 columns instead of 60 columns. The remainder of the merged .csv consists of the desired 60 column format. Unsure where in the merge process it went wrong.
The first item in the problematic row is the first item in 20140308.export.CSV while the second (starting in column 61) is the first item in 20140313.export.CSV. The first .csv file is 20140301.export.CSV the last is 20140331.export.CSV (YYYYMMDD.export.csv), for a total of 31 .csv files. This means that the problematic row consists of the first item from different .csv files.
The Data comes from http://data.gdeltproject.org/events/index.html. In particular the dates of March 01 - March 31, 2014. Inspecting the download of each individual .csv file shows that each file is formatted the same way, with tab delimiters and comma separated values.
The code I used is below. If there is anything else I can post, please let me know. All of this was run through Jupyter Lab through Google Cloud Platform. Thanks for the help.
import glob
import pandas as pd
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory= False) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
I used the following bash code to download the data:
!curl -LO http://data.gdeltproject.org/events/[20140301-20140331].export.CSV.zip
I used the following code to unzip the data:
!unzip -a "********".export.CSV.zip
I used the following code to transfer to my storage bucket:
!gsutil cp 2014DataCombinedMarch.csv gs://ddeltdatabucket/2014DataCombinedMarch.csv

Looks like these CSV files have no header on them, so Pandas is trying to use the first row in the file as a header. Then, when Pandas tries to concat() the dataframes together, it's trying to match the column names which it has inferred for each file.
I figured out how to suppress that behavior:
import glob
import pandas as pd
def read_file(f):
names = [f"col_{i}" for i in range(58)]
return pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory=False, names=names)
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([read_file(f) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
You can supply your own column names to Pandas through the names parameter. Here, I'm just supplying col_0, col_1, col_2, etc for the names, because I don't know what they should be. If you know what those columns should be, you should change that names = line.
I tested this script, but only with 2 data files as input, not all 31.
PS: Have you considered using Google BigQuery to get the data? I've worked with GDELT before through that interface and it's way easier.

Related

Extracting a column from a collection of csv files and constructing a new table with said data

I'm a newbie when it comes to Python with a bit more experience in MATLAB. I'm currently trying to write a script that basically loops through a folder to pick up all the .csv files, extract column 14 from csv file 1 and adding it to column 1 of the new table, extract column 14 from csv file 2 and adding it to column 2 of the new table, to build up a table of column 14 from all csvfiles in the folder. I'd ideally like to have the headers of the new table to show the respective filename that said column 14 has been extracted from.
I've considered that Python is base0 so I've double checked that it reads the desired column, but as my code stands, i can only get it to print all the files' 14th columns in the one array and I'm not sure how to split it up to put it into a table. Perhaps via dataframe, although I'm not entirely sure how they work.
Any help would be greatly appreciated!
Code attached below:
import os
import sys
import csv
pathName = "D:/GLaDOS-CAMPUS/data/TestData-AB/"
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
print(numFiles)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[13])
Finding files.
I'm not sure if your way of finding files is right or not. Since I do not have a folder with csv files. But I can say it is way better to use glob for getting list of files:
from glob import glob
files = glob("/Path/To/Files/*.csv")
This will return all csv files.
Reading CSV files
Now we need to find a way to read all files and get 13th column. I don't know if it is an overkill but I prefer to use pandas and numpy to get 13th column.
To read a column of a csv file using pandas one can use:
pd.read_csv(file, usecols=[COL])
Now we can loop over files and get 13th columns:
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
Notice we converted all values to numpy arrays.
Merging all columns
In columns we have our each column as an element of a list. So it is technical rows. Not columns.
Now we should get the transpose of the array so it will become columns:
pd.DataFrame(np.transpose(columns))
The code
The whole code would look like:
from glob import glob
import pandas as pd
import numpy as np
files = glob("/Path/To/Files/*.csv")
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
print(pd.DataFrame(np.transpose(columns)))

Extracting individual rows from dataframe

I am currently doing one of my final assignment and I have a CSV file with a few columns of different data.
Currently interested in extracting out a single column and converting the individual rows into a txt file.
Here is my code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"a", encoding="utf-8") as f:
f.write(df["content"][i])
No issue with extracting out the individual rows. But when I examine the txt files that was extracted and look at the content, I noticed that it copied out the text (which is what I want) but it did so twice (which is not what I want).
Example:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
This is what was copied to the txt file:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file.This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
Any advise on how to just copy the content once only?
Thanks! While thinking about how to rectify this, I came to the same conclusion as you. I made a switch from "a" to "w" and it solved that issue.
Too used to append so I tried that before I tried write.
The correct code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"w", encoding="utf-8") as f:
f.write(df["content"][i])

How to convert data from txt files to Excel files using python

I have a text file that contains data like this. It is is just a small example, but the real one is pretty similar.
I am wondering how to display such data in an "Excel Table" like this using Python?
The pandas library is wonderful for reading csv files (which is the file content in the image you linked). You can read in a csv or a txt file using the pandas library and output this to excel in 3 simple lines.
import pandas as pd
df = pd.read_csv('input.csv') # if your file is comma separated
or if your file is tab delimited '\t':
df = pd.read_csv('input.csv', sep='\t')
To save to excel file add the following:
df.to_excel('output.xlsx', 'Sheet1')
complete code:
import pandas as pd
df = pd.read_csv('input.csv') # can replace with df = pd.read_table('input.txt') for '\t'
df.to_excel('output.xlsx', 'Sheet1')
This will explicitly keep the index, so if your input file was:
A,B,C
1,2,3
4,5,6
7,8,9
Your output excel would look like this:
You can see your data has been shifted one column and your index axis has been kept. If you do not want this index column (because you have not assigned your df an index so it has the arbitrary one provided by pandas):
df.to_excel('output.xlsx', 'Sheet1', index=False)
Your output will look like:
Here you can see the index has been dropped from the excel file.
You do not need python! Just rename your text file to CSV and voila, you get your desired output :)
If you want to rename using python then -
You can use os.rename function
os.rename(src, dst)
Where src is the source file and dst is the destination file
XLWT
I use the XLWT library. It produces native Excel files, which is much better than simply importing text files as CSV files. It is a bit of work, but provides most key Excel features, including setting column widths, cell colors, cell formatting, etc.
saving this is:
df.to_excel("testfile.xlsx")

Combining multiple .csv files using pandas and keeping the original structure

I have around 60 .csv files which i would like to combine in pandas. So far i've used this:
import pandas as pd
import glob
total_files = glob.glob("something*.csv")
data = []
for csv in total_files:
list = pd.read_csv(csv, encoding="utf-8", sep='delimiter', engine='python')
data.append(list)
biggerlist = pd.concat(data, ignore_index=True)
biggerlist.to_csv("output.csv")
This works somewhat, only the files I would like to combine all have the same structure of 15 columns with the same headers. When I use this code, only one column is filled with info of the entire row, and every column name is add-up of all column names (e.g. SEARCH_ROW, DATE, TEXT, etc.).
How can I combine these csv files, while keeping the same structure of the original files?
Edit:
So perhaps I should be a bit more specific regarding my data. This is a snapshot of one of the .csv files i'm using:
As you can see it is just newspaper-data, where the last column is 'TEXT', which isn't shown completely when you open the file.
This is a part of how it looks when i have combined the data using my code.
Apart, i can read any of these .csv files no problem using
data = pd.read_csv("something.csv",encoding="utf-8", sep='delimiter', engine='python')
I solved it!
The problem was the amount of comma's in the text part of my .csv files. So after removing all comma's (just using search/replace), I used:
import pandas
import glob
filenames = glob.glob("something*.csv")
df = pandas.DataFrame()
for filename in filenames:
df = df.append(pandas.read_csv(filename, encoding="utf-8", sep=";"))
Thanks for all the help.

I want to combine csv's, dropping rows while only keeping certain columns

This is the code I have so far:
import pandas as pd
import glob, os
os.chdir("L:/FMData/")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
results.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This however is producing a new document as instructed but isn't copying any data into it. I'm wanting to look up files in a certain location, with names along the lines of FM001, combine them, skip the first 7 rows in each csv, and only keep columns 1 and 2 in the new file. Can anyone help with my code?
Thanks in advance!!!
To combine multiple csv files, you should create a list of dataframes. Then combine the dataframes within your list via pd.concat in a single step. This is much more efficient than appending to an existing dataframe.
In addition, you need to write your result to a file outside your for loop.
For example:
results = []
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
df = pd.concat(results, axis=0)
df.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This code works on my side (using Linux and Python 3), it populates a csv file with data in.
Add a print just after the read_csv to see if your csv file actually reads any data, else nothing will be written, like this:
namedf = pd.read_csv(file)
print(namedf)
results = results.append(namedf)
It adds row 1 (probably becuase it is considered the header) and then skips 7 rows then continue, this is my result for csv file just written from one to eleven out in rows:
F5331_FM001.csv
one
0 nine
1 ten
2 eleven
Addition:
If print(namedf) shows nothing, then check your input files.
The python program is looking in L:/FMData/ for your files. Are you sure your files are located in that directory? You can change the directory by adding the correct path with the os.chdir command.

Categories