Python pandas create datafrane from csv embeded within a web txt file - python

I am trying to import CSV formatted data to Pandas dataframe. The CSV data is located within a .txt file the is located at a web URL. The issue is that I only want to import a part (or parts) of the .txt file that is formatted as CSV (see image below). Essentially I need to skip the first 9 rows and then import rows 10-16 as CSV.
My code
import csv
import pandas as pd
import io
url = "http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt"
df = pd.read_csv(io.StringIO(url), skiprows = 9, sep =',', skipinitialspace = True)
df
I get a lengthy error msg that ultimately says "EmptyDataError: No columns to parse from file"
I have looked at similar examples Read .txt file with Python Pandas - strings and floats but this is different.

The code above attempts to read a CSV file from the URL itself rather than the text file fetched from that URL. To see what I mean take out the skiprows parameter and then show the data frame. You'll see this:
Empty DataFrame
Columns: [http://www.bom.gov.au/climate/averages/climatology/windroses/wr15/data/086282-3pmMonth.txt]
Index: []
Note that the columns are the URL itself.
Import requests (you may have to install it first) and then try this:
content = requests.get(url).content
df = pd.read_csv(io.StringIO(content.decode('utf-8')),skiprows=9)

Related

How to avoid pandas to_json escaping forward ashes in urls

I am trying to load JSON file data into a dataframe, filter a few records, and write it back to file again. My file contains one JSON record per line and each one has a URL in it.
This is the sample data in the input file.
{"site_code":"111","site_url":"https://www.site111.com"}
{"site_code":"222","site_url":"https://www.site333.com"}
{"site_code":"333","site_url":"https://www.site333.com"}
Sample code I used
import pandas as pd
sites = pd.read_json('sites.json', lines=True)
modified_sites = sites[sites['site_code']!=222]
modified_sites.to_json('modified_sites.json',orient='records',lines=True)
But the generated file contains escaped forward slashes
{"site_code":111,"site_url":"https:\/\/www.site111.com"}
{"site_code":333,"site_url":"https:\/\/www.site333.com"}
How can I avoid it and get the following data in the generated file?
{"site_code":111,"site_url":"https://www.site111.com"}
{"site_code":333,"site_url":"https://www.site333.com"}
Note: I referred to these but not helpful for my case
pandas to_json() redundant backslashes
You can try to format escaped slashes directly and save result to file:
import pandas as pd
import numpy as np
sites = pd.read_json('sites.json', lines=True)
modified_sites = sites[sites['site_code']!=222]
modified_sites.to_json('modified_sites.json',orient='records',lines=True)
formatted_json = modified_sites.to_json(orient='records',lines=True).replace('\\/', '/')
print(formatted_json, file=open('modified_sites.json', 'w'))

Extracting individual rows from dataframe

I am currently doing one of my final assignment and I have a CSV file with a few columns of different data.
Currently interested in extracting out a single column and converting the individual rows into a txt file.
Here is my code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"a", encoding="utf-8") as f:
f.write(df["content"][i])
No issue with extracting out the individual rows. But when I examine the txt files that was extracted and look at the content, I noticed that it copied out the text (which is what I want) but it did so twice (which is not what I want).
Example:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
This is what was copied to the txt file:
"This is an example of what the dataframe have at that particular column which I want to convert to a txt file.This is an example of what the dataframe have at that particular column which I want to convert to a txt file."
Any advise on how to just copy the content once only?
Thanks! While thinking about how to rectify this, I came to the same conclusion as you. I made a switch from "a" to "w" and it solved that issue.
Too used to append so I tried that before I tried write.
The correct code:
import pandas as pd
import csv
df = pd.read_csv("AUS_NZ.csv")
print(df.head(10))
print(df["content"])
num_of_review = len(df["content"])
print(num_of_review)
for i in range (num_of_review):
with open ("{}.txt".format(i),"w", encoding="utf-8") as f:
f.write(df["content"][i])

Combining .csv Files in Python - Merged File Data Error - Jupyter Lab

I am trying to merge a large number of .csv files. They all have the same table format, with 60 columns each. My merged table results in the data coming out fine, except the first row consists of 640 columns instead of 60 columns. The remainder of the merged .csv consists of the desired 60 column format. Unsure where in the merge process it went wrong.
The first item in the problematic row is the first item in 20140308.export.CSV while the second (starting in column 61) is the first item in 20140313.export.CSV. The first .csv file is 20140301.export.CSV the last is 20140331.export.CSV (YYYYMMDD.export.csv), for a total of 31 .csv files. This means that the problematic row consists of the first item from different .csv files.
The Data comes from http://data.gdeltproject.org/events/index.html. In particular the dates of March 01 - March 31, 2014. Inspecting the download of each individual .csv file shows that each file is formatted the same way, with tab delimiters and comma separated values.
The code I used is below. If there is anything else I can post, please let me know. All of this was run through Jupyter Lab through Google Cloud Platform. Thanks for the help.
import glob
import pandas as pd
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory= False) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
I used the following bash code to download the data:
!curl -LO http://data.gdeltproject.org/events/[20140301-20140331].export.CSV.zip
I used the following code to unzip the data:
!unzip -a "********".export.CSV.zip
I used the following code to transfer to my storage bucket:
!gsutil cp 2014DataCombinedMarch.csv gs://ddeltdatabucket/2014DataCombinedMarch.csv
Looks like these CSV files have no header on them, so Pandas is trying to use the first row in the file as a header. Then, when Pandas tries to concat() the dataframes together, it's trying to match the column names which it has inferred for each file.
I figured out how to suppress that behavior:
import glob
import pandas as pd
def read_file(f):
names = [f"col_{i}" for i in range(58)]
return pd.read_csv(f, delimiter='\t', encoding='UTF-8', low_memory=False, names=names)
file_extension = '.export.CSV'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
combined_csv_data = pd.concat([read_file(f) for f in all_filenames])
combined_csv_data.to_csv('2014DataCombinedMarch.csv')
You can supply your own column names to Pandas through the names parameter. Here, I'm just supplying col_0, col_1, col_2, etc for the names, because I don't know what they should be. If you know what those columns should be, you should change that names = line.
I tested this script, but only with 2 data files as input, not all 31.
PS: Have you considered using Google BigQuery to get the data? I've worked with GDELT before through that interface and it's way easier.

Combining multiple .csv files using pandas and keeping the original structure

I have around 60 .csv files which i would like to combine in pandas. So far i've used this:
import pandas as pd
import glob
total_files = glob.glob("something*.csv")
data = []
for csv in total_files:
list = pd.read_csv(csv, encoding="utf-8", sep='delimiter', engine='python')
data.append(list)
biggerlist = pd.concat(data, ignore_index=True)
biggerlist.to_csv("output.csv")
This works somewhat, only the files I would like to combine all have the same structure of 15 columns with the same headers. When I use this code, only one column is filled with info of the entire row, and every column name is add-up of all column names (e.g. SEARCH_ROW, DATE, TEXT, etc.).
How can I combine these csv files, while keeping the same structure of the original files?
Edit:
So perhaps I should be a bit more specific regarding my data. This is a snapshot of one of the .csv files i'm using:
As you can see it is just newspaper-data, where the last column is 'TEXT', which isn't shown completely when you open the file.
This is a part of how it looks when i have combined the data using my code.
Apart, i can read any of these .csv files no problem using
data = pd.read_csv("something.csv",encoding="utf-8", sep='delimiter', engine='python')
I solved it!
The problem was the amount of comma's in the text part of my .csv files. So after removing all comma's (just using search/replace), I used:
import pandas
import glob
filenames = glob.glob("something*.csv")
df = pandas.DataFrame()
for filename in filenames:
df = df.append(pandas.read_csv(filename, encoding="utf-8", sep=";"))
Thanks for all the help.

How to use Python Web Scraping to download CSV file then convert it to Pandas Dataframe?

I'd like my script to do the following:
1) Access this website:
2) Import a CSV file titled "Sales Data with Leading Indicator"
3) Convert it to pandas Dataframe for data analysis.
Currently, the code I have is this:
response = request.urlopen("http://vincentarelbundock.github.io/Rdatasets/datasets.html")
csv = response.read()
Thanks in advance
pandas.read_csv() method accepts a URL to a csv file as its buffer, so
import pandas as pd
pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv')
Should basically work. See further info here .

Categories