I'm trying to automate my code a bit more, but for that I need my program to know the filenames,
uploaded1 = files.upload()
df = pd.read_csv('Formulário sem título1.csv')
I already tried to do like this:
uploaded1 = files.upload()
df = pd.read_csv(uploaded1)
But it doesn't work like that, i don't know if it's the best but i'm thinking of doing something like this:
uploaded1 = files.upload()
file_name = uploaded1[filename]
df = pd.read_csv(uploaded1)
You didn't say what is files and I found files.upload() only in snippets in Google Colabs so I assume that it is from Google Colabs - so it is not part of pandas.
snippets in Google Colabs shows that you can get filenames using .keys()
from google.colab import files
uploaded = files.upload()
for name in uploaded.keys():
print('filename:', name)
print('length:', uploaded[name])
EDIT:
Full working code
from google.colab import files
import pandas as pd
uploaded = files.upload()
for name in uploaded.keys():
print('filename:', name)
print('length:', uploaded[name])
df = pd.read_csv(name)
print(df)
Related
Looked all over SO for an approach to this problem and none that I have tried have worked. I've seen several posts about downloading zip files from URL or unzipping files from a local directory in Python, but I am a bit confused on how to put it all together.
My problem: I have a page of zipped data that is organized by month going back to 2010. I'd like to use some Python code to:
scrape the page and nab only the .zip files (there's other data on the page)
unzip each respective monthly dataset
extract and concatenate all the .csv files in each unzipped folder to one long Pandas dataframe
I've tried
from urllib.request import urlopen
url = "https://s3.amazonaws.com/capitalbikeshare-data/2010-capitalbikeshare-tripdata.zip"
save_as = "tripdata1.csv"
# Download from URL
with urlopen(url) as file:
content = file.read()
# Save to file
with open(save_as, 'wb') as download:
download.write(content)
but this returns gibberish.
Then, I tried an approach I saw from a related SO post:
import glob
import pandas as pd
from zipfile import ZipFile
path = r'https://s3.amazonaws.com/capitalbikeshare-data/index.html' # my path to site
#load all zip files in folder
all_files = glob.glob(path + "/*.zip")
df_comb = pd.DataFrame()
flag = False
for filename in all_files:
zip_file = ZipFile(filename)
for text_file in zip_file.infolist():
if text_file.filename.endswith('tripdata.csv'):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter=';',
header=0,
index_col=['ride_id']
)
if not flag:
df_comb = df
flag = True
else:
df_comb = pd.concat([df_comb, df])
print(df_comb.info())
But this returned a df with zero data, or with additional tinkering, returned error that there were no filenames on the page... :/
Final data should essentially just be a row-wise merge of all the monthly trip data from the index.
Any advice or fixes will be highly appreciated!
I tried to get genres of songs in regional-us-daily-latest, and output genres and other datas as csv file. But colab said,
FileNotFoundError: [Errno 2] No such file or directory: 'regional-us-daily-latest.csv'
I mounted My Drive, but still didn't work.
Could you shed some light on this?
!pip3 install spotipy
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import json
from google.colab import drive
client_id = ‘ID’
client_secret = ’SECRET’
client_credentials_manager = spotipy.oauth2.SpotifyClientCredentials(client_id, client_secret)
spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
import csv
csvfile = open('/content/drive/MyDrive/regional-us-daily-latest.csv', encoding='utf-8')
csvreader = csv.DictReader(csvfile)
us = ("regional-us-daily-latest.csv", "us.csv")
for region in (us):
inputfile = region[0]
outputfile = region[1]
songs = pd.read_csv(inputfile, index_col=0, header=1)
songs = songs.assign(Genre=0)
for index, row in songs.iterrows():
artist = row["Artist"]
result = spotify.search(artist, limit=1, type="artist")
genre = result["artists"]["items"][0]["genres"]
songs['Genre'][index] = genre
songs.head(10)
songs.to_csv(outputfile)
files.download(outputfile)
Save the csv file in the Google drive and go to your notebook click on drive and search for your file in the drive . Then copy the path of the csv file in a variable and use the variable using read_csv() method
please mount the drive first
from google.colab import drive
drive.mount('/content/drive')
Change the directory to MyDrive and check current directory
import os
os.chdir("drive/My Drive/")
print(os.getcwd())
!ls
Set the path of file. and use source_file variable where file name required
source_file = os.path.join(os.getcwd(), "regional-us-daily-latest.csv")
I have gcs bucket and can list out all the files in the bucket using google colab like this:-
!gsutil ls gs://custom_jobs/python_test/
This lists out all the files which are:-
test_1.csv
test_2.csv
I can read a single file at a time like this:-
d = pd.read_csv('gs://custom_jobs/python_test/test_1.csv')
What I intend to do is read test_1.csv and test_2.csv in a single dataframe like we can do locally:-
import glob
files = glob.glob("/home/shashi/python_test/*.csv")
all_dat = pd.DataFrame()
for file in files:
dat = pd.read_csv(file)
all_dat = all_dat.append(dat, ignore_index=True)
How is this possible in google colab when my files are on google storage bucket?
Try using the ls command in gsutil
Ex:
import subprocess
result = subprocess.run(['gsutil', 'ls', 'gs://custom_jobs/python_test/*.csv'], stdout=subprocess.PIPE)
all_dat = pd.DataFrame()
for file in result.stdout.splitlines():
dat = pd.read_csv(file.strip())
all_dat = all_dat.append(dat, ignore_index=True)
One simple solution might be:
from google.cloud import storage
bucket_name = "your-bucket-name"
all_dat = pd.DataFrame()
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
dat = pd.read_csv("gs://{}/{}".format(bucket_name, blob.name))
all_dat = all_dat.append(dat, ignore_index=True)
One simple solution that I found was:-
files = !gsutil ls -r gs://custom_jobs/python_test/
I want to download the daily data about the daily covid19 cases from the ECDC Website. How do I do that with python code and import it to my notebook.
I have previously downloaded the data from GitHub, but I have no idea how to download the data from a link provided on a live website.
from github.MainClass import Github
g = Github('KEY')
repo = g.get_repo("CSSEGISandData/COVID-19")
file_list = repo.get_contents("csse_covid_19_data/csse_covid_19_daily_reports")
github_dir_path = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/'
file_path = github_dir_path + str(file_list[-2]).split('/')[-1].split(".")[0]+ '.csv'
I was just able to use this and download it. Is your issue getting the list of files or you are unaware that you can use URLs in read_csv
import pandas as pd
url = 'https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv'
df = pd.read_csv(url, error_bad_lines = False)
TSV(Tab separated Value) extension file can't be uploaded to google colab using pandas
Used this to upload my file
import io
df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv']))
import io
stk = pd.read_csv(io.BytesIO(uploaded['train.tsv']))
A tsv file should be uploaded and read into the dataframe stk
import pandas as pd
from google.colab import files
import io
#firstly, upload file to colab
uploaded = files.upload()
#secondly, get path to file in colab
file_path = io.BytesIO(uploaded['file_name.tsv'])
#the last step is familiar to us
df = pd.read_csv(file_path, sep='\t', header=0)
To save tsv file on google colab, .to_csv function can be used as follows:
df.to_csv('path_in_drive/filename.tsv', sep='\t', index=False, header=False)
stk = pd.read_csv('path_in_drive/filename.tsv') #to read the file
Don't know if this is a solution to your problem as it doesn't upload the files, but with this solution you can import files that are also on your google drive.
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive/My\ Drive/{'.//'}
After mounting you should be able to load files into your script like on your desktop