I have created a function to parse data from CityBik API for 4 cities.
In the function I parse the data and add in the necessary details I require before creating a DataFrame using pandas.
When I run the function it displays the data as being from one city only. Does anyone know how I can fix this?
def parse_bike_data(city_name, fpath):
fin = open(fpath, "r")
json_data = fin.read() #read in JSON data
data = json.loads(json_data)
stations= data['network']['stations']
rows = []
#look at each observation
for obs in stations:
row={}
row = {"City": city_name}
# parse the local datatime in ISO8601 format
obs_date = datetime.strptime(obs["timestamp"], "%Y-%m-%dT%H:%M:%S.%f%z")
# strip the timezone
obs_date = obs_date.replace(tzinfo=None)
# round it to the nearest hour
row["Timestamp"] = round_datetime(obs_date)
#add in relevant station data
row['Station'] = obs["name"]
row["Free Bikes"] = obs["free_bikes"]
row["Empty Slots"] = obs["empty_slots"]
row["Capacity"] = obs["extra"]["slots"]
rows.append(row)
fin.close()
return pd.DataFrame(rows) #return a data frame
Related
I'm a newbie in this community and I hope you can help me with my Problem. In my current project I want to scrape a page. These are gas stations with multiple information. Now all the information from the petrol stations is stored as one variable. However, I want each gas station to have a row so that I get a large data frame. Each individual gas station is provided with an id and they are stored in the variable ids.
ids=results["objectID"].tolist()
id_details=[]
for i,id in enumerate(ids):
input_dict = {
'diff_time_zone':-1,
'objectID':id,
'poiposition':'50.5397219 8.7328552',
'stateAll':'2',
'category':1,
'language':'de',
'prognosis_offset':-1,
'windowSize':305
}
encoded_input_string = json.dumps(input_dict, indent=2).encode('utf-8')
encoded_input_string = base64.b64encode(encoded_input_string).decode("utf-8")
r = s.post("https://example.me/getObject_detail.php", headers=headers, data="post="+encoded_input_string)
soup = BeautifulSoup(r.text, "lxml")
lists= soup.find('div', class_='inside')
rs= lists.find_all("p")
final = []
for lists in rs:
txt = lists if type(lists) == NavigableString else lists.text
id_details.append(txt)
df= pd.DataFrame(id_details,columns = ['place'])
well, personally I would use a database rather than a data frame in that case and probably not saving as a file. as I can see there is dictionary-based data that can be easily implemented in Elastic Search for example.
If there is any reason(that forced not using any kind of databases) for doing that (Using Dataframe) accessing to file and appending it to the end of it would work fine, and you should maximize your chunks because accessing to file and writing to it is working like a bottleneck here but saying chunk because Ram is not unlimited.
---Update asking for the second way.
Some parts of your code are missing but u will get the idea.
file_name = 'My_File.csv'
cols = ['place'] # e.x creating an empty csv with only one column - place using pandas
data = dict(zip(cols,[[] for i in range(len(cols))]))
df = pd.DataFrame(data) #creating df
df.to_csv(file_name, mode='w', index=False, header=True) #saving
id_details={'place':[]}
for i, id in enumerate(ids):
#Some algo...
for lists in rs:
id_details['place'].append(txt)
if i %100==0:
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
id_details['place'] = []
df_main = pd.DataFrame(id_details)
df_main.to_csv(file_name, mode='a', index=False, header=False)
I'm trying to read my Twitter data saved in json format using the following code:
import json
with open(file, 'r') as f:
line = f.readline()
tweet = json.loads(line)
df1 = pd.DataFrame(tweet)
This code reads only one tweet and it works, but when I'm trying to read all file by:
with open(file, 'r') as f:
for line in f:
tweet = json.loads(line)
I receive an error:
JSONDecodeError: Expecting value: line 2 column 1 (char 1)
What should I change to read this file properly?
My main task is to find creation dates for those tweets and I found it using following filters (I just used one tweet which worked at the beginning):
df2 = df[["user"]]
df3 = df2.loc[['created_at']]
df3
Is there a better way than DataFrames to handle this data?
A more succint way to read in (all) your JSON file for me looks like
import pandas as pd
df = pd.read_json("python.json", orient = 'records', lines = True)
You can then apply transformations to df so to get data from the columns that you are interested in.
You can do something like this:
import pandas as pd
#results is the JSON tweet data.
#Define the columns you want to extract
resultFrame = pd.DataFrame(columns=["username","created_at","tweet"])
print len(results)
for i in range(len(results)):
resultFrame.loc[i,"username"] = results[i].user.screen_name
resultFrame.loc[i, "created_at"] = results[i].created_at
resultFrame.loc[i, "tweet"] = results[i].text
print resultFrame.head()
I'm trying to write a dataframe to a .csv using df.to_csv(). For some reason, its only writing the last value (data for the last ticker). It reads through a list of tickers (turtle, all tickers are in first column) and spits out price data for each ticker. I can print all the data without a problem but can't seem to write to .csv. Any idea why? Thanks
input_file = pd.read_csv("turtle.csv", header=None)
for ticker in input_file.iloc[:,0].tolist():
data = web.DataReader(ticker, "yahoo", datetime(2011,06,1), datetime(2016,05,31))
data['ymd'] = data.index
year_month = data.index.to_period('M')
data['year_month'] = year_month
first_day_of_months = data.groupby(["year_month"])["ymd"].min()
first_day_of_months = first_day_of_months.to_frame().reset_index(level=0)
last_day_of_months = data.groupby(["year_month"])["ymd"].max()
last_day_of_months = last_day_of_months.to_frame().reset_index(level=0)
fday_open = data.merge(first_day_of_months,on=['ymd'])
fday_open = fday_open[['year_month_x','Open']]
lday_open = data.merge(last_day_of_months,on=['ymd'])
lday_open = lday_open[['year_month_x','Open']]
fday_lday = fday_open.merge(lday_open,on=['year_month_x'])
monthly_changes = {i:MonthlyChange(i) for i in range(1,13)}
for index,ym, openf,openl in fday_lday.itertuples():
month = ym.strftime('%m')
month = int(month)
diff = (openf-openl)/openf
monthly_changes[month].add_change(diff)
changes_df = pd.DataFrame([monthly_changes[i].get_data() for i in monthly_changes],columns=["Month","Avg Inc.","Inc","Avg.Dec","Dec"])
CSVdir = r"C:\Users\..."
realCSVdir = os.path.realpath(CSVdir)
if not os.path.exists(CSVdir):
os.makedirs(CSVdir)
new_file_name = os.path.join(realCSVdir,'PriceData.csv')
new_file = open(new_file_name, 'wb')
new_file.write(ticker)
changes_df.to_csv(new_file)
Use a for appending instead of wb because it overwrites the data in every iteration of loop.For different modes of opening a file see here.
I have a data file that has 14 lines of header. In the header, there is the metadata for the latitude-longitude coordinates and time. I am currently using
pandas.read_csv(filename, delimiter",", header=14)
to read in the file but this just gets the data and I can't seem to get the metadata. Would anyone know how to read in the information in the header? The header looks like:
CSD,20160315SSIO
NUMBER_HEADERS = 11
EXPOCODE = 33RR20160208
SECT_ID = I08
STNBBR = 1
CASTNO = 1
DATE = 20160219
TIME = 0558
LATITUDE = -66.6027
LONGITUDE = 78.3815
DEPTH = 462
INSTRUMENT_ID = 0401
CTDPRS,CTDPRS_FLAG,CTDTMP,CTDTMP_FLAG
DBAR,,ITS-90,,PSS-78
You have to parse your metadata header by yourself, yet you can do it in an elegant manner in one pass and even by using it on the fly so that you can extract data out it / control the correctness of the file etc.
First, open the file yourself:
f = open(filename)
Then, do the work to parse each metadata line to extract data out it. For the sake of the explanation, I'm just skipping these rows:
for i in range(13): # skip the first 13 lines that are useless for the columns definition
f.readline() # use the resulting string for metadata extraction
Now you have the file pointer ready on the unique header line you want to use to load the DataFrame. The cool thing is that read_csv accepts file objects! Thus you start loading your DataFrame right away now:
pandas.read_csv(f, sep=",")
Note that I don't use the header argument as I consider by your description you have only that one last line of header that is useful for your dataframe. You can build and adjust hearder parsing values / rows to skip from that example.
Although the following method does not use Pandas, I was able to extract the header information.
with open(fname) as csvfile:
forheader_IO2016 = csv.reader(csvfile, delimiter=',')
header_IO2016 = []
for row in forheader_IO2016:
header_IO2016.append(row[0])
date = header_IO2016[7].split(" ")[2]
time = header_IO2016[8].split(" ")[2]
lat = float(header_IO2016[9].split(" ")[2])
lon = float(header_IO2016[10].split(" ")[4])
I have a .CSV file with 75 columns and almost 4000 rows. I need to create a shapefile (point) for the entire .CSV file, with each column as a field. All 75 columns need to be brought over to the new shapefile, with each column representing a field.
There seems to a good amount on this topic already, but everything I can find addresses .csv files with a small number of columns.
https://gis.stackexchange.com/questions/17590/why-is-an-extra-field-necessary-when-creating-point-shapefile-from-csv-files-in
https://gis.stackexchange.com/questions/35593/using-the-python-shape-library-pyshp-how-to-convert-csv-file-to-shp
This script looks close to what I need to accomplish, but again it adds a field for every column in the .CSV, which in this example there are three fields; DATE, LAT, LON.
import arcpy, csv
arcpy.env.overwriteOutput = True
#Set variables
arcpy.env.workspace = "C:\\GIS\\StackEx\\"
outFolder = arcpy.env.workspace
pointFC = "art2.shp"
coordSys = "C:\\Program Files\\ArcGIS\\Desktop10.0\\Coordinate Systems" + \
"\\Geographic Coordinate Systems\\World\\WGS 1984.prj"
csvFile = "C:\\GIS\\StackEx\\chicken.csv"
fieldName = "DATE1"
#Create shapefile and add field
arcpy.CreateFeatureclass_management(outFolder, pointFC, "POINT", "", "", "", coordSys)
arcpy.AddField_management(pointFC, fieldName, "TEXT","","", 10)
gpsTrack = open(csvFile, "r")
headerLine = gpsTrack.readline()
#print headerLine
#I updated valueList to remove the '\n'
valueList = headerLine.strip().split(",")
print valueList
latValueIndex = valueList.index("LAT")
lonValueIndex = valueList.index("LON")
dateValueIndex = valueList.index("DATE")
# Read each line in csv file
cursor = arcpy.InsertCursor(pointFC)
for point in gpsTrack.readlines():
segmentedPoint = point.split(",")
# Get the lat/lon values of the current reading
latValue = segmentedPoint[latValueIndex]
lonValue = segmentedPoint[lonValueIndex]
dateValue = segmentedPoint[dateValueIndex]
vertex = arcpy.CreateObject("Point")
vertex.X = lonValue
vertex.Y = latValue
feature = cursor.newRow()
feature.shape = vertex
feature.DATE1 = dateValue
cursor.insertRow(feature)
del cursor
Is there a simpler way to create a shapefile using python without adding a field for all 75 columns in the .CSV file? Any help is greatly appreciated.
Simply select just the columns you need; you are not required to use all columns.
Use the csv module to read the file, then just pick out the 2 values from each row:
import csv
cursor = arcpy.InsertCursor(pointFC)
with open('yourcsvfile.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
point = arcpy.CreateObject("Point")
point.X, point.Y = float(row[5]), float(row[27]) # take the 6th and 28th columns from the row
cursor.insertRow(point)