I was using R to get my data perpared, however I find myself forced to use python instead.
The csv files have been stored as sf dataframe, where a column geometry stores both long and lat.
In my files, I have the following structure:
a,geometry,b
50,c(-95.11, 10.19),32.24
60,,c(-95.12, 10.27),22.79
70,c(-95.13, 10.28),14.91
80,c(-95.14, 10.33),18.35
90,c(-95.15, 10.5),28.35
99,c(-95.16, 10.7),48.91
The aim here is to read the file while knowing that c(-95.11, 10.19) are 2 values lon and lat so they can be storred in two different columns. However having the separator inside the value which is also not a string makes this really hard to do.
The expected output should be :
a,long,lat,b
50,-95.11, 10.19,32.24
60,,-95.12, 10.27,22.79
70,-95.13, 10.28,14.91
80,-95.14, 10.33,18.35
90,-95.15, 10.5,28.35
99,-95.16, 10.7,48.91
Does this work (input file: data.csv; output file: data_out.csv):
import csv
with open('data.csv', 'r') as fin, open('data_out.csv', 'w') as fout:
reader, writer = csv.reader(fin), csv.writer(fout)
next(reader)
writer.writerow(['a', 'long', 'lat', 'b'])
for row in reader:
row[1] = row[1][2:]
row[2] = row[2][1:-1]
writer.writerow(row)
In your sample output is a blank after the second column: Is this intended? Also, your sample input has in line two a double , after the first column?
If you were looking for a R based solution you may consider extracting the coordinates from {sf} based geometry column into regular columns, and saving accordingly.
Consider this example, built on three semi-random North Carolina cities:
library(sf)
library(dplyr)
cities <- data.frame(name = c("Raleigh", "Greensboro", "Wilmington"),
x = c(-78.633333, -79.819444, -77.912222),
y = c(35.766667, 36.08, 34.223333)) %>%
st_as_sf(coords = c("x", "y"), crs = 4326)
cities # a class sf data.frame
Simple feature collection with 3 features and 1 field
geometry type: POINT
dimension: XY
bbox: xmin: -79.81944 ymin: 34.22333 xmax: -77.91222 ymax: 36.08
geographic CRS: WGS 84
name geometry
1 Raleigh POINT (-78.63333 35.76667)
2 Greensboro POINT (-79.81944 36.08)
3 Wilmington POINT (-77.91222 34.22333)
mod_cit <- cities %>%
mutate(long = st_coordinates(.)[,1],
lat = st_coordinates(.)[,2]) %>%
st_drop_geometry()
mod_cit # a regular data.frame
name long lat
1 Raleigh -78.63333 35.76667
2 Greensboro -79.81944 36.08000
3 Wilmington -77.91222 34.22333
Related
I got my .dat data formatted into arrays I could use in graphs and whatnot.
I got my data from this website and it requires an account if you want to download it yourself. The data will still be provided below, however.
https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1028
data in python:
import pandas as pd
df = pd.read_csv("ocean_flux_co2_2d.dat", header=None)
print(df.head())
0 1 2 3
0 -178.75 -77.0 0.000003 32128.7
1 -176.25 -77.0 0.000599 32128.7
2 -173.75 -77.0 0.001649 39113.5
3 -171.25 -77.0 0.003838 58934.0
4 -168.75 -77.0 0.007192 179959.0
I then decided to put this data into arrays that could be put into graphs and other functions.
Like so:
lat = []
lon = []
sed = []
area = []
with open('/home/srowpie/SrowFinProj/Datas/ocean_flux_tss_2d.dat') as f:
for line in f:
parts = line.split(',')
lat.append(float(parts[0]))
lon.append(float(parts[1]))
sed.append(float(parts[2]))
area.append(float(parts[3]))
lat = np.array(lat)
lon = np.array(lon)
sed = np.array(sed)
area = np.array(area)
My question now is how can I put this data into a map with data points? Column 1 is latitude, Column 2 is longitude, Column 3 is sediment flux, and Column 4 is the area covered. Or do I have to bootleg it by making a graph that takes into account the variables lat, lon, and sed?
You don't need to get the data into an array. Just apply df.values and you would have a numpy array of all the data in the dataframe.
Example -
array([[-1.78750e+02, -7.70000e+01, 3.00000e-06, 3.21287e+04],
[-1.76250e+02, -7.70000e+01, 5.99000e-04, 3.21287e+04],
[-1.73750e+02, -7.70000e+01, 1.64900e-03, 3.91135e+04],
[-1.71250e+02, -7.70000e+01, 3.83800e-03, 5.89340e+04],
[-1.68750e+02, -7.70000e+01, 7.19200e-03, 1.79959e+05]])
I'll not recommend storing individual columns as variable. Instead just set the column names for the dataframe and then use them to extract a pandas Series of the data in that column.
df.columns = ["Latitude", "Longitude", "Sediment Flux", "Area covered"]
This what the table would look like after this,
Latitude
Longitude
Sediment Flux
Area covered
0
-178.75
-77.0
3e-06
32128.7
1
-176.25
-77.0
0.000599
32128.7
2
-173.75
-77.0
0.001649
39113.5
3
-171.25
-77.0
0.003838
58934.0
4
-168.75
-77.0
0.007192
179959.0
Simply do df[column_name] to get the data in that column.
For example -> df["Latitude"]
Output -
0 -178.75
1 -176.25
2 -173.75
3 -171.25
4 -168.75
Name: Latitude, dtype: float64
Once you have done all this, you can use folium to plot the rows on real interactive maps.
import folium as fl
map = fl.Map(df.iloc[0, :2], zoom_start = 100)
for index in df.index:
row = df.loc[index, :]
fl.Marker(row[:2].values, f"{dict(row[2:])}").add_to(map)
map
This is the data in single cell of dataframe with 14 columns. Cell is the element of column. There are 45k+ this kind of cells, to do it manually is a hell.
one cell data
I'd like to do with this cell 3 things:
move text part with address, state, zip - to another column;
delete the hooks () of cell;
separate for 2 columns longitude and latitude.
How it's possible to do?
Here's a simple, working example with 2 data points:
text1 = """30881 EKLUTNA LAKE RD
CHUGIAK, AK 99567
(61.4478, -149.3136)"""
text2 = """30882 FAKE STR
CHUGIAK, AK 98817
(43.4478, -119.3136)"""
d = {'col1': [text1, text2]}
df = pd.DataFrame(data=d)
def fix(row):
#We split the text by newline
address, cp, latlong = row.col1.split('\n')
#We get the latitude and longitude by splitting by a comma
latlong_vec = latlong[1:-1].split(',')
#This part isn't really necessary but we create the variables for claity
lat = float(latlong_vec[0])
long = float(latlong_vec[1])
return pd.Series([address + ". " + cp, lat, long])
df[['full address', 'lat', 'long']] = df.apply(fix, axis = 1)
Output of the 3 new columns:
df['full address']
0 30881 EKLUTNA LAKE RD. CHUGIAK, AK 99567
1 30882 FAKE STR. CHUGIAK, AK 98817
df['lat']
0 61.4478
1 43.4478
Name: lat, dtype: float64
df['long']
0 -149.3136
1 -119.3136
Name: long, dtype: float64
Name: full address, dtype: object
I have a file called sampleweather100 which has Latitudes and longtidudes of addresses. If i manually type in these lats and longs under the location list function, I get the output I desire. However, I want to write a function where it pulls out the output for all rows of my csv without me manually entering it:
import pandas as pd
my_cities = pd.read_csv('sampleweather100.csv')
from wwo_hist import retrieve_hist_data
#lat = -31.967819
#lng = 115.87718
#location_list = ["-31.967819,115.87718"]
frequency=24
start_date = '11-JAN-2018'
end_date = '11-JAN-2019'
api_key = 'MyKey'
location_list = ["('sampleweather100.csv')['Lat'],('sampleweather100.csv')['Long']"]
hist_weather_data = retrieve_hist_data(api_key,
location_list,
start_date,
end_date,
frequency,
location_label = False,
export_csv = True,
store_df = True)
My function location_list = ["('sampleweather100.csv')['Lat'],('sampleweather100.csv')['Long']"] does not work. Is there a better way or a forloop that will fetch each rows lat and long into that location_list function:
Reprex of dataset:
my_cities
Out[89]:
City Lat Long
0 Lancaster 39.754545 -82.636371
1 Canton 40.851178 -81.470345
2 Edison 40.539561 -74.336307
3 East Walpole 42.160667 -71.213680
4 Dayton 39.270486 -119.577078
5 Fort Wainwright 64.825343 -147.673877
6 Crystal 45.056106 -93.350020
7 Medford 42.338916 -122.839771
8 Spring Valley 41.103816 -74.045399
9 Hillsdale 41.000879 -74.026089
10 Newyork 40.808582 -73.951553
Your way of building the list just does not make sense. You are using the filename of the csv, which is just a string and holds no reference to the file itself or the dataframe you have created from it.
Since you buildt a dataframe called my_cities from your csv using pandas, you need to extract your list of pairs from the dataframe my_cities:
location_list = [','.join([str(lat), str(lon)]) for lat, lon in zip(my_cities['Lat'], my_cities['Long'])]
This is the list you get with the above line using your sample dataframe:
['39.754545,-82.636371', '40.851178000000004,-81.470345',
'40.539561,-74.33630699999999', '42.160667,-71.21368000000001',
'39.270486,-119.577078', '64.825343,-147.673877', '45.056106,-93.35002',
'42.338916,-122.839771', '41.103815999999995,-74.045399',
'41.000879,-74.026089', '40.808582,-73.951553']
You could use one of these to covert the dataframe into a list of comma-separated pairs:
location_list = [
'{},{}'.format(lat, lon)
for i, (lat, lon) in my_cities.iterrows()
]
or
location_list = [
'{},{}'.format(lat, lon)
for lat, lon in my_cities.values
]
I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!
I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.
I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.
I have a log file which I need to plot in python with different data points as a multi line plot with a line for each unique point , the problem is that in some samples some points would be missing and new points would be added in another, as shown is an example with each line denoting a sample of n points where n is variable:
2015-06-20 16:42:48,135 current stats=[ ('keypassed', 13), ('toy', 2), ('ball', 2),('mouse', 1) ...]
2015-06-21 16:42:48,135 current stats=[ ('keypassed', 20, ('toy', 5), ('ball', 7), ('cod', 1), ('fish', 1) ... ]
in the above 1 st sample 'mouse ' is present but absent in the second line with new data points in each sample added like 'cod','fish'
so how can this be done in python in the quickest and cleanest way? are there any existing python utilities which can help to plot this timed log file? Also being a log file the samples are thousands in numbers so the visualization should be able to properly display it.
Interested to apply multivariate hexagonal binning to this and different color hexagoan for each unique column "ball,mouse ... etc". scikit offers hexagoanal binning but cant figure out how to render different colors for each hexagon based on the unique data point. Any other visualization technique would also help in this.
Getting the data into pandas:
import pandas as pd
df = pd.DataFrame(columns = ['timestamp','name','value'])
with open(logfilepath) as f:
for line in f.readlines():
timestamp = line.split(',')[0]
#the data part of each line can be evaluated directly as a Python list
data = eval(line.split('=')[1])
#convert the input data from wide format to long format
for name, value in data:
df = df.append({'timestamp':timestamp, 'name':name, 'value':value},
ignore_index = True)
#convert from long format back to wide format, and fill null values with 0
df2 = df.pivot_table(index = 'timestamp', columns = 'name')
df2 = df2.fillna(0)
df2
Out[142]:
value
name ball cod fish keypassed mouse toy
timestamp
2015-06-20 16:42:48 2 0 0 13 1 2
2015-06-21 16:42:48 7 1 1 20 0 5
Plot the data:
import matplotlib.pylab as plt
df2.value.plot()
plt.show()