I have the following data:
Example:
DRIVER_ID;TIMESTAMP;POSITION
156;2014-02-01 00:00:00.739166+01;POINT(41.8836718276551 12.4877775603346)
I want to create a pandas dataframe with 4 columns that are the id, time, longitude, latitude.
So far, I got:
cur_cab = pd.DataFrame.from_csv(
path,
sep=";",
header=None,
parse_dates=[1]).reset_index()
cur_cab.columns = ['cab_id', 'datetime', 'point']
path specifies the .txt file containing the data.
I already wrote a function that returns the longitude and latitude values from the point formated string.
How do I expand the data frame with the additional column and the splitted values ?
After loading, if you're using a recent version of pandas then you can use the vectorised str methods to parse the column:
In [87]:
df['pos_x'], df['pos_y']= df['point'].str[6:-1].str.split(expand=True)
df
Out[87]:
cab_id datetime \
0 156 2014-01-31 23:00:00.739166
point pos_x pos_y
0 POINT(41.8836718276551 12.4877775603346) 0 1
Also you should stop using from_csv it's no longer updated, use the top level read_csv so your loading code would be:
cur_cab = pd.read_csv(
path,
sep=";",
header=None,
parse_dates=[1],
names=['cab_id', 'datetime', 'point'],
skiprows=1)
Related
I am new to Python, Can i please seek some help from experts here?
I wish to construct a dataframe from https://api.cryptowat.ch/markets/summaries JSON response.
based on following filter criteria
Kraken listed currency pairs (Please take note, there are kraken-futures i dont want those)
Currency paired with USD only, i.e aaveusd, adausd....
Ideal Dataframe i am looking for is (somehow excel loads this json perfectly screenshot below)
Dataframe_Excel_Screenshot
resp = requests.get(https://api.cryptowat.ch/markets/summaries) kraken_assets = resp.json() df = pd.json_normalize(kraken_assets) print(df)
Output:
result.binance-us:aaveusd.price.last result.binance-us:aaveusd.price.high ...
0 264.48 267.32 ...
[1 rows x 62688 columns]
When i just paste the link in browser JSON response is with double quotes ("), but when i get it via python code. All double quotes (") are changed to single quotes (') any idea why?. Though I tried to solve it with json_normalize but then response is changed to [1 rows x 62688 columns]. i am not sure how do i even go about working with 1 row with 62k columns. i dont know how to extract exact info in the dataframe format i need (please see excel screenshot).
Any help is much appreciated. thank you!
the result JSON is a dict
load this into a dataframe
decode columns into products & measures
filter to required data
import requests
import pandas as pd
import numpy as np
# load results into a data frame
df = pd.json_normalize(requests.get("https://api.cryptowat.ch/markets/summaries").json()["result"])
# columns are encoded as product and measure. decode columns and transpose into rows that include product and measure
cols = np.array([c.split(".", 1) for c in df.columns]).T
df.columns = pd.MultiIndex.from_arrays(cols, names=["product","measure"])
df = df.T
# finally filter down to required data and structure measures as columns
df.loc[df.index.get_level_values("product").str[:7]=="kraken:"].unstack("measure").droplevel(0,1)
sample output
product
price.last
price.high
price.low
price.change.percentage
price.change.absolute
volume
volumeQuote
kraken:aaveaud
347.41
347.41
338.14
0.0274147
9.27
1.77707
613.281
kraken:aavebtc
0.008154
0.008289
0.007874
0.0219326
0.000175
403.506
3.2797
kraken:aaveeth
0.1327
0.1346
0.1327
-0.00673653
-0.0009
287.113
38.3549
kraken:aaveeur
219.87
226.46
209.07
0.0331751
7.06
1202.65
259205
kraken:aavegbp
191.55
191.55
179.43
0.030559
5.68
6.74476
1238.35
kraken:aaveusd
259.53
267.48
246.64
0.0339841
8.53
3623.66
929624
kraken:adaaud
1.61792
1.64602
1.563
0.0211692
0.03354
5183.61
8366.21
kraken:adabtc
3.757e-05
3.776e-05
3.673e-05
0.0110334
4.1e-07
252403
9.41614
kraken:adaeth
0.0006108
0.00063
0.0006069
-0.0175326
-1.09e-05
590839
367.706
kraken:adaeur
1.01188
1.03087
0.977345
0.0209986
0.020811
1.99104e+06
1.98693e+06
Hello Try the below code. I have understood the structure of the Dataset and modified to get the desired output.
`
resp = requests.get("https://api.cryptowat.ch/markets/summaries")
a=resp.json()
a['result']
#creating Dataframe froom key=result
da=pd.DataFrame(a['result'])
#using Transpose to get required Columns and Index
da=da.transpose()
#price columns contains a dict which need to be seperate Columns on the data frame
db=da['price'].to_dict()
da.drop('price', axis=1, inplace=True)
#intialising seperate Data frame for price
z=pd.DataFrame({})
for i in db.keys():
i=pd.DataFrame(db[i], index=[i])
z=pd.concat([z,i], axis=0 )
da=pd.concat([z, da], axis=1)
da.to_excel('nex.xlsx')`
I have a textfile that contains 2 columns of data. They are separated with unfix number of whitespace/s. I want to load it on a pandas DataFrame.
Example:
306.000000 1.125783
307.000000 0.008101
308.000000 -0.005917
309.000000 0.003784
310.000000 -0.516513
Please note that it also starts with whitespace/s.
My desired output would be like:
output = {'Wavelength': [306.000000, 307.000000, 308.000000, 309.000000, 310.000000],
'Reflectance': [1.125783, 0.008101, -0.005917, 0.003784, -0.516513]}
df = pd.DataFrame(data=output)
Use read_csv:
df = pd.read_csv('file.txt', sep='\\s+', names=['Wavelength', 'Reflectance'], header=None)
I have a Dataframe with the following date field:
463 14-05-2019
535 03-05-2019
570 11-05-2019
577 09-05-2019
628 08-08-2019
630 25-05-2019
Name: Date, dtype: object
I have to format it as DDMMAAAA. This is what I'm doing inside a loop (for idx, row in df.iterrows():):
I'm removing the \- char using regex:
df.at[idx, 'Date'] = re.sub('\-', '', df.at[idx, 'Date'])
then using apply to enforce and an 8 digit string with leading zeros
df['Date'] = df['Date'].apply(lambda x: '{0:0>8}'.format(x))
But even though the df['Date'] field has the 8 digits with the leading 0 on the df, whent exporting it to csv the leading zeros are being removed on the exported file like below.
df.to_csv(path_or_buf=report, header=True, index=False, sep=';')
field as in csv:
Dt_DDMMAAAA
30102019
12052019
7052019
26042019
3052019
22042019
25042019
2062019
I know I must be missing the point somewhere along the way here, but I just can't figure out what the issue (or if it's even an issue, rather then a misused method).
IMO the simplest method is to use the date_format argument when writing to CSV. This means you will need to convert the "Date" column to datetime beforehand using pd.to_datetime.
(df.assign(Date=pd.to_datetime(df['Date'], errors='coerce'))
.to_csv(path_or_buf=report, date_format='%d%m%Y', index=False))
This prints,
Date
14052019
05032019
05112019
05092019
08082019
25052019
More information on arguments to to_csv can be found in Writing a pandas DataFrame to CSV file.
What i will do is using strftime + 'to_excel`, since In csv , if you open it with text , it will show the leading zero, since csv will not keeping any format when display, in that case , you can using excel
pd.to_datetime(df.Date,dayfirst=True).dt.strftime('%m%d%Y').to_excel('your.xls')
Out[722]:
463 05142019
535 05032019
570 05112019
577 05092019
628 08082019
630 05252019
Name: Date, dtype: object
Firstly, your method is producing a file which contains leading zeros just as you expect. I reconstructed this minimal working example from your description and it works just fine:
import pandas
import re
df = pandas.DataFrame([["14-05-2019"],
["03-05-2019"],
["11-05-2019"],
["09-05-2019"],
["08-08-2019"],
["25-05-2019"]], columns=['Date'])
for idx in df.index:
df.at[idx, 'Date'] = re.sub('\-', '', df.at[idx, 'Date'])
df['Date'] = df['Date'].apply(lambda x: '{0:0>8}'.format(x))
df.to_csv(path_or_buf="report.csv", header=True, index=False, sep=';')
At this point report.csv contains this (with leading zeros just as you wanted).
Date
14052019
03052019
11052019
09052019
08082019
25052019
Now as to why you thought it wasn't working. If you are mainly in Pandas, you can stop it from guessing the type of the output by specifying a dtype in read_csv:
df_readback = pandas.read_csv('report.csv', dtype={'Date': str})
Date
0 14052019
1 03052019
2 11052019
3 09052019
4 08082019
5 25052019
It might also be that you are reading this in Excel (I'm guessing this from the fact that you are using ; separators). Unfortunately there is no way to ensure that Excel reads this field correctly on double-click, but if this is your final target, you can see how to mangle your file for Excel to read correctly in this answer.
I am using the following code:
df = pd.read_csv('/Python Test/AcquirerRussell3000.csv')
I have the following type of data:
18.07.2000 27.1875 0 08.08.2000 25.3125 0.1 05.09.2000 \
0 19.07.00 26.6250 -0.020690 09.08.00 25.2344 -0.003085 06.09.00
1 20.07.00 26.6250 0.000000 10.08.00 25.1406 -0.003717 07.09.00
2 21.07.00 25.6875 -0.035211 11.08.00 25.5781 0.017402 08.09.00
3 24.07.00 26.2500 0.021898 14.08.00 25.4375 -0.005497 11.09.00
4 25.07.00 26.6875 0.016667 15.08.00 25.5625 0.004914 12.09.00
I am getting the following error:
Pythone Test/untitled0.py:1: DtypeWarning: Columns (long list of numbers) have mixed types.
Specify dtype option on import or set low_memory=False.
So every 3rd column is a date the rest are numbers. I guess there is no single dtype since dates are strings and the rest is a float or int? I have about 5000 columns or more and around 400 rows.
I have seen similar questions to this but dont quite know how to apply this to my data. Furthermore I want to run the following code after to stack the data frame.
a = np.arange(len(df.columns))
df.columns = [a % 3, a // 3]
df = df.stack().reset_index(drop=True)
df.to_csv('AcquirerRussell3000stacked.csv', sep=',')
What dtype should I use? Or should I just set low_memory to false?
This solved my problem from here
dashboard_df = pd.read_csv(p_file, sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
Could anyone explain this answer to me tough?
df = pd.read_csv('/Python Test/AcquirerRussell3000.csv', engine='python')
or
df = pd.read_csv('/Python Test/AcquirerRussell3000.csv', low_memory=False)
does the trick for me.
I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64