I am new to python and pandas and am attempting to make aggregate plots of orientation data across response time for my research. The approach I am attempting requires grouping and splitting the data by trial, however I have no index variable to group by in the raw data file. Just as some context, I am working with about 300 .csv files with 10-15k rows in each. Here is a snippet of the format of the raw .csv files:
25.10 3.1 7.8 173.6 0.695646 -0.046507 0.716452 -0.024699 -0.014172 -0.712739 -0.086428 0.695940 88.4 1.8 -174.3
25.25 3.1 7.6 173.6 0.696440 -0.045587 0.715711 -0.025514 -0.013402 -0.712050 -0.085468 0.696778 88.5 1.7 -174.3
25.40 2.9 7.6 173.6 0.697160 -0.045407 0.715048 -0.024725 -0.014230 -0.711399 -0.085251 0.697454 88.6 1.7 -174.3
25.55 3.2 7.8 173.6 0.695360 -0.046466 0.716729 -0.024797 -0.014058 -0.713018 -0.086403 0.695660 88.4 1.8 -174.2
Response: S
END TRIAL 1
BEGIN TRIAL 2
0.05 126.4 126.4 0.0 -0.322306 -0.712941 -0.535465 -0.317978 -0.322306 -0.712941 -0.535465 -0.317978 105.7 -34.0 74.7
0.20 129.1 129.1 0.0 -0.311974 -0.711464 -0.555195 -0.297070 -0.311974 -0.711464 -0.555195 -0.297070 105.1 -37.2 76.3
As you can see there is no variable to group by trial or headers, only three rows (with a different structure) that separate the trials.
I have managed to get column headers and extract the relevant variables with pandas:
df = pd.read_csv('data.csv', delim_whitespace=True, names=['time', 'S1', 'S2', 'Enc', 'q1a', 'q1b', 'q1c', 'q1d', 'q2a', 'q2b', 'q2c', 'q2d', 'yaw', 'pitch', 'roll'], header=0, usecols=['time', 'S1', 'S2'])
usecols=['time', 'S1', 'S2']
Which outputs this data structure:
time S1 S2
0 0.25 277.5 277.5
1 0.25 277.5 277.5
2 0.40 277.5 277.5
3 0.55 277.5 277.5
4 0.70 277.5 277.5
5 0.85 277.5 277.5
.........................
784 117.70 161.2 96.9
785 END TRIAL 1.0
786 BEGIN TRIAL 2.0
787 0.10 159.9 159.9
Closer, but now I am stuck on how to parse the individual trials into separate columns or data structures because there is no grouping list that signifies trial number and the number of rows varies for each trial. After reading through the Pandas documentation for the last few days and scouring similar questions raised here, I am having trouble finding the best solution.
Eventually, I'd like to have a single data structure to make some exploratory visualizations. Am I heading in the right direction, or is there a smarter approach to this?
Related
I have a problem I'm struggling with since a few days, and can't find my way out!
I have a folder with many CSVs, each containing two columns: "date" (YYYY-MM-DD) and "value" (a float). The dates are usually a range of consecutive days (but some days might be missing).
Now each of these CSVs starts from a different date.
I need to merge them into a unique panda dataframe with "date" as index and then several columns like "csv1_value", "csv2_value", "csv3_value" etc. I've done it with 'merge' command on 'date' and that means I do have a panda that contains only the rows where the same "date" was found across all the CSVs.
This is useful because actually some 'dates' in the range might be missing from a file, and in that case I need that date to be deleted from the panda even if it's present in the other files.
BUT I would need to have actually the start of the range in the panda to be the older date I have, and then if that date is missing in the others (because they start later) having the value for that file as 0.
AND any date that is missing from one file range, it should be filled with the latest value (useful to have 0.00 in any file starting later until there's actually some value).
A bit complex, will try an example:
csv1:
"2020-01-01","1.01"
"2020-01-02","2.01"
"2020-01-03","3.01"
"2020-01-04","4.01"
"2020-01-05","5.01"
"2020-01-06","6.01"
"2020-01-07","7.01"
"2020-01-08","8.01"
"2020-01-09","9.01"
"2020-01-10","10.01"
csv2:
"2020-01-04","4.02"
"2020-01-05","5.02"
"2020-01-06","6.02"
"2020-01-08","8.02"
"2020-01-09","9.02"
"2020-01-10","10.02"
csv3:
"2020-01-03","3.03"
"2020-01-04","4.03"
"2020-01-05","5.03"
"2020-01-06","6.03"
"2020-01-07","7.03"
"2020-01-09","9.03"
"2020-01-10","10.03"
resulting Panda should be:
"2020-01-01","1.01","0.00","0.00"
"2020-01-02","2.01","0.00","0.00"
"2020-01-03","3.01","0.00","3.03"
"2020-01-04","4.01","4.02","4.03"
"2020-01-05","5.01","5.02","5.03"
"2020-01-06","6.01","6.02","6.03"
"2020-01-07","7.01","6.02","7.03"
"2020-01-08","8.01","8.02","7.03"
"2020-01-09","9.01","9.02","9.03"
"2020-01-10","10.01","10.02","10.03"
Anyone has an idea how I could achieve all this? My head is exploding...
you can do this using two outer joins, then fill NA with zeros
df1 = pd.read_csv('csv1')
df2 = pd.read_csv('csv2')
df3 = pd.read_csv('csv3')
DF = pd.merge(df1, df2, how='outer', on='date')
DF = pd.merge(DF, df3, how='outer', on='date')
DF.fillna(0, inplace=True)
My solution is designed to cope with arbitrary number of input files
(not only 3, as in the other solution).
Start with reading of your input files, creating a list of
DataFrames, with proper names of the second column:
import glob
frames = []
for i, fn in enumerate(glob.glob('Input*.csv'), start=1):
frames.append(pd.read_csv(fn, parse_dates=[0], names=['Date', f'csv{i}_value']))
Then join them into a single DataFrame:
df = frames.pop(0)
while len(frames) > 0:
df2 = frames.pop(0)
df = df.join(df2.set_index('Date'), on='Date')
For now, from your sample files, you have:
Date csv1_value csv2_value csv3_value
0 2020-01-01 1.01 NaN NaN
1 2020-01-02 2.01 NaN NaN
2 2020-01-03 3.01 NaN 3.03
3 2020-01-04 4.01 4.02 4.03
4 2020-01-05 5.01 5.02 5.03
5 2020-01-06 6.01 6.02 6.03
6 2020-01-07 7.01 NaN 7.03
7 2020-01-08 8.01 8.02 NaN
8 2020-01-09 9.01 9.02 9.03
9 2020-01-10 10.01 10.02 10.03
And to get the result, run:
df = df.ffill().fillna(0.0)
The result is:
Date csv1_value csv2_value csv3_value
0 2020-01-01 1.01 0.00 0.00
1 2020-01-02 2.01 0.00 0.00
2 2020-01-03 3.01 0.00 3.03
3 2020-01-04 4.01 4.02 4.03
4 2020-01-05 5.01 5.02 5.03
5 2020-01-06 6.01 6.02 6.03
6 2020-01-07 7.01 6.02 7.03
7 2020-01-08 8.01 8.02 7.03
8 2020-01-09 9.01 9.02 9.03
9 2020-01-10 10.01 10.02 10.03
How to find possible errors
One of things to check is whether the program finds expected
CSV files.
To check it, run:
for i, fn in enumerate(glob.glob('Input*.csv'), start=1):
print(i, fn)
and you should get a list of files found.
Another detail to check is whether your files have names starting
from Input and have csv extension. Maybe you should change Input*.csv
to some other pattern?
Attempt also to run my code partially, i.e.:
first the loop creating the list of DataFrames,
then check the size of this list, print some of DataFrames
and invoke info() on them (make test printouts),
after that run the second part of my code (while loop).
If some error occus, state in which instruction it occurred.
I have a script where I do munging with dataframes and extract data like the following:
times = pd.Series(df.loc[df['sy_x'].str.contains('AA'), ('t_diff')].quantile([.1, .25, .5, .75, .9]))
I want to add the resulting data from quantile() to a data frame with separate columns for each of those quantiles, lets say the columns are:
ID pt_1 pt_2 pt_5 pt_7 pt_9
AA
BB
CC
How might I add the quantiles to each row of ID?
new_df = None
for index, value in times.items():
for col in df[['pt_1', 'pt_2','pt_5','pt_7','pt_9',]]:
..but that feels wrong and not idiomatic. Should I be using loc or iloc? I have a couple more Series that I'll need to add to other columns not shown, but I think I can figure that out once I know
EDIT:
Some of the output of times looks like:
0.1 -0.5
0.25 -0.3
0.5 0.0
0.75 2.0
0.90 4.0
Thanks in advance for any insight
IIUC, you want a groupby():
# toy data
np.random.seed(1)
df = pd.DataFrame({'sy_x':np.random.choice(['AA','BB','CC'], 100),
't_diff': np.random.randint(0,100,100)})
df.groupby('sy_x').t_diff.quantile((0.1,.25,.5,.75,.9)).unstack(1)
Output:
0.10 0.25 0.50 0.75 0.90
sy_x
AA 16.5 22.25 57.0 77.00 94.5
BB 9.1 21.00 58.5 80.25 91.3
CC 9.7 23.25 40.5 65.75 84.1
Try something like:
pd.DataFrame(times.values.T, index=times.keys())
I have a dataframe which looks like this:
1 2 3 4 Density
Mineral
Quartz 13.4 23.0 23.4 28.3 2.648
Plagioclase 5.2 8.2 8.5 11.7 2.620
K-feldspar 2.3 2.4 2.6 3.1 2.750
What I need to do is to calculate the new rows based on the condition made on the row:
DESIRED OUTPUT
1 2 3 4 Density
Mineral
Quartz 13.4 23.0 23.4 28.3 2.648
Plagioclase 5.2 8.2 8.5 11.7 2.620
K-feldspar 2.3 2.4 2.6 3.1 2.750
Quartz_v 5.06 8.69 8.84 10.69 2.648
Plagioclase_v ...
So the process is basically I need to the following steps:
1) Define the new row, for example, Quartz_v
2) and then perform the following calculation Quartz_v = each column value of Quartz divided by the Density value of Quartz_v
I have already loaded the data as a two dataframes, the density and mineral ones, and merged them, so the each mineral will have the correct density in front of it.
Use
DataFrame.div with axis=0 to perform division,
rename to rename the index, and
append to concatenate the result to the original (you can also use pd.concat instead).
d = df['Density']
result = df.append(df.div(d, axis=0).assign(Density=d).rename(lambda x: x+'_v'))
result
1 2 3 4 Density
Mineral
Quartz 13.400000 23.000000 23.400000 28.300000 2.648
Plagioclase 5.200000 8.200000 8.500000 11.700000 2.620
K-feldspar 2.300000 2.400000 2.600000 3.100000 2.750
Quartz_v 5.060423 8.685801 8.836858 10.687311 2.648
Plagioclase_v 1.984733 3.129771 3.244275 4.465649 2.620
K-feldspar_v 0.836364 0.872727 0.945455 1.127273 2.750
I am merging one column from DataFrame (df1) with another DataFrame (df2 where both have the same index. The result of this operation gives me a lot more rows that I started with (duplicates). Is there a way to avoid duplicates? Please see the example codes below to replicate my issue.
df1 = pd.DataFrame([[1, 1.0, 2.3,0.2,0.53], [2, 3.35, 2.0,0.2,0.65], [2,3.4,
2.0,0.25,0.55]],
columns=["Sample_ID", "NaX", "NaU","OC","EC"])\
.set_index('Sample_ID')
df2 = pd.DataFrame([[1,0.2, 1.5, 82], [2, 3.35,2.4,92],[2, 3.4, 2.0,0.25]],
columns=["Sample_ID", "OC","Flow", "Diameter"])\
.set_index('Sample_ID')
df1 = pd.merge(df1,df2['Flow'].to_frame(), left_index=True,right_index=True)
My result (below) has two entries for sample "2" starting with 3.35 and then two entries for "2" starting with 3.40.
What I was expecting was just two entries for "2", one starting with 3.35 and the other line for "2" starting with 3.40. So the total number of rows should be only three, while I have a total of 5 rows of data now.
Can you please see what the reason for this is? Thanks for your help!
NaX NaU OC EC Flow
Sample_ID
1 1.00 2.3 0.20 0.53 1.5
2 3.35 2.0 0.20 0.65 2.4
2 3.35 2.0 0.20 0.65 2.0
2 3.40 2.0 0.25 0.55 2.4
2 3.40 2.0 0.25 0.55 2.0
What you want to do is concatenate as follows:
pd.concat([df1, df2['Flow'].to_frame()], axis=1)
...which returns your desired output. The axis=1 argument let's you "glue on" extra columns.
As to why your join is returning twice as many entries for Sample_ID = 2, you can read through the docs on joins. The relevant portion is:
In SQL / standard relational algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data.
I am using NYC trips data. I wanted to convert the lat-long present in the data to respective boroughs in NYC. I especially want to know if there is some NYC airport (Laguardia/JFK) present in one of those trips.
I know that Google Maps API and even libraries like Geopy get the reverse geocoding. However, most of them give city and country level codings.
I wanted to extract the borough or airport (like Queens, Manhattan, JFK, Laguardia etc) name from the lat-long. I have lat-long for both pickup and dropoff locations.
Here is a sample dataset in pandas dataframe.
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
In [5]:
You can find the data here too:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
After bit of research I found I can leverage Google Maps API, to get the county and even establishment level data.
Here is the code I wrote:
A mapper function to get the geocode data from Google API for the lat-long passed
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
# Geo_code data for pickup-lat-long
trip_data_sample["est_pickup"] = [y["address_components"][0]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
trip_data_sample["locality_pickup"]=[y["address_components"][2]["long_name"] for y in map(reverse_geocode, trip_data_sample["lat_long_pickup"].values)]
However, I initially had 1.4MM records. It was taking lot of time to get this done. So I reduced to 200K. Even that was taking lot of time to run. So then I reduced to 115K. Even that taking too much time.
So now I reduced to 50K. But then this sample would hardly be having a representative distribution of the whole data.
I was wondering if there is any better and faster way to get the reverse geocode of lat-long. I am not using Spark since I am running it on local mac. So using Spark might not give that much speed leverage on single machine. Pls advise.