I am generating a pivot table report using the pandas Python module. The source data includes a lot of readings measured in milliseconds. If the number of milliseconds exceeds 999 then the value in that CSV file will include commas (e.g. 1,234 = 1.234 seconds).
Here's how I'm trying to run the report:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
The problem is that the dtype for the Average Usage Time (ms) is an object so the np.max function just treats it like it's NaN. I therefore never see any values greater than 999.
I tried fixing the issue like this:
import pandas as pd
import numpy as np
pool_usage = pd.read_csv("c:/foo/ds-dump.csv")
# Add a column to the end that shows you where the data came from
pool_usage["Source File"] = "ds-dump.csv"
# Convert strings to numbers if possible
pool_usage = pool_usage.convert_objects(convert_numeric=True)
report = pool_usage.pivot_table(values=['Average Pool Size', 'Average Usage Time (ms)'], index=['Source File'], aggfunc=np.max)
print(report)
This did actually change the dtype of the Average Usage Time column to a float but all of the values that are greater than 999 are still treated like NaN's.
How can I convert the Average Usage Time column to a float even though it's possible that some of the values may include commas?
The read_csv function takes an optional thousands argument. Its default is None so you can change it to "," to have it recognise 1,234 as 1234 when it reads the file:
pd.read_csv("c:/foo/ds-dump.csv", thousands=",")
The column holding the millisecond values should then have the int64 datatype once the file has been read into memory.
Related
Hello I'm learning Python and Pandas, and I'm working on an exercise. I'm loading in 2 csv files and merging them into one dataframe.
import pandas as pd
# File to Load (Remember to Change These)
school_data_to_load = "Resources/schools_complete.csv"
student_data_to_load = "Resources/students_complete.csv"
# Read School and Student Data File and store into Pandas DataFrames
school_data_df = pd.read_csv(school_data_to_load)
student_data_df = pd.read_csv(student_data_to_load)
# Combine the data into a single dataset.
school_data_complete_df = pd.merge(student_data_df, school_data_df, how="left", on=["school_name", "school_name"])
school_data_complete_df.head()
The output looks like the picture above.
I'm trying to:
Calculate the percentage of students with a passing math score (70 or greater)
Calculate the percentage of students with a passing reading score (70 or greater)
Calculate the percentage of students who passed math and reading (% Overall Passing)
I'm looking to populate a new dataframe by looking at students who only got 70 or greater on their math and reading scores by using the loc command
I got this error. I don't understand because the values in the columns should all be integers so why is it saying I'm trying to pass strings in there as well?
You are not comparing the values in the column. You are just comparing "math_score" >= 70. There's a string on the left, and an integer on the right, hence your problem.
Fix the location of your angle brackets, and you should be good to go:
passing_maths_total = school_data_complete_df.loc[school_data_complete_df["math score"] >= 70]
Pandas broadcasts the result of the >= comparison, so comparing the Pandas Series school_data_complete_df["math score"] with 70 results in a boolean Pandas Series which can be used for indexing, e.g. in .loc.
The colon is unnecessary because the row index comes first in .loc anyways.
This solution is not tested.
I have two .csv files with data arranged date-wise. I want to compute the monthly accumulated value for each month and for all the years. While reading the csv files, it reads without any error. However, while computing the monthly accumulated values, for one times series (in one csv file), it is doing it correctly. But, for the other time series, the same code malfunctions. The only difference I notice is, when I read the first csv file (with a 'Date' and 'Value' column, and total no. of rows = 826), the dataframe has 827 rows (last row as nan). This nan thing is not observed for the other csv file.
Please note that my timeseries starts from 06-06-20xx to 01-10-20xx every year from 2008-2014. I am obtaining the monthly accumulated value for each month and then removing the zero values (for months Jan-May and Nov-Dec). When my code runs, for the first csv, I get monthly accumulated values starting from June month of 2008. But, for the second, its starts from January 2008 (and has a non-zero value, which ideally should be zero).
Since I am new in python coding, I cannot figure out where the issue is. Any help is appreciated. Thanks in advance.
Here is my code:
import pandas as pd
import numpy as np
# read the csv file
df_obs = pd.read_csv("..path/first_csv.csv")
df_fore = pd.read_csv("..path/second_csv.csv")
# convert 'Date' column to datetime index
df_obs['Date'] = pd.to_datetime(df_obs['Date'])
df_fore['Date'] = pd.to_datetime(df_fore['Date'])
# perform GroupBy operation over monthly frequency
monthly_accumulated_obs = df_obs.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
monthly_accumulated_fore = df_fore.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
Sometimes more verbose but explicit solutions work better and are more flexible, so here's an alternative one, using convtools:
from datetime import date, datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
# generate an ad hoc grouper;
# it's a simple function to be reused further
converter = (
c.group_by(c.item("Date"))
.aggregate(
{
"Date": c.item("Date"),
"Observed": c.ReduceFuncs.Sum(c.item("Observed")),
}
)
.gen_converter()
)
# read the stream of prepared rows
rows = (
Table.from_csv("..path/first_csv.csv", header=True)
.update(
Date=c.call_func(
datetime.strptime, c.col("Date"), "%m-%d-%Y"
).call_method("replace", day=1),
Observed=c.col("Observed").as_type(float),
)
.into_iter_rows(dict)
)
# process them
result = converter(rows)
I have a timeseries data and I would like to clean the data by approximating the missing data points and standardizing the sample rate.
Given the fact that there might be some unevenly spaced datapoints, I would like to define a function to get the timeseries and an interval X (e.g., 30 minutes or any other interval) as an input and gives the timeseries with points being spaced within X intervals as an output.
As you can see below, the periods are every 10 minutes but some data points are missing. So the algorithm should detect the missing times and remove them and create the appropriate times and generate the value for them. Then based on the defined function, the sample rate should be changed and standardized.
For approximating missing data and cleaning it, either average or linear interpolation would work.
Here is a part of raw data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"Time": ["10:09:00","10:19:00","10:29:00","10:43:00","10:59:00 ", "11:09:00"],
"Value": ["378","378","379","377","376", "377"],
})
df
First of all you need to convert "Time"" into a datetime index. Make pandas recognize the dates as actual dates with df["Time"] = pd.to_datetime(df["Time"]). Then Set time as the index: df = df.set_index("Time").
Once you have the datetime index, you can do all sorts of time-based operations with it. In your case, you want to resample: df.resample('10T')
This leaves us with the following code:
df["Time"] = pd.to_datetime(df["Time"], format="%H:%S:%M")
df = df.set_index("Time")
df.resample('10T')
From here on you have a lot of options on how to treat cases in which you have missing data (fill / interpolate / ...), or in which you have multiple data points for one new one (average / sum / ...). I suggest you take a look at the pandas resampling api. For conversions and formatting between string and datetime refer to strftime.
I used following code:
query = """select * from ZONE.STATE_MASTER_DATA WHERE TIME_KEY BETWEEN '2020-01-01' AND '2020-03-31'"""
webinar_data = gbq.read_gbq(query,project_id='Project1')
However only 1000 rows data is captured. In Google big query, number of rows is 401321.
How to capture all rows of data.
Thanks!
It is described in the Pandas documentation that in order to change the maximum number of rows which will be shown, you should use display.max_rows.
After importing the pandas library, you can use the set_options() method to set the amount of rows in the output. Besides, you can also choose None, which will output all the rows in your dataset. Although, be careful using None because in case that you have a very large amount of rows, your Kernel might be busy for a while until it displays everything.
Below, there are some usage examples:
1) Setting the maximum of rows to 3000 in the output
import pandas as pd
pd.set_option('display.max_rows', 3000)
2) Setting the maximum of rows to unlimited in the output
import pandas as pd
pd.set_option('display.max_rows', None)
In addition, I must point that the current default is 15 rows per output.
I've been working on this for a few hours and am giving up at this point. I have a scientific tool that is slightly glitching and creating a .csv database with datapoints out of order, i.e.
Test_ID Data_Point Test_Time Step_Time etc...
1 1439 1441.044976 1328.572329
1 1440 1442.046983 1329.574335
1 1121 1122.423305 1009.950658
1 1122 1123.424295 1010.951648
Note how the data skips from 1440 back to 1121. if you backed tracked in the .csv file you'd find a section of about 40 rows, after 1120, missing. This is a large data file of about 125k rows.
I'm using python in the canopy environment with pandas installed. I'm trying to sort the database on the Data_Point (as I thought it'd be the easiest, you could do it based on test or step time) column keeping the rows intact. Here is the code I've tried:
import pandas as pd
import numpy
from pylab import plt, plot, legend, show
df = pd.read_csv("C:\ArbinData\PanCell3_Cycling_0-30.csv")
df2 = df.sort_values('Data_Point', ascending = 0)
for x in range(1, len(df2['Data_Point']):
#Do science.
Thanks for any help, I'm out of energy on this one.
You are using sort_values wrongly. The parameters of the argument ascending have to be boolean(True/False) and not binary(1/0) values.
It should be:
df2 = df.sort_values(by=['Data_Point'], ascending=False)
This sorts the values in descending order.
Newer syntax starting from v0.17.0 supports the usage of integers 1/0 to depict the boolean True/False values respectively.
df2 = df.sort_values(by=['Data_Point'], ascending=0)
You can even pass a list of elements to the ascending keyword argument whose length corresponds to the number of items passed to the by keyword argument.