I have two pandas DataFrames. df1 is 2 years of time series data recorded hourly for 20,000+ users, and it looks something like this:
TimeStamp | UserID1 | UserID2 | ... | UserID20000 |
---------------------------------------------------------------
2017-01-01 00:00:00 | 1.5 | 22.5 | ... | 5.5 |
2017-01-01 01:00:00 | 4.5 | 3.2 | ... | 9.12 |
.
.
.
2019-12-31 22:00:00 | 4.2 | 7.6 | ... | 8.9 |
2029-12-31 23:00:00 | 3.2 | 0.9 | ... | 11.2 |
df2 is ~ 20 attributes for each of the users and looks something like this:
User | Attribute1 | Attribute2 | ... | Attribute20 |
------------------------------------------------------------
UserID1 | yellow | big | ... | 450 |
UserID2 | red | small | ... | 6500 |
.
.
.
UserID20000 | yellow | small | ... | 950 |
I would like to create a Plotly Dash with callbacks where a user can specify attribute values or ranges of values (ie Attribute1 == 'yellow', Attribute20 < 1000 AND Attribute20 > 500) to create line graphs of the time series data of only the users that meet the specified attribute criteria.
I'm new to Plotly, but I'm able to create static plots with matplotlib by filtering df2 based on the attributes I want, making a list of the User IDs after filter, and reindexing df1 with the list of filtered User IDs:
filtered_users = df2.loc[(df2[Attribute1] == 'yellow'), 'User'].to_list()
df1 = df1.reindex(filtered_users, axis=1)
While this works, I'm not sure if the code is that efficient, and I'd like to be able to explore the data interactively, hence the move to Plotly.
Related
Given table1 below, we cannot determine exactly how many types of change_reason are, but we need to split the given table into different sub tables according to different change_reason, for example as table 2-4. Is there a way to not filter each change reason one by one since they are different everytime and there would be a lot of new ones pop up?
table 1:
| Borrower | Impact |change_reason |
| -------- | -------|--------------|
| AAA | 2.5 | ICR upgrade |
| BBB | 4.0 | ICR downgrade|
| CCC | 5.0 | ICR upgrade |
| DDD | 2.2 | New borrower |
| EEE | 1.0 | ICR downgrade|
...
table 2:
|Borrower | Impact | change_reason |
|---------|--------|---------------|
| AAA | 2.5 | ICR upgrade |
| CCC | 5.0 | ICR upgrade |
table 3:
|Borrower | Impact | change_reason |
|---------|--------|---------------|
| BBB | 4.0 | ICR downgrade |
| EEE | 1.0 | ICR downgrade |
table 4:
|Borrower | Impact | change_reason |
|---------|--------|---------------|
| DDD | 2.2 | New borrower |
if I understand your issue correctly, the following code may help. tables referring to the original combined table. Create a dictionary of dataframes.
tables = dict(tuple(table.groupby('name')))
and then select as per the keys
tables["Change_Reason1"]
print(tables["Change_Reason2"])
You can try the following code (for loop) to export the file in csv, I hope this should work. Note: table is your original dataframe
tables = dict(tuple(table.groupby('name')))
for key in tables.keys():
table.loc[table["change_reason"] == key].to_csv("{}.csv".format(key), header=True)
I have a huge CSV-File which I initially converted into a Parquet-File. This File contains Information from different sensors.
| | Unnamed: 0 | sensor_id | timestamp | P1 | P2 |
|---:|-------------:|------------:|:--------------------|------:|-----:|
| 0 | 0 | 4224 | 2020-05-01T00:00:00 | 0.5 | 0.5 |
| 1 | 1 | 3016 | 2020-05-01T00:00:00 | 0.77 | 0.7 |
| 2 | 2 | 29570 | 2020-05-01T00:00:00 | 0.82 | 0.52 |
In order to process the data I want to create several smaller (using resampling etc.) DataFrames containing the timeseries of each sensor. These timeseries should then be inserted into a HDF5-File.
Is there any faster other possibility besides looping over every group:
import dask.dataframe as dd
import numpy as np
def parse(d):
# ... parsing
return d
# load data
data = dd.read_parquet(fp)
sensor_ids = np.unique(test['sensor_id'].values).compute() # get array of all ids/groups
groups = test.groupby('sensor_id')
res = []
for idx in sensor_ids:
d = parse(groups.get_group(idx).compute())
res.append(d)
# ... loop over res ... store ...
I was thinking about using data.groupby('sensor_id').apply(....) but this results in a single DataFrame. While the solution above calls the compute()-method in every iteration leading to a too high computation time. The data contains a total of approx. 200_000_000 rows. There is a total of approx 11_000 sensors/groups.
Can I implemented writing the timeseries to a HDF5-File for every sensor into a function and call apply?
The desired result for one group/sensor looks like this:
parse(data.groupby('sensor_id').get_group(4224).compute()).to_markdown()
| timestamp | sensor_id | P1 | P2 |
|:--------------------|------------:|--------:|--------:|
| 2020-05-01 00:00:00 | 4224 | 2.75623 | 1.08645 |
| 2020-05-02 00:00:00 | 4224 | 5.69782 | 3.21847 |
Here looping is not the best way if you are happy to save the small datasets as parquet you could just use the option partition_on.
import dask.dataframe as dd
data = dd.read_parquet(fp)
data.to_parquet("data_partitioned", partition_on="sensor_id")
I would like to convert the following sql statement to the equivalent pandas expression.
select
a1.country,
a1.platform,
a1.url_page as a1_url_page,
a2.url_page as a2_url_page,
a1.userid, a1.a1_min_time,
min(a2.dvce_created_tstamp) as a2_min_time
from(
select country, platform, url_page, userid,
min(dvce_created_tstamp) as a1_min_time
from pageviews
group by 1,2,3,4) as a1
left outer join pageviews as a2 on a1.userid=a2.userid
and a1.a1_min_time < a2.dvce_created_tstamp
and a2.url_page <> a1.url_page
group by 1,2,3,4,5,6
I am aware of the merging command of pandas however in our case we have a composite join clause that includes also inequality. I haven't found some documentation on how to handle this case.
Of course I can think as a last resort to iterate through dataframes but I do not think that this is the most efficient way to do it.
For example we can add some sample input data
----------------------------------------------------------------
| country | platform | url_page | userid | dvce_created_tstamp |
|----------------------------------------------------------------
| gr | win | a | bar | 2019-01-01 00:00:00 |
| gr | win | b | bar | 2019-01-01 00:01:00 |
| gr | win | a | bar | 2019-01-01 00:02:00 |
| gr | win | a | foo | 2019-01-01 00:00:00 |
| gr | win | a | foo | 2019-01-01 01:00:00 |
The response from sql
When I use dataframe left merge command I get following output
(edit: Add sample data)
It is obvious that we miss the rows with null a2_url_page
I have a dataframe like this which is an application log:
+---------+----------------+----------------+---------+----------+-------------------+------------+
| User | ReportingSubId | RecordLockTime | EndTime | Duration | DurationConverted | ActionType |
+---------+----------------+----------------+---------+----------+-------------------+------------+
| User 5 | 21 | 06:19.6 | 06:50.5 | 31 | 00:00:31 | Edit |
| User 4 | 19 | 59:08.6 | 59:27.6 | 19 | 00:00:19 | Add |
| User 25 | 22 | 29:09.4 | 29:37.0 | 28 | 00:00:28 | Edit |
| User 10 | 19 | 28:36.9 | 33:37.0 | 300 | 00:05:00 | Add |
| User 27 | 22 | 13:27.7 | 16:54.9 | 207 | 00:03:27 | Edit |
| User 5 | 21 | 11:22.8 | 12:37.3 | 75 | 00:01:15 | Edit |
+---------+----------------+----------------+---------+----------+-------------------+------------+
I wanted to visualize the duration of adds and edits for each user, ad Gantt Chart seemed ideal for me.
I was able to do it for a sample dataframe of 807 rows with the following code:
data = []
for row in df_temp.itertuples():
data.append(dict(Task=str(row.User), Start=str(row.RecordLockTime), Finish=str(row.EndTime), Resource=str(row.ActionType)))
colors = {'Add': 'rgb(110, 244, 65)',
'Edit': 'rgb(244, 75, 66)'}
fig = ff.create_gantt(data, colors=colors, index_col='Resource', show_colorbar=True, group_tasks=True)
for i in range(len(fig["data"]) - 2):
text = "User: {}<br>Start: {}<br>Finish: {}<br>Duration: {}<br>Number of Adds: {}<br>Number of Edits: {}".format(df_temp["User"].loc[i],
df_temp["RecordLockTime"].loc[i],
df_temp["EndTime"].loc[i],
df_temp["DurationConverted"].loc[i],
counts[counts["User"] == df_temp["User"].loc[i]]["Add"].iloc[0],
counts[counts["User"] == df_temp["User"].loc[i]]["Edit"].iloc[0])
fig["data"][i].update(text=text, hoverinfo="text")
fig['layout'].update(autosize=True, margin=dict(l=150))
py.iplot(fig, filename='gantt-group-tasks-together', world_readable=True)
and I am more than happy with the result : https://plot.ly/~pawelty/90.embed
However my original df has more users and 2500 rows in total. That seems to be too much for plotly. I get 502 error.
I am a huge fan of plotly but I might have reached it's limit. Can I change something in order to visualize it with Plotly ? Any other tool I could use?
I started using plotly.offline.plot(fig) to plot offline and it worked much faster and I got less errors. I also have the problem that my graph doesn't get displayed or sometimes only in fullscreen mode...
I import plotly instead of plotly.plotly though, otherwise it doesn't work.
I come from a SPSS background and I want to declare missing values in a Pandas DataFrame.
Consider the following dataset from a Likert Scale:
SELECT COUNT(*),v_6 FROM datatable GROUP BY v_6;
| COUNT(*) | v_6 |
+----------+------+
| 1268 | NULL |
| 2 | -77 |
| 3186 | 1 |
| 2700 | 2 |
| 512 | 3 |
| 71 | 4 |
| 17 | 5 |
| 14 | 6 |
I have a DataFrame
pdf = psql.frame_query('SELECT * FROM datatable', con)
The null values are already declared as NaN - now I want -77 also to be a missing value.
In SPSS I am used to:
MISSING VALUES v_6 (-77).
No I am looking for the Pandas counterpart
I have read:
http://pandas.pydata.org/pandas-docs/stable/missing_data.html
but I honestly do not get the trick how the proposed way in my case would be...
Use pandas.Series.replace():
df['v_6'] = df['v_6'].replace(-77, np.NaN)