I am trying to manipulate my dataframe before I conduct network analysis using networkx.
Here is an sample of data i got:
sample data
I am trying to use the title and cast columns and trun them to something like this:
ideal format
The ideal result is to have one column for each individual actor and the movie/show that he/she is in. If the actor has more than 1 show/movie, I want to have different rows for that actor as well.
Could someone please advise me on how to make it happen? Thank you!!
So to use pandas you first import into the dataframe. Lets call it "f".
import pandas
f = pandas.read_csv('path/to/csv')
after that you can access individual columns by doing:
f['title']
similar to a dictionary. if you want both in the same dataframe, pass in a list of columns like so:
f[['title', 'cast']]
that is as much as I can provide without knowing the extent of the project.
Related
I would like to import this data from a Navigraph survey results.
https://navigraph.com/blog/survey2022
The dataset is here:
https://download.navigraph.com/docs/flightsim-community-survey-by-navigraph-2022-data.zip
However, I noticed the structure is something I'm not quite used to, and perhaps this is how a lot of polling data is shared. The semicolons being separators is not an issue. It's the fact there's a mix of "select multiple" responses as columns. The tidiest thing is starting at the third row, each row is a single respondent.
How can I clean up this data so it is as "tidy" as possible? How would I melt() these columns into rows? How do I handle the multiple selection responses in the sub-columns?
I'd like the questions and responses to simply be two columns respectively.
Hello how are you? I don't have full knowledge in this type of work but I believe you will have to:
1- Read the file as is
2- Concatenate the columns of questions and answers
3- Create the dataset that will be used
I believe that pandas has some commands that will help you, just find the patterns to define what are "questions" and "answers" in this dataset.
I have an Excel file containing some data. The first row contains names, some of which are similar.
I want to create a loop in Python in order to apply an operation on each group of similar rows.
I tried this but it didn't help because it shows only one group of my data :
df.loc[df['name'] == 'sofia']
DATA
This is my data it represents the data of a cell of a network
in this very long file like 19000 rows i have the data of each cell for two years 2020 and 2021 and i have to the prediction of the data_trend of the year 2022 so i did the holtwinters method on one cell only and i wanted to apply it on all the cells this is my problem iam begginer in python so i will be so gratful if you can help me out
thanks in advance
Can you give us some example data and describe what you want to do with the rows?
Spontaneously it sounds like you should do something like:
def your_func(x):
your operations
return results of operations
df['col_to_upd'] = df.groupby('name')['col_to_upd'].apply(your_func)
This page may be relevant for you to look up the Pandas functions. Like erikheld writes, it is difficult to give an exact solution to your problem, with no data matrix provided, or example of the operation you want to perform on each group.
I've been going through hours of research just to try to solve this seemingly simple issue. I'm not sure why it's been so hard to try to find. I'm trying to plot the stock data of aapl. When i extract the data from ameritrade, its a nested json dictionary. I came from matlab and found this very simple in matlab, but I am not sure how to extract the nested json out. I used pd.read_json to extract the first json but then there is still one left inside the dataframe that has data i need to plot. Any help would be greatly appreciated. Below is what they look like:
df = pd.read_json(aapldata)
And the df looks like this, I'm trying to extract the data within the 'candles' column.
Dataframe Picture Showing Candle Column
As long as there is only one level of nesting, you should be able to do this :
from pandas.io.json import json_normalize
df = json_normalize(aapldata)
source
I want to convert a CSV file of time-series data with
multiple sensors.
This is what the data currently looks like:
The different sensors are described by numbers and have different numbers of axes. If a new activity is labeled, everything below belongs to this new label. The label is in the same column as the first entry of each sensor.
This is the way I would like the data to be:
Each sensor axis has its own column and the according label is added in the last column.
So far, I have created a DataObject class to access timestamp, sensortype, sensorvalues, and the belonging parent_label for each row in the CSV.
I thought the most convenient way to solve this would be by using pandas DataFrame but simply using pd.DataFrame(timestamp, sensortype, sensorvalues, label)
won't work.
Any ideas/hints? Maybe other ways to solve this problem?
I am fairly new to programming, especially Python, so I have already run out of ideas.
Thanks in advance
Try creating a numpy matrix of the columns you require then convert them to a pandas DataFrame.
Otherwise, you can also try to import the csv using pandas from the start.
Also for the following
pd.DataFrame(timestamp, sensortype, sensorvalues, label)
try referring to the pd.concat function as well. You would need to convert each array to a DataFrame, put them in a list and then concat them with pandas.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')