Provide sample DataFrame from csv file when asking question on stackoverflow [duplicate] - python

This question already has answers here:
How to make good reproducible pandas examples
(5 answers)
Closed 17 days ago.
when asking a python/pandas question on stackoverflow i often like to provide a sample dataframe.
I usually have a local csv file i deal with for testing.
So for a DataFrame i like to provide a code in my question like
df = pd.DataFrame()
Is there an easy way or tool to get a csv file into code in a format like this, so another user can easily recreate the dataframe?
For now i usually do it manually, which is annoying and time consuming. I have to copy/paste the data from excel to stackoverflow, remove tabs/spaces, rearrange numbers to get a list or dictionary and so on.
Example csv file:
col1
col2
1
3
2
4
I if want to provide this table i can provide code like:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
I will have to create the dictionary and Dataframe manually. I manually have to write the code into the stackoverflow editor.
For a more complex table this could lead to a lot of work.
Hope you get the "problem".
Thank you.

You can make a dict from the .csv and pass it to the pandas.DataFrame constructor :
N = 5 # <- adjust here to choose the number of rows
dico = pd.read_csv("f.csv").sample(N).to_dict("list")
S = f"df = pd.DataFrame{dico}") # <- copy its output and paste in StackOverflow
You can also use pyperclip to copy directly the text you'll paste/include on your question :
#pip install pyperclip
import pyperclip
pyperclip.copy(S)

Related

Loop function to rename dataframes

I am new to coding and currently i want to create individual dataframes from each excel tab. It works out so far by doing a search in this forum (i found a sample using dictionary), but then i need one more step which i can't figure out.
This is the code i am using:
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
d[f'{sheet}'] = pd.read_excel(xls, sheet_name=sheet)
Let's say i have 3 excel tabs called 'alpha', 'beta' and 'charlie'.
the code above will gave me 3 dataframes and i can call them by typing: d['alpha'], d['beta'] and d['charlie'].
What i want is to rename the dataframes so instead of calling them by typing (for example) d['alpha'], i just need to write alpha (without any other extras).
Edit: The excel i want to parse has 50+ tabs and it can grow
Edit 2: Thank you all for the links and the answers! it is a great help
Don't rename them.
I can think of two scenarios here:
1. The sheets are fundamentally different
When people ask how to dynamically assign to variable names, the usual (and best) answer is "Use a dictionary". Here's one example.
Indeed, this is the reason Pandas does it this way!
In this case, my opinion is that your best move here is to do nothing, and just use the dictionary you have.
2. The sheets are roughly the same
If the sheets are all basically the same, and only differ by one attribute (e.g. they represent monthly sales and the names of the sheets are 'May', 'June', etc), then your best move is to merge them somehow, adding a column to reflect the sheet name (month, in my example).
Whatever you do, don't use exec or eval, no matter what anyone tells you. They are not options for beginner programmers.
I think you are looking for the build-in exec method, which executes strings.
But I do not recommend using exec, it is really widely discussed why it shouldn't be used or at least should be used cautiously.
As I do not have your data, I think it is achievable using the following code:
import pandas as pd
excel='sample.xlsx'
xls=pd.ExcelFile(excel)
for sheet in xls.sheet_names:
print(sheet)
code_to_execute = f'{sheet} = pd.read_excel(xls,sheet_name={sheet})'
exec(code_to_execute)
But again, I highlight that it is not the cleanest way to do that. Your approach is definitely cleaner, to be more precise, I would always use dicts for those kinds of assignments. See here for more about exec.
In general, you want to generate a string.
possible_string = 'a=10'
exec(possible_string)
print(a) # 10
You need to create variables which correspond to the three dataframes:
alpha, beta, charlie = d.values()
Edit:
Since you mentioned that the excel sheet could have 50+ tabs and could grow, you may prefer to do it your original loop. This can be done dynamically using exec
import pandas as pd
excel = 'sample.xlsx'
xls = pd.ExcelFile(excel)
d = {}
for sheet in xls.sheet_names:
print(sheet)
exec(f'{sheet}' + " = pd.read_excel(xls, sheet_name=sheet)")
It might be better practice, however, to simply index your sheets and access them by index. A 50+ length collection of excel sheets is probably better organized by appending to a list and accessing by index:
d = []
for sheet in xls.sheet_names:
print(sheet)
d.append(pd.read_excel(xls, sheet_name=sheet))
#d[0] = alpha; d[1] = beta, and so on...

Create an array from data in an excel file in python

I'm really new to python and pandas so would you please help me answer this seemingly simple question? I already have an excel file containing my data, now I want to create an array containing those data in python. For example, I have data in excel that look like this:
I want from those data to create a matrix of the form like the python code below:
Actually, my data is much longer so is there any way that I can take advantage of pandas to put the data from my excel file into a matrix in python similar to the simple example above?
Thank you!
you can put all your values into a=np.array([40,56,87,98,58,98,56,63]), then a.reshape(4,2) but in your case a.reshape(3,9). Hope you get my point.
You can use pandas.read_excel()
In the documentation there is also some examples like:
pd.read_excel('tmp.xlsx', index_col=0)
Name Value
0 string1 1
1 string2 2
2 #Comment 3

Splitting a single large csv file to resample by two columns

I am doing a machine learning project with phone sensor data (accelerometer). I need to preprocess dataset before I export it to the ML model. I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. Since the lengths are different for each class and subject, I have to resample. I want to split a single csv file by class and subject to be able to resample. I have tried some things like groupby() or other things but did not work. I will be glad if you can share thoughts what I can do about this problem. This is my first time asking a question on this site if I made a mistake I would appreciate it if you warn me about my mistakes. Thank you from now.
I share some code and outputs to help you understand my question better.
what i got when i tried with groupby() but not exactly what i wanted
This is how my csv file looks like. It contains more than 300,000 data.
Some code snippet:
import pandas as pd
import numpy as np
def read_data(file_path):
data = pd.read_csv(file_path)
return data
# read csv file
dataset = read_data('raw_data.csv')
df1 = pd.DataFrame( dataset.groupby(['alphabet', 'subject'])['x_axis'].count())
df1['x_axis'].head(20)
I also need to do this for every x_axis, y_axis and z_axis so what can I use other than groupby() function? I do not want to use only the lengths but also the values of all three to be able to resample.
First, calculate the greatest common number of sample
num_sample = df.groupby(['alphabet', 'subject'])['x_axis'].count().min()
Now you can sample
df.groupby(['alphabet', 'subject']).sample(num_sample)

Pandas gives me a SettingWithCopyWarning [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am trying to create two new columns in my dataframe depending on the values of the columns Subscribers, External Party and Direction. If the Direction is I for Incoming, column a should become External Party and col B should become Subscriber. If the Direction is O for Outgoing, it should be the other way around. I use the code:
import pandas as pd
import numpy as np
...
df['a'] = np.where((df.Direction == 'I'), df['External Party'], df['Subscriber'])
df['b'] = np.where((df.Direction == 'O'), df['External Party'], df['Subscriber'])
I get a SettingWithCopyWarning from Pandas, but the code does what it needs to do. How can I improve this operation to avoid the error?
Thanks in advance!
Jo
Inspect the place in your code where df is created.
Most probably, it is a view of another DataFrame, something like:
df = df_src[...]
Then any atempt to save something in df causes just this warning.
To avoid it, create df as a truly independent DataFrame, with its
own data buffer. Something like:
df = df_src[...].copy()
Now df has its own data buffer, and can be modified without the
above warning.
If you are planning to work with the same df later on in your code then it is sometimes useful to create a deep copy of the df before making any iterations.
Pandas native copy method is not always acting as one would expect - here is a similar question that might give more insights.
You can use copy module that comes with python to copy the entire object and to ensure that there are no links between 2 dataframes.
import copy
df_copy = copy.deepcopy(df)

(Python) manually copy/paste data from pandas table without copying the index

I've been looking around but could not find an similar post, so I thought I'd give it a go.
I wrote an pandas program that sucessfully displays the resulting dataframe in pandas table format in a tkinter textbox. the aim is that the user can select the data ancopy/paste it into an (existing)excel sheet. when doing this, the index is always copied as well. I was wondering if one could programmatically select the complete table except the index?
I know that one can save to excel or other with index=false, but I could not find a kind of df.select....index=false. I hope my explanation is more or less clear ;-)
Thanks a lot
screenshot
you could use dataframe's 'to_string' function, here you could pass 'index = False' as one of the parameters. For Ex: say we have this df:
import pandas as pd
df = pd.DataFrame({'a': ['yes', 'no', 'yes' ], 'b': [10, 5, 20]})
print(df.to_string(index = False))
this would give you:
a b
yes 10
no 5
yes 20
Hope this helps!
I finally found it.
Instead of using something like self.mytable.copy('columns') to select everything and then switch to Excel and paste it, I use this line of code which does exactly what I need :
df.to_clipboard(sep="\t", index=False)
The sep="\t" makes it split up amongst columns in Excel.
Hopefully someone can use this at some stage.

Categories