dataframe generating own column names - python

For a project, I want to create a script that allows the user to enter values (like a value in centimetres) multiple times. I had a While-loop in mind for this.
The values need to be stored in a dataframe, which will be used to generate a graph of the values.
Also, there is no maximum nr of entries that the user can enter, so the names of the variables that hold the values have to be generated with each entry (such as M1, M2, M3…Mn). However, the dataframe will only consist of one row (only for the specific case that the user is entering values for).
So, my question boils down to this:
How do I create a dataframe (with pandas) where the script generates its own column name for a measurement, like M1, M2, M3, …Mn, so that all the values are stored.
I can't acces my code right now, but I have created a While-loop that allows the user to enter values, but I'm stuck on the dataframe and columns part.
Any help would be greatly appreciated!

I agree with #mischi, without additional context, pandas seems overkill, but here is an alternate method to create what you describe...
This code proposes a method to collect the values using a while loop and input() (your while loop is probably similar).
colnames = []
inputs = []
counter = 0
while True:
value = input('Add a value: ')
if value == 'q': # provides a way to leave the loop
break
else:
key = 'M' + str(counter)
counter += 1
colnames.append(key)
inputs.append(value)
from pandas import DataFrame
df = DataFrame(inputs, colnames) # this creates a DataFrame with
# a single column and an index
# using the colnames
df = df.T # This transposes the DataFrame to
# so the indexes become the colnames
df.index = ['values'] # Sets the name of your row
print(df)
The output of this script looks like this...
Add a value: 1
Add a value: 2
Add a value: 3
Add a value: 4
Add a value: q
M0 M1 M2 M3
values 1 2 3 4

pandas seems a bit of an overkill, but to answer your question.
Assuming you collect numerical values from your users and store them in a list:
import numpy as np
import pandas as pd
values = np.random.random_integers(0, 10, 10)
print(values)
array([1, 5, 0, 1, 1, 1, 4, 1, 9, 6])
columns = {}
column_base_name = 'Column'
for i, value in enumerate(values):
columns['{:s}{:d}'.format(column_base_name, i)] = value
print(columns)
{'Column0': 1,
'Column1': 5,
'Column2': 0,
'Column3': 1,
'Column4': 1,
'Column5': 1,
'Column6': 4,
'Column7': 1,
'Column8': 9,
'Column9': 6}
df = pd.DataFrame(data=columns, index=[0])
print(df)
Column0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 \
0 1 5 0 1 1 1 4 1
Column8 Column9
0 9 6

Related

Inserting rows in df based on groupby using value of previous row

I need to insert rows based on the column week based on the groupby type, in some cases i have missing weeks in the middle of the dataframe at different positions and i want to insert rows to fill in the missing rows as copies of the last existing row, in this case copies of week 7 to fill in the weeks 8 and 9 and copies of week 11 to fill in rows for week 12, 13 and 14 : on this table you can see the jump from week 7 to 10 and from 11 to 15:
the perfect output would be as follow: the final table with incremental values in column week the correct way :
Below is the code i have, it inserts only one row and im confused why:
def middle_values(final : DataFrame) -> DataFrame:
finaltemp= pd.DataFrame()
out= pd.DataFrame()
for i in range(0, len(final)):
for f in range(1, 52 , 1):
if final.iat[i,8]== f and final.iat[i-1,8] != f-1 :
if final.iat[i,8] > final.iat[i-1,8] and final.iat[i,8] != (final.iat[i-1,8] - 1):
line = final.iloc[i-1]
c1 = final[0:i]
c2 = final[i:]
c1.loc[i]=line
concatinated = pd.concat([c1, c2])
concatinated.reset_index(inplace=True)
concatinated.iat[i,11] = concatinated.iat[i-1,11]
concatinated.iat[i,9]= f-1
finaltemp = finaltemp.append(concatinated)
if 'type' in finaltemp.columns:
for name, groups in finaltemp.groupby(["type"]):
weeks = range(groups['week'].min(), groups['week'].max()+1)
out = out.append(pd.merge(finaltemp, pd.Series(weeks, name='week'), how='right').ffill())
out.drop_duplicates(subset=['project', 'week'], keep = 'first', inplace=True)
out.drop_duplicates(inplace = True)
out.sort_values(["Budget: Budget Name", "Budget Week"], ascending = (False, True), inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
out.reset_index(inplace=True)
out.drop(['level_0'], axis = 1, inplace=True)
return out
else :
return final
For the first part of your question. Suppose we have a dataframe like the following:
df = DataFrame({"project":[1,1,1,2,2,2], "week":[1,3,4,1,2,4], "value":[12,22,18,17,18,23]})
We can create a new multi index to get the additional rows that we need
new_index = pd.MultiIndex.from_arrays([sorted([i for i in df['project'].unique()]*52),
[i for i in np.arange(1,53,1)]*df['project'].unique().shape[0]], names=['project', 'week'])
We can then apply this index to get the new dataframe that you need with blanks in the new rows
df = df.set_index(['project', 'week']).reindex(new_index).reset_index().sort_values(['project', 'week'])
You would then need to apply a forward fill (using ffill) or a back fill (using bfill) with groupby and transform to get the required values in the rows that you need.

Find the growth of the same name in a dataframe

I have a dataset:
Year Name Value
1 A 10
2 B 20
3 A 25
3 B 10
I want to be able to find how each name has changed over the years. Ideally the result should look like
Name Growth/Year
A (25-10)/(3-1)
B (10-20)/(3-2)
I can build a list of unique name first, and then loop through the dataset to find the value change. But is there an easy way to do it?
Try using pandas groupby:
df = pd.DataFrame({'year':[1,2,3,3], 'name':['A','B','A','B'], 'value':[10,20,25,10]})
df.groupby('name').apply(lambda x: (x['value'].iloc[1]-x['value'].iloc[0])/(x['year'].iloc[1]-x['year'].iloc[0]))
>>>
name
A 7.5
B -10.0
dtype: float64
Or to have more flexibility, you could define an aggregation function:
def value_change(x):
yr1 = min(x['year'])
yr2 = max(x['year'])
# Get values corresponding to min and max years in case
# min and max year rows aren't contiguous
value1 = x[x['year']==yr1]['value'].iloc[0]
value2 = x[x['year']==yr2]['value'].iloc[0]
return (value2-value1)/(yr2-yr1)
df.groupby('name').apply(value_change)
So first you want the rows the which correspond to the first and last year for each name:
df = pd.DataFrame([[1,"A", 10], [2, "B", 20], [3, "A", 25], [3, "B", 10]], columns=["Year", "Name", "Value"])
first_year = df.groupby("Name").Year.idxmin().sort_index()
last_year = df.groupby("Name").Year.idxmax().sort_index()
I added the sort_index just to be sure that the order lines up but it might be a good idea to test this to make sure that nothing went wrong. Here's how I would test it:
assert (first_year.index == last_year.index).all()
Next, you can use these to extract the values you want and compute the change:
change_value = (df.loc[last_year.values, "Value"].values - df.loc[first_year.values, "Value"].values)
change_time = (df.loc[last_year.values, "Year"].values - df.loc[first_year.values, "Year"].values)
change_over_time = change_value / change_time
Then, you can convert this into pandas frame if you'd like:
pd.Series(change_over_time, index=first_year.index)
There probably a tidier way to do this all within the groupby but I think this way easier to understand.

Pandas- How to save frequencies of different values in different columns line by line in a csv file (including 0 frequencies)

I have a CSV file with the following columns of interest
fields = ['column_0', 'column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7', 'column_8', 'column_9']
for each of these columns, there are 153 lines of data, containing only two values: -1 or +1
My problem is that, for each column, I would like to save the frequencies of each -1 and +1 values in comma-separated style line by line in a CSV file. I have the following problems when I do the following:
>>>df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
>>>print df['column_2'].value_counts()
1 148
-1 5
>>>df['column_2'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
148
5
Which is obviously what I dont want, I want the values in the same line of the text file separated by comma (e.g., 148, 5).
The second problem I have happens when one of the frequencies are zero,
>>> print df['column_9'].value_counts()
1 153
>>> df['column_9'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
153
I also dont want that behavior, I would like to see 153, 0
So, in summary, I would like to know how to do that with Pandas
Given one column, save its different values frequencies in the same line of a csv file and separated by commas. For example:
148,5
If there is a value with frequency 0, put that in the CSV. For example:
153,0
Append these frequency values in different lines of the same CSV file. For example:
148,5
153,0
Can I do that with pandas? or should I move to other python lib?
Example with some dummy data:
import pandas as pd
df = pd.DataFrame({'col1': [1, 1, 1, -1, -1, -1],
'col2': [1, 1, 1, 1, 1, 1],
'col3': [-1, 1, -1, 1, -1, -1]})
counts = df.apply(pd.Series.value_counts).fillna(0).T
print(counts)
Output:
-1 1
col1 3.0 3.0
col2 0.0 6.0
col3 4.0 2.0
You can then export this to csv.
See this answer for ref:
How to get value counts for multiple columns at once in Pandas DataFrame?
I believe you could do what you want like this
import io
import pandas as pd
df = pd.DataFrame({'column_1': [1,-1,1], 'column_2': [1,1,1]})
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
df['column_1'].value_counts().to_frame().T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
But I would suggest something like this since you would have to otherwise specify that one of the expected values were missing
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
counts = df[['column_1', 'column_2']].apply(lambda column: column.value_counts())
counts = counts.fillna(0)
counts.T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
Here is an example with three columns c1, c2, c3 and data frame d which is defined before the function is invoked.
import pandas as pd
import collections
def wcsv(d):
dc=[dict(collections.Counter(d[i])) for i in d.columns]
for i in dc:
if -1 not in list(i.keys()):
i[-1]=0
if 1 not in list(i.keys()):
i[1]=0
w=pd.DataFrame([ list(j.values()) for j in dc],columns=['1','-1'],index=['c1','c2','c3'])
w.to_csv("t.csv")
d=pd.DataFrame([[1,1,-1],[-1,1,1],[1,1,-1],[1,1,-1]],columns=['c1','c2','c3'])
wcsv(d)

Selecting values from the same dataframe but for different years

I have a data frame of some features and corresponding years. Each value of the feature is listed for different years. I need to compare the values of a specific year with that 7 years earlier. So basically I need to define a function that will generate two columns one will give me the value of the feature from the table for a specific year and the other one for the same feature but 7 years earlier. How can I do that?
feature year
value1 2001
value1 2008
vlaue2 1996
etc
e.g. I want to compare value1(2008) with value1(2008 - 7) etc.
there should also be some conditional statements as year 2000 can't be compared with (2000-7 =1993) because there is no value for the feature for year (1993) for example.
Here's a quick solution from what I understand from your question,
import numpy as np
import pandas as pd
data = {'feature': ['A', 'B', 'C', 'A'],
'value': [1, 10, 3, 50],
'year':[2001, 2002, 2003, 2008]}
df = pd.DataFrame(data)
def compFeature(df, f, y):
if df[(df.feature == f) & (df.year == (y-7))].year is not None:
now = df[(df.feature == f) & (df.year == y)].value
old = df[(df.feature == f) & (df.year == (y-7))].value
result = np.subtract(now,old)
else:
result = np.nan
return result
This is just to get you started.
With the minimum information you have given, this can be used as a solution:
Let's create a function to get data for both years, if available.
def compare(x):
f1 = df.loc[df['year'] == x, 'feature'].values[0]
y2 = x - 7
if y2 in df['year'].unique():
f2 = df.loc[df['year'] == y2, 'feature'].values[0]
return (x, f1, y2, f2)
else:
pass
Apply the function to the year column and assign a new dataframe name.
foo = df['year'].apply(compare)
Create a dataframe of non-null values in foo:
bar = pd.DataFrame(data = list(foo.loc[~foo.isnull()]), columns = ['f1', 'y1', 'f2', 'y2'])
This will result in four columns for easy comparison. I understand you were looking for two columns solution but a four column solution with comparative data next to each other would make sense for later usage as well.

Combining dataframes in Python to a dictionary using one of the dataframes as key

I have 3 dataframes, containing daily data: unique code, names, scores. First value in Row 1 is called Rank and then I have dates, first column under Rank contains the rank number (the first column is used as index).
**df1** UNIQUE CODES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Code_1 Code_3 Code_4
2 Code_2 Code_1 Code_2
...
1000 Code_5 Code_6 Code_7
**df2** NAMES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Jon Maria Peter
2 Brian Jon Maria
...
1000 Chris Tim Charles
**df3** SCORES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 10 20 30
2 15 10 40
...
1000 25 15 20
Desired output:
I want to combine these dataframes into a dictionary, using df1 codenames as keys, so it will look like this:
dictionary = {'Code_1':[Jon, 20] , 'Code_2':[Brian, 15]}
As there are repeat competitors, I will need to sum their scores during all of the data series. So in the above examples, the Score_1 of Jon will contain scores for 12/8/2017 and 12/9/2017.
There are 1000 rows and 26 columns + index, so need a way to capture those. I think that a nested loop could work here, but don't have enough experience to build one that works.
In the end, I would like to sort the dictionary by highest score. Please suggest any solutions to this or more straightforward ways to combine this data and get the score ranking.
I attached pictures of dataframes, containing names, codes, and scores.
names
codes
scores
I used the proposed solution below on the 3 dataframes that I have. Please note that hashtags stands for code, players for names, and trophies for scores:
# reshape to get dates into rows
hashtags_reshaped = pd.melt(hashtags, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
# reshape to get dates into rows
players_reshaped = pd.melt(players, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
# reshape to get the dates into rows
trophies_reshaped = pd.melt(trophies, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([hashtags_reshaped['Date'],
hashtags_reshaped['Code'], players_reshaped['Name'],
trophies_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
But I'm getting a strange output: the summed scores should be in hundreds or low thousands (as an average score is 200-300 and an average participation frequency is 4-6 times). The score results I'm getting are way off, but their match codes and names correctly.
summed_df:
0 (MandiBralaX, 996871590076253)
1 (Arso_C, 9955130513430)
2 (ThatRainbowGuy, 9946)
3 (fabi, 9940)
4 (Dogão, 991917)
5 (Hierbo, 99168)
6 (Clyde, 9916156180128)
7 (.A.R.M.I.N., 9916014310187143)
8 (keftedokofths, 9900)
9 (⚽AngelSosa⚽, 990)
10 (Totoo98, 99)
group_df:
Code Name Score \
0 #JL2J02LY MandiBralaX 996871590076253
1 #80JQ90VC Arso_C 9955130513430
2 #9GGC2CUQ ThatRainbowGuy 9946
3 #8LL989QV fabi 9940
4 #9PPC89L Dogão 991917
5 #2JPLQ8JP8 Hierbo 99168
This should get you much of the way there. I didn't create a dictionary at the end as you specified; while you may need that format, you'd end up with nested dictionaries or lists, as each Code has 1 Name but possibly many Dates and Scores associated with it. How do you want those recorded - list, dict, etc?
The code below returns a grouped dataframe; you can output it directly to a dict (shown), but you'll probably want to specify the format in detail, especially if you need an ordered dictionary. (Dictionaries are inherently not ordered; you'll have to from collections import OrderedDict and review that documentation if you really need an ordered dictionary.
import pandas as pd
#create the dfs; note that 'Code' is set up as a string
df1 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['1', '2'], '12/9/2017': ['3', '1']})
df1.set_index('Rank', inplace = True)
# reshape to get dates into rows
df1_reshaped = pd.melt(df1, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
#print(df1_reshaped)
# create the second df
df2 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['Name_1', 'Name_2'], '12/9/2017': ['Name_3', 'Name_1']})
df2.set_index('Rank', inplace = True)
# reshape to get dates into rows
df2_reshaped = pd.melt(df2, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
#print(df2_reshaped)
# create the third df
df3 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['10', '20'], '12/9/2017': ['30', '10']})
df3.set_index('Rank', inplace = True)
# reshape to get the dates into rows
df3_reshaped = pd.melt(df3, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
#print(df3_reshaped)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([df1_reshaped['Date'], df1_reshaped['Code'], df2_reshaped['Name'], df3_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
Unsorted dict:
d = dict(zip(summed_df.Code, summed_df.li))
print(d)
You can make the OrderedDict directly, of course, and should:
from collections import OrderedDict
d2 = OrderedDict(zip(summed_df.Code, summed_df.li))
print(d2)
summed_df:
Code Name Score li
0 3 Name_3 30 (Name_3, 30)
1 1 Name_1 20 (Name_1, 20)
2 2 Name_2 20 (Name_2, 20)
d:
{'3': ('Name_3', 30), '1': ('Name_1', 20), '2': ('Name_2', 20)}
d2, sorted:
OrderedDict([('3', ('Name_3', 30)), ('1', ('Name_1', 20)), ('2', ('Name_2', 20))])
This returns your (name, score) as a tuple, not a list, but... it should get more of the way there.

Categories