grouping and reordering by partial identifiers in python - python

I have data from a csv that produces a dataframe that looks like the following:
d = {"clf_2007": [20],
"e_2007": [25],
"ue_2007": [17],
"clf_2008": [300],
"e_2008": [20],
"ue_2008": [10]}
df = pd.DataFrame(d)
which produces a data frame (forgive me for not knowing how to properly code that into stackoverflow)
clf_2007 clf_2008 e_2007 e_2008 ue_2007 ue_2008
0 20 300 25 20 17 10
I want to manipulate that data to produce something that looks like this:
clf e ue
2007 20 25 17
2008 300 20 10
2007 and 2008 in the original column names represent dates, but they don't need to be datetime now. I need to merge them with another dataframe that has the same "dates" eventually, but I can figure that out later.
Thus far, I've tried groupbys and I've tried them by string indexes (like str[ :8]) and such, and, outside of it not working, I don't even think groupby is the right tool. I've also tried pd.PeriodIndex, but, again, that doesn't seem like the right tool to me.
Is there a standardized way to do something like this? Or is the brute force way (get it into an excel spreadsheet and just move the data around manually), the only way to get what I'm looking for here?

I think this will be a lot easier if you pre-process your data to have three columns: key, year and value. Something like:
rows = []
for k, v in d.iteritems():
key, year = k.split("_")
for val in v:
rows.append({'key': key, 'year': year, 'value': val})
Put those rows into a dataframe, call it dfA. I'm assuming you might have more than one value for each (key, year) pair and you want to aggregate them somehow. I'll assume you do that and end up with a dataframe called df, whose columns are still key, year, and value. At that point, you just need to pivot:
pd.pivot_table(df,index=['year'], columns=['key'])
You end up with multi-indexed rows/columns that you'll want to clean up, but I'll leave that to you.

You can generate a column multiindex:
df.columns = pd.MultiIndex.from_tuples([col.split("_") for col in df])
print(df.columns)
# clf e ue
# 2007 2008 2007 2008 2007 2008
And then stack the table:
df = df.stack()
print(df)
# clf e ue
#0 2007 20 25 17
# 2008 300 20 10
You can optionally flatten the index, too:
df.index = df.index.get_level_values(1)
print(df)
# clf e ue
#2007 20 25 17
#2008 300 20 10

Related

Create a dataframe from multiple list of dictionary values

I have a code as below,
safety_df ={}
for key3,safety in analy_df.items():
safety = pd.DataFrame({"Year":safety['index'],
'{}'.format(key3)+"_CR":safety['CURRENT'],
'{}'.format(key3)+"_ICR":safety['ICR'],
'{}'.format(key3)+"_D/E":safety['D/E'],
'{}'.format(key3)+"_D/A":safety['D/A']})
safety_df[key3] = safety
Here in this code I'm extracting values from another dictionary. It will looping through the various companies that why I named using format in the key. The output contains above 5 columns for each company(Year,CR, ICR,D/E,D/A).
Output which is printing out is with plenty of NA values where after
Here I want common column which is year for all companies and print following columns which is C1_CR, C2_CR, C3_CR, C1_ICR, C2_ICR, C3_ICR,...C3_D/A ..
I tried to extract using following code,
pd.concat(safety_df.values())
Sample output of this..
Here it extracts values for each list, but NA values are getting printed out because of for loops?
I also tried with groupby and it was not worked?..
How to set Year as common column, and print other values side by side.
Thanks
Use axis=1 to concate along the columns:
import numpy as np
import pandas as pd
years = np.arange(2010, 2021)
n = len(years)
c1 = np.random.rand(n)
c2 = np.random.rand(n)
c3 = np.random.rand(n)
frames = {
'a': pd.DataFrame({'year': years, 'c1': c1}),
'b': pd.DataFrame({'year': years, 'c2': c2}),
'c': pd.DataFrame({'year': years[1:], 'c3': c3[1:]}),
}
for key in frames:
frames[key].set_index('year', inplace=True)
df = pd.concat(frames.values(), axis=1)
print(df)
which results in
c1 c2 c3
year
2010 0.956494 0.667499 NaN
2011 0.945344 0.578535 0.780039
2012 0.262117 0.080678 0.084415
2013 0.458592 0.390832 0.310181
2014 0.094028 0.843971 0.886331
2015 0.774905 0.192438 0.883722
2016 0.254918 0.095353 0.774190
2017 0.724667 0.397913 0.650906
2018 0.277498 0.531180 0.091791
2019 0.238076 0.917023 0.387511
2020 0.677015 0.159720 0.063264
Note that I have explicitly set the index to be the 'year' column, and in my example, I have removed the first year from the 'c' column. This is to show how the indices of the different dataframes are matched when concatenating. Had the index been left to its standard value, you would have gotten the years out of sync and a NaN value at the bottom of column 'c' instead.

Extract numbers from string column from Pandas DF

I have the next DataFrame with string column ("Info"):
df = pd.DataFrame( {'Date': ["2014/02/02", "2014/02/03"], 'Info': ["Out of 78 shares traded during the session today, there were 54 increases, 9 without change and 15 decreases.", "Out of 76 shares traded during the session today, there were 60 increases, 4 without change and 12 decreases."]})
I need to extract the numbers from "Info" to new 4 columns in the same df.
The first row will have the values [78, 54, 9, 15]
I have trying with
df[["new1","new2","new3","new4"]]= df.Info.str.extract('(\d+(?:\.\d+)?)', expand=True).astype(int)
but I think that is more complicated.
regards,
Just so I understand, you're trying to avoid capturing decimal parts of numbers, right? (The (?:\.\d+)? part.)
First off, you need to use pd.Series.str.extractall if you want all the matches; extract stops after the first.
Using your df, try this code:
# Get a multiindexed dataframe using extractall
expanded = df.Info.str.extractall(r"(\d+(?:\.\d+)?)")
# Pivot the index labels
df_2 = expanded.unstack()
# Drop the multiindex
df_2.columns = df_2.columns.droplevel()
# Add the columns to the original dataframe (inplace or make a new df)
df_combined = pd.concat([df, df_2], axis=1)
Extractall might be better for this task
df[["new1","new2","new3","new4"]] = df['Info'].str.extractall(r'(\d+)')[0].unstack()
Date Info new1 new2 new3 new4
0 2014/02/02 Out of 78 shares traded during the session tod... 78 54 9 15
1 2014/02/03 Out of 76 shares traded during the session tod... 76 60 4 12

Merging converted dataframes from multiple series

I receive some data in 11 different pandas series. I need to combine the whole data into one pandas dataframe to carry out further analysis and reporting.
The format in which the data is received is as under:
Series1:
Sales
Item Series Year
A Sal 2018 100
2019 200
B Sal 2018 300
2019 400
Series2:
Purchases
Item Series Year
A Pur 2018 50
2019 100
B Pur 2018 150
2019 200
Series3:
Expenses
Product Series Year
A Exp 2019 100
B Exp 2019 200
I have a number of series parameter. So, I created a loop where the following code merges two of the total series till the all series are merged. I have tried to consolidate all such series into one dataframe using this code:
df = pd.merge(df,series1,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
But even if we write separate lines for each two pairs for our example here, it will be:
df = pd.merge(series1,series2,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
df = pd.merge(df,series3,left_on=['Product','Year'],right_on=['Product','Year']).reset_index()
However the issue with this is:
It only allows to merge two series at a time.
When I merge the third series in this example, as it doesn't have data for 2018, instead of putting NULL there, it remove the 2018 rows for even the series 1 and series 2 data in the dataframe. So, I am only left with merged data from all three series for 2019.
I considered converting all the series to list individually and then converting those lists to a dictionary, which then is converted into a dataframe. That works, but requires a lot of effort and requires code change if number of series changes. So, this doesn't work for me.
Any other way to do this?
Did you try using the to_frame method?
For example, you could use
df = pd.Series["a", "b", "c"]
df.to_frame()
to convert.
Try using this method in your data frame.
Here's it in the docs.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html
Try pd.concat():
import pandas as pd
import pandas as pd
s1 = pd.Series([100, 200, 300, 400], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['1','1','2','2'], [2018, 2019, 2018, 2019]]))
s2 = pd.Series([50, 100, 150, 200], index = pd.MultiIndex.from_arrays([['A','A','B','B'],['3','3','4','4'], [2018, 2019, 2018, 2019]]))
s3 = pd.Series([100, 200], index = pd.MultiIndex.from_arrays([['A','B'],['5','6'], [2019, 2019]]))
df = pd.concat([s.droplevel(1) for s in [s1, s2, s3]], axis = 1)
0 1 2
A 2018 100 50 NaN
2019 200 100 100.0
B 2018 300 150 NaN
2019 400 200 200.0

Convert dictionaries to dataframe

I am trying to convert this dictionary:
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
data
to dataframe like:
date balance
Jan 2018 1000
Feb 2018 1100
Mar 2018 1400
Apr 2018 700
May 2018 800
I used the dataframe to convert, but it didn't give the format as above, how can i do it? Thank you!
pd.DataFrame.from_dict(data_c, orient='columns')
Here is my solution:
import pandas as pd
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
arr = [list(*d.items()) for d in data]
df = pd.DataFrame(arr, columns=['data', 'balance'])
you need get proper array from the tuple of dictionary before pass it to DataFrame.
Try this
df = pd.DataFrame.from_dict({k: v for d in data for k, v in d.items()},
orient='index',
columns=['balance']).rename_axis('date').reset_index()
Out[477]:
date balance
0 Jan 2018 1000
1 Feb 2018 1100
2 Mar 2018 1400
3 Apr 2018 700
4 May 2018 800
From the documentation of from_dict
orient : {‘columns’, ‘index’}, default ‘columns’
The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’.
Since you want your keys to indicate rows, changing the orient to index will give the result your want. However first you need to put your data in a single dictionary. This code will give you the result you want.
data = ({"Jan 2018":1000},{"Feb 2018":1100},{"Mar 2018":1400},{"Apr 2018":700},{"May 2018":800})
d = {}
for i in data:
for k in i.keys():
d[k] = i[k]
df = pd.DataFrame.from_dict(d, orient='index')
What you have there is a tuple of single-element dictionaries. This is unidiomatic, and poor design. If all the dictionaries correspond to the same columns, then a list of tuples would do just fine.
Solutions
I believe the currently accepted answer relies on there being only one key:value pair in each dictionary. That’s unfortunate, since it automatically excludes most situations where this design makes any sense.
If, hypothetically, the "tuple of 1-element dicts" couldn't be changed, here is how I would suggest doing things:
import pandas as pd
import itertools as itt
raw_data = ({"Jan 2018": 1000}, {"Feb 2018": 1100}, {"Mar 2018": 1400}, {"Apr 2018": 700}, {"May 2018": 800})
data = itt.chain.from_iterable(curr.items() for curr in raw_data)
df = pd.DataFrame(data, columns=['date', 'balance'])
Here is the sensible alternative to all this.
import pandas as pd
data = [("Jan 2018", 1000), ("Feb 2018", 1100), ("Mar 2018", 1400), ("Apr 2018", 700), ("May 2018", 800)]
df = pd.DataFrame(data, columns=['date', 'balance'])
df:
date balance
0 Jan 2018 1000
1 Feb 2018 1100
2 Mar 2018 1400
3 Apr 2018 700
4 May 2018 800
It would probably be even better if those dates were actual date types, not strings. I will change that later.

Python processing CSV file really slow

So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of
ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR
I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.
Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.
My code goes as follows:
df = pd.read_csv(filename, delimiter=';')
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array.
for index, row in df.iterrows():
# PROCESSING LOGIC GOES HERE
# IT CONSISTS OF FEW IF STATEMENTS
list_dates[row.DAY][row.MONTH][row.YEAR] += 1
row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
df_fixed = df_fixed.append(row.to_frame().T)
df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')
I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets.
at row 43000 it needs 29 seconds and so on...
Thanks for all future help :)
EDIT:
I am adding additional information about my CSV and exptected output
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011
I have to replace - in the special ID field to a proper number.
For example for a row with
ID = 2 the SPECIAL_ID will be
26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on...
So exptected output for above rows would be
ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011
EDIT:
I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete
list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index()
for index, row in df_dates.iterrows():
df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
for index, row in df_candidates.iterrows():
vrstica = {}
vrstica['ID'] = row['identifikator']
vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
vrstica['day'] = row['day']
vrstica['MONTH'] = row['MONTH']
vrstica['YEAR'] = row['YEAR']
list_popravljeni.append(vrstica)
pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())
I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.
df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])
I added a fake record at the end to confirm it does increment properly:
SPECIAL_ID sex age zone key day month year counts
0 13012016505001 F 1 1001001 1001001_F_1 13 1 2016 001
1 25122013505001 F 4 1001001 1001001_F_4 25 12 2013 001
2 24022012505001 F 5 1001001 1001001_F_5 24 2 2012 001
3 09032012505001 F 5 1001001 1001001_F_5 9 3 2012 001
4 21082011505001 F 6 1001001 1001001_F_6 21 8 2011 001
5 16082011505001 F 6 1001001 1001001_F_6 16 8 2011 001
6 21102011505002 F 6 1001001 1001001_F_6 16 8 2011 002
7 21102012505003 F 6 1001001 1001001_F_6 16 8 2011 003
If you want to get rid of counts, you just need:
df.drop('counts', inplace=True, axis=1)

Categories