I am trying to add and update multiple columns in a pandas dataframe using a second dataframe. The problem I get is when the number of columns I want to add doesn't match the number of columns in the base dataframe I get the following error: "Shape of passed values is (2, 3), indices imply (2, 2)"
A simplified version of the problem is below
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2, b ** 3
#create three new fields from the data
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
if the number of fields being added matches the number already in the table the opertaion works as expected.
tst = DataFrame({"One":[1,2],"Two":[2,4]})
def square(row):
"""
for each row in the table return multiple calculated values
"""
a = row["One"]
b = row["Two"]
return a ** 2, b ** 2
#create three new fields from the data
tst[["One^2", "Two^2"]] = tst.apply(square, axis=1)
I realise I could do each field seperately but in the actual problem I am trying to solve I perform a join between the table being updated and an external table within the "updater" (i.e. square) and want to be able to grab all the required information at once.
Below is how I would do it in SQL. Unfortunately the two dataframes contain data from different database technologies, hence why I have to do perform the operation in pandas.
update tu
set tu.a_field = upd.the_field_i_want
tu.another_field = upd.the_second_required_field
from to_update tu
inner join the_updater upd
on tu.item_id = upd.item_id
and tu.date between upd.date_from and upd.date_to
Here you can see the exact details of what I am trying to do. I have a table "to_update" that contains point-in-time information against an item_id. The other table "the_updater" contains date range information against the item_id. For example a particular item_id may sit with customer_1 from DateA to DateB and with customer_2 between DateB and DateC etc. I want to be able to align information from the table containing the date ranges against the point-in-time table.
Please note a merge won't work due to problems with the data (this is actually being written as part of a dataquality test). I really need to be able to replicate the functionality of the update statement above.
I could obviously do it as a loop but I was hoping to use the pandas framework where possible.
Declare a empty column in dataframe and assign it to zero
tst["Two^3"] = 0
Then do the respective operations for that column, along with other columns
tst[["One^2", "Two^2", "Two^3"]] = tst.apply(square, axis=1)
Try printing it
print tst.head(5)
Related
I am trying to create a python code aiming to match 2 tables puit_1 and puit_2 in order to retrieve the identifier id of the table puit_2 by creating a new column id in the table puit_1, if the name of the table puit_1 equals the table puit_2 .
below the loop :
puits_sig[id_credo] = pd.series([])
for j in range(len(puits_credo['noms'])):
id_pcredo = []
for i in range(len(puits_sig['sigle_puits'])):
if puits_credo['noms'][j] == puits_credo['noms'][j]:
if str(puits_sig['sigle_puits'][i]) in puits_credo['noms'][j]:
id_pcredo.append(puits_credo['ID_Objet'][j])
print(id_pcredo)
puits_sig['id_credo'] = id_pcredo
ValueError : length of values 1 does not match length of index 1212
This code gives me an error that I couldn't solve (for information, I'm a beginner in programming).
Any help below some extract from tables?
Extract from table 1
[sigle_puit]
[categorie]
BNY2D
ACTIF
BRM2
INACTIF
Extract from table 2
[Nom]
[ID Object]
BLY23
89231
BRM1
12175
What I believe you're looking for is an inner-merge Pandas operation. What this does is create a new dataframe which has the 'name' column which both puit_1 and puit_2 have in common, and all other information which comes from the columns in dataframes 1 and 2.
merged_df = puit_1.merge(puit_2, how = 'inner', on = ['name'])
See this question for more:
Merge DataFrames with Matching Values From Two Different Columns - Pandas
EDIT: If you want to keep rows which have a name which only exists in one dataframe, you might want to use how='outer' in your merge operation
I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)
So I have two different tables right now. These tables contain a series of information including one column being a specific date.
Example:
[Table 1]
Unique Identifier (Primary Key) / Date / Piece of Information
0001 / December 1, 2020 / Apples
[Table 2]
Unique Identifier (Primary Key) / Date / Piece of Information
0001 / December 5, 2020 / Oranges
I am trying to compare the two tables if the second table has a date that is AFTER the first table (for the same unique identifier), I would like to write this to a new table. There are a lot of rows in this table, and I need to keep going through the rows. However I can't seem to get this to work. This is what I am doing:
import pandas as pd
from pyspark.sql.functions import desc
from pyspark.sql import functions as F
def fluvoxamine_covid_hospital_progression(fluvoxamine_covids_that_were_also_in_hospital_at_any_time, fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk):
df_fluvoxamine_covid_outpatients = pd.DataFrame(fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk)
df_fluvoxamine_covid_outpatients.dropDuplicates(['visit_occurrence_id'])
df_fluvoxamine_covid_outpatients.sort(desc('visit_start_date'))
df_fluvoxamine_converted_hospital = pd.DataFrame(fluvoxamine_covids_that_were_also_in_hospital_at_any_time)
df_fluvoxamine_converted_hospital.dropDuplicates(['visit_occurrence_id'])
df_fluvoxamine_converted_hospital.sort(desc('visit_start_date'))
i = 0
if df_fluvoxamine_covid_outpatients.sort('visit_start_date') < df_fluvoxamine_converted_hospital.sort('visit_start_date'):
i = i + 1
Try to break it down into steps. I renamed your variables for readability.
# renamed variables
converted = fluvoxamine_covids_that_were_also_in_hospital_at_any_time
outpatients = fluvoxamine_for_hospitalization_analysis_only_outpatients_need_to_dbl_chk
For the first step, keep the first few lines of code as you wrote.
# Load and clean the data
covid_outpatients = pd.DataFrame(outpatients)
converted_hospital = pd.DataFrame(converted)
covid_outpatients.dropDuplicates(['visit_occurrence_id'])
converted_hospital.dropDuplicates(['visit_occurrence_id'])
Next, join the data using the unique identifier column.
all_data = covid_outpatients.set_index('Unique Identifier (Primary Key)').join(converted_hospital.set_index('Unique Identifier (Primary Key)'), lsuffix='_outpatients', rsuffix='_converted')
Reset the index with the unique identifier column.
all_data['Unique Identifier (Primary Key)'] = all_data.index
all_data.reset_index(drop=True, inplace=True)
Generate a mask based on the date comparison. A mask is a series of boolean values with the same size/shape as the DataFrame. In this case, the mask is True if the outpatients date is less than the converted date, otherwise, the value in the series is False.
filtered_data = all_data[all_data['visit_start_date_outpatients'] < all_data['visit_start_date_converted']]
Note, if your data is not in a date format, it might need to be converted or cast for the mask to work properly.
Lastly, save the output as a comma-separated-value CSV file.
// Generate an output. For example, it's easy to save it as a CSV file.
filtered_data.to_csv('outpatient_dates_less_than_converted_dates.csv')
In addition to the official documentation for pandas dataframes, the website https://towardsdatascience.com/ has many good tips. I hope this helps!
I have a csv file imported to a pandas dataframe. It probably came from a database export that combined a one-to-many parent and detail table. The format of the csv file is as follows:
header1, header2, header3, header4, header5, header6
sample1, property1,,,average1,average2
,,detail1,detail2,,
,,detail1,detail2,,
,,detail1,detail2,,
sample2, ...
,,detail1,detail2,,
,,detail1,detail2,,
...
(i.e. line 0 is the header, line 1 is record 1, lines 2 through n are details, line n+1 is record 2 and so on...)
What is the best way to extricate (renormalize?) the details into separate DataFrames that can be referenced using values in the sample# records? The number of each subset of details are different for each sample.
I can use:
samplelist = df.header2[pd.notnull(df.header2)]
to get the starting index of each sample so that I can grab samplelist.index[0] to samplelist.index[1] and put it in a smaller dataframe. Detail records by themselves have no reference to which sample they came from, so that has to be inferred from the order of the csv file (notice that there is no intersection of filled/empty fields in my example).
Should I make a list of dataframes, a dict of dataframes, or a panel of dataframes?
Can I somehow create variables from the sample1 record fields and somehow attach them to each dataframe that has only detail records (like a collection of objects that have several scalar members and one dataframe each)?
Eventually I will create statistics on data from each detail record grouping and plot them against values in the sample records (e.g. sampletype, day or date, etc. vs. mystatistic). I will create intermediate Series to also be attached to the sample grouping like a kernel density estimation PDF or histogram.
Thanks.
You can use the fact that the first column seems to be empty unless it's a new sample record to .fillna(method='ffill') and then .groupby('header1') to get all the separate groups. On these, you can calculate statistics right away or store as separate DataFrame. High level sketch as follows:
df.header1 = df.header1.fillna(method='ffill')
for sample, data in df.groupby('header1'):
print(sample) # access to sample name
data = ... # process sample records
The answer above got me going in the right direction. With further work, the following was used. It turns out I needed to use two columns as a compound key to uniquely identify samples.
df.header1 = df.header1.fillna(method='ffill')
df.header2 = df.header2.fillna(method='ffill')
grouped = df.groupby(['header1','header2'])
samplelist = []
dfParent = pd.DataFrame()
dfDetail = pd.DataFrame()
for sample, data in grouped:
samplelist.append(sample)
dfParent = dfParent.append(grouped.get_group(sample).head(n=1), ignore_index=True)
dfDetail = dfDetail.append(data[1:], ignore_index=True)
dfParent = dfParent.drop(['header3','header4',etc...]) # remove columns only used in
# detail records
dfDetail = dfDetail.drop(['header5','header6',etc...]) # remove columns only used once
# per sample
# Now details can be extracted by sample number in the sample list
# (e.g. the first 10 for sample 0)
samplenumber = 0
dfDetail[
(dfDetail['header1'] == samplelist[samplenumber][0]) &
(dfDetail['header2'] == samplelist[samplenumber][1])
].header3[:10]
Useful links were:
Pandas groupby and get_group
Pandas append to DataFrame
I have two tables, one contains SCHEDULE_DATE (over 300,000 records) and WORK_WEEK_CODE, and the second table contains WORK_WEEK_CODE, START_DATE, and END_DATE. The first table has duplicate schedule dates, and the second table is 3,200 unique values. I need to populate the WORK_WEEK_CODE in table one with the WORK_WEEK_CODE from table two, based off of the range where the schedule date falls. Samples of the two tables are below.
I was able to accomplish the task using arcpy.da.UpdateCursor with a nested arcpy.da.SearchCursor, but with the volume of records, it takes a long time. Any suggestions on a better (and less time consuming) method would be greatly appreciated.
Note: The date fields are formatted as string
Table 1
SCHEDULE_DATE,WORK_WEEK_CODE
20160219
20160126
20160219
20160118
20160221
20160108
20160129
20160201
20160214
20160127
Table 2
WORK_WEEK_CODE,START_DATE,END_DATE
1601,20160104,20160110
1602,20160111,20160117
1603,20160118,20160124
1604,20160125,20160131
1605,20160201,20160207
1606,20160208,20160214
1607,20160215,20160221
You can use Pandas dataframes as a more efficient method. Here is the approach using Pandas. Hope this helps:
import pandas as pd
# First you need to convert your data to Pandas Dataframe I read them from csv
Table1 = pd.read_csv('Table1.csv')
Table2 = pd.read_csv('Table2.csv')
# Then you need to add a shared key for join
Table1['key'] = 1
Table2['key'] = 1
#The following line joins the two tables
mergeddf = pd.merge(Table1,Table2,how='left',on='key')
#The following line converts the string dates to actual dates
mergeddf['SCHEDULE_DATE'] = pd.to_datetime(mergeddf['SCHEDULE_DATE'],format='%Y%m%d')
mergeddf['START_DATE'] = pd.to_datetime(mergeddf['START_DATE'],format='%Y%m%d')
mergeddf['END_DATE'] = pd.to_datetime(mergeddf['END_DATE'],format='%Y%m%d')
#The following line will filter and keep only lines that you need
result = mergeddf[(mergeddf['SCHEDULE_DATE'] >= mergeddf['START_DATE']) & (mergeddf['SCHEDULE_DATE'] <= mergeddf['END_DATE'])]