Actually there is no csv file. I have parquet files. In that I need to extract data from three tables. The tables are publication,section and alt section tables
As you can see from the images, I need the following outputs
I have a dataframe like this as shown in the screenshot.
I need to get the following output as a dataframe
pub number std kw1 stdkw2
---------------------------
1078143 T. Art.
Like this if there are 3 or more values for the same number it should take all of them as stdkw1,stdkw2,stdkw3 etc..
Group the dataframe by pub_number. Then iterate over the groups. Append the std_section_name values with pub_number to the result list. Create a dataframe with data from the result list. Later add column names to the dataframe.
df = pd.DataFrame([[3333,1078143,'T.'],[3333,1078143,'ssss'],[3334,1078145,'Art'],[3334,1078145,'Art'],[3334,1078145,'Art'],[3334,1078145,'Art'],[3334,1078143,'team']],columns=['section_id','pub_number','std_section_name'])
result = list()
for name,group in df.groupby(by = ['pub_number']):
if group.shape[0] < 3:
continue
result.append([name] + group['std_section_name'].tolist())
ref = pd.DataFrame(result)
ref.columns = ["pub_number"] + ["stdkw" + str(i) for i in range(1,ref.shape[1])]
print(ref)
Related
I am trying to combine dataframes with 2 columns into a single dataframe. The initial dataframes are generated through a for loop and stored in a list. I am having trouble getting the data from the list of dataframes into a single dataframe. Right now when I run my code, it treats each full dataframe as a row.
def linear_reg_function(category):
df = pd.read_csv(file)
df = df[df['category_column'] == category]`
df1 = df[['category_column', 'value_column']]
df_export.append(df1)
df_export = []
for category in category_list:
linear_reg_function(category)
when I run this block of code I get a list of dataframes that have 2 columns. When I try to convert df_export to a dataframe, it ends up with 12 rows (the number of categories in category_list). I tried:
df_export = pd.DataFrame()
but the result was:
_
I would like to have a single dataframe with 2 columns, [Category, Value] that includes the values of all 12 categories generated in the for loop.
You can use pd.concat to merge a list of DataFrames into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')
you can manipulate it to your proper demande
I have a list of multiple data frames on cryptocurrency. So I want to apply a function to all of these data frames, which should convert all the data frames so that I am only left with data from 2021.
The function looks like this:
dataframe_list = [bitcoin, have, binance, Cardano, chainlink, cosmos, crypto com, dogecoin, eos, Ethereum, iota, litecoin, monero, nem, Polkadot, Solana, stellar, tether, uni swap, usdcoin, wrapped, xrp]
def date_func(i):
i['Date'] = pd.to_datetime(i['Date'])
i = i.set_index(i['Date'])
i = i.sort_index()
i = i['2021-01-01':]
return(i)
for dataframe in dataframe_list:
dataframe = date_func(dataframe)
However, I am only left with one data frame called 'dataframe', which only contains values of the xrp dataframe.
I would like to have a new dataframe from each dataframe, called aave21, bitcoin21 .... which only contains values from 2021 onwards.
What am I doing wrong?
Best regards and thanks in advance.
You are overwriting dataframe when iterating over dataframe_list, i.e. you only keep the latest dataframe.
You can either try:
dataframe = pd.DataFrame()
for df in dataframe_list:
dataframe.append(date_func(df))
Or shorter:
dataframe = pd.concat([data_func(df) for df in dataframe_list])
You are overwriting dataframe variable in your for loop when iterating over dataframe_list. You need to keep appending results into a new variable.
final_df = pd.DataFrame()
for dataframe in dataframe_list:
final_df.append(date_func(dataframe))
print(final_df)
I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.
So I have multiple data frames that I am attempting to loop over.
I have created a list using the following code:
data_list = [df1, df2, df3]
After that I would like to filter out a predefined range of numbers in the column 'Firm_Code' in each data frame.
So far, I am able to filter out firms with a respective code between 6000 and 6999 for a single data frame as follows:
FFirms = range(6000,7000)
Non_FFirms = [b for b in df1['Firm_Code'] if b not in FFirms]
df1 = df1.loc[df1['Firm_Code'].isin(Non_FFirms)]
Now I would like to loop over the data_list. My first try looks like the following:
for i in data_list:
i = i.loc[i.Firm_Code.isin(Non_FFirms)]
Appreciate any suggestions!
Instead of making the list of dataframes, you can concat all the data frames into a single dataframe.
data_df = pd.concat([df1,df2,df3],ignore_index=True)
In case you need identification from which dataframe you have fetched the value you can add a new column say 'Df_number'.
using data_df you can you can filter the data
FFirms = range(6000,7000)
Non_FFirms = [b for b in df1['Firm_Code'] if b not in FFirms]
filtered_data_df = data_df.loc[data_df['Firm_Code'].isin(Non_FFirms)]
I need to create a dataframe from a loop. the idea is that the loop will read a data frame of texts (train_vs) and search for specific key words ['govern', 'data'] and then calculate their frequency or TF. what I want is an outcome of two columns with the TF of the words for each text inside them. the code I am using is the following:
d = pd.DataFrame()
key = ['govern', 'data']
for k in key:
for w in range(0, len(train_vs)):
wordcount = Counter(train_vs['doc_text'].iloc[w])
a_vs = (wordcount[k]/len(train_v.iloc[w])*1)
temp = pd.DataFrame([{k: a_vs}] )
d = pd.concat([d, temp])
however, I am getting two columns but with values for the first key word and nan for second for the whole texts column and then nan for the first and values for the second again for the whole texts column. so the number of the rows of the outcome dataframe is double.
I want to have both values next to each other.
Your help is highly appreciated.
Thanks.
From the pandas.concat documentation:
Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.
What you are doing, when loop with the key changes is to try and concatenate a new df (temp) that has a single column('data') to the old df that also has a single column ('gonvern') and that's why you get the half columns of NANs.
What you could do instead of concatenating millions of dataframes is to build only one dataframe, by building the columns.
d = pd.DataFrame()
key = ['govern', 'data']
for k in key:
column = []
for w in range(0, len(train_vs)):
wordcount = Counter(train_vs['doc_text'].iloc[w])
a_vs = (wordcount[k] / len(train_v.iloc[w]) * 1)
column.append(a_vs)
d[k] = column