I have several datasets that I import as csv files and display them in a DataFrame in Pandas. The csv files are info about Covid updates.
The datasets has several columns relating to this, for example "country_region", "last_update" & "confirmed".
Let's say I wanted to look up the confirmed cases of Covid for Germany.
I'm trying to write a function that will return a slice of the DataFrame that corresponds to those constraints to be able to display the match I'm looking for.
I need to do this in some generic way so I can provide any value from any column.
I wish I had some code to include but I'm stuck on how to even proceed.
Everything I find online only specifies for looking up values relating to a pre-defined value.
Something like this?
def filter(county_region_val, last_update_val, confirmed_val, df):
df = df.loc[((df['county_region'] == county_region_val) & (df['last_update'] == last_update_val) & (df[''confirmed'] == confirmed_val)).reset_index(drop=True)
return df
Related
So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]
I am still learning python, kindly excuse if the question looks trivial to some.
I have a csv file with following format and I want to extract a small segment of it and write to another csv file:
So, this is what I want to do:
Just extract the entries under actor_list2 and the corresponding id column and write it to a csv file in following format.
Since the format is not a regular column headers followed by some values, I am not sure how to select starting point based on a cell value in a particular column.e.g. even if we consider actor_list2, then it may have any number of entries under that. Please help me understand if it can be done using pandas dataframe processing capability.
Update: The reason why I would like to automate it is because there can be thousands of such files and it would be impractical to manually get that info to create the final csv file which will essentially have a row for each file.
As Nour-Allah has pointed out the formatting here is not very regular to say the least. The best you can do if that is the case that your data comes out like this every time is to skip some rows of the file:
import pandas as pd
df = pd.read_csv('blabla.csv', skiprows=list(range(17)), nrows=8)
df_res = df.loc[:, ['actor_list2', 'ID']]
This should get you the result but given how erratic formatting is, this is no way to automate. What if next time there's another actor? Or one fewer? Even Nour-Allah's solution would not help there.
Honestly, you should just get better data.
As the CSV file you have is not regular, so a lot of empty position, that contains 'nan' objects. Meanwhile, the columns will be indexed.
I will use pandas to read
import pandas as pd
df = pd.read_csv("not_regular_format.csv", header=None)
Then, initialize and empty dictionary to store the results in, and use it to build an output DataFram, which finally send its content to a CSV file
target={}
Now you need to find actor_list2 in the second columns which is the column with the index 0, and if it exists, start store the names and scores from in the next rows and columns 1 and 2 in the dictionary target
rows_index = df[df[1] == 'actor_list2'].index
if len(rows_index) > 0:
i = rows_index[0]
while True:
i += 1
name = df.iloc[i, 1]
score = df.iloc[i, 2]
if pd.isna(name): # the names sequence is finished and 'nan' object exists.
break
target[name] = [score]
and finally, construct DataFrame and write the new output.csv file
df_output=pd.DataFrame(target)
df_output.to_csv('output.csv')
Now, you can go anywhere with the given example above.
Good Luck
I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates
I need to add a 'Status' column to my Pandas df that checks multiple columns of the df in order to display a message about possible options for the user. I think I could do this if the columns were static but the next time a new tool is brought on line it will show up in the data as a new column that will need to be checked. I'm trying to avoid hardcoding in specific tool names.
I figured a way to conditionally create a column for each tool that works well but it creates one for each tool. In the above case I need one column instead of multiple.
Here's how I'm managing to create a new column based on all the tools that are in the dataframe:
tools = unded[unded.columns[unded.columns.str.contains("tool")]]
undedtools = tools.columns.values.tolist()
for tool in undedtools:
unded.loc[(unded[tool] == 'Y') & (unded['lines_down'] == 'N'), tool + '_format'] = 2
unded.loc[(unded[tool]) == 0, tool + '_format'] = 3
This creates a column for each tool named like "tool123_format". The columns get filled in with a number which I use for formatting a report. So now that I have a column like this for each tool I need to check all of these columns and report on the status.
I would expect it to report something like "tool123 and tool456 are open" if it finds a 2 in the format column for each of those tools. Then the next line may have no open tools so it would say "All paths are closed. Escalate to eng"
How do I get the tool names conditionally into this "Status" column for each row of the dataframe? I had previously had this whole thing in SQL but I'm getting tired of adding dozens of new lines to my CASE WHEN statement each time we add a new tool.
Would really appreciate some help with the following problem. I'm intending on using Pandas library to solve this problem, so would appreciate if you could explain how this can be done using Pandas if possible.
I want to take the following excel file:
Before
and:
1)convert the 'before' file into a pandas data frame
2)look for the text in 'Site' column. Where this text appears within the string in the 'Domain' column, return the value in 'Owner' Column under 'Output'.
3)the result should look like the 'After' file. I would like to convert this back into CSV format.
After
So essentially this is similar to an excel vlookup exercise, except its not an exact match we're looking for between the 'Site' and 'Domain' column.
I have already attempted this in Excel but im looking at over 100,000 rows, and comparing them against over 1000 sites, which crashes excel.
I have attempted to store the lookup list in the same file as the list of domains we want to classify with the 'Owner'. If there's a much better way to do this eg storing the lookup list in a separate data frame altogether, then that's fine.
Thanks in advance for any help, i really appreciate it.
Colin
I think the OP's question differs somewhat from the solutions linked in the comments which either deal with exact lookups (map) or lookups between dataframes. Here there is a single dataframe and a partial match to find.
import pandas as pd
import numpy as np
df = pd.ExcelFile('data.xlsx').parse(0)
df = df.astype(str)
df['Test'] = df.apply(lambda x: x['Site'] in x['Domain'],axis=1)
df['Output'] = np.where(df['Test']==True, df['Owner'], '')
df
The lambda allows reiteration of the in test to be applied across the axis, to return a boolean in Test. This then acts as a rule for looking up Owner and placing in Output.