How to merge multiple csv and create a dataframe? - python

I'd like to do the following steps: 1. merge all csv in the same directory 2. create as Dataframe 3. assign columns and drop a column then set one of column ('Type') as index 4. for all file,s I would like to melt column D to end column as rows
file_list = glob.glob("*.csv")
for file in file_list:
merged_file = pd.read_csv(file)
print(merged_file)
merged_file = pd.DataFrame()
df.columns = ['Type', 'Country', 'Source']
df = df.drop('Source', axis=1)
df = df.set_index('Type',drop=False).stack().reset_index()
agg_df = pd.melt(df, id_vars = [df, 'Source', 'Country', 'Type'])
df = df.sort_values('Type').reset_index(drop=True)
print(df)
The clear expected output (aligned to the left):
Mineral name - Type - Country - Prod_t_2021
Mineral name - Type - Country - Prod_t_2022
Mineral name - Type - Country - Reserves_t
Mineral name - Type - Country - Reserves_notes
Mineral name could be extracted from Type as string
The source is World.zip from URL: https://www.sciencebase.gov/catalog/item/63b5f411d34e92aad3caa57f

IIUC, use this :
df = (
pd.concat([pd.read_csv(f) for f in file_list], ignore_index=True)
.melt(id_vars = ["Source", "Country", "Type"])
.set_index("Type")
.sort_index()
)
​
Output :
Source Country variable value
Type
Mine production: Palladium MCS2023 United States Prod_t_est_2022 NaN
Mine production: Palladium MCS2023 South Africa Reserves_ore_kt NaN
Mine production: Palladium MCS2023 Russia Reserves_ore_kt NaN
Mine production: Palladium MCS2023 Canada Reserves_ore_kt NaN
Mine production: Palladium MCS2023 United States Reserves_ore_kt NaN
Mine production: Palladium MCS2023 United States Reserves_kt NaN
Mine production: Palladium MCS2023 Canada Reserves_kt NaN
Mine production: Palladium MCS2023 Russia Reserves_kt NaN
Mine production: Palladium MCS2023 South Africa Reserves_kt NaN
Mine production: Palladium MCS2023 Other countries Reserves_notes Included with platinum

Related

Is there a way to take all of my row values and make them col headers?

I currently have a long list of countries (234 values). For simplicity sake, picture 1 displays only 10. This is what my df currently looks like:
I want to create a matrix of some sort, where the countries listed in each row are also the col headers. This is what I want my output dataframe to look like:
Country
China
India
U.S
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
China
India
U.S.
Indonesia
Pakistan
...
Montserrat
Falkland Islands
Niue
Tokelau
Vatican City
So to reiterate the question, how do I take the value in each row of col 1 and copy it to be my dataframe column headers to create a matrix. This dataframe is also being scraped from a website using requests and beautiful soup, so it isn't like i can create a csv file of a pre-made dataframe. Is what I want to do possible?
Initialize a Pandas DataFrame as follows
countryList = ['China', 'India', 'U.S.']
pd.DataFrame(columns=countryList, index=countryList)
and just append all elements of ```countryList`` according to your use case. This yields an empty dataframe to insert data into.
China
India
U.S.
China
NaN
NaN
NaN
India
NaN
NaN
NaN
U.S.
NaN
NaN
NaN
Will something like this work?
data = ["US","China","England","Spain",'Brazil']
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df
Output:
Country US China England Spain Brazil
0 US
1 China
2 England
3 Spain
4 Brazil
You can even set the country as the index like:
data = [1,2,3,4,5]
df = pd.DataFrame({"Country":data})
df[df.Country.values] = ''
df = df.set_index(df.Country)[df.Country.values].rename_axis(index=None)
Output:
US China England Spain Brazil
US
China
England
Spain
Brazil
7Shoe's answer is good, but in case you already have a dataframe:
import pandas as pd
df = pd.DataFrame({'Country':['U.S.','Canada', 'India']})
pd.DataFrame(columns=df.Country, index=df.Country).rename_axis(None)
Output
Country U.S. Canada India
U.S. NaN NaN NaN
Canada NaN NaN NaN
India NaN NaN NaN

conditional merge with pandas

I have 3 excel files. Lets say file A.xlsx is the main file where I work and in it I have 2 columns: Material, Location.
In file B.xlsx I have Material and Seasonality columns.
Same in fil C.xlsx, Material and Seasonality.
In the Location I have only two countries. ( France, Spain, blank values).
The file B only has the information on seasonality about France.
and file C only has the information on seasonality about Spain.
I would like to get the seasonality information from files B and C and put in a separate column (named "Seasonality) in file A corresponding to the Location value.
df1
Material Location
2527 France
2528 Spain
2627 Spain
2628 NaN
2725 France
df2
**
Material Seasonality
2527 Summer
2725 Autumn
**
df3
Material Seasonality
2528 Winter
2627 Spring
desired output
Material Location Seasonality
2527 France Summer
2528 Spain Winter
2627 Spain Spring
2628 NaN Nan
2725 France Autumn
The code I have tried is this. I know its totally wrong but I saw someone recommended to someone something like this
import pandas as pd
df= pd.read_excel('A.xlsx')
df1 = pd.read_excel('B.xlsx')
df3 = pd.read_excel('C.xlsx')
df4 = pd.merge(df,df1[['Material','Seasonality']], on='Material', how='left') #merge for France
df5=pd.merge(df,df3[['Material', 'Seasonality']], on = 'Material', how='left') if df['Location']== 'Spain'
print(df5)
Concatonating B and C files are not possible
p.s. I dont know if it will helps or not but I already have merged the file A and B with the seasonality information on France and dataframe has now values corresponding France and NaNs (corresponding to Spain and Blank values from from Location column). Could you please help?
Let us try concat then groupby
out = pd.concat([df1,df2,df3]).groupby('Material',as_index=False).first()
Out[90]:
Material Location Seasonality
0 2527 France Summer
1 2528 Spain Winter
2 2627 Spain Spring
3 2628 None None
4 2725 France Autumn

How to create a column from another df according to matching columns?

I have a df named population with a column named countries. I want to merge rows so they reflect regions = ( africa, west hem, asia, europe, mideast). I have another df named regionref from kaggle that have all countries of the world and the region they are associated with.
How do I create a new column in the population df that has the corresponding regions for the countries in the country column, using the region column from the kaggle dataset.
so essentially this is the population dataframe
CountryName 1960 1950 ...
US
Zambia
India
And this is the regionref dataset
Country Region GDP...
US West Hem
Zambia Africa
India Asia
And I want the population df to look like
CountryName Region 1960 1950 ...
US West Hem
Zambia Africa
India Asia
EDIT: I tried the concatenation but for some reason the two columns are not recognizing the same values
population['Country Name'].isin(regionref['Country']).value_counts()
This returned False for all values, as in there are no values in common.
And this is the output, as you can see there are values in common
You just need a join functionality, or to say, concatenate, in pandas way.
Given two DataFrames pop, region:
pop = pd.DataFrame([['US', 1000, 2000], ['CN', 2000, 3000]], columns=['CountryName', 1950, 1960])
CountryName 1950 1960
0 US 1000 2000
1 CN 2000 3000
region = pd.DataFrame([['US', 'AMER', '5'], ['CN', 'ASIA', '4']], columns = ['Country', 'Region', 'GDP'])
Country Region GDP
0 US AMER 5
1 CN ASIA 4
You can do:
pd.concat([region.set_index('Country'), pop.set_index('CountryName')], axis = 1)\
.drop('GDP', axis =1)
Region 1950 1960
US AMER 1000 2000
CN ASIA 2000 3000
The axis = 1 is for concatenating horizontally. You have to set column index for joining it correctly.

Pandas iterate over rows and find the column names

i have a two dataframes as:
df = pd.DataFrame({'America':["Ohio","Utah","New York"],
'Italy':["Rome","Milan","Venice"],
'Germany':["Berlin","Munich","Jena"]});
df2 = pd.DataFrame({'Cities':["Rome", "New York", "Munich"],
'Country':["na","na","na"]})
i want to itirate on df2 "Cities" column to find the cities on my (df) and append the country of the city (df column names) to the df2 country column
Use melt with map by dictionary:
df1 = df.melt()
print (df1)
variable value
0 America Ohio
1 America Utah
2 America New York
3 Italy Rome
4 Italy Milan
5 Italy Venice
6 Germany Berlin
7 Germany Munich
8 Germany Jena
df2['Country'] = df2['Cities'].map(dict(zip(df1['value'], df1['variable'])))
#alternative, thanks #Sandeep Kadapa
#df2['Country'] = df2['Cities'].map(df1.set_index('value')['variable'])
print (df2)
Cities Country
0 Rome Italy
1 New York America
2 Munich Germany
After melting and renaming the first dataframe:
df1 = df.melt().rename(columns={'variable': 'Country', 'value': 'Cities'})
the solution is a simple merge:
df2 = df2[['Cities']].merge(df1, on='Cities')

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Categories