Example Usage
To use renewenergy in a project:
import renewenergy
print(renewenergy.__version__)
0.1.1
Dummy Example Setup
Imports
from renewenergy.clean_data import clean_data
from renewenergy.create_scatter_plots import create_scatter_plots
from renewenergy.reading_data import reading_data
from renewenergy.impute_split import impute_split
from renewenergy.split_xy_columns import split_xy_columns
from renewenergy.plot_rmse import plot_rmse
import pandas as pd
from matplotlib import pyplot as plt
Creating Dummy Datasets
The code below creates a dummy dataset that can be used to demonstrate use of functions within this package.
test_data= {'Variable 1':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Variable 1':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Variable 1':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'Variable 1':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
"Renewable electricity output (% of total electricity output)":['A1', 'B1', 'C1', 'D1','E1','F1', 'G1', 'H1', 'I1','J1']}
test_df= pd.DataFrame(test_data)
test_df.head()
| Variable 1 | Renewable electricity output (% of total electricity output) | |
|---|---|---|
| 0 | A | A1 |
| 1 | B | B1 |
| 2 | C | C1 |
| 3 | D | D1 |
| 4 | E | E1 |
Function Usage & Examples
reading_data()
The reading_data function reads a specified data file in from a URL containing a zip file.
reading_data(url, data_file, data_path, file_name)
Parameters
url : str
URL of link containing zip file.
data_file: str
Specified file in .ZIP that data is to be extracted from.
data_path: str
Directory to which the imported data should be saved to.
file_name: str
Name of file that imported data will be saved to.
Returns
file_name.csv
CSV file that data is saved to.
Example
Below is a URL containing a zip file with a CSV file containing target data. The function reading_data will read in the target data and save it to a CSV named downloaded.csv in the data directory.
url= "https://github.com/DSCI-310-2024/renewenergy/raw/makingdocumentation/tests/docs_data.zip"
imported_data= reading_data(url, "testing_data.csv", "data", "imported.csv")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 3
1 url= "https://github.com/DSCI-310-2024/renewenergy/raw/makingdocumentation/tests/docs_data.zip"
----> 3 imported_data= reading_data(url, "testing_data.csv", "data", "imported.csv")
File ~/checkouts/readthedocs.org/user_builds/renewenergy/checkouts/latest/src/renewenergy/reading_data.py:60, in reading_data(url, data_file, data_path, file_name)
58 path = pathlib.Path(data_path+"/"+file_name)
59 if os.path.exists(path):
---> 60 raise ValueError("The filename already exists.")
61 else:
62 return data.to_csv(data_path+"/"+file_name)
ValueError: The filename already exists.
clean_data()
The clean_data function performs all cleaning steps on the dataset for analysis.
clean_data(dataread,dataout,datafile1, datafile2, seed)
Parameters
dataread: str
Path to dataset
dataout: str
Path to save training and testing datasets to.
datafile1: str
Name of CSV file to save test data to.
datafile2: str
Name of CSV file to save training data to.
seed: int (optional)
Used to allow for reproduceability of results.
Returns
training.csv
CSV containing the training data
test.csv
CSV containing the test data
Example
The function, clean_data will take tidy the data loaded below into tidy format. Within this function, the data will be split with a certain seed (accomplished by impute_split, which the use is demonstrated below.
cleaned_data= clean_data("data/imported.csv", "data", "energy_test.csv", "energy_train.csv",12)
split_data_train,split_data_train_test = impute_split(test_df, 0, 0.5)
print(split_data_train)
Variable 1 Renewable electricity output (% of total electricity output)
6 G G1
7 H H1
3 D D1
0 A A1
5 F F1
print(split_data_train_test)
Variable 1 Renewable electricity output (% of total electricity output)
2 C C1
8 I I1
4 E E1
9 J J1
1 B B1
create_scatter_plots()
The create_scatter_plots function creates scatter plots for given columns against a common y-column.
create_scatter_plots(data, x_columns, y_column, nrows, ncols, figsize)
Parameters
data: DataFrame
The data to plot
x_columns: list
A list of column names for the x-axis
y_column: str
The column name for the y-axis
nrows: int
Number of rows in the subplot grid
ncols: int
Number of columns in the subplot grid
figsize: tuple (optional)
Figure size
Returns
matplotlib.figure.Figure:
The created figure
Example
This function will aid in EDA, after identifying the columns to be plotted against the target variable, clean plots will be generated to further understand the relationship between variables.
from matplotlib import pyplot as plt
data = pd.read_csv("data/energy_train.csv")
toprow = ['Access to electricity (% of population)', 'CO2 emissions (kt)']
bottomrow = ['Land area (sq. km)', 'Death rate, crude (per 1,000 people)']
#combine all columns for x-axis in a single list
x_columns = toprow + bottomrow
#specify the common y-axis column for all plots
y_column = 'Renewable electricity output (% of total electricity output)'
#scatter plots
fig = create_scatter_plots(data, x_columns, y_column, nrows=2, ncols=2)
split_xy_columns()
The split_xy_columns function splist a dataset into X (explanatory variables) and Y (target variables)
split_xy_columns(dataset)
Parameters
dataset : pandas dataframe
Returns
dataset.x
Pandas dataframe containing explanatory variables
dataset.y
Pandas dataframe containing target variable
Example
This function will create the datasets with and without the target variable, which makes it easy to compare predicted vs actual values.
split_data_x,split_data_y = split_xy_columns(data)
split_data_x
| Access to electricity (% of population) | Adjusted net national income (constant 2015 US$) | CO2 emissions (kt) | Death rate, crude (per 1,000 people) | Land area (sq. km) | PM2.5 air pollution, mean annual exposure (micrograms per cubic meter) | Population, total | Renewable energy consumption (% of total final energy consumption) | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100.000000 | 1.155276e+09 | 499.80 | 7.500 | 460.0 | 15.313468 | 93419.0 | 1.36 |
| 1 | 100.000000 | 5.297312e+11 | 565190.10 | 2.542 | 2149690.0 | 66.820874 | 32749848.0 | 0.01 |
| 2 | 100.000000 | 3.281082e+10 | 13139.70 | 9.600 | 20142.6 | 17.920895 | 2063531.0 | 21.38 |
| 3 | 55.100000 | 1.178042e+09 | 304.10 | 5.260 | 27990.0 | 12.490086 | 612660.0 | 48.64 |
| 4 | 98.761360 | 1.428836e+11 | 73314.40 | 13.200 | 230080.0 | 16.094924 | 19815616.0 | 23.67 |
| 5 | 100.000000 | 0.000000e+00 | 0.00 | 5.400 | 34.0 | 0.000000 | 38825.0 | 0.05 |
| 6 | 98.155426 | 7.326220e+08 | 232.30 | 5.281 | 2830.0 | 11.016346 | 203571.0 | 37.47 |
| 7 | 22.800000 | 6.950738e+09 | 1080.44 | 6.724 | 24670.0 | 38.642417 | 11642959.0 | 86.31 |
| 8 | 95.500000 | 1.083893e+12 | 1592559.40 | 13.000 | 16376870.0 | 12.874773 | 144096870.0 | 3.20 |
| 9 | 99.900000 | 3.252919e+10 | 45410.10 | 14.600 | 87460.0 | 27.892525 | 7095383.0 | 21.24 |
| 10 | 85.300000 | 2.865126e+11 | 425063.10 | 9.259 | 1213090.0 | 29.235031 | 55876504.0 | 7.73 |
| 11 | 100.000000 | 7.156612e+10 | 30753.70 | 9.900 | 48082.0 | 19.115558 | 5423801.0 | 13.41 |
plot_rmse()
The plot_rmse function performs linear regression and plots the results on a graph containing Expected vs Predicted observations
plot_rmse(training_data_path, test_data_path, output_path)
Parameters
training_data_path: str
Path to training data .csv file
test_data_path: str
Path to test data .csv file
output_path: str
Directory to which the figure should be saved to
Returns
results.png:
Figure containing the Predicted vs Expected Values of the linear regression
Example
This function will calculate the RMSE values and plot the expected vs predicted values for each value in the testing set, based on the model created with the training data.
results= plot_rmse("data/energy_train.csv", "data/energy_test.csv", "results/lin_reg_results.png")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 results= plot_rmse("data/energy_train.csv", "data/energy_test.csv", "results/lin_reg_results.png")
File /opt/anaconda3/envs/dsci_310_14/lib/python3.12/site-packages/renewenergy/plot_rmse.py:87, in plot_rmse(training_data_path, test_data_path, output_path)
83 energy_test = pd.read_csv(test_data_path)
85 #splitting the x and y columns of the data
---> 87 energy_train_x, energy_train_y = split_xy_columns(energy_train)
88 energy_test_x, energy_test_y = split_xy_columns(energy_test)
90 #making the linear model
NameError: name 'split_xy_columns' is not defined