0.2 C
Washington

Learn Data Analysis with Julia

Image by Author
 
Julia is another programming language like Python and R. It combines the speed of low-level languages like C with simplicity like Python. Julia is becoming popular in the data science space, so if you want to expand your portfolio and learn a new language, you have come to the right place. 
In this tutorial, we will learn to set up Julia for data science, load the data, perform data analysis, and then visualize it. The tutorial is made so simple that anyone, even a student, can start using Julia to analyze the data in 5 minutes. 
 
1. Setting Up Your Environment
 

Download the Julia and install the package by going to the (julialang.org). 
We need to set up Julia for Jupyter Notebook now. Launch a terminal (PowerShell), type `julia` to launch the Julia REPL, and then type the following command. 

using Pkg
Pkg.add(“IJulia”)

 

Launch the Jupyter Notebook and start the new notebook with Julia as Kernel.
Create the new code cell and type the following command to install the necessary data science packages. 

using Pkg
Pkg.add(“DataFrames”)
Pkg.add(“CSV”)
Pkg.add(“Plots”)
Pkg.add(“Chain”)

 
2. Loading Data
 
For this example, we are using the Online Sales Dataset from Kaggle. It contains data on online sales transactions across different product categories.
We will load the CSV file and convert it into DataFrames, which is similar to Pandas DataFrames. 

using CSV
using DataFrames

# Load the CSV file into a DataFrame
data = CSV.read(“Online Sales Data.csv”, DataFrame)

 
3. Exploring Data
 
We will use the’ first’ function instead of `head` to view the top 5 rows of the DataFrame. 

 

 
To generate the data summary, we will use the `describe` function. 

 

 
Similar to Pandas DataFrame, we can view specific values by providing the row number and column name.

Output:

 
4. Data Manipulation
 
We will use the `filter` function to filter the data based on certain values. It requires the column name, the condition, the values, and the DataFrame. 

filtered_data = filter(row -> row[:”Unit Price”] > 230, data)
last(filtered_data, 5)

 

 
We can also create a new column similar to Pandas. It is that simple. 

data[!, :”Total Revenue After Tax”] = data[!, :”Total Revenue”] .* 0.9
last(data, 5)

 

 
Now, we will calculate the mean values of “Total Revenue After Tax” based on different “Product Category”. 

using Statistics

grouped_data = groupby(data, :”Product Category”)
aggregated_data = combine(grouped_data, :”Total Revenue After Tax” .=> mean)
last(aggregated_data, 5)

 

 
5. Visualization
 
Visualization is similar to Seaborn. In our case, we are visualizing the bar chart of recently created aggregated data. We will provide the X and Y columns, and then the Title and labels. 

using Plots

# Basic plot
bar(aggregated_data[!, :”Product Category”], aggregated_data[!, :”Total Revenue After Tax_mean”], title=”Product Analysis”, xlabel=”Product Category”, ylabel=”Total Revenue After Tax Mean”)

 
The majority of total mean revenue is generated through electronics. The visualization looks perfect and clear.   
 
 
To generate histograms, we just have to provide X column and label data. We want to visualize the frequency of items sold. 

histogram(data[!, :”Units Sold”], title=”Units Sold Analysis”, xlabel=”Units Sold”, ylabel=”Frequency”)

 

 
It seems like the majority of people bought one or two items. 
To save the visualization, we will use the `savefig` function.

 
6. Creating Data Processing Pipeline
 
Creating a proper data pipeline is necessary to automate data processing workflows, ensure data consistency, and enable scalable and efficient data analysis.
We will use the `Chain` library to create chains of various functions previously used to calculate total mean revenue based on various product categories. 

using Chain
# Example of a simple data processing pipeline
processed_data = @chain data begin
filter(row -> row[:”Unit Price”] > 230, _)
groupby(_, :”Product Category”)
combine(_, :”Total Revenue” => mean)
end
first(processed_data, 5)

 

 
To save the processed DataFrame as a CSV file, we will use the `CSV.write` function. 

CSV.write(“output.csv”, processed_data)

 
Conclusion
 
In my opinion, Julia is simpler and faster than Python. Many of the syntax and functions that I am used to are also available in Julia, like Pandas, Seaborn, and Scikit-Learn. So, why not learn a new language and start doing things better than your colleagues? Also, it will help you get a Job related to research, as most clinical researchers prefer Julia over Python. 
In this tutorial, we learned how to set up the Julia environment, load the dataset, perform powerful data analysis and visualization, and build the data pipeline for reproducibility and reliability. If you are interested in learning more about Julia for data science, please let me know so I can write even more simple tutorials for your guys.  
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

━ more like this

Newbury BS cuts resi, expat, landlord rates by up to 30bps  – Mortgage Strategy

Newbury Building Society has cut fixed-rate offers by up to 30 basis points across a range of mortgage products including standard residential, shared...

Rate and Term Refinances Are Up a Whopping 300% from a Year Ago

What a difference a year makes.While the mortgage industry has been purchase loan-heavy for several years now, it could finally be starting to shift.A...

Goldman Sachs loses profit after hits from GreenSky, real estate

Second-quarter profit fell 58% to $1.22 billion, or $3.08 a share, due to steep declines in trading and investment banking and losses related to...

Building Data Science Pipelines Using Pandas

Image generated with ChatGPT   Pandas is one of the most popular data manipulation and analysis tools available, known for its ease of use and powerful...

#240 – Neal Stephenson: Sci-Fi, Space, Aliens, AI, VR & the Future of Humanity

Podcast: Play in new window | DownloadSubscribe: Spotify | TuneIn | Neal Stephenson is a sci-fi writer (Snow Crash, Cryptonomicon, and new book Termination...