Linux & the Shell — Open-Ended Exercises
How to Effectively Work at the Command Line
Tip: Unless stated otherwise, exercises assume a CSV named
penguins.csv(with a header) in the working directory. Exercise 0 shows how to download one from the internet. Answers are hidden—click to reveal.
If you do not have a Linux/Unix-based Machine (aka Windows), you can go to GitHub codespaces for one of your repositories and navigate through there.
0) Download a CSV from the internet (and name it penguins.csv)
Question. Use a command-line tool to download a CSV and save it as penguins.csv. Verify it looks like a CSV and preview the first few lines.
The file is available at https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
Show solution
# Using curl (follow redirects, write to file):
curl -L -o penguins.csv \
https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
# Or using wget:
wget -O penguins.csv \
https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
# (Optional) R from shell:
R -q -e "download.file('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv','penguins.csv', mode='wb')"
# (Optional) Python from shell:
python - <<'PY'
import urllib.request
url='https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv'
urllib.request.urlretrieve(url, 'penguins.csv')
PY
# Basic checks
file penguins.csv
head -n 5 penguins.csv
wc -l penguins.csv # total lines (incl. header)-L (curl) follows redirects. -O/-o choose output filename. - Use head, wc -l, and cut -d',' -f1-5 | head to quickly sanity-check.
1) Where am I? What’s here?
Question. Print your current directory and list files with sizes and hidden entries.
Show solution
pwd
ls -lha2) Create a working area
Question. Make a folder shell_practice and change into it. Create notes.md.
Show solution
mkdir -p shell_practice && cd shell_practice
: > notes.md # or: touch notes.md3) Count rows in an uncompressed CSV (skip header)
Question. Count the number of data rows (exclude the header line) in penguins.csv.
Show solution
# total lines minus header
total=$(wc -l < penguins.csv)
echo $(( total - 1 ))
# or with tail:
tail -n +2 penguins.csv | wc -l4) Compress a CSV with gzip and pigz
Question. Create penguins.csv.gz using (a) gzip and (b) pigz. Compare time and file size.
Show solution
# (a) Using gzip
time gzip -kf penguins.csv # -k keep original, -f overwrite
ls -lh penguins.csv penguins.csv.gz
# (b) Using pigz (parallel gzip)
# If missing, install via your package manager (e.g., apt, brew, conda).
time pigz -kf penguins.csv
ls -lh penguins.csv penguins.csv.gz
# Inspect compressed vs uncompressed byte counts
gzip -l penguins.csv.gzpigz uses multiple cores → faster on large files; compression ratio is the same algorithm as gzip.
5) Count rows in a compressed CSV (.gz)
Question. Count data rows in penguins.csv.gz without fully decompressing to disk.
Show solution
# Using gzip’s decompressor:
gzip -cd penguins.csv.gz | tail -n +2 | wc -l
# Using pigz if available:
pigz -dc penguins.csv.gz | tail -n +2 | wc -l
# Using zcat (often symlinked to gzip -cd):
zcat penguins.csv.gz | tail -n +2 | wc -l-c writes to stdout; -d decompresses. tail -n +2 skips the header.
6) Quick column exploration with cut, head, sort, uniq
Question. Show the first 5 IDs and the distinct sex values with counts.
Show solution
head -n 6 penguins.csv | cut -d',' -f1 # header + first 5 IDs
cut -d',' -f2 penguins.csv | tail -n +2 | sort | uniq -c7) Filter rows by a condition with awk
Question. Count participants with age >= 60. Compute the mean BMI.
Show solution
# age >= 60 (age is 3rd column)
awk -F',' 'NR>1 && $3 >= 60 {c++} END{print c+0}' penguins.csv
# mean BMI (4th column)
awk -F',' 'NR>1 {s+=$4; n++} END{print s/n}' penguins.csv8) Find rows with missing values in any field
Question. Count how many data rows contain an empty field.
Show solution
# simple heuristic: consecutive delimiters or trailing comma
grep -E ',,' penguins.csv | wc -l
# more thorough (detect empty at start/end or middle):
awk -F',' 'NR>1{for(i=1;i<=NF;i++) if($i==""){m++ ; break}} END{print m+0}' penguins.csv9) Save the first 20 IDs to a file
Question. Write the first 20 IDs (not including header) to sample_ids.txt.
Show solution
tail -n +2 penguins.csv | cut -d',' -f1 | head -n 20 > sample_ids.txt
wc -l sample_ids.txt # should be 2010) Chain operations with pipes
Question. Among male participants, show counts by age (3rd column) in ascending order.
Show solution
awk -F',' 'NR>1 && $2=="male"{print $3}' penguins.csv | sort -n | uniq -c11) Make the analysis reproducible with a script
Question. Create analyze.sh that prints: total rows, rows age ≥ 60, and mean BMI. Run it.
Show solution
cat > analyze.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
csv="${1:-penguins.csv}"
echo "File: $csv"
echo -n "Total data rows: "
tail -n +2 "$csv" | wc -l
echo -n "Age >= 60 rows: "
awk -F',' 'NR>1 && $3 >= 60 {c++} END{print c+0}' "$csv"
echo -n "Mean BMI: "
awk -F',' 'NR>1 {s+=$4; n++} END{print s/n}' "$csv"
EOF
chmod +x analyze.sh
./analyze.sh penguins.csv12) Script for compressed input
Question. Modify your script so it also accepts penguins.csv.gz seamlessly.
Show solution
cat > analyze_any.sh << 'EOF'
#!/usr/bin/env bash
set -euo pipefail
f="${1:-penguins.csv}"
stream() {
case "$f" in
*.gz) gzip -cd "$f" ;;
*) cat "$f" ;;
esac
}
# Skip header once:
data="$(stream | tail -n +2)"
echo "File: $f"
echo "Total data rows: $(printf "%s\n" "$data" | wc -l)"
echo "Age >= 60 rows: $(printf "%s\n" "$data" | awk -F',' '$3>=60{c++} END{print c+0}')"
echo "Mean BMI: $(printf "%s\n" "$data" | awk -F',' '{s+=$4; n++} END{print s/n}')"
EOF
chmod +x analyze_any.sh
./analyze_any.sh penguins.csv.gz13) Record a reproducible terminal session
Question. Record your workflow to session.log and preview it.
Show solution
script -q session.log
# …run a few commands…
exit
less session.log14) One-liners for large files
Question. Show the uncompressed byte size of penguins.csv.gz without fully inflating it; then estimate memory needed to load the CSV.
Show solution
# Uncompressed and compressed sizes (bytes):
gzip -l penguins.csv.gz
# Rough row count without header (streaming):
rows=$(gzip -cd penguins.csv.gz | tail -n +2 | wc -l)
echo "Rows: $rows"gzip -l reports compressed and uncompressed sizes; not row count. - Memory needs depend on parsing overhead; this is only an order-of-magnitude check.
15) Remote/HPC touchpoint (optional)
Question. Copy your CSV to a remote machine and check its line count there.
Show solution
scp penguins.csv user@server:~/data/
ssh user@server 'wc -l ~/data/penguins.csv'16) Parallel compression benchmarking (optional, larger files)
Question. Compare wall-clock time for gzip vs pigz on a large file.
Show solution
# Create a larger file by duplication (demo only):
awk 'NR==1 || FNR>1' penguins.csv penguins.csv penguins.csv penguins.csv penguins.csv \
> big.csv # header from first file, rest skip header
# Benchmark (prints elapsed time)
/usr/bin/time -f "gzip: %E" gzip -kf big.csv
/usr/bin/time -f "pigz: %E" pigz -kf big.csv
ls -lh big.csv big.csv.gz17) Sanity checks and integrity
Question. Verify that the compressed and uncompressed files have identical content checksums.
Show solution
md5sum penguins.csv
gzip -c penguins.csv | md5sum # checksum of compressed stream (different)
gzip -cd penguins.csv.gz | md5sum # checksum of decompressed content stream
# To compare content equality:
md5sum penguins.csv > a.md5
gzip -cd penguins.csv.gz | md5sum > b.md5
diff a.md5 b.md5 # no output => identical content18) Find an Answer in the Shell you didn’t know
Question. Either ask a question about the shell you do not know, one thing you’d like your terminal to be able to do more easily, or figure out something with data (e.g. pull one column from a CSV file in bash/shell).
Share this question/solution with the instructors and the class.