PDF Accessibility – An Overlooked Barrier in Learning Content. At eLaHub, we often find PDFs uploaded to the eLearning courses we audit. Many of them contain accessibility barriers impacting learners with disabilities and access needs who use screen readers, magnification, or text-to-speech tools etc. Because PDF accessibility can be so challenging for L&D teams I recommend checking out this hugely useful guidance on PDF accessibility. It's focused on Higher and Further education but is equally relevant to workplace learning resources. 📌 Check out the guide here: https://lnkd.in/etDBS2WV Many thanks to Alistair McNaught and the team of accessibility leads from a range of UK universities for putting this together. It’s Creative Commons (CC-BY-NC), so feel free to share and use it in your training. #eLearning #accessibility #eLearningAccessibility #DigitalInclusion (A table titled 'Key issues for disabled users' summarising common accessibility challenges with PDFs. It has three columns: 'Issue', 'Typical user experience', and 'Who is affected? The rows list various issues such as missing text in PDFs, improper heading styles, lack of image descriptions, poor hyperlink practices, unexpected reading order, poor colour use, and unmarked table headers. The table describes how these issues impact different users, particularly those using screen readers, text-to-speech tools, and magnification software, as well as users with visual access needs and neurodivergent conditions.)
Organizing Digital Files Efficiently
Explore top LinkedIn content from expert professionals.
-
-
I know some of you might be thinking, "Why is Brij sharing another Linux post?" 😅 But let me assure you—there’s a good reason. No matter where you stand in your tech career, Linux is the foundation of so much of what we do in this industry. From DevOps to AI Engineering, Linux is an essential skill! I have created this Linux Filesystem Hierarchy infographic as part of my ongoing series. The filesystem is the beating heart of the Linux OS, and understanding it is critical: Let’s break it down: 📂 /𝗯𝗶𝗻: Home to essential user commands like 𝚕𝚜 and 𝚌𝚙. Think of it as the toolbox for day-to-day tasks. 📂 /𝗯𝗼𝗼𝘁: Contains critical files to boot the operating system. Without this, your system isn’t going anywhere! 📂 /𝗱𝗲𝘃: Represents device files for hardware like disks and peripherals. Understanding this is key for hardware troubleshooting. 📂 /𝗲𝘁𝗰: The nerve center of system-wide configurations for applications and services. 📂 /𝗵𝗼𝗺𝗲: Where all user directories live—your personal files and system settings reside here. 📂 /𝗹𝗶𝗯: Shared libraries and kernel modules essential for running software and the OS itself. 📂 /𝗺𝗲𝗱𝗶𝗮: Mount point for removable devices like USB drives. 📂 /𝗺𝗻𝘁: Temporary mount point for external filesystems during maintenance or setup. 📂 /𝗼𝗽𝘁: Houses third-party or add-on software—great for custom installations. 📂 /𝗽𝗿𝗼𝗰: A virtual filesystem providing real-time kernel process information. It’s like a live feed of what your system is doing. 📂 /𝗿𝗼𝗼𝘁: The home directory for the system’s root user. Admin-level actions start here. 📂 /𝗿𝘂𝗻: Temporary runtime data storage for active system processes. 📂 /𝘀𝗯𝗶𝗻: Contains essential administrative commands, like those for managing the system. 📂 /𝘀𝗿𝘃: Stores data for system services, such as FTP or web servers. 📂 /𝘀𝘆𝘀: Another virtual filesystem, exposing system hardware and device details. 📂 /𝘁𝗺𝗽: Temporary storage for applications. Files here are usually cleared on reboot. 📂 /𝘂𝘀𝗿: The largest directory, holding user utilities, binaries, and documentation—usually read-only. 📂 /𝘃𝗮𝗿: Dynamic files such as logs, cache, and other system data live here. 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: Whether you're debugging, configuring a server, or deploying an application, understanding these directories saves time and makes you more effective. Every directory has a purpose, and knowing them is a fundamental skill for any tech professional.
-
Data compression isn’t complicated. It boils down to 5 things: - choose the right data types Don’t store an integer as a string. • Use INT instead of BIGINT when possible. • Use BOOLEAN, DATE, and ENUM over freeform strings Smaller types = less storage = faster scans. - Use Parquet or ORC format • Compress better than row-based formats • Only read the columns you need = faster queries. Great for analytical workloads. - Enable Encoding Schemes • Run-Length Encoding (RLE): Stores sequences efficiently. This is perfect for repeated values (like country='US'). The most important scheme for analytical data • Dictionary Encoding: Maps repeated strings to integers. This is amazing for fields like status, region, etc. • Delta Encoding: Store the difference between values. This is ideal for timestamps or sorted numeric data. - Partition and Sort Intelligently • Partition by high-cardinality columns = bad idea. • Sort data to maximize compression (e.g., sort logs by timestamp and user_id before writing). Sorted + Encoded + Columnar = extremely fast at any size. - Use Compression Codecs Wisely • Snappy: Fast, lower compression ratio. • ZSTD: Slower, better compression. • GZIP: CPU-intensive, but supported everywhere. For compute-bound workloads, prioritize speed. For storage-bound, prioritize compression ratio. Most analytical workloads are storage bound so prioritize compression ratio with ZSTD. What would you add here?
-
Git Lifecycle for Data Engineers: Think in Pipelines ⚙️ From dev to production, Git is the “Data Lineage” for your infrastructure. If you build data pipelines, you already understand Git. The flow is almost the same. 𝗪𝗼𝗿𝗸𝗶𝗻𝗴 𝗗𝗶𝗿𝗲𝗰𝘁𝗼𝗿𝘆 Your raw zone. Files change, experiments happen, nothing locked yet. 𝗦𝘁𝗮𝗴𝗶𝗻𝗴 𝗔𝗿𝗲𝗮 git add marks what should move forward. Like selecting the clean batch before loading. 𝗟𝗼𝗰𝗮𝗹 𝗥𝗲𝗽𝗼 git commit -m "msg" stores a snapshot. Clear history. Easy rollback. 𝗥𝗲𝗺𝗼𝘁𝗲 𝗥𝗲𝗽𝗼 Shared source of truth. git push sends your work. git pull syncs with the team. Know these common commands you’ll use daily: • git add → stage changes • git commit -m → save snapshot • git commit -a -m → stage + commit tracked files • git push → send to remote • git fetch → download updates only • git pull → fetch + merge • git merge → combine branches • git diff → inspect changes anytime Image Credits: Brij kishore Pandey Follow the Data engineers rule: Commit like pipeline checkpoints — small, clear, reversible. Version control isn’t just for devs. It’s how data teams ship with confidence. 🔁
-
LlamaIndex just unveiled a new approach involving AI agents for reliable document processing, from processing invoices to insurance claims and contract reviews. LlamaIndex’s new architecture, Agentic Document Workflows (ADW), goes beyond basic retrieval and extraction to orchestrate end-to-end document processing and decision-making. Imagine a contract review workflow: you don't just parse terms, you identify potential risks, cross-reference regulations, and recommend compliance actions. This level of coordination requires an agentic framework that maintains context, applies business rules, and interacts with multiple system components. Here’s how ADW works at a high level: (1) Document parsing and structuring – using robust tools like LlamaParse to extract relevant fields from contracts, invoices, or medical records. (2) Stateful agents – coordinating each step of the process, maintaining context across multiple documents, and applying logic to generate actionable outputs. (3) Retrieval and reference – tapping into knowledge bases via LlamaCloud to cross-check policies, regulations, or best practices in real-time. (4) Actionable recommendations – delivering insights that help professionals make informed decisions rather than just handing over raw text. ADW provides a path to building truly “intelligent” document systems that augment rather than replace human expertise. From legal contract reviews to patient case summaries, invoice processing, and insurance claims management—ADW supports human decision-making with context-rich workflows rather than one-off extractions. Ready to use notebooks https://lnkd.in/gQbHTTWC More open-source tools for AI agent developers in my recent blog post https://lnkd.in/gCySSuS3
-
Ever wondered how a search engine like 𝗚𝗼𝗼𝗴𝗹𝗲 or 𝗕𝗶𝗻𝗴 finds results in milliseconds? It’s one of the most misunderstood system design problems - and it’s more relevant than ever for interviews and real-world roles. Let’s break it down simply. 𝗧𝗵𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 You're given a million documents. Each is ~10KB. Now: someone types a few keywords - and your system needs to return all matching documents instantly. How do you design this? 𝗧𝗵𝗲 𝗖𝗼𝗿𝗲 𝗜𝗱𝗲𝗮: 𝗜𝗻𝘃𝗲𝗿𝘁𝗲𝗱 𝗜𝗻𝗱𝗲𝘅 Instead of scanning every document, we pre-build a structure that works like the index at the back of a book. For each word, we store a sorted list of locations - i.e., which documents contain the word, and where. So, when a user searches for multiple words, we just find the intersection of these lists. And since they’re sorted, we can intersect efficiently. But that’s just the start. Real speed needs real optimization. Let’s dive deeper: 1. Delta Compression Store the difference between document IDs instead of the full IDs. Why? Smaller data → better cache usage → faster lookup. 2. Caching Frequent Queries User queries follow a skewed pattern - a few are extremely common. Cache them. You’ll save compute for the majority of traffic. 3. Frequency-Based Indexing Not all documents are equal. Keep high-quality/top-ranked documents in memory, and the rest on disk. Most queries will hit RAM-only, keeping latency low. 4. Smart Intersection Order Always intersect the smallest sets first. If you search "INDIA GDP 2009", it’s faster to start with "GDP" and "2009" than with "INDIA". 5. Multilevel Indexing Want better accuracy? Break documents into paragraphs or sentences and index them too. That way, matches are not just found - they’re found in context. Why this matters: This isn't just about search engines. It’s about designing systems that handle scale, latency, and optimization - the exact thinking top tech companies test for. Mastering this gives you an edge in interviews and real-world backend design.
-
This AI Workflow architecture has cut down the document review time from weeks to minutes for a Canadian startup -- - The workflow starts with CondoScan's property documents, both PDFs and scanned files, stored in Google Cloud Storage. - The documents undergo OCR processing to handle any scanned content. After that, LlamaParse parses, cleans, and structures the text for later use. - Once the text is structured, it gets stored in a Pinecone VectorDB for semantic retrieval. - For analysis, they use Cloud Run Jobs that operate in Docker containers. Here, a FastAPI server orchestrates everything and processes the documents, sending the results to the front-end. - The document analysis engine, powered by LlamaIndex AgentWorkflow, pulls knowledge, recognises key entities, and generates insights like risk scores. - Finally, all the metadata, insights, and raw documents sit in MongoDB. Users interact through CondoScan, asking natural language questions to a chat interface for instant insights. With this setup, CondoScan significantly improves the accuracy and speed of condo document review, letting buyers make better decisions faster. Link to article: https://lnkd.in/gV2WEhZk #AI #RAG #GenAI
-
How AI reduced costs by 35% for this major bank I see many companies struggling with inefficient, labor-intensive processes. A prime example is a leading financial institution I worked with recently. According to a study by Accenture It is found that 73% of wealth management firms struggle with Data fragmentation across different systems. They were weighed down by over 1,000 different file formats For settlement instruction data entering their organization. Manually processing this messy data took a huge teamwork, Causing transcription errors, Translation issues And poor data quality. To streamline things, we started by training a machine learning system To automate handling of the 35 most prevalent file formats. Within just 4 months, This AI solution was accurately processing 95% of the workload Automatically at 99.9945% accuracy levels. Only 5% of formats had to be routed for manual review. The impact? A 35% reduction in headcount is needed For this workflow with significant cost savings. But equally important were the qualitative gains - clean data, no more costly errors. This project highlights how implementing AI thoughtfully, Even for a specific narrow process, Can bring impressive ROI. If repetitive data drudgery is sapping your resources, It may be time to adopt an AI solution.
-
"How do you efficiently sync 50,000 jobs daily without breaking the database?" This was the challenge I faced while building Recraft. Today I'll share how we solved it using an underrated format: JSONL. The Problem: Our job crawler processes 200,000+ jobs daily, identifying remote positions, analyzing tech stacks, and classifying roles. After filtering, we're left with 50,000 quality jobs that need to be synced with our main platform. The Catch? - Our crawler runs on a separate instance - Has its own database - Performs heavy preprocessing - Needs to sync everything efficiently over the network Initially, this caused: - Memory overflows - Network timeouts - Slow processing - High costs The Solution: JSONL + Compression + Streams Here's how we made it work: - Why JSONL? Traditional JSON was killing our performance: Loading 50k jobs into memory 😰 - Single JSON error corrupting entire payloads - Slow processing of massive arrays - Network timeouts with large files JSONL (JSON Lines) solved these elegantly. Each job is a separate line: Our Pipeline: a) Crawler side: - Query processed jobs - Stream to JSONL file - Compress (gzip) on the fly - Upload compressed file (~70% smaller!) b) Platform side: - Download compressed file - Stream decompress - Process line by line - Update database in batches The Results: - Memory usage: 1.5GB → 200MB 📉 - Processing time: 60% faster ⚡ - Network transfer: 70% smaller 🌐 - Out-of-memory errors: Zero since implementation 💪 Key Learnings: - Stream processing is crucial for large datasets - Compression is worth the extra complexity - Line-by-line processing beats large arrays - Batch database updates significantly improve speed -------------------------------- Questions for the tech community: How do you handle large data syncs? Have you tried JSONL in production? What compression strategies work best for you?
-
Keyword search just got 10x faster by being... lazier? My amazing colleagues at Weaviate reducing keyword search time by 10x while using 90% less storage. 𝗧𝗵𝗲 𝗜𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻 BlockMax WAND isn't just an incremental improvement - it's a fundamental rethinking of document scoring. By dividing posting lists into blocks with local max impact, it creates a hierarchical optimisation that wasn't possible before. 𝗧𝗵𝗲 𝗡𝘂𝗺𝗯𝗲𝗿𝘀 • Traditional WAND: Inspects 15-30% of documents • BlockMax WAND: Only 5-15% of documents • Query time reduction: 80-94% faster • Storage reduction: 50-90% smaller indices What makes this significant is how it elegantly solves the classic space-time tradeoff. Instead of choosing between fast queries OR efficient storage, BlockMax WAND achieves both through clever compression techniques like varenc and delta encoding. The algorithm uses block-level metadata to skip entire sections without even loading them from disk. It's like having a table of contents for your index - you know exactly where NOT to look. For researchers working on information retrieval, this opens new possibilities: • Scaling to truly massive datasets becomes feasible • Real-time search in production systems with strict latency requirements • New opportunities for hybrid vector-keyword search optimisation In a world where text corpora are growing exponentially, being able to search billions of documents efficiently isn't just nice to have. It's essential for the future of hybrid search in RAG and AI systems. This isn't just about making search faster. It's about making previously impossible search applications possible. Learn more: https://lnkd.in/eifsqgqt