Python for Secret Agents - Volume II - Steven F. Lott - E-Book

Python for Secret Agents - Volume II E-Book

Steven F. Lott

0,0
20,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Python is easy to learn and extensible programming language that allows any manner of secret agent to work with a variety of data. Agents from beginners to seasoned veterans will benefit from Python's simplicity and sophistication. The standard library provides numerous packages that move beyond simple beginner missions. The Python ecosystem of related packages and libraries supports deep information processing.
This book will guide you through the process of upgrading your Python-based toolset for intelligence gathering, analysis, and communication. You'll explore the ways Python is used to analyze web logs to discover the trails of activities that can be found in web and database servers. We'll also look at how we can use Python to discover details of the social network by looking at the data available from social networking websites.
Finally, you'll see how to extract history from PDF files, which opens up new sources of data, and you’ll learn about the ways you can gather data using an Arduino-based sensor device.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 278

Veröffentlichungsjahr: 2015

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Python for Secret Agents Volume II
Credits
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. New Missions – New Tools
Background briefing on tools
Doing a Python upgrade
Preliminary mission to upgrade pip
Background briefing: review of the Python language
Using variables to save results
Using the sequence collections: strings
Using other common sequences: tuples and lists
Using the dictionary mapping
Comparing data and using the logic operators
Using some simple statements
Using compound statements for conditions: if
Using compound statements for repetition: for and while
Defining functions
Creating script files
Mission One – upgrade Beautiful Soup
Getting an HTML page
Navigating the HTML structure
Doing other upgrades
Mission to expand our toolkit
Scraping data from PDF files
Sidebar on the ply package
Building our own gadgets
Getting the Arduino IDE
Getting a Python serial interface
Summary
2. Tracks, Trails, and Logs
Background briefing – web servers and logs
Understanding the variety of formats
Getting a web server log
Writing a regular expression for parsing
Introducing some regular expression rules and patterns
Finding a pattern in a file
Using regular expression suffix operators
Capturing characters by name
Looking at the CLF
Reading and understanding the raw data
Reading a gzip compressed file
Reading remote files
Studying a log in more detail
What are they downloading?
Trails of activity
Who is this person?
Using Python to run other programs
Processing whois queries
Breaking a request into stanzas and lines
Alternate stanza-finding algorithm
Making bulk requests
Getting logs from a server with ftplib
Building a more complete solution
Summary
3. Following the Social Network
Background briefing – images and social media
Accessing web services with urllib or http.client
Who's doing the talking?
Starting with someone we know
Finding our followers
What do they seem to be talking about?
What are they posting?
Deep Under Cover – NLTK and language analysis
Summary
4. Dredging up History
Background briefing–Portable Document Format
Extracting PDF content
Using generator expressions
Writing generator functions
Filtering bad data
Writing a context manager
Writing a PDF parser resource manager
Extending the resource manager
Getting text data from a document
Displaying blocks of text
Understanding tables and complex layouts
Writing a content filter
Filtering the page iterator
Exposing the grid
Making some text block recognition tweaks
Emitting CSV output
Summary
5. Data Collection Gadgets
Background briefing: Arduino basics
Organizing a shopping list
Getting it right the first time
Starting with the digital output pins
Designing an external LED
Assembling a working prototype
Mastering the Arduino programming language
Using the arithmetic and comparison operators
Using common processing statements
Hacking and the edit, download, test and break cycle
Seeing a better blinking light
Simple Arduino sensor data feed
Collecting analog data
Collecting bulk data with the Arduino
Controlling data collection
Data modeling and analysis with Python
Collecting data from the serial port
Formatting the collected data
Crunching the numbers
Creating a linear model
Reducing noise with a simple filter
Solving problems adding an audible alarm
Summary
Index

Python for Secret Agents Volume II

Python for Secret Agents Volume II

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author,nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2014

Second edition: December 2015

Production reference: 1011215

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78528-340-6

www.packtpub.com

Credits

Author

Steven F. Lott

Reviewer

Shubham Sharma

Commissioning Editor

Julian Ursell

Acquisition Editor

Subho Gupta

Content Development Editor

Riddhi Tuljapurkar

Technical Editor

Danish Shaikh

Copy Editor

Vibha Shukla

Project Coordinator

Sanchita Mandal

Proofreader

Safis Editing

Indexer

Priya Sane

Graphics

Kirk D'Penha

Production Coordinator

Komal Ramchandani

Cover Work

Komal Ramchandani

About the Author

Steven F. Lott has been programming since the 70s, when computers were large, expensive, and rare. As a contract software developer and architect, he has worked on hundreds of projects from very small to very large. He's been using Python to solve business problems for over 10 years.

He's currently leveraging Python to implement microservices and ETL pipelines.

His other titles with Packt Publishing include Python Essentials, Mastering Object-Oriented Python, Functional Python Programming, and Python for Secret Agents.

Steven is currently a technomad who lives in various places on the East Coast of the U.S. His technology blog is http://slott-softwarearchitect.blogspot.com.

About the Reviewer

Shubham Sharma holds a bachelor's degree in computer science engineering with specialization in business analytics and optimization from UPES, Dehradun. He has a good skill set of programming languages. He also has an experience in web development ,Android, and ERP development and works as a freelancer.

Shubham also loves writing and blogs at www.cyberzonec.in/blog. He is currently working on Python for the optimal specifications and identifications of mobile phones from customer reviews.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at <[email protected]> for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and readPackt's entire library of books.

Why subscribe?

Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

Secret agents are dealers and brokers of information. Information that's rare or difficult to acquire has the most value. Getting, analyzing, and sharing this kind of intelligence requires a skilled use of specialized tools. This often includes programming languages such as Python and its vast ecosystem of add-on libraries.

The best agents keep their toolkits up to date. This means downloading and installing the very latest in updated software. An agent should be able to analyze logs and other large sets of data to locate patterns and trends. Social network applications such as Twitter can reveal a great deal of useful information.

An agent shouldn't find themselves stopped by arcane or complex document formats. With some effort, the data in a PDF file can be as accessible as the data in a plain text file. In some cases, agents need to build specialized devices to gather data. A small processing such as an Arduino can gather raw data for analysis and dissemination; it moves the agent to the Internet of Things.

What this book covers

Chapter 1, New Missions – New Tools, addresses the tools that we're going to use. It's imperative that agents use the latest and most sophisticated tools. We'll guide field agents through the procedures required to get Python 3.4. We'll install the Beautiful Soup package, which helps you analyze and extract data from HTML pages. We'll install the Twitter API so that we can extract data from the social network. We'll add PDFMiner3K so that we can dig data out of PDF files. We'll also add the Arduino IDE so that we can create customized gadgets based on the Arduino processor.

Chapter 2, Tracks, Trails, and Logs, looks at the analysis of bulk data. We'll focus on the kinds of logs produced by web servers as they have an interesting level of complexity and contain valuable information on who's providing intelligence data and who's gathering this data. We'll leverage Python's regular expression module, re, to parse log data files. We'll also look at ways in which we can process compressed files using the gzip module.

Chapter 3, Following the Social Network, discusses one of the social networks. A field agent should know who's communicating and what they're communicating about. A network such as Twitter will reveal social connections based on who's following whom. We can also extract meaningful content from a Twitter stream, including text and images.

Chapter 4, Dredging Up History, provides you with essential pointers on extracting useful data from PDF files. Many agents find that a PDF file is a kind of dead-end because the data is inaccessible. There are tools that allow us to extract useful data from PDF. As PDF is focused on high-quality printing and display, it can be challenging to extract data suitable for analysis. We'll show some techniques with the PDFMiner package that can yield useful intelligence. Our goal is to transform a complex file into a simple CSV file, very much similar to the logs that we analyzed in Chapter 2, Tracks, Trails, and Logs.

Chapter 5, Data Collection Gadgets, expands the field agent's scope of operations to the Internet of Things (IoT). We'll look at ways to create simple Arduino sketches in order to read a typical device; in this case, an infrared distance sensor. We'll look at how we will gather and analyze raw data to do instrument calibration.

What you need for this book

A field agent needs a computer over which they have administrative privileges. We'll be installing additional software. A secret agent without the administrative password may have trouble installing Python 3 or any of the additional packages that we'll be using.

For agents using Windows, most of the packages will come prebuilt using the .EXE installers.

For agents using Linux, developer's tools are required. The complete suite of developer's tools is generally needed. The Gnu C Compiler (GCC) is the backbone of these tools.

For agents using Mac OS X, the developer's tool, XCode, is required and can be found at https://developer.apple.com/xcode/. We'll also need to install a tool called homebrew (http://brew.sh) to help us add Linux packages to Mac OS X.

Python 3 is available from the Python download page at https://www.python.org/download.

We'll download and install several things beyond Python 3.4 itself:

The Pillow package will allow us to work with image files: https://pypi.python.org/pypi/Pillow/2.4.0The Beautiful Soup version 4 package will allow us to work with HTML web pages: https://pypi.python.org/pypi/beautifulsoup4/4.3.2The Twitter API package will let us search the social network: https://pypi.python.org/pypi/TwitterAPI/2.3.3We'll use PDF Miner 3k to extract meaningful data from PDF files: https://pypi.python.org/pypi/pdfminer3k/1.3.0We'll use the Arduino IDE. This comes from https://www.arduino.cc/en/Main/Software. We'll also want to install PySerial: https://pypi.python.org/pypi/pyserial/2.7This should demonstrate how extensible Python is. Almost anything an agent might need is already be written and available through the Python Package Index (PyPi) at https://pypi.python.org/pypi.

Who this book is for

This book is for field agents who know a little bit of Python and are very comfortable installing new software. Agents must be ready, willing, and able to write some new and clever programs in Python. An agent who has never done any programming before may find some of this a bit advanced; a beginner's tutorial in the basics of Python may be helpful as preparation.

We'll expect that an agent using this book is comfortable with simple mathematics. This involves some basic statistics and elementary geometry.

We expect that secret agents using this book will be doing their own investigations as well. The book's examples are designed to get the agent started down the road to develop interesting and useful applications. Each agent will have to explore further afield on their own.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. New Missions – New Tools

The espionage job is to gather and analyze data. This requires us to use computers and software tools.

However, a secret agent's job is not limited to collecting data. It involves processing, filtering, and summarizing data, and also involves confirming the data and assuring that it contains meaningful and actionable information.

Any aspiring agent would do well to study the history of the World War II English secret agent, code-named Garbo. This is an inspiring and informative story of how secret agents operated in war time.

We're going to look at a variety of complex missions, all of which will involve Python 3 to collect, analyze, summarize, and present data. Due to our previous successes, we've been asked to expand our role in a number of ways.

HQ's briefings are going to help agents make some technology upgrades. We're going to locate and download new tools for new missions that we're going to be tackling. While we're always told that a good agent doesn't speculate, the most likely reason for new tools is a new kind of mission and dealing with new kinds of data or new sources. The details will be provided in the official briefings.

Field agents are going to be encouraged to branch out into new modes of data acquisition. Internet of Things leads to a number of interesting sources of data. HQ has identified some sources that will push the field agents in new directions. We'll be asked to push the edge of the envelope.

We'll look at the following topics:

Tool upgrades, in general. Then, we'll upgrade Python to the latest stable version. We'll also upgrade the pip utility so that we can download more tools.Reviewing the Python language. This will only be a quick summary.Our first real mission will be an upgrade to the Beautiful Soup package. This will help us in gathering information from HTML pages.After upgrading Beautiful Soup, we'll use this package to gather live data from a web site.We'll do a sequence of installations in order to prepare our toolkit for later missions.In order to build our own gadgets, we'll have to install the Arduino IDE.

This will give us the tools for a number of data gathering and analytical missions.

Background briefing on tools

The organization responsible for tools and technology is affectionately known as The Puzzle Palace. They have provided some suggestions on what we'll need for the missions that we've been assigned. We'll start with an overview of the state of art in Python tools that are handed down from one of the puzzle solvers.

Some agents have already upgraded to Python 3.4. However, not all agents have done this. It's imperative that we use the latest and greatest tools.

There are four good reasons for this, as follows:

Features: Python 3.4 adds a number of additional library features that we can use. The list of features is available at https://docs.python.org/3/whatsnew/3.4.html.Performance: Each new version is generally a bit faster than the previous version of Python.Security: While Python doesn't have any large security holes, there are new security changes in Python.Housecleaning: There are a number of rarely used features that were and have been removed.

Some agents may want to start looking at Python 3.5. This release is anticipated to include some optional features to provide data type hints. We'll look at this in a few specific cases as we go forward with the mission briefings. The type-analysis features can lead to improvements in the quality of the Python programming that an agent creates. The puzzle palace report is based on intelligence gathered at PyCon 2015 in Montreal, Canada. Agents are advised to follow the Python Enhancement Proposals (PEP) closely. Refer to https://www.python.org/dev/peps/.

We'll focus on Python 3.4. For any agent who hasn't upgraded to Python 3.4.3, we'll look at the best way to approach this.

If you're comfortable with working on your own, you can try to move further and download and install Python 3.5. Here, the warning is that it's very new and it may not be quite as robust as the Python version 3.4. Refer to PEP 478 (https://www.python.org/dev/peps/pep-0478/) for more information about this release.

Doing a Python upgrade

It's important to consider each major release of Python as an add-on and not a replacement. Any release of Python 2 should be left in place. Most field agents will have several side-by-side versions of Python on their computers. The following are the two common scenarios:

The OS uses Python 2. Mac OS X and Linux computers require Python 2; this is the default version of Python that's found when we enter python at the command prompt. We have to leave this in place.We might also have an older Python 3, which we used for the previous missions. We don't want to remove this until we're sure that we've got everything in place in order to work with Python 3.4.

We have to distinguish between the major, minor, and micro versions of Python. Python 3.4.3 and 3.4.2 have the same minor version (3.4). We can replace the micro version 3.4.2 with 3.4.3 without a second thought; they're always compatible with each other. However, we don't treat the minor versions quite so casually. We often want to leave 3.3 in place.

Generally, we do a field upgrade as shown in the following:

Download the installer that is appropriate for the OS and Python version. Start at this URL: https://www.python.org/downloads/. The web server can usually identify your computer's OS and suggest the appropriate download with a big, friendly, yellow button. Mac OS X agents will notice that we now get a .pkg (package) file instead of a .dmg (disk image) containing .pkg. This is a nice simplification.When installing a new minor version, make sure to install in a new directory: keep 3.3 separate from 3.4. When installing a new micro version, replace any existing installation; replace 3.4.2 with 3.4.3.
For Mac OS X and Linux, the installers will generally use names that include python3.4 so that the minor versions are kept separate and the micro versions replace each other.For Windows, we have to make sure we use a distinct directory name based on the minor version number. For example, we want to install all new 3.4.x micro versions in C:\Python34. If we want to experiment with the Python 3.5 minor version, it would go in C:\Python35.
Tweak the PATH environment setting to choose the default Python.
This information is generally in our ~/.bash_profile file. In many cases, the Python installer will update this file in order to assure that the newest Python is at the beginning of the string of directories that are listed in the PATH setting. This file is generally used when we log in for the first time. We can either log out and log back in again, or restart the terminal tool, or we can use the source ~/.bash_profile command to force the shell to refresh its environment.For Windows, we must update the advanced system settings to tweak the value of the PATH environment variable. In some cases, this value has a huge list of paths; we'll need to copy the string and paste it in a text editor to make the change. We can then copy it from the text editor and paste it back in the environment variable setting.
After upgrading Python, use pip3.4 (or easy_install-3.4) to add the additional packages that we need. We'll look at some specific packages in mission briefings. We'll start by adding any packages that we use frequently.

At this point, we should be able to confirm that our basic toolset works. Linux and Mac OS agents can use the following command:

MacBookPro-SLott:Code slott$ python3.4

This should confirm that we've downloaded and installed Python and made it a part of our OS settings. The greeting will show which micro version of Python 3.4 have we installed.

For Windows, the command's name is usually just python. It would look similar to the following:

C:\> python

The Mac OS X interaction should include the version; it will look similar to the following code:

MacBookPro-SLott:NavTools-1.2 slott$ python3.4 Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.version_info sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)

We've entered the python3.4 command. This shows us that things are working very nicely. We have Python 3.4.3 successfully installed.

We don't want to make a habit of using the python or python3 commands in order to run Python from the command line. These names are too generic and we could accidentally use Python 3.3 or Python 3.5, depending on what we have installed. We need to be intentional about using Python3.4.

Preliminary mission to upgrade pip

The first time that we try to use pip3.4, we may see an interaction as shown in the following:

MacBookPro-SLott:Code slott$ pip3.4 install anything You are using pip version 6.0.8, however version 7.0.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command.

The version numbers may be slightly different; this is not too surprising. The packaged version of pip isn't always the latest and greatest version. Once we've installed the Python package, we can upgrade pip3.4 to the recent release. We'll use pip to upgrade itself.

It looks similar to the following code:

MacBookPro-SLott:Code slott$ pip3.4 install --upgrade pip You are using pip version 6.0.8, however version 7.0.3 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Collecting pip from https://pypi.python.org/packages/py2.py3/p/pip/pip-7.0.3-py2.py3-none-any.whl#md5=6950e1d775fea7ea50af690f72589dbd Downloading pip-7.0.3-py2.py3-none-any.whl (1.1MB) 100% |################################| 1.1MB 398kB/s Installing collected packages: pip Found existing installation: pip 6.0.8 Uninstalling pip-6.0.8: Successfully uninstalled pip-6.0.8 Successfully installed pip-7.0.3

We've run the pip installer to upgrade pip. We're shown some details about the files that are downloaded and new is version installed. We were able to do this with a simple pip3.4 under Mac OS X.

Some packages will require system privileges that are available via the sudo command. While it's true that a few packages don't require system privileges, it's easy to assume that privileges are always required. For Windows, of course, we don't use sudo at all.

On Mac OS X, we'll often need to use sudo -H instead of simply using sudo. This option will make sure that the proper HOME environment variable is used to manage a cache directory.

Note that your actual results may differ from this example, depending on how out-of-date your copy of pip turns out to be. This pip install --upgrade pip is a pretty frequent operation as the features advance.