I want to extract the URL from the recovery.js file (the file is created by Firefox, it contains the URLs of the windows and tabs in the Firefox session). My goal is to be able to store and classify these URLs in a text file. Then I can for example open a "special" firefox session with only the desired URLs and still store the other URLs. Note: I could do everything manually from Firefox but I find it less convenient.

In [1]:

import re
import os.path

Path to the recovery.js file:

In [2]:

# Using os.path.expanduser because of the tild:
filename = os.path.expanduser('~/Library/Application Support/Firefox/Profiles/g9vvodyo.default/sessionstore-backups/recovery.js')

Extraction of the data:

In [3]:

with open(filename, 'r') as f:
    data = f.readlines()

Extract the URLs with a regular expression:

In [4]:

if len(data) == 1:
    # regex extract the url (http) untill the first quote (")
    urls = re.findall(r'(https?://\S[^"]*)', data[0])
else:
    print('The data are not stored in one string. You should adapt the code. Good luck!')

In [5]:

len(urls)

Out[5]:

Discard the duplicates in the URLs:

In [6]:

urls = list(set(urls))
len(urls)

Out[6]:

In [7]:

urls

Out[7]:

['http://public.slidesharecdn.com/b/images/logo/linkedin-ss/linkedin_ss_favicon.ico?d0e5c05903',
 'http://rrcns.readthedocs.org/en/cns2012/reproducible_research.html#a-note-on-terminology-reproduction-replication-and-reuse',
 'http://fr.slideshare.net/khinsen/presentations',
 'http://rrcns.readthedocs.org/favicon.ico',
 'http://mosaic-data-model.github.io/#',
 'https://github.com/khinsen/article-statistique-et-societe\\',
 'http://www.slideshare.net/khinsen/reproducible-research-in-molecular-biophysics-and-structural-biology',
 'http://rrcns.readthedocs.org/en/cns2012/',
 'http://www.slideshare.net/khinsen/presentations',
 'http://www.andrewdavison.info/notes/workflows-reproducible-research-comp-neuro/',
 'http://fr.slideshare.net/khinsen/reproducible-research-in-molecular-biophysics-and-structural-biology',
 'http://www.andrewdavison.info/',
 'https://github.com',
 'https://assets-cdn.github.com/favicon.ico',
 'https://github.com/khinsen/article-statistique-et-societe']

Write the urls in a file

In [8]:

filename = 'urls.txt'
# write only if the file does not exist
if not os.path.isfile(filename):
    with open(filename, 'w') as f:
        for url in urls:
            f.write('{0}\n'.format(url))
    print('You can now open all the urls with "firefox -n $(cat urls.txt)" on linux or "open /Applications/Firefox.app $(cat urls.txt)" on OS X'.)
    print('NOTE: Open firefox before the previous command so firefox will open the URLs in tabs instead of sessions.')
else:
    print('The file already exists! Delete the file if you want to create a file with this name.')

You can now open all the urls with "firefox -n $(cat urls.txt)" on linux or "open /Applications/Firefox.app $(cat urls.txt)" on OS X
NOTE: Open firefox before the previous command so firefox will open the URLs in tabs instead of sessions

At this point, I made a manual classification of the URLs and put the new file in my home directory. The URLs are classified by theme, the theme is preceeded by "#". Then there are the URLs corresponding to that theme and then 2 blank lines.

Then I write a new script to extract the URLs according to their theme.

Open the desired URLs in the browser¶

Extract the theme (the themes are preceeded by "#"):

In [21]:

filename = 'urls.txt'
with open(filename, 'r') as f:
    urls = f.readlines()

In [22]:

urls

Out[22]:

['# Reproducbility:\n',
 'http://rrcns.readthedocs.org/en/cns2012/reproducible_research.html#a-note-on-terminology-reproduction-replication-and-reuse\n',
 'http://mosaic-data-model.github.io/#\n',
 'http://www.slideshare.net/khinsen/reproducible-research-in-molecular-biophysics-and-structural-biology\n',
 'http://www.andrewdavison.info/notes/workflows-reproducible-research-comp-neuro/\n',
 'http://fr.slideshare.net/khinsen/reproducible-research-in-molecular-biophysics-and-structural-biology\n',
 'https://github.com/khinsen/article-statistique-et-societe\n',
 '\n',
 '\n',
 '# Miscellaneous:\n',
 'http://www.slideshare.net/khinsen/presentations\n',
 'http://www.andrewdavison.info/\n',
 'http://rrcns.readthedocs.org/en/cns2012/\n']

Find the urls "themes"

In [23]:

# regex start with # and then any character (.) with one or more repetitions (+)
themes = [re.findall(r'^#.+', url) for url in urls]
# Flatten the preceeding list of list:
themes = [val for sublist in themes for val in sublist]
themes

Out[23]:

['# Reproducbility:', '# Miscellaneous:']

Write the urls in a new file¶

For example I want to open the urls from the "reproducibility" theme:

In [24]:

theme = themes[0]
theme

Out[24]:

'# Reproducbility:'

Find the line number corresponding to the theme we are interested in:

In [25]:

regex = r'' + theme
for cpt, url in enumerate(urls):
    # If the theme is found
    if re.match(regex, url):
        print(cpt)
        break

Now, we know the rank (=cpt) corresponding to the theme we are interested in, so we will save the URLs until the first blank line:

In [26]:

regex = r'\n'
filename = 'extracted_urls.txt'
f = open(filename, 'w')
for url in urls[cpt+1:]:
    print(url)
    f.write(url)
    if re.match(regex, url):
        break
f.close() # Don't forget to close the file!

http://rrcns.readthedocs.org/en/cns2012/reproducible_research.html#a-note-on-terminology-reproduction-replication-and-reuse

http://mosaic-data-model.github.io/#

http://www.slideshare.net/khinsen/reproducible-research-in-molecular-biophysics-and-structural-biology

http://www.andrewdavison.info/notes/workflows-reproducible-research-comp-neuro/

http://fr.slideshare.net/khinsen/reproducible-research-in-molecular-biophysics-and-structural-biology

https://github.com/khinsen/article-statistique-et-societe

You can now open the extracted URLs via this command line:

open /Applications/Firefox.app $(cat extracted_urls.txt)

NOTE: Open firefox before the previous command so firefox will open the URLs in tabs instead of sessions

For this second part ("Open the desired URLs in the browser"), I wrote a Python script "open_urls.py" that can be used from the command line (e.g.: ./open_urls.py urls.txt '# Reproducibility'. It will open all the URLs of the "Reproducibility" theme). You can download the script here.

Guillaume Chevrot

“Anyone who stops learning is old, anyone who keeps learning stays young."

Extract the URLs of a Firefox session

Open the desired URLs in the browser¶

Write the urls in a new file¶