Empty databases are boring. In part 2 we made an empty database.
Now we fill the database.
We are going to use a couple of celery functions outlined in tasks.py
Here are the benefits of building this as celery tasks:
Here's the basic function layout:
An important part about celery tasks, and really distributed systems in general is to pass them all the context they need, as simple objects (send object ids, not full objects, send filesystem paths, not file handles). You'll see that pattern replicated here.
@app.task def find_and_process_by_path(path): if not path.endswith("**"): path = path + "**" for filename in glob.iglob(path, recursive=True): print("Checking", filename) will_ignore = False for ignore in IGNORE: if re.match(ignore, filename): print("IGNORING", filename) will_ignore = True break if not will_ignore: print("PROCESSING", filename) process_file_by_path.delay(filename) return path
I start by globbing any path given, checking to see if I should ignore it, and then sending files into the next task.
@app.task def process_file_by_path(path): print("PROCESSING ONE FILE", path) if not os.path.isfile(path): print ("NOT A FILE", path) return try: sf = StoredFile.objects.get(content=path) except StoredFile.DoesNotExist: # If it does NOT have an entry, create one sf = StoredFile( content=path ) sf.save() uid = find_owner(path) user, _ = User.objects.get_or_create(username=uid) sf.user = user sf.save() if path[-4:] in ['.jpg', 'jpeg']: process_jpeg_metadata.delay(path) return sf.id
This function processes a single file by path name.
First it checks that it's a file (and not a folder)
Next it gets, or creates a database entry for the file.
After that it associates the Filesystem user object with an existing or new Django User object.
If it's a jpeg file it queues up the next celery task called "process_jpeg_metadata"
@app.task def process_jpeg_metadata(path): # Determine if this object already has an entry in the data_hub DataFile table try: sf = StoredFile.objects.get(content=path) except StoredFile.DoesNotExist: # If it does NOT have an entry, create one sf = StoredFile( content=path ) sf.save() if sf.processor_metadata is None: sf.processor_metadata = {} sf.processor_metadata['jpeg_metadata_started'] = timezone.now().isoformat() img = Image.open(sf.content) exif = get_exif_data(img) lat, lng = get_lat_lng(img) if lat and lng: sf.location = Point(lng, lat) # Set "start" datetime based on EXIF formatted date if exif.get('DateTimeOriginal'): sf.start = datetime.datetime.strptime(exif['DateTimeOriginal'], "%Y:%m:%d %H:%M:%S") elif exif.get('DateTime'): sf.start = datetime.datetime.strptime(exif['DateTime'], "%Y:%m:%d %H:%M:%S") # Some programs use IPTC data for keywords and tags iptc = get_iptc_data(img) if iptc.get('Keywords', []): for tagname in iptc["Keywords"]: tag, _ = Tag.objects.get_or_create(name=tagname) sf.tags.add(tag) # Not all exif fields are json serializable # sf.metadata = exif sf.metadata = { "width": img.width, "height": img.height, } sf.kind = "Image" sf.mime_type = "image/jpeg" sf.save()
This task gets some interesting things out of the EXIF and IPTC data, like location, DateTime, and tags.
To use these tasks to populate the database I open up a python shell: `python manage.py shell` and run:
from backend.tasks import * find_and_process_by_path.delay("/home/issac/Pictures/")
In another tab I can run `celery -A backend -l info worker` to run 8 celery workers to process the tasks and fill up my database. This took a while but at the end I had over five thousand photos in my database. Not a bad dataset at all.
16th June 2018
I won't ever give out your email address. I don't publish comments but if you'd like to write to me then you could use this form.
I'm Issac. I live in Oakland. I make things for fun and money. I use electronics and computers and software. I manage teams and projects top to bottom. I've worked as a consultant, software engineer, hardware designer, artist, technology director and team lead. I do occasional fabrication in wood and plastic and metal. I run a boutique interactive agency with my brother Kasey and a roving cast of experts at Kelly Creative Tech. I was the Director of Technology for Nonchalance during the The Latitude Society project. I was the Lead Web Developer and then Technical Marketing Engineer at Nebula, which made an OpenStack Appliance. I've been building things on the web and in person since leaving Ohio State University's Electrical and Computer engineering program in 2007. Lots of other really dorky things happened to me before that, like dropping out of high school to go to university, getting an Eagle Scout award, and getting 6th in a state-wide algebra competition. I have an affinity for hopscotch.